llvm-project

History

Manish Gupta 9774cd17e8 [mlir][nvgpu] Fix affine maps computing indices for LdMatrixOp srcMemref This patch fixes and simplifies the ldmatrix affine map arithmetic by abstracting the affine expressions in terms of pitch-linear layout (strided and contiguous dimensions). Then it applies the maps for strided and contiguous dimensions in row-major and col-major. LdMatrixOp collaboratively (32 threads in a warp) load tiles (8 row x 128b col) of data. It can load either x1, x2, x4 tiles. Additionally, it can transpose at 16-bit granularity when moving data from the Shared Memory to registers. This patch fixes affine map: (laneid -> coordinate index a thread points in a tile). - Loading x4 tiles needs all 32 lanes T0-31 point to a contiguous chunk of 128b. The issue was exposed when running this case. - Loading x2 tiles and x1 needs T0-15 threads and T0-7 threads points to contiguous chunk of 128b. The patch is NFC for these cases. Differential Revision: https://reviews.llvm.org/D138978	2022-12-01 18:26:33 -08:00
..
vector-to-mma-ops-mma-sync.mlir	[mlir][nvgpu] Fix affine maps computing indices for LdMatrixOp srcMemref	2022-12-01 18:26:33 -08:00
vector-to-mma-ops.mlir	Revert "[WIP] Add support for MMA conversion for 1-D vector.transfer followed by a broadcast to 2-D"	2022-12-01 02:57:03 -08:00

Manish Gupta 9774cd17e8 [mlir][nvgpu] Fix affine maps computing indices for LdMatrixOp srcMemref

This patch fixes and simplifies the ldmatrix affine map arithmetic by
abstracting the affine expressions in terms of pitch-linear layout
(strided and contiguous dimensions). Then it applies the maps for
strided and contiguous dimensions in row-major and col-major.

LdMatrixOp collaboratively (32 threads in a warp) load tiles
(8 row x 128b col) of data. It can load either x1, x2, x4 tiles.
Additionally, it can transpose at 16-bit granularity when moving
data from the Shared Memory to registers.

This patch fixes affine map:
(laneid -> coordinate index a thread points in a tile).

- Loading x4 tiles needs all 32 lanes T0-31 point to a contiguous
  chunk of 128b. The issue was exposed when running this case.
- Loading x2 tiles and x1 needs T0-15 threads and T0-7 threads points
  to contiguous chunk of 128b. The patch is NFC for these cases.

Differential Revision: https://reviews.llvm.org/D138978

2022-12-01 18:26:33 -08:00

vector-to-mma-ops-mma-sync.mlir

[mlir][nvgpu] Fix affine maps computing indices for LdMatrixOp srcMemref

2022-12-01 18:26:33 -08:00

vector-to-mma-ops.mlir

Revert "[WIP] Add support for MMA conversion for 1-D vector.transfer followed by a broadcast to 2-D"

2022-12-01 02:57:03 -08:00