#### **PASI Summer School**

### Advanced Algorithmic Techniques for GPUs

# Lecture 3: Blocking/Tiling for Locality

# Objective

- Reuse each data accessed from the global memory multiple times
  - Across threads shared memory blocking
  - Within a thread register tiling
- Register tiling is also often used to re-use computation results for increased efficiency.



# **Basic Concept of Blocking/Tiling**

- In a congested traffic system, significant reduction of vehicles can greatly improve the delay seen by all vehicles
  - Carpooling for commuters
  - Blocking/Tiling for global memory accesses
    - drivers = threads,
    - cars = data





# Some computations are more challenging to block/tile than others.

- Some carpools may be easier than others
  - More efficient if neighbors are also classmates or coworkers
  - Some vehicles may be more suitable for carpooling
- Similar variations exist in blocking/tiling





# Carpools need synchronization.

Good – when people have similar schedule



# Same with Blocking/Tiling

Good – when threads have similar access timing



• Bad – when threads have very different timing

# **Outline of Technique**

- Identify a block/tile of global memory content that are accessed by multiple threads
- Load the block/tile from global memory into onchip memory
- Have the multiple threads to access their data from the on-chip memory
- Move on to the next block/tile

# **Tiled Matrix Multiply**

- Each row of Md is accessed by multiple threads
- Problem: some threads can be much further along than others
  - An entire row may need to be in on-chip memory
  - Not enough on-chip memory for
     large input matrices
     Thread 1



2

9

# A Small Example

- Can we use two onchip memory locations to reduce the number of M accesses by the two threads?
  - Not if the two threads can have very different timing!



### Every M and N Element is used exactly twice in generating a 2X2 tile of P

|                 | P <sub>0,0</sub>                    | P <sub>1,0</sub>                    | P <sub>0,1</sub>                    | P <sub>1,1</sub>                    |
|-----------------|-------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|
|                 | thread <sub>0,0</sub>               | thread <sub>1,0</sub>               | thread <sub>0,1</sub>               | thread <sub>1,1</sub>               |
|                 | M <sub>0,0</sub> * N <sub>0,0</sub> | $M_{0,0} * N_{1}$                   | M <sub>0,1</sub> * N <sub>0,0</sub> | M <sub>0,1</sub> * N <sub>1</sub>   |
| Access<br>order | $M_{10} * N_{0,1}$                  | $M_{1,0} * N_{1,1}$                 | M <sub>1,1</sub> * N <sub>0,1</sub> | M <sub>1,1</sub> * N <sub>1,1</sub> |
|                 | M <sub>2,0</sub> * N <sub>0,2</sub> | M <sub>2,0</sub> * N <sub>1,2</sub> | M <sub>2,1</sub> * N <sub>0,2</sub> | M <sub>2,1</sub> * N <sub>1,2</sub> |
|                 | M <sub>3,0</sub> * N <sub>0,3</sub> | M <sub>3,0</sub> * N <sub>1,3</sub> | M <sub>3,1</sub> * N <sub>0,3</sub> | M <sub>3,1</sub> * N <sub>1,3</sub> |

## Breaking Md and Nd into Tiles



# Breaking Md and Nd into Tiles (cont.)



| Each phase uses one tile from Md and |                                              |                                              |                                                                                                                 |                                              |                                              |                                                                                                                 |  |  |  |
|--------------------------------------|----------------------------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|--|--|--|
|                                      | one from Nd                                  |                                              |                                                                                                                 |                                              |                                              |                                                                                                                 |  |  |  |
| Phase 1                              |                                              |                                              | Phase 2                                                                                                         |                                              |                                              |                                                                                                                 |  |  |  |
| T <sub>0,0</sub>                     | Md <sub>0,0</sub><br>↓<br>Mds <sub>0,0</sub> | Nd <sub>0,0</sub><br>↓<br>Nds <sub>0,0</sub> | $PValue_{0,0} += Mds_{0,0}^*Nds_{0,0} + Mds_{1,0}^*Nds_{0,1}$                                                   | Md <sub>2,0</sub><br>↓<br>Mds <sub>0,0</sub> | Nd <sub>0,2</sub><br>↓<br>Nds <sub>0,0</sub> | $PValue_{0,0} += Mds_{0,0}*Nds_{0,0} + Mds_{1,0}*Nds_{0,1}$                                                     |  |  |  |
| T <sub>1,0</sub>                     | Md <sub>1,0</sub><br>↓<br>Mds <sub>1,0</sub> | Nd <sub>1,0</sub><br>↓<br>Nds <sub>1,0</sub> | PValue <sub>1,0</sub> +=<br>Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,0</sub> *Nds <sub>1,1</sub>  | Md <sub>3,0</sub><br>↓<br>Mds <sub>1,0</sub> | Nd <sub>1,2</sub><br>↓<br>Nds <sub>1,0</sub> | $PValue_{1,0} += Mds_{0,0}*Nds_{1,0} + Mds_{1,0}*Nds_{1,1}$                                                     |  |  |  |
| <b>T</b> <sub>0,1</sub>              | $Md_{0,1}$ $\downarrow$ $Mds_{0,1}$          | Nd <sub>0,1</sub><br>↓<br>Nds <sub>0,1</sub> | PdValue <sub>0,1</sub> +=<br>MdS <sub>0,1</sub> *NdS <sub>0,0</sub> +<br>MdS <sub>1,1</sub> *NdS <sub>0,1</sub> | Md <sub>2,1</sub><br>↓<br>Mds <sub>0,1</sub> | Nd <sub>0,3</sub><br>↓<br>Nds <sub>0,1</sub> | $\begin{array}{l} PdValue_{0,1} += \\ Mds_{0,1}^* Nds_{0,0} + \\ Mds_{1,1}^* Nds_{0,1} \end{array}$             |  |  |  |
| T <sub>1,1</sub>                     | Md <sub>1,1</sub><br>↓<br>Mds <sub>1,1</sub> | Nd <sub>1,1</sub><br>↓<br>Nds <sub>1,1</sub> | PdValue <sub>11</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub>  | Md <sub>3,1</sub><br>↓<br>Mds <sub>1,1</sub> | Nd <sub>1,3</sub><br>↓<br>Nds <sub>1,1</sub> | PdValue <sub>1,1</sub> +=<br>Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +<br>Mds <sub>1,1</sub> *Nds <sub>1,1</sub> |  |  |  |

©Wen-mei W. Hwu and David Kirk/NVIDIA, Chile, January 5-7, 2011 time

# Tiled Multiply – Large Matrices

- Make sure that tiles are all loaded in vertical patters from the global memory
- Md data can then be accessed from shared memory in horizontal direction

0



2



### **First-order Size Considerations**

#### • Assume

- TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- A 1024\*1024 Pd gives 64\*64 = 4096 Thread Blocks
- Each thread block perform 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
  - Memory bandwidth no longer a limiting factor
  - Could use thread coarsening to further reduce traffic
- Each thread block can have up to 1024 threads
   Can use 32\*32 tiles to further reduce traffic



# Memory Layout of a Matrix in C

Access direction in Kernel code





# Memory Layout of a Matrix in C



# Loading a Tile

- All threads in a block participate
  - Each thread loads one Md element and one Nd element in based tiled code
- Assign the loaded element to each thread such that the accesses within each warp is coalesced

# CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE\_WIDTH, TILE\_WIDTH);

dim3 dimGrid(Width / TILE\_WIDTH,

Width / TILE\_WIDTH);

# **Tiled Multiply**

bv

K

m

12

F WIDTH

- Each block computes one square sub-matrix Pd<sub>sub</sub> of size TILE\_WIDTH
- Each thread computes one element of Pd<sub>sub</sub>

0





by

```
Tiled Matrix Multiplication Kernel
  _global___ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
    shared float Mds[TILE_WIDTH][TILE_WIDTH];
2.
     shared float Nds[TILE WIDTH][TILE WIDTH];
    int bx = blockIdx.x; int by = blockIdx.y;
3.
    int tx = threadIdx.x; int ty = threadIdx.y;
4.
   Identify the row and column of the Pd element to work on
11
5.
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE WIDTH + tx;
6.
    float Pvalue = 0;
7.
  Loop over the Md and Nd tiles required to compute the Pd element
     for (int m = 0; m < Width/TILE_WIDTH; ++m) {</pre>
8.
   Coolaborative loading of Md and Nd tiles into shared memory
//
9.
       Mds[tx][ty] = Md[Row*Width + m*TILE WIDTH + tx];
      Nds[tx][ty] = Nd[(m*TILE_WIDTH + ty) * Width + Col)];
10.
      _____syncthreads();
11.
      for (int k = 0; k < TILE_WIDTH; ++k)
12.
13.
         Pvalue += Mds[tx][k] * Nds[k][ty];
14.
      ____syncthreads();
15.}
16.
     Pd[Row*Width+Col] = Pvalue;
```

23

## Shared Memory and Threading

- Each SM in Fermi has 64KB on-chip SRAM, partitioned into 48KB L1 cache and 16KB shared memory, or vice versa
  - SM shared memory size is implementation dependent!
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time (Problem with earlier GPUs!)
- Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - A 150GB/s bandwidth can now support (150/4)\*16 = 600 GFLOPS!

# **ANY MORE QUESTIONS?**