Back to Taskflow

Learning from Examples » Matrix Multiplication with CUDA GPU

docs/MatrixMultiplicationWithCUDAGPU.html

4.0.05.6 KB
Original Source

Learning from Examples » Matrix Multiplication with CUDA GPU

Following up on Matrix Multiplication, this page studies how to accelerate a matrix multiplication workload on a GPU using tf::cudaGraph.

Define a Matrix Multiplication Kernel

GPU can perform a lot of parallel computations more than CPUs. It is especially useful for data-intensive computing such as matrix multiplication. With GPU, we express the parallel patterns at a fine-grained level. The kernel, written in CUDA, is described as follows:

// CUDA kernel to perform matrix multiplication\_\_global\_\_ void matmul(int \*A, int \*B, int \*C, int M, int K, int N) {int row = blockIdx.y \* blockDim.y + threadIdx.y;int col = blockIdx.x \* blockDim.x + threadIdx.x;int sum = 0;if(col \< N && row \< M) {for(int i = 0; i \< K; i++) {sum += a[row \* K + i] \* b[i \* N + col];}c[row \* N + col] = sum;}}

Each CUDA thread corresponds to an element of C and compute its result. Instead of storing each matrix in a 2D array, we use 1D layout to ease the data transfer between CPU and GPU. In a row-major layout, an element (x, y) in the 2D matrix can be addressed at x * width + y in the transformed 1D layout.

Define a CUDA Graph for Matrix Multiplication

The next step is to allocate memory for A, B, and C at a GPU. We create three tasks each calling cudaMalloc to allocate space for one matrix. Then, we create a CUDA graph to offload matrix multiplication to a GPU. The entire code is described as follows:

void matrix\_multiplication(int\* A, int\* B, int\* C, int M, int K, int N) {tf::Taskflow taskflow;tf::Executor executor;// allocate the host and gpu storage for Atf::Task allocate\_a = taskflow.emplace(&{cudaMalloc(&da, M\*K\*sizeof(int));}).name("allocate\_a");// allocate the host and gpu storage for Btf::Task allocate\_b = taskflow.emplace(&{cudaMalloc(&db, K\*N\*sizeof(int));}).name("allocate\_b");// allocate the host and gpu storage for Ctf::Task allocate\_c = taskflow.emplace(&{cudaMalloc(&dc, M\*N\*sizeof(int));}).name("allocate\_c");// create a CUDA graph task to run the matrix multiplicationtf::Task cudaFlow = taskflow.emplace(&{tf::cudaGraph cg;// copy data to da, db, and dctf::cudaTask copy\_da = cg.copy(da, A, M\*K);tf::cudaTask copy\_db = cg.copy(db, B, K\*N);tf::cudaTask copy\_hc = cg.copy(C, dc, M\*N);dim3 grid((K+16-1)/16, (M+16-1)/16);dim3 block (16, 16);tf::cudaTask kmatmul = cg.kernel(grid, block, 0, matmul, da, db, dc, M, K, N);kmatmul.succeed(copy\_da, copy\_db).precede(copy\_hc);// dump the CUDA graphcg.dump(std::cout);// instantiate an executable CUDA graph and run it through a streamtf::cudaStream stream;tf::cudaGraphExec exec(cg);stream.run(exec).synchronize();}).name("cudaFlow");// free the gpu storageauto free = taskflow.emplace(&{cudaFree(da);cudaFree(db);cudaFree(dc);}).name("free");// create dependencycudaFlow.succeed(allocate\_a, allocate\_b, allocate\_c).precede(free);// run the taskflowexecutor.run(taskflow).wait();}

Within the cudaFlow, we create two host-to-device (H2D) tasks that copy data from A and B to da and db, one device-to-host (D2H) task that copies the result from dc to C, and one kernel task that launches matmul on the GPU (by default, GPU 0). H2D tasks precede the kernel and the kernel precedes the D2H task. These GPU operations form a GPU task graph managed by a cudaFlow. The first dump of the taskflow gives the following graph:

Taskflowp0x55d923794f10allocate_ap0x55d923795240cudaFlowp0x55d923794f10->p0x55d923795240p0x55d923795350freep0x55d923795240->p0x55d923795350p0x55d923795020allocate_bp0x55d923795020->p0x55d923795240p0x55d923795130allocate_cp0x55d923795130->p0x55d923795240

A cudaFlow encapsulates a GPU task dependency graph similar to a tf::Subflow (see Subflow Tasking). In order to visualize it, we need to execute the graph first and then dump the taskflow.

Taskflowcluster_p0x5558af971240cudaFlow: cudaFlowp0x5558af970f10allocate_ap0x5558af971240cudaFlowp0x5558af970f10->p0x5558af971240p0x5558af971350freep0x5558af971240->p0x5558af971350p0x5558af971020allocate_bp0x5558af971020->p0x5558af971240p0x5558af971130allocate_cp0x5558af971130->p0x5558af971240p0x7f6fd8000b20H2D_ap0x7f6fd8000db0matmulp0x7f6fd8000b20->p0x7f6fd8000db0p0x7f6fd8000ce0D2H_cp0x7f6fd8000db0->p0x7f6fd8000ce0p0x7f6fd8000c00H2D_bp0x7f6fd8000c00->p0x7f6fd8000db0p0x7f6fd8000ce0->p0x5558af971240

Benchmarking

We run three versions of matrix multiplication, sequential CPU, parallel CPUs, and one GPU, on a machine of 12 Intel i7-8700 CPUs at 3.20 GHz and a Nvidia RTX 2080 GPU using various matrix sizes of A, B, and C.

ABCCPU SequentialCPU ParallelGPU Parallel
10x1010x1010x100.142 ms0.414 ms82 ms
100x100100x100100x1001.641 ms0.733 ms83 ms
1000x10001000x10001000x10001532 ms504 ms85 ms
2000x20002000x20002000x200025688 ms4387 ms133 ms
3000x30003000x30003000x3000104838 ms16170 ms214 ms
4000x40004000x40004000x4000250133 ms39646 ms427 ms

As the matrix size increases, the speed-up of GPU over CPUs becomes prominent. For example, at 4000x4000, the GPU runtime is 585.8 times faster than the sequential CPU runtime and is 92.8 times faster than the parallel CPU solutions.