| | Taskflow: A General-purpose Task-parallel Programming System |

Searching...

No Matches

Matrix Multiplication with CUDA GPU

Following Matrix Multiplication, we accelerate matrix multiplication on a CUDA GPU using tf::cudaGraph. The GPU's massive thread-level parallelism reduces large problem runtimes from minutes to milliseconds.

CUDA Kernel

Unlike the CPU version where each task processes one row, on the GPU we assign one CUDA thread to each element of C. We store all matrices in 1D row-major layout to simplify host-to-device transfers: element (x, y) in a matrix of width W is stored at index x * W + y.

The CUDA kernel is:

__global__ void matmul(int* A, int* B, int* C, int M, int K, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

int sum = 0;

if(row < M && col < N) {

for(int i = 0; i < K; i++) {

sum += A[row * K + i] * B[i * N + col];

}

C[row * N + col] = sum;

}

Each thread computes one element of C by iterating over the full inner dimension K.

CUDA Graph Task

We build a Taskflow that allocates GPU memory in parallel, runs the CUDA graph, and frees GPU memory when done:

void matrix_multiplication(int* A, int* B, int* C, int M, int K, int N) {

tf::Taskflow taskflow;

tf::Executor executor;

int *da, *db, *dc;

// allocate GPU memory for A, B, and C in parallel

tf::Task allocate_a = taskflow.emplace(& {

cudaMalloc(&da, M * K * sizeof(int));

}).name("allocate_a");

tf::Task allocate_b = taskflow.emplace(& {

cudaMalloc(&db, K * N * sizeof(int));

}).name("allocate_b");

tf::Task allocate_c = taskflow.emplace(& {

cudaMalloc(&dc, M * N * sizeof(int));

}).name("allocate_c");

// build and execute the CUDA graph

tf::Task cuda = taskflow.emplace(& {

tf::cudaGraph cg;

// H2D transfers for A and B

tf::cudaTask copy_da = cg.copy(da, A, M * K);

tf::cudaTask copy_db = cg.copy(db, B, K * N);

// kernel: one thread per element of C

dim3 grid ((N + 15) / 16, (M + 15) / 16);

dim3 block(16, 16);

tf::cudaTask kmatmul = cg.kernel(grid, block, 0,

matmul, da, db, dc, M, K, N

);

// D2H transfer for C

tf::cudaTask copy_hc = cg.copy(C, dc, M * N);

kmatmul.succeed(copy_da, copy_db)

.precede(copy_hc);

tf::cudaStream stream;

tf::cudaGraphExec exec(cg);

stream.run(exec).synchronize();

}).name("cuda");

// free GPU memory

tf::Task free_mem = taskflow.emplace(& {

cudaFree(da);

cudaFree(db);

cudaFree(dc);

}).name("free");

cuda.succeed(allocate_a, allocate_b, allocate_c)

.precede(free_mem);

executor.run(taskflow).wait();

}

tf::Executor

class to create an executor

Definition executor.hpp:62

tf::Executor::run

tf::Future< void > run(Taskflow &taskflow)

runs a taskflow once

tf::FlowBuilder::emplace

Task emplace(C &&callable)

creates a static task

Definition flow_builder.hpp:1571

tf::Task

class to create a task handle over a taskflow node

Definition task.hpp:569

tf::Task::succeed

Task & succeed(Ts &&... tasks)

adds precedence links from other tasks to this

Definition task.hpp:1266

tf::Task::precede

Task & precede(Ts &&... tasks)

adds precedence links from this to other tasks

Definition task.hpp:1258

tf::Taskflow

class to create a taskflow object

Definition taskflow.hpp:64

tf::cudaGraphBase::copy

cudaTask copy(T *tgt, const T *src, size_t num)

creates a memcopy task that copies typed data

Definition cuda_graph.hpp:1075

tf::cudaGraphBase::kernel

cudaTask kernel(dim3 g, dim3 b, size_t s, F f, ArgsT... args)

creates a kernel task

Definition cuda_graph.hpp:1010

tf::cudaStreamBase::synchronize

cudaStreamBase & synchronize()

synchronizes the associated stream

Definition cuda_stream.hpp:232

tf::cudaStreamBase::run

cudaStreamBase & run(const cudaGraphExecBase< C, D > &exec)

runs the given executable CUDA graph

tf::cudaTask::succeed

cudaTask & succeed(Ts &&... tasks)

adds precedence links from other tasks to this

Definition cuda_graph.hpp:418

tf::cudaTask::precede

cudaTask & precede(Ts &&... tasks)

adds precedence links from this to other tasks

Definition cuda_graph.hpp:407

tf::cudaGraphExec

cudaGraphExecBase< cudaGraphExecCreator, cudaGraphExecDeleter > cudaGraphExec

default smart pointer type to manage a cudaGraphExec_t object with unique ownership

Definition cudaflow.hpp:23

tf::cudaGraph

cudaGraphBase< cudaGraphCreator, cudaGraphDeleter > cudaGraph

default smart pointer type to manage a cudaGraph_t object with unique ownership

Definition cudaflow.hpp:18

tf::cudaStream

cudaStreamBase< cudaStreamCreator, cudaStreamDeleter > cudaStream

default smart pointer type to manage a cudaStream_t object with unique ownership

Definition cuda_stream.hpp:340

The outer Taskflow manages CPU-side orchestration: the three allocation tasks run in parallel, then the CUDA graph task runs, and finally GPU memory is freed. Inside the CUDA graph, two H2D copy tasks feed the kernel and the kernel feeds the D2H copy task. The CPU taskflow graph is shown below:

Embedded content

After execution, the full task graph including the CUDA sub-graph can be visualised:

Embedded content

Benchmarking

We compare three versions — sequential CPU, parallel CPU, and one GPU — on a 12-core Intel i7-8700 at 3.20 GHz and a Nvidia RTX 2080:

Matrix size	CPU sequential	CPU parallel	GPU
10×10	0.142 ms	0.414 ms	82 ms
100×100	1.641 ms	0.733 ms	83 ms
1000×1000	1532 ms	504 ms	85 ms
2000×2000	25688 ms	4387 ms	133 ms
3000×3000	104838 ms	16170 ms	214 ms
4000×4000	250133 ms	39646 ms	427 ms

For small matrices the GPU's data transfer overhead dominates and CPU solutions are faster. As problem size grows, the GPU's thread-level parallelism dominates completely. At 4000×4000, the GPU is 585× faster than the sequential CPU and 92× faster than the parallel CPU solution.

Learning from Examples
Maintained by Dr. Tsung-Wei Huang — Generated by 1.13.1