docs/matrix_multiplication.html
| | Taskflow: A General-purpose Task-parallel Programming System |
Loading...
Searching...
No Matches
Matrix Multiplication
We study parallel 2D matrix multiplication on a CPU, showing how Taskflow's parallel-for algorithm turns a sequential triple-nested loop into a scalable parallel program with minimal code change.
We multiply matrix A (M×K) by matrix B (K×N) to produce output matrix C (M×N). The number of columns in A must equal the number of rows in B. Each element C[m][n] is the dot product of row m of A and column n of B:
The sequential triple-nested loop is:
for(int m = 0; m < M; m++) {
for(int n = 0; n < N; n++) {
C[m][n] = 0;
for(int k = 0; k < K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
}
The rows of C are completely independent of one another — computing row m does not affect any other row. We exploit this independence by assigning one task per row. This coarse-grained decomposition gives each task enough work to amortise the overhead of task creation and scheduling:
void matrix_multiplication(int** A, int** B, int** C, int M, int K, int N) {
tf::Taskflow taskflow;
tf::Executor executor;
// one task per row of C
for(int m = 0; m < M; m++) {
taskflow.emplace(m, & {
for(int n = 0; n < N; n++) {
for(int k = 0; k < K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
});
}
executor.run(taskflow).wait();
}
class to create an executor
Definition executor.hpp:62
tf::Future< void > run(Taskflow &taskflow)
runs a taskflow once
Task emplace(C &&callable)
creates a static task
Definition flow_builder.hpp:1571
class to create a taskflow object
Definition taskflow.hpp:64
The same parallelism can be expressed more concisely using tf::Taskflow::for_each_index, which handles partitioning and scheduling automatically:
taskflow.for_each_index(0, M, 1, [&](int m) {
for(int n = 0; n < N; n++) {
for(int k = 0; k < K; k++) {
C[m][n] += A[m][k] * B[k][n];
}
}
});
tf::FlowBuilder::for_each_index
Task for_each_index(B first, E last, S step, C callable, P part=P())
constructs an index-based parallel-for task
See Parallel Iterations for partitioning strategies and chunk-size tuning that can further improve performance.
The table below compares sequential and parallel CPU runtimes on a 12-core Intel i7-8700 at 3.2 GHz:
| Matrix size | CPU sequential | CPU parallel |
|---|---|---|
| 10×10 | 0.142 ms | 0.414 ms |
| 100×100 | 1.641 ms | 0.733 ms |
| 1000×1000 | 1532 ms | 504 ms |
| 2000×2000 | 25688 ms | 4387 ms |
| 3000×3000 | 104838 ms | 16170 ms |
| 4000×4000 | 250133 ms | 39646 ms |
For small matrices the task-creation overhead dominates and the sequential version is faster. As the matrix size grows, task overhead becomes negligible relative to the computation, and the parallel speed-up grows toward the number of available cores — reaching 6.3× at 4000×4000.
NoteFor GPU-accelerated matrix multiplication that achieves over 90× speed-up at large sizes, see Matrix Multiplication with CUDA GPU.