Back to Taskflow

Learning from Examples » Matrix Multiplication

docs/matrix_multiplication.html

4.0.03.4 KB
Original Source

Learning from Examples » Matrix Multiplication

We study the classic problem, 2D matrix multiplication. We will start with a short introduction about the problem and then discuss how to solve it parallel CPUs.

Problem Formulation

We are multiplying two matrices, A (MxK) and B (KxN). The numbers of columns of A must match the number of rows of B. The output matrix C has the shape of (MxN) where M is the rows of A and N the columns of B. The following example multiplies a 3x3 matrix with a 3x2 matrix to derive a 3x2 matrix.

As a general view, for each element of C we iterate a complete row of A and a complete column of B, multiplying each element and summing them.

We can implement matrix multiplication using three nested loops.

for(int m=0; m\<M; m++) {for(int n=0; n\<N; n++) {C[m][n] = 0;for(int k=0; k\<K; k++) {C[m][n] += A[m][k] \* B[k][n];}}}

Parallel Patterns

At a fine-grained level, computing each element of C is independent of each other. Similarly, computing each row of C or each column of C is also independent of one another. With task parallelism, we prefer coarse-grained model to have each task perform rather large computation to amortize the overhead of creating and scheduling tasks. In this case, we avoid intensive tasks each working on only a single element. by creating a task per row of C to multiply a row of A by every column of B.

// C = A \* B// A is a MxK matrix, B is a KxN matrix, and C is a MxN matrixvoid matrix\_multiplication(int\*\* A, int\*\* B, int\*\* C, int M, int K, int N) {tf::Taskflow taskflow;tf::Executor executor;for(int m=0; m\<M; ++m) {taskflow.emplace([m, &] () {for(int n=0; n\<N; n++) {for(int k=0; k\<K; k++) {C[m][n] += A[m][k] \* B[k][n];// inner product}}});}executor.run(taskflow).wait();}

Instead of creating tasks one-by-one over a loop, you can leverage Taskflow::for_each_index to create a parallel-for task. A parallel-for task spawns a subflow to perform parallel iterations over the given range.

// perform parallel iterations on the range [0, M) with the step size of 1tf::Task task = taskflow.for\_each\_index(0, M, 1, [&] (int m) {for(int n=0; n\<N; n++) {for(int k=0; k\<K; k++) {C[m][n] += A[m][k] \* B[k][n];}}});

Please visit Parallel Iterations for more details.

Benchmarking

Based on the discussion above, we compare the runtime of computing various matrix sizes of A, B, and C between a sequential CPU and parallel CPUs on a machine of 12 Intel i7-8700 CPUs at 3.2 GHz.

ABCCPU SequentialCPU Parallel
10x1010x1010x100.142 ms0.414 ms
100x100100x100100x1001.641 ms0.733 ms
1000x10001000x10001000x10001532 ms504 ms
2000x20002000x20002000x200025688 ms4387 ms
3000x30003000x30003000x3000104838 ms16170 ms
4000x40004000x40004000x4000250133 ms39646 ms

The speed-up of parallel execution becomes clean as we increase the problem size. For example, at 4000x4000, the parallel runtime is 6.3 times faster than the sequential runtime.