Matrix multiplication in parallel streams

Consider an algorithm for multiplying rectangular matrices using Java Streams. Let’s take the optimized version of the algorithm using three nested loops and replace the outer loop with a stream. Let’s compare the operating time of two algorithms.

We process the rows of the resulting matrix in parallel mode, and populate each row layerwise. Due to the parallel use of cache of the execution environment on multi-core machines, the computation time can be reduced by more than two times. To check, let’s take two rectangular matrices {L×M} and {M×N}.

Further optimization: Winograd — Strassen algorithm.

Parallel stream #

We bypass the rows of the first matrix L in parallel mode. In each thread there are two nested loops: across the common side of two matrices M and across the columns of the second matrix N. Processing of the rows of the resulting matrix occurs independently of each other in parallel streams.

/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] parallelMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a' in parallel mode
    IntStream.range(0, l).parallel().forEach(i -> {
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    });
    return c;
}

Three nested loops #

We take the optimized variant of algorithm, that uses cache of the execution environment better than others. The outer loop bypasses the rows of the first matrix L, then there is a loop across the common side of the two matrices M and it is followed by a loop across the columns of the second matrix N. Writing to the resulting matrix occurs row-wise, and each row is filled in layers.

/**
 * @param l rows of matrix 'a'
 * @param m columns of matrix 'a'
 *          and rows of matrix 'b'
 * @param n columns of matrix 'b'
 * @param a first matrix 'l×m'
 * @param b second matrix 'm×n'
 * @return resulting matrix 'l×n'
 */
public static int[][] sequentialMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
    // resulting matrix
    int[][] c = new int[l][n];
    // bypass the indexes of the rows of matrix 'a'
    for (int i = 0; i < l; i++)
        // bypass the indexes of the common side of two matrices:
        // the columns of matrix 'a' and the rows of matrix 'b'
        for (int k = 0; k < m; k++)
            // bypass the indexes of the columns of matrix 'b'
            for (int j = 0; j < n; j++)
                // the sum of the products of the elements of the i-th
                // row of matrix 'a' and the j-th column of matrix 'b'
                c[i][j] += a[i][k] * b[k][j];
    return c;
}

Testing #

To check, we take two matrices A=[900×1000] and B=[1000×750], filled with random numbers. First, we compare the correctness of the implementation of the two algorithms — matrix products must match. Then we execute each method 10 times and calculate the average execution time.

// start the program and output the result
public static void main(String[] args) {
    // incoming data
    int l = 900, m = 1000, n = 750, steps = 10;
    int[][] a = randomMatrix(l, m), b = randomMatrix(m, n);
    // matrix products
    int[][] c1 = parallelMatrixMultiplication(l, m, n, a, b);
    int[][] c2 = sequentialMatrixMultiplication(l, m, n, a, b);
    // check the correctness of the results
    System.out.println("The results match: " + Arrays.deepEquals(c1, c2));
    // measure the execution time of two methods
    benchmark("Stream", steps, () -> {
        int[][] c = parallelMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c1)) System.out.print("error");
    });
    benchmark("Loops", steps, () -> {
        int[][] c = sequentialMatrixMultiplication(l, m, n, a, b);
        if (!Arrays.deepEquals(c, c2)) System.out.print("error");
    });
}

Helper methods

// helper method, returns a matrix of the specified size
private static int[][] randomMatrix(int row, int col) {
    int[][] matrix = new int[row][col];
    for (int i = 0; i < row; i++)
        for (int j = 0; j < col; j++)
            matrix[i][j] = (int) (Math.random() * row * col);
    return matrix;
}

// helper method for measuring the execution time of the passed code
private static void benchmark(String title, int steps, Runnable runnable) {
    long time, avg = 0;
    System.out.print(title);
    for (int i = 0; i < steps; i++) {
        time = System.currentTimeMillis();
        runnable.run();
        time = System.currentTimeMillis() - time;
        // execution time of one step
        System.out.print(" | " + time);
        avg += time;
    }
    // average execution time
    System.out.println(" || " + (avg / steps));
}

Output depends on the execution environment, time in milliseconds:

The results match: true
Stream | 113 | 144 | 117 | 114 | 114 | 117 | 120 | 125 | 111 | 113 || 118
Loops | 1357 | 530 | 551 | 569 | 535 | 538 | 525 | 517 | 518 | 514 || 615

Comparing algorithms #

On an eight-core Linux x64 computer, we create a Windows x64 virtual machine for tests. All other things being equal, in the settings, we change the number of processors. We run the above test 100 times instead of 10 — we get a summary table of results. Time in milliseconds.

   CPU |   1 |   2 |   3 |   4 |   5 |   6 |   7 |   8 |
-------|-----|-----|-----|-----|-----|-----|-----|-----|
Stream | 522 | 276 | 179 | 154 | 128 | 112 | 110 |  93 |
Loops  | 519 | 603 | 583 | 571 | 545 | 571 | 559 | 467 |

Results: with a large number of processors in the system, the multithreaded algorithm works out noticeably faster than three nested loops. The computation time can be reduced by more than two times.

All the methods described above, including collapsed blocks, can be placed in one class.

Required imports

import java.util.Arrays;
import java.util.stream.IntStream;