Consider an algorithm for multiplying rectangular matrices using Java Streams. Let’s take the optimized version of the algorithm using three nested loops and replace the outer loop with a stream. Let’s compare the operating time of two algorithms.
We process the rows of the resulting matrix in parallel mode, and populate each row layerwise. Due to
the parallel use of cache of the execution environment on multi-core machines, the computation time can
be reduced by more than two times. To check, let’s take two rectangular matrices {L×M
} and {M×N
}.
Further optimization: Winograd — Strassen algorithm.
We bypass the rows of the first matrix L
in parallel mode. In each thread there are two nested loops:
across the common side of two matrices M
and across the columns of the second matrix N
. Processing
of the rows of the resulting matrix occurs independently of each other in parallel streams.
/**
* @param l rows of matrix 'a'
* @param m columns of matrix 'a'
* and rows of matrix 'b'
* @param n columns of matrix 'b'
* @param a first matrix 'l×m'
* @param b second matrix 'm×n'
* @return resulting matrix 'l×n'
*/
public static int[][] parallelMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
// resulting matrix
int[][] c = new int[l][n];
// bypass the indexes of the rows of matrix 'a' in parallel mode
IntStream.range(0, l).parallel().forEach(i -> {
// bypass the indexes of the common side of two matrices:
// the columns of matrix 'a' and the rows of matrix 'b'
for (int k = 0; k < m; k++)
// bypass the indexes of the columns of matrix 'b'
for (int j = 0; j < n; j++)
// the sum of the products of the elements of the i-th
// row of matrix 'a' and the j-th column of matrix 'b'
c[i][j] += a[i][k] * b[k][j];
});
return c;
}
We take the optimized variant of algorithm, that uses cache of the execution environment better
than others. The outer loop bypasses the rows of the first matrix L
, then there is a loop across
the common side of the two matrices M
and it is followed by a loop across the columns of the
second matrix N
. Writing to the resulting matrix occurs row-wise, and each row is filled in layers.
/**
* @param l rows of matrix 'a'
* @param m columns of matrix 'a'
* and rows of matrix 'b'
* @param n columns of matrix 'b'
* @param a first matrix 'l×m'
* @param b second matrix 'm×n'
* @return resulting matrix 'l×n'
*/
public static int[][] sequentialMatrixMultiplication(int l, int m, int n, int[][] a, int[][] b) {
// resulting matrix
int[][] c = new int[l][n];
// bypass the indexes of the rows of matrix 'a'
for (int i = 0; i < l; i++)
// bypass the indexes of the common side of two matrices:
// the columns of matrix 'a' and the rows of matrix 'b'
for (int k = 0; k < m; k++)
// bypass the indexes of the columns of matrix 'b'
for (int j = 0; j < n; j++)
// the sum of the products of the elements of the i-th
// row of matrix 'a' and the j-th column of matrix 'b'
c[i][j] += a[i][k] * b[k][j];
return c;
}
To check, we take two matrices A=[900×1000]
and B=[1000×750]
, filled with random numbers.
First, we compare the correctness of the implementation of the two algorithms — matrix products
must match. Then we execute each method 10 times and calculate the average execution time.
// start the program and output the result
public static void main(String[] args) {
// incoming data
int l = 900, m = 1000, n = 750, steps = 10;
int[][] a = randomMatrix(l, m), b = randomMatrix(m, n);
// matrix products
int[][] c1 = parallelMatrixMultiplication(l, m, n, a, b);
int[][] c2 = sequentialMatrixMultiplication(l, m, n, a, b);
// check the correctness of the results
System.out.println("The results match: " + Arrays.deepEquals(c1, c2));
// measure the execution time of two methods
benchmark("Stream", steps, () -> {
int[][] c = parallelMatrixMultiplication(l, m, n, a, b);
if (!Arrays.deepEquals(c, c1)) System.out.print("error");
});
benchmark("Loops", steps, () -> {
int[][] c = sequentialMatrixMultiplication(l, m, n, a, b);
if (!Arrays.deepEquals(c, c2)) System.out.print("error");
});
}
// helper method, returns a matrix of the specified size
private static int[][] randomMatrix(int row, int col) {
int[][] matrix = new int[row][col];
for (int i = 0; i < row; i++)
for (int j = 0; j < col; j++)
matrix[i][j] = (int) (Math.random() * row * col);
return matrix;
}
// helper method for measuring the execution time of the passed code
private static void benchmark(String title, int steps, Runnable runnable) {
long time, avg = 0;
System.out.print(title);
for (int i = 0; i < steps; i++) {
time = System.currentTimeMillis();
runnable.run();
time = System.currentTimeMillis() - time;
// execution time of one step
System.out.print(" | " + time);
avg += time;
}
// average execution time
System.out.println(" || " + (avg / steps));
}
Output depends on the execution environment, time in milliseconds:
The results match: true
Stream | 113 | 144 | 117 | 114 | 114 | 117 | 120 | 125 | 111 | 113 || 118
Loops | 1357 | 530 | 551 | 569 | 535 | 538 | 525 | 517 | 518 | 514 || 615
On an eight-core Linux x64 computer, we create a Windows x64 virtual machine for tests. All other things being equal, in the settings, we change the number of processors. We run the above test 100 times instead of 10 — we get a summary table of results. Time in milliseconds.
CPU | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
-------|-----|-----|-----|-----|-----|-----|-----|-----|
Stream | 522 | 276 | 179 | 154 | 128 | 112 | 110 | 93 |
Loops | 519 | 603 | 583 | 571 | 545 | 571 | 559 | 467 |
Results: with a large number of processors in the system, the multithreaded algorithm works out noticeably faster than three nested loops. The computation time can be reduced by more than two times.
All the methods described above, including collapsed blocks, can be placed in one class.
import java.util.Arrays;
import java.util.stream.IntStream;
© Golovin G.G., Code with comments, translation from Russian, 2022