Fast Matrix Multiplication - Part 1
July 2021
Introduction
Inspired by a few blog posts about micro-optimizations on HN lately, I decided I want to take a stab at it myself. I picked matrix multiplication also because of a comment on HN mentioning you could spend a few months on it. This project will have absolutely no use in the real world as matrix multiplication is a solved problem, but I have too much time on my hand.
The code is in Rust because it’s gonna attract all the views and I’ll become famous. You can find the repo here.
Benchmark Setup
I use criterion for benchmarking, it’s quite simple and very pleasant to use. The benchmarks try to multiply 100x120 and 120x200 matrices as many times as possible. And my computer runs on a Ryzen 7 3700X.
Naive Implementation
We’ll start with the simplest approach: three nested for loops, to loop over all rows of the first matrix, all cols of the second matrix and calculate the dot product between those rows and columns.
struct Matrix {
data: Vec<Vec<i32>>,
}
impl Matrix {
fn naive(a: &Matrix, b: &Matrix) -> Matrix {
let mut mat = Matrix::new((a.dim.0, b.dim.1));
for i in 0..a.dim.0 {
let row = a.row(i);
for j in 0..b.dim.1 {
let sum = row.iter().zip(b.col(j).iter())
.map(|(x, y)| x * y).sum();
mat.set(i, j, sum);
}
}
mat
}
fn row(&self, idx: usize) -> &[i32] {
&self.data[idx]
}
fn col(&self, idx: usize) -> Vec<i32> {
let mut vec = Vec::with_capacity(self.dim.0);
for row in 0..self.dim.0 {
vec.push(self.get(row, idx));
}
vec
}
}
Let’s talk about the most obvious problem in the code.
Cache lines
When the CPU loads or caches data from RAM, it doesn’t do so for just the piece of data that you ask for. It instead will load a contiguous block of memory to fit its cache lines. For example, in our matrix, when we ask for self.data[i][j]
, self.data[i][j+1]..self.data[i][j+15]
are also loaded. So if we were to write a piece of code like
for k in 0..16 {
print!(self.data[i][j+k])
}
the CPU would only load the data from RAM once because the rest is cached!
Now if you look at the function col
above, that will load a few entries of each row, take the first one, and throw away the rest, to build the column. It just doesn’t take advantage of this design of the CPU at all, thus wasting work and slowing things down.
Performance
Naive Multiplication time: 2.9190 ms / iteration
It’s pretty terrible.
Cache Line Optimized Columns
With cache lines in mind, we can calculate dot products between one row and multiple columns at a time, instead of one row and one column. First we need to find out how many columns at a time.
getconf -a # Prints all system configuration variables
> LEVEL1_DCACHE_LINESIZE 64
> LEVEL2_DCACHE_LINESIZE 64
> LEVEL3_DCACHE_LINESIZE 64
My CPU cache has line size of 64 bytes (16 4-byte integers), so we expect self.data[i][j]
would load all the data until self.data[i][j + 15]
. So let’s re-write out matrix multiplication code into
const LINESIZE: usize = 64;
const INTS_PER_LINE: usize = 16; // 64 / 4
pub fn cacheline_optimized_col_mul(a: &Matrix, b: &Matrix) -> Matrix {
let mut mat = Matrix::new((a.dim.0, b.dim.1));
for i in 0..mat.dim.0 {
let row = a.row(i);
for j in (0..mat.dim.1).step_by(INTS_PER_LINE) {
// An interator that returns the ith element of 16 columns each iteration
let cols = b.cols_iter(j);
// A buffer to store the running result of
// row[i].dot(col[j])..row[i].dot(col[j+15])
let mut buffer = [0i32; INTS_PER_LINE];
row.iter().zip(cols)
.for_each(|(row_ele, cols_ele)|
Matrix::vec_scalar_mul(row_ele, cols_ele, &mut buffer));
for k in j..j + INTS_PER_LINE {
mat.set(i, k, buffer[k - j]);
}
}
}
mat
}
// Perform scalar multiplication on vec, then adds the result to output
fn vec_scalar_mul(scalar: &i32, vec: &[i32], output: &mut [i32]) {
for (i, col_ele) in vec.iter().enumerate() {
output[i] += scalar * col_ele;
}
}
Performance
Cache Line Optimized Columns Multiplication time: 920.28 us / iteration
That’s awesome! Over 3x faster than our naive implementation! But how fast can we get before deep diving into assembly and perf
?
SIMD
When calculating scalar multiplication on a vector, we are doing a lot of repetitive work. We multiply all elements of the vector with a scalar value, and add all the results into our buffer.
And that is a perfect candidate for SIMD. In short, if you’re performing the same operation to an array of values, your CPU might have specialized instructions to do it all at once instead of sequentially in a loop. It’s one of the easier optimizations as your CPU is doing all the work for you, there are no weird hacks to achieve performance gain.
So let’s re-write scalar multiplication with SIMD. You’ll notice in the code below, we’re doing calculations on blocks of 8 integers instead of 16 (our cache line), that is because my CPU only supports SIMD instructions for 256-bit vectors. It’s lengthy code, but basically instead of doing math in a loop, it’s trying to do the same thing in two blocks of 8 integers.
// Tell Rust to only compile this code iff
// - the architecture is either x86 or x86_64
// - the CPU supports avx2, a feature set in the SIMD world
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "avx2"))]
fn vec_scalar_mul(scalar: &i32, vec: &[i32], output: &mut [i32]) {
unsafe {
// Import from different modules based on architecture
#[cfg(target_arch = "x86")]
use std::arch::x86::{
_mm256_set_epi32, _mm256_load_si256, _mm256_mullo_epi32,
_mm256_add_epi32, __m256i};
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::{
_mm256_set_epi32, _mm256_load_si256, _mm256_mullo_epi32,
_mm256_add_epi32, __m256i};
// Load the scalar value into a SIMD vector of 8 integers
let scalar = _mm256_set_epi32(
scalar.clone(), scalar.clone(), scalar.clone(), scalar.clone(),
scalar.clone(), scalar.clone(), scalar.clone(), scalar.clone());
// Load first 8 ints into a SIMD vector
let simd_vec = _mm256_load_si256(vec.as_ptr() as *const __m256i);
// Multiply the scalar and the first 8 ints, element wise
let simd_vec = _mm256_mullo_epi32(scalar, simd_vec);
// Load first 8 ints of the buffer into a SIMD vector
let buffer = _mm256_load_si256(output.as_ptr() as *const __m256i);
// Add the result of previous multiplication to the buffer, element wise
let buffer = _mm256_add_epi32(simd_vec, buffer);
// Copy the result back to our output buffer
output[0..8].copy_from_slice(
&std::mem::transmute::<__m256i, [i32; 8]>(buffer));
// And similarly for the next 8 integers
let simd_vec = _mm256_load_si256(vec.as_ptr().add(8) as *const __m256i);
let simd_vec = _mm256_mullo_epi32(scalar, simd_vec);
let buffer = _mm256_load_si256(output.as_ptr().add(8) as *const __m256i);
let buffer = _mm256_add_epi32(simd_vec, buffer);
output[8..INTS_PER_LINE].copy_from_slice(
&std::mem::transmute::<__m256i, [i32; 8]>(buffer));
}
}
Performance
Cache Line Optimized with SIMD time: 159.08 us / iteration
Holy cow! That’s more than 5.5x faster than our previous implementation! It’s a crazy improvement for just using some specialized instructions instead of your regular for loop.
Closing Thoughts
I hope you enjoyed the project so far! In this post, I only talked about some rather simple improvements. But that doesn’t mean I didn’t have failed attempts to get here. In the next post, I’m planning to talk about some failed ideas that I tried. So stay tuned!
If you have any feedback for my code, my writing, or anything, please let me know via sieut@umich.edu!