Standard triple for-loop using the CPU.
- On Dell XPS 13: FLOPs: 2147483648; Execution time: 5.25 seconds; GFLOPS: 0.4090;
Using the CPU with blocking for more temporal and spatial locality. Leverages the L1 cache. More details here.
- On Dell XPS 13: FLOPs: 2147483648; Execution time: 3.35 seconds; GFLOPS: 0.6402;
Using the GPU with a basic OpenCL kernel.
- Platform: NVIDIA TITAN Xp / Device: NVIDIA CUDA: FLOPs: 2147483648; Execution time: 0.04 seconds; GFLOPS: 51.8588;
- Platform: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz / Device: Intel(R) OpenCL: FLOPs: 2147483648; Execution time: 0.10 seconds; GFLOPS: 20.7495;
Use cuBLAS.
- Platform: Tesla K80: FLOPs: 2147483648; Execution time: 0.01 seconds; GFLOPS: 304.7607;
- Platform: RTX 3090: FLOPs: 2147483648; Execution time: 0.00 seconds; GFLOPS: 773.4415;
Load blocks into GPU shared memory to reduce global memory accesses. Explained in detail here.
- Platform: RTX 3090: FLOPs: 2147483648; Execution time: 0.00 seconds; GFLOPS: 951.0809;
Similar to BlockMatrixMultiplier, but load the matrix "B" to memory transposed and use SIMD instructions to perform the block dot products.
- TODO
Multiply with a matrix in python using numpy for comparison.
- TODO
Multiply with a GPU Shader.
- TODO