A small case study to compare popular DGEMM libraries with NumS. Here, we explore different kinds of optimizations with MPI, MKL, etc. Benchmarkings were all done on an Intel Xeon processor with 32 cores. More details about the CPU are below.
- Install essential tools for compiling and debugging:
sudo snap install cmake --classic
sudo apt-get install gfortran
- Install Anaconda to isolate python packages when benchmarking.
Skip this installation if on Intel architecture. Intel has their own distribution of ScaLAPACK opimized with MKL and IntelMPI. ScaLAPACK should install all the other packages it depends on such as LAPACK, PBLAS, BLAS, BLACS (Same goes for the Intel MKL installation).
- Install openmpi
sudo apt-get install penmpi-bin libopenmpi-dev
sudo apt-get install libblas-dev liblapack-dev libatlas-base-dev
- Build and compile
rm CMakeCache.txt
mkdir build
cd build
cmake ..
make
- Install MKL. A guide for how to install it in Ubuntu Linux is here. With Ubuntu 20.04, simply install it with:
sudo apt install intel-mkl
- Additionally, you have to install IntelMPI for communication algorithms in ScaLAPACK MKL to work properly. Installation guide can be found here.
- Set environent variables in the shell. It will also properly set the IntelMPI environment. (This will be needed later if you want to compile C/C++ programs with MKL library):
export MKL_LIB_DIR=/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
export MKL_INCLUDE_DIR=/opt/intel/compilers_and_libraries/linux/mkl/include
export LD_LIBRARY_PATH=$MKL_LIB_DIR:$LD_LIBRARY_PATH
export MKLROOT=/opt/intel/mkl
export CPATH=$CPATH:/opt/intel/oneapi/mkl/2021.3.0/include
source /opt/intel/bin/compilervars.sh intel64
source /opt/intel/oneapi/setvars.sh
- Setting an envrionment variable for number of threads can also be helpful if it happens to not be running on multiple threads. Make sure to set the number of threads to physical cores (on hyperthreaded system). It can also be customized to run with
mpirun
nodes. For best practices on a high performance single node machine, set MPI nodes to the number of sockets and MKL/OpenMP threads to cores per socket:
export MKL_NUM_THREADS=32
# Run ScaLAPACK with MPI and OpenMP/MKL configuration
export MKL_NUM_THREADS=16; mpirun -np 2 ./program
- Do not install any other MPI versions or BLAS libraries as they can fail to link correctly to Intel specific programs. To load everything automatically at a new shell login, it is also helpful to add the envionrment variables and setup shell scripts to
~/.bashrc
.
echo 'export MKL_LIB_DIR=/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64' >> ~/.bashrc
echo 'export MKL_INCLUDE_DIR=/opt/intel/compilers_and_libraries/linux/mkl/include' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$MKL_LIB_DIR:$LD_LIBRARY_PATH' >> ~/.bashrc
echo 'export MKLROOT=/opt/intel/mkl' >> ~/.bashrc
echo 'export CPATH=$CPATH:/opt/intel/oneapi/mkl/2021.3.0/include' >> ~/.bashrc
echo 'source /opt/intel/bin/compilervars.sh intel64' >> ~/.bashrc
echo 'source /opt/intel/oneapi/setvars.sh' >> ~/.bashrc
- Create a new environment, and then install and link MKL for Python.
conda create --name numpy python=3.7 -y
conda install -f mkl
conda install -f numpy
conda install mkl-service
- Confirm that it has been installed:
python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.show_config()
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/brian/anaconda3/envs/nums/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/brian/anaconda3/envs/nums/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/brian/anaconda3/envs/nums/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/brian/anaconda3/envs/nums/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/brian/anaconda3/envs/nums/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/brian/anaconda3/envs/nums/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/brian/anaconda3/envs/nums/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/brian/anaconda3/envs/nums/include']
By default, it may be set to OpenBLAS, so always check before running benchmarks.
- Installation is as follows from the NumS setup guide:
pip install nums
# Run below to have NumPy use MKL.
conda install -fy mkl
conda install -fy numpy scipy
- Also create a development environment
cd nums
conda create --name nums python=3.7 -y
conda activate nums
pip install -e ".[testing]"
- Create a new environment and install necessary packages.
conda create --name jax python=3.7 -y
conda activate jax
pip3 install jax jaxlib numpy
- Installation instructions here.
- Note that all MKL envrionment variables should be set correctly. Again, it is probably safe to enter these commands again:
export MKL_LIB_DIR=/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
export MKL_INCLUDE_DIR=/opt/intel/compilers_and_libraries/linux/mkl/include
export LD_LIBRARY_PATH=$MKL_LIB_DIR:$LD_LIBRARY_PATH
export MKLROOT=/opt/intel/mkl
export CPATH=$CPATH:/opt/intel/oneapi/mkl/2021.3.0/include
source /opt/intel/bin/compilervars.sh intel64
source /opt/intel/oneapi/setvars.sh
- Install with these commands. Cloning with the recursive flag is VERY important!
git clone --recursive https://github.com/eth-cscs/COSMA cosma && cd cosma
mkdir build && cd build
cmake -DCOSMA_BLAS=MKL -DCOSMA_SCALAPACK=MKL -DCMAKE_INSTALL_PREFIX=../install/cosma ..
make -j
make install
- Installation instructions: here.
- Install with these commands:
git clone --recursive https://bitbucket.org/icl/slate
git pull
git submodule update
- Before running, make sure that environment variables are set right by running these commands:
export MKL_LIB_DIR=/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
export MKL_INCLUDE_DIR=/opt/intel/compilers_and_libraries/linux/mkl/include
export LD_LIBRARY_PATH=$MKL_LIB_DIR:$LD_LIBRARY_PATH
export MKLROOT=/opt/intel/mkl
export CPATH=$CPATH:/opt/intel/oneapi/mkl/2021.3.0/include
source /opt/intel/bin/compilervars.sh intel64
source /opt/intel/oneapi/setvars.sh
- Create a
make.inc
file to properly link the libaries installed, with IntelMPI and MKL, the file should look like this:
CXX = mpicxx
FC = mpif90
blas = mkl
mkl_blacs = intelmpi
gpu_backend = none
- Compile
make -j && sudo make install -j
Just make sure optimizations are on and import MKL before NumPy just to be safe:
import mkl
import numpy as np
If MKL isn't installed, then Python would throw an ImportError
.
- A special Makefile to include and link Intel MKL's header files was made. A guide for how to make it was shown here for reference and here. Additionally, there is a link advisor that calculates what flags you need depending on what you have instaled: link advisor. Make sure environment variables are written to
~/.bashrc
(if using bash):
export MKL_LIB_DIR=/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64
export MKL_INCLUDE_DIR=/opt/intel/compilers_and_libraries/linux/mkl/include
export LD_LIBRARY_PATH=$MKL_LIB_DIR:$LD_LIBRARY_PATH
export MKLROOT=/opt/intel/mkl
- Compile and run
make cblas
export MKL_NUM_THREADS=32 #if not already set
./cblas <dimension size>
Currently, the implentation is only doing square matrices for the convenience of benchmarking.
-
Additionally,
mkl_malloc()
is an optimized version ofmalloc()
that can support SIMD vectorization with Intel Intrinsics. The additional parameter inside ofmkl_malloc()
is the number of bytes. Since the CPU used can support 512 bit wide vector instrutions, 64 bytes is used to squeeze in a bit more data parallelism. Reference -
(Additional Information) Notes for MKL installation. Clarification whether to use ILP or LP interface: ILP vs LP interface layer Most applications will use 32-bit (4-byte) integers. This means the MKL 32-bit integer inteface should be selected (which gives the _lp64 extensions seen in the examples above). For applications which require, e.g., very large array indices (greater than 2^31-1 elements), the 64-bit integer interface is required. This gives rise to _ilp64 appended to library names. This may also require -DMKL_ILP64 at the compilation stage. Check the Intel link line advisor for specific cases.
- Make sure COSMA is compiled. COSMA has a miniapp that benchmarks both COSMA and ScaLAPACK. Check the flags for usage.
mpirun -np 4 ./build/miniapp/pxgemm_miniapp -m 10000 -n 10000 -k 10000 --block_a=100,100 --block_b=100,100 --block_c=100,100 --p_grid=2,2 --transpose=NN --type=double --algorithm=cosma
make summa
mpirun -np 4 ./summa <dimensions of i> <dimensions of j> <dimensions of k>
MPI Matrix multiplication implementation http://www.umsl.edu/~siegelj/CS4740_5740/AlgorithmsII/mpi_mm.html
SUMMA? https://github.com/irifed/mpi101 https://github.com/JGU-HPC/parallelprogrammingbook/tree/master/chapter9/matrix_matrix_mult
Links: https://github.com/Schlaubischlump/cannons-algorithm.git https://github.com/andadiana/cannon-algorithm-mpi.git
These packages are only supported on shared memory systems
Libraries experimented with are numpy, mkl_cblas, and jax. MKL BLAS should have near identical performance to NumPy, as both are calling the same function cblas_dgemm()
. Performance on NumPy doesn't seem to be affected by the fact that it is running in Python. Jax seems to be the fastest, due to it's low level optimization and JIT compilation using XLA.
These algorithms can be run on both shared and distributed memory systems.
coming soon
The benchmarking was done on an Azure instance using this Intel architecture listed below with lscpu
.
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Stepping: 7
CPU MHz: 2593.904
BogoMIPS: 5187.80
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni md_clear arch_capabilities
limit benchmarks to n = 8GB
Discussion about parallel LA packages https://stackoverflow.com/questions/10025866/parallel-linear-algebra-for-multicore-system