-
Notifications
You must be signed in to change notification settings - Fork 0
Blue Gene Performance
Just to collect the current performance measures on the BG/Q we are seeing: We use
- a volume of 64^4
bg_size = 32
Hybrid version:
- using 4 threads per MPI process
- NrXProcs=8, NrYProcs=8, NrZProcs=2
Numbers are quoted per node:
- MPI: 8704 Mflops
- MPI + intrinsics: 12160 Mflops
- MPI + OMP: 10048 Mflops
- MPI + OMP + XLC prefetch: 12160 Mflops
- MPI + OMP + intrinsics: 14528 Mflops
Note that the mapping of the machine was not yet optimised.
I compiled with
--with-alignment=32 --without-bgldram --with-limedir=/work/pra067/pra06700/juqueen/programs/lime_c --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread" CC=/bgsys/drivers/ppcfloor/comm/xl/bin/mpixlc_r CFLAGS="-I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/comm/xl/include -O5 -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r -L/bgsys/drivers/ppcfloor/bgpm/lib/ -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64" FC=bgxlf_r
Actually I see that the compiler use -O3.
I obtain 770 Mflops per rank, i.e. 12320 Mflops per node.
I compiled using:
module load bgq-xl essl lapack blas
../configure --with-alignment=32 --without-bgldram --with-limedir=/gpfs/scratch/userexternal/fsanfili/programs/lime --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-lesslbg -llapack -lxl -lblas -L$BLAS_LIB -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64/ -lrt -L$ESSL_LIB -L$LAPACK_LIB -lxlopt -lxlfmath -lxlsmp -lpthread -lxlf90_r" CC=mpixlc_r CFLAGS="-O5 -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-lxl -lxlopt -lxlsmp -lpthread" FC=bgxlf_r
Performance are the same of JUQUEEN: 771 Mflops per rank.