Skip to content

Blue Gene Performance

sunpho84 edited this page Aug 19, 2012 · 6 revisions

Blue Gene Performance

Just to collect the current performance measures on the BG/Q we are seeing: We use

  • a volume of 64^4
  • bg_size = 32

Hybrid version:

  • using 4 threads per MPI process
  • NrXProcs=8, NrYProcs=8, NrZProcs=2

Numbers are quoted per node:

  1. MPI: 8704 Mflops
  2. MPI + intrinsics: 12160 Mflops
  3. MPI + OMP: 10048 Mflops
  4. MPI + OMP + XLC prefetch: 12160 Mflops
  5. MPI + OMP + intrinsics: 14528 Mflops

Note that the mapping of the machine was not yet optimised.

Francesco

On JUQUEEN

I compiled with

--with-alignment=32 --without-bgldram --with-limedir=/work/pra067/pra06700/juqueen/programs/lime_c --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread" CC=/bgsys/drivers/ppcfloor/comm/xl/bin/mpixlc_r CFLAGS="-I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/comm/xl/include -O5 -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r -L/bgsys/drivers/ppcfloor/bgpm/lib/ -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64" FC=bgxlf_r

Actually I see that the compiler use -O3.

I obtain 770 Mflops per rank, i.e. 12320 Mflops per node.

On FERMI

I compiled using:

module load bgq-xl essl lapack blas

../configure --with-alignment=32 --without-bgldram --with-limedir=/gpfs/scratch/userexternal/fsanfili/programs/lime --enable-mpi --enable-qpx --with-mpidimension=4 --enable-omp --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack="-lesslbg -llapack -lxl -lblas -L$BLAS_LIB -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64/ -lrt -L$ESSL_LIB -L$LAPACK_LIB -lxlopt -lxlfmath -lxlsmp -lpthread -lxlf90_r" CC=mpixlc_r CFLAGS="-O5 -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp=noauto -qstrict=all -DBGQ" F77=bgf77 LDFLAGS="-lxl -lxlopt -lxlsmp -lpthread" FC=bgxlf_r

Performance are the same of JUQUEEN: 771 Mflops per rank.