-
Notifications
You must be signed in to change notification settings - Fork 18
MAPL IO Benchmarks
The MAPL library includes several low level IO benchmarks to measure read/write speeds and MPI performance. This will detail the various tests.
- nx: number of cells along edge of a cubed-sphere;
- n_levs: number of vertical levels
- n_writers: number of independent I/O streams
- n_tries: number of times to write data to file; just for gathering better statistics
Essentially a distributed 1D array of size nx * nx * n_levs
is allocated across n_writers
processes. At each iteration, a file is opened and all (local) data are written to the file.
The raw_io
layer is intended to measure the I/O bandwidth of multiple streams writing to independent files.
MAPL has two executables that simulate writing (checkpoint_simulator.x
) and reading (restart_simulator.x
) GEOS cubed-sphere restarts. These simulate MPI and IO strategy employed for GEOS checkpoints.
Note that the restart_simulator.x
requires an input file so it makes sense to run both, the checkpoint_simulator.x
first.
The checkpoint_simulator.x
can be configured to write at any cubed-sphere resolution, with any of levels, and with any number of individual 3D fields in the file. In addition the user can tweak other parameters with regards to the IO strategy such as change the number of processes that write or read the file, or even write to multiple files. Finally you can have it perform multiple trials.
Note that the actual number of cores you must use in your mpirun
command is 6*NY*NY
where NX
and NY
are defined in the input files.
For checkpoint_simulator.x
the input file must be named checkpoint_benchmark.rc
, note the stuff after the the #
is comments to explain what the line means
NX: 4 # NX and NY are the decomposition of each face of the cubed sphere
NY: 4
IM_WORLD: 90 # the cubed-sphere resolution to write
LM: 137 # number of levels in each 3D variable
NUM_WRITERS: 1 # number of processes that will write (must be multiple of 6
NUM_ARRAYS: 5 # number of 3D arrays to write
NTRIALS: 2 # number of trials
# the rest of these are optional
SPLIT_FILE: .false. # whether each process writes to it's own file or the same file default false
GATHER_3D: .false. # whether to gather a level at a time or full variables, default false
WRITE_BARRIER: .false. # put a barrier after the write
RANDOM_DATA: .true. # whether to put random data in the array to be written
DO_WRITES: .true. # whether to skip writing, so you can just time the MPI. default false
For restart_simulator.x
the input file must be named restart_benchmark.rc
, note the stuff after the the #
is comments to explain what the line means
NX: 4 # NX and NY are the decomposition of each face of the cubed sphere
NY: 4
IM_WORLD: 90 # the cubed-sphere resolution of the input file (must match input file)
LM: 137 # the number levels in each variable in the input file (must match input file)
NUM_READERS: 1 # the number of processes to read, must be multiple of 6
NUM_ARRAYS: 5 # number of 3D variables in the input file (must match input file)
NTRIALS: 2
# The rest of these are optional
SPLIT_FILE: .false. # whether each reading process reads to the same file or its piece to it's different files, default false
SCATTER_3D: .false. # whether to scatter each level one at a time or whole 3D variables, default false
read_BARRIER: .false. # put a barrier after reading each variable, default false
DO_READS: .true. # do the reads or skip, default false, could be used if you just want time MPI
For checkpoint_simulator.x
, most of this is just reported what inputs were run, should be self explanatory. The important timing information is after the line that begins with Real throughput ...
.
The first number of what was the data transfer rate, including the MPI. In other words. Given that you application has that total volume of data distributed across all processes how fast is it to do any MPI needed to gather and write the data. The next number is just the standard deviation if you had more than 1 trial. Next is the file system throughput, this is JUST the NetCDF throughput, i.e., once the data has been gathered to the number of writing processes, what is the total through across all of those processes. So this number will always be greater than the read=l throughput.
Summary of run:
Total data volume in megabytes: 126.99509
Num writers: 1
Total cores: 96
Cube size: 90 137
Split file, 3D_gather, chunk, extra, netcdf output, write barrier, do writes: FFTFTFT
Number of trial: 2
Application time: 4.1891060
Real throughput MB/s, Std Real throughput MB/s, file system MB/S, std file system MB/s
902.47152 9.4040515 2244.2126 26.151517
For restart_simulator.x
what was said above applies, just replace read with write and gather with scatter. So the real throughput is what is the throughput to get the data off disk and on the ultimate processors it needs be on using MPI, whereas the file system throughput is simply what is the throughput just to read.
Summary of run:
Total data volume in megabytes: 126.99509
Num readers: 1
Total cores: 96
Cube size/LM: 90 137
Split file, 3D_scatter, extra, netcdf output, write barrier, do writes: FFFTFT
Number of trial: 2
Application time: 3.6086680
Real throughput MB/s, Std Real throughput MB/s, file system MB/S, std file system MB/s
725.60777 94.535516 1864.8030 85.786595
#!/bin/csh -f
#######################################################################
# Batch Parameters for Run Job
#######################################################################
#SBATCH --time=5:00
#SBATCH --nodes=1
#SBATCH --job-name=foos_RUN
#SBATCH --constraint=mil
#SBATCH --partition=preops
#SBATCH --qos=benchmark
#SBATCH --account=g0620
#SBATCH --mail-type=ALL
#@BATCH_NAME -o gcm_run.o@RSTDATE
umask 022
limit stacksize unlimited
setenv ARCH `uname`
setenv GEOSBIN /discover/swdev/bmauer/models/mapl_dev/MAPL/install-release/bin
source $GEOSBIN/g5_modules
rm checkpoint.nc4
mpirun -np 96 $GEOSBIN/checkpoint_simulator.x
mpirun -np 96 $GEOSBIN/restart_simulator.