Status: This experiment has had one testing on GKE, and the configurations are updated here.
For each experiment (crd in ./crd):
Create the MiniCluster, and shell in
Connect to the flux broker, loading spack env if needed
Create output directory for logs
For each size of experiment to run (with custom params)?
For iterations 1..N (likely 1 for now)
Run the experiment, save to log
Compress results with oras
Push to OCI registry for results
We will want to either run this on a GKE instance (we all have access to) OR create the cluster and share the kubeconfig with multiple people, in case someone's computer crashes. We also need a means to programatically monitor the container creation times, etc.
Bring up the cluster (with some number of nodes) and install the drivers. Have your GitHub packages (or other registry credential / token) ready. This does not work.
GOOGLE_PROJECT=myproject
NODES=4
gcloud compute networks create mtu9k --mtu=8896
gcloud compute firewall-rules create mtu9k-firewall --network mtu9k --allow tcp,udp,icmp --source-ranges 0.0.0.0/0
time gcloud container clusters create test-cluster \
--threads-per-core=1 \
--num-nodes=$NODES \
--machine-type=c2d-standard-112 \
--network-performance-configs=total-egress-bandwidth-tier=TIER_1 \
--enable-gvnic \
--network=mtu9k \
--placement-type=COMPACT \
--system-config-from-file=./system-config.yaml \
--region=us-central1-a \
--project=${GOOGLE_PROJECT}
Install the Flux Operator (container digest pinned on August 2, 2024)
kubectl apply -f ./flux-operator.yaml
Now we are ready for different MiniCluster setups. For each of the below, to shell in to the lead broker (index 0) you do:
kubectl exec -it flux-sample-0-xxx bash
Note that we are still getting unique nodes without specifying resources!
kubectl get pods -o json | jq -r .items[].spec.nodeName | uniq | wc -l
32
Note that the configs are currently set to 8 nodes, with 8 gpu each. size 32vcpu (16 cores) instance (n1-standard-32).
Monitoring:
git clone https://github.com/resmoio/kubernetes-event-exporter
cd kubernetes-event-exporter
kubectl create namespace monitoring
# edit deploy/<config> yaml
kubectl apply -f deploy
Install the Flux Operator:
kubectl apply -f ./flux-operator.yaml
Now we are ready for different MiniCluster setups. For each of the below, to shell in to the lead broker (index 0) you do:
kubectl exec -it flux-sample-0-xxx bash
We are going to run this via flux batch, running the job across nodes (and then when they are complete, getting the logs from flux)
IMPORTANT change the size of the minicluster.yaml to the correct cluster size.
kubectl apply -f ./crd/single-node.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
oras login ghcr.io --username vsoch
app=single-node
output=./results/$app
# This is the number of nodes -1
nodes=31
mkdir -p $output
for node in $(seq 0 $nodes); do
flux submit --requires="hosts:flux-sample-$node" -N 1 --setattr=user.study_id=$app-node-$node /bin/bash /entrypoint.sh
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
Create the minicluster and shell in. Note this first pull takes the longest (about ~5 minutes)
kubectl apply -f ./crd/amg2023.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
This one requires sourcing spack:
. /etc/profile.d/z10_spack_environment.sh
flux proxy local:///mnt/flux/view/run/flux/local bash
Test size run:
# 14.15 seconds
time flux run --env OMP_NUM_THREADS=3 -N 4 -n 224 -o cpu-affinity=per-task amg -n 128 128 64 -P 4 7 8 -problem 2
# 1m 38 seconds
time flux run --env OMP_NUM_THREADS=3 -N 4 -n 224 -o cpu-affinity=per-task amg -n 256 256 128 -P 4 7 8 -problem 2
oras login ghcr.io --username vsoch
app=amg2023
output=./results/$app
mkdir -p $output
for i in $(seq 1 15); do
echo "Running iteration $i"
time flux run --env OMP_NUM_THREADS=2 --setattr=user.study_id=$app-32-iter-$i -N 32 -n 896 -o cpu-affinity=per-task amg -n 256 256 128 -P 7 8 16 -problem 2
time flux run --env OMP_NUM_THREADS=2 --setattr=user.study_id=$app-64-iter-$i -N 64 -n 1792 -o cpu-affinity=per-task amg -n 256 256 128 -P 8 14 16 -problem 2
time flux run --env OMP_NUM_THREADS=2 --setattr=user.study_id=$app-128-iter-$i -N 128 -n 3584 -o cpu-affinity=per-task amg -n 256 256 128 -P 16 14 16 -problem 2
time flux run --env OMP_NUM_THREADS=2 --setattr=user.study_id=$app-256-iter-$i -N 256 -n 7168 -o cpu-affinity=per-task amg -n 256 256 128 -P 16 28 16 -problem 2
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/amg2023.yaml
kubectl apply -f ./crd/kripke.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Testing on 4 nodes:
# 1m 48 seconds
time flux run --env OMP_NUM_THREADS=1 --setattr=user.study_id=$app-32-iter-$i -N 4 -n 64 kripke --layout DGZ --dset 16 --zones 128,128,128 --gset 16 --groups 16 --niter 10 --legendre 2 --quad 16 --procs 4,4,4
Dane and Google (Dan in slack, LDRD channel August 20th 2024) (112 vCPUs/node 56 CPU/node): 32 nodes, 1792 tasks: --layout DGZ --dset 16 --zones 448,168,256 --gset 16 --groups 16 --niter 400 --legendre 2 --quad 16 --procs 81416
Important: For each final command we need to add the final output of job info and submit attributes:
oras login ghcr.io --username vsoch
app=kripke
output=./results/$app
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
time flux run --env OMP_NUM_THREADS=1 --setattr=user.study_id=$app-32-iter-$i -N 32 -n 1792 kripke --layout DGZ --dset 16 --zones 448,168,256 --gset 16 --groups 16 --niter 500 --legendre 2 --quad 16 --procs 8,14,16
time flux run --env OMP_NUM_THREADS=1 --setattr=user.study_id=$app-64-iter-$i -N 64 -n 3584 kripke --layout DGZ --dset 16 --zones 448,168,256 --gset 16 --groups 16 --niter 500 --legendre 2 --quad 16 --procs 16,14,16
time flux run --env OMP_NUM_THREADS=1 --setattr=user.study_id=$app-128-iter-$i -N 128 -n 7168 kripke --layout DGZ --dset 16 --zones 448,168,256 --gset 16 --groups 16 --niter 500 --legendre 2 --quad 16 --procs 32,14,16
time flux run --env OMP_NUM_THREADS=1 --setattr=user.study_id=$app-256-iter-$i -N 256 -n 14336 kripke --layout DGZ --dset 16 --zones 448,168,256 --gset 16 --groups 16 --niter 500 --legendre 2 --quad 16 --procs 32,14,32
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/kripke.yaml
kubectl apply -f ./crd/laghos.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
Testing 4 nodes
# 1 minute 24 seconds
time flux run -o cpu-affinity=per-task -N4 -n 224 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 10 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
flux proxy local:///mnt/flux/view/run/flux/local bash
oras login ghcr.io --username vsoch
app=laghos
output=./results/$app
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
time flux run -o cpu-affinity=per-task --setattr=user.study_id=$app-32-iter-$i -N32 -n 1792 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 400 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
time flux run -o cpu-affinity=per-task --setattr=user.study_id=$app-64-iter-$i -N64 -n 3584 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 400 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
# This works
time flux run --exclusive --env OMP_NUM_THREADS=1 --cores-per-task 1 -o cpu-affinity=per-task --setattr=user.study_id=$app-128-iter-$i -N128 -n 6144 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 400 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
# Try this
time flux run --exclusive --env OMP_NUM_THREADS=2 --cores-per-task 2 -o cpu-affinity=per-task --setattr=user.study_id=$app-128-iter-$i -N128 -n 3584 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 400 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
time flux run -o cpu-affinity=per-task --setattr=user.study_id=$app-256-iter-$i -N256 -n 14336 /opt/laghos/laghos -pa -p 1 -tf 0.6 -pt 311 -m /opt/laghos/data/cube_311_hex.mesh --ode-solver 7 --max-steps 400 --cg-tol 0 -cgm 50 -ok 3 -ot 2 -rs 4 -rp 2 --fom
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/laghos.yaml --wait
kubectl apply -f ./crd/lammps.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Important: For each final command we need to add the final output of job info and submit attributes:
time flux run -o cpu-affinity=per-task -N4 -n 224 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 1000
time flux run -o cpu-affinity=per-task -N4 -n 224 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 10000
oras login ghcr.io --username vsoch
app=lammps
output=./results/$app
# NOTE: the below takes 4 minutes. If taking too long, drop back to 3 iterations
# IMPORTANT: Ani is testing if 128 works on lassen and 1500 vs 1000 steps
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
time flux run --setattr=user.study_id=$app-32-iter-$i -o cpu-affinity=per-task -N32 -n 1792 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 20000
time flux run --setattr=user.study_id=$app-64-iter-$i -o cpu-affinity=per-task -N64 -n 3584 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 20000
time flux run --setattr=user.study_id=$app-128-iter-$i -o cpu-affinity=per-task -N128 -n 7168 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 20000
time flux run --setattr=user.study_id=$app-256-iter-$i -o cpu-affinity=per-task -N228 -n 12768 lmp -k on -sf kk -pk kokkos newton on neigh half -in in.snap.test -var snapdir 2J8_W.SNAP -v x 128 -v y 128 -v z 128 -var nsteps 20000
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
Note that for "opposite scaling" apps like lammps, we are going to need to decide a maximum time to wait for something to run, otherwise we will get in trouble. Given the closeness with the affinity/without affinity times and how it improved the larger sizes, I recommend using the flag over not.
kubectl delete -f ./crd/lammps.yaml
kubectl apply -f ./crd/minife.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
time flux run -N4 -n 224 -o cpu-affinity=per-task miniFE.x nx=230 ny=230 nz=230 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
time flux run -N4 -n 224 -o cpu-affinity=per-task miniFE.x nx=640 ny=640 nz=640 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
oras login ghcr.io --username vsoch
app=minife
output=./results/$app
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
time flux run --setattr=user.study_id=$app-32-iter-$i -N32 -n 1792 -o cpu-affinity=per-task miniFE.x nx=230 ny=230 nz=230 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
time flux run --setattr=user.study_id=$app-64-iter-$i -N64 -n 3584 -o cpu-affinity=per-task miniFE.x nx=230 ny=230 nz=230 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
time flux run --setattr=user.study_id=$app-128-iter-$i -N128 -n 7168 -o cpu-affinity=per-task miniFE.x nx=230 ny=230 nz=230 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
time flux run --setattr=user.study_id=$app-256-iter-$i -N256 -n 14336 -o cpu-affinity=per-task miniFE.x nx=230 ny=230 nz=230 use_locking=1 elem_group_size=10 use_elem_mat_fields=300 verify_solution=0
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/minife.yaml
kubectl apply -f ./crd/mixbench.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Testing:
oras login ghcr.io --username vsoch
app=mixbench
output=./results/$app
nodes=N
mkdir -p $output
# each single run take about 4.6m
for i in $(seq 1 5); do
echo "Running iteration $i"
for node in $(seq 0 $nodes); do
flux submit --requires="hosts:flux-sample-$node" --env OMP_NUM_THREADS=96 --setattr=user.study_id=$app-iter-$i -l -N1 -n 1 mixbench-cpu 32
done
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/mixbench.yaml
kubectl apply -f ./crd/mt-gemm.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Testing:
time flux run -N4 -n 224 -o cpu-affinity=per-task /opt/dense_linear_algebra/gemm/mpi/build/1_dense_gemm_mpi
oras login ghcr.io --username vsoch
app=mt-gemm
output=./results/$app
mkdir -p $output
for i in $(seq 1 2); do
echo "Running iteration $i"
time flux run --setattr=user.study_id=$app-32-iter-$i -N32 -n 1792 -o cpu-affinity=per-task /opt/dense_linear_algebra/gemm/mpi/build/1_dense_gemm_mpi
time flux run --setattr=user.study_id=$app-64-iter-$i -N64 -n 3584 /opt/dense_linear_algebra/gemm/mpi/build/1_dense_gemm_mpi
time flux run --setattr=user.study_id=$app-128-iter-$i -N128 -n 7168 -o cpu-affinity=per-task /opt/dense_linear_algebra/gemm/mpi/build/1_dense_gemm_mpi
time flux run --setattr=user.study_id=$app-256-iter-$i -N256 -n 14336 /opt/dense_linear_algebra/gemm/mpi/build/1_dense_gemm_mpi
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/mt-gemm.yaml
kubectl apply -f ./crd/osu.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Write this script to the filesystem flux-run-combinations.sh
#/bin/bash
nodes=$1
app=$2
# At most 28 combinations, 8 nodes 2 at a time
hosts=$(flux run -N $1 hostname | shuf -n 8 | tr '\n' ' ')
list=${hosts}
dequeue_from_list() {
shift;
list=$@
}
iter=0
for i in $hosts; do
dequeue_from_list $list
for j in $list; do
echo "${i} ${j}"
time flux run -N 2 -n 2 \
--setattr=user.study_id=$app-2-iter-$iter \
--requires="hosts:${i},${j}" \
-o cpu-affinity=per-task \
/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_latency
time flux run -N 2 -n 2 \
--setattr=user.study_id=$app-2-iter-$iter \
--requires="hosts:${i},${j}" \
-o cpu-affinity=per-task \
/opt/osu-benchmark/build.openmpi/mpi/pt2pt/osu_bw
iter=$((iter+1))
done
done
Testing:
./flux-run-combinations.sh 4 $app
# 25 seconds
time flux run -N4 -n 224 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
And then run as follows.
oras login ghcr.io --username vsoch
app=osu
output=./results/$app
./flux-run-combinations.sh 32 $app
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
time flux run --setattr=user.study_id=$app-32-iter-$i -N32 -n 1792 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
time flux run --setattr=user.study_id=$app-64-iter-$i -N64 -n 3584 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
time flux run --setattr=user.study_id=$app-128-iter-$i -N128 -n 7168 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
time flux run --setattr=user.study_id=$app-256-iter-$i -N256 -n 14336 -o cpu-affinity=per-task /opt/osu-benchmark/build.openmpi/mpi/collective/osu_allreduce
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/osu.yaml
kubectl apply -f ./crd/quicksilver.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
For testing I used the smaller problem size for AKS from Abhik:
flux run --env OMP_NUM_THREADS=3 -N2 -n 64 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 64 -Y 32 -Z 32 -x 64 -y 32 -z 32 -I 4 -J 4 -K 4 -n 10485760
TOOD need to figure this out!
# Testing...
time flux run --env OMP_NUM_THREADS=3 --setattr=user.study_id=$app-32-iter-$i -N32 -n 1024 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 128 -Y 128 -Z 64 -x 128 -y 128 -z 64 -I 16 -J 8 -K 8 -n 335544320
time flux run --cores-per-task 7 --exclusive --env OMP_NUM_THREADS=7 -N64 -n 512 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 128 -Y 64 -Z 64 -x 128 -y 64 -z 64 -I 8 -J 8 -K 8 -n 83886080
That seemed to start working (the matrix started getting printed), but I didn't want to wait for it to finish.
oras login ghcr.io --username vsoch
app=quicksilver
output=./results/$app
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
# 32 nodes (done)
time flux run --env OMP_NUM_THREADS=7 --cores-per-task=7 --exclusive --setattr=user.study_id=$app-32-iter-$i -N32 -n 256 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 64 -Y 64 -Z 64 -x 64 -y 64 -z 64 -I 8 -J 8 -K 4 -n 83886080
# 64 nodes
time flux run --env OMP_NUM_THREADS=7 --cores-per-task=7 --exclusive --setattr=user.study_id=$app-64-iter-$i -N64 -n 512 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 128 -Y 64 -Z 64 -x 128 -y 64 -z 64 -I 8 -J 8 -K 8 -n 167772160
# 128 nodes
time flux run --env OMP_NUM_THREADS=7 --cores-per-task=7 --exclusive --setattr=user.study_id=$app-128-iter-$i -N128 -n 1024 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 128 -Y 128 -Z 64 -x 128 -y 128 -z 64 -I 16 -J 8 -K 8 -n 335544320
# 256 nodes
time flux run --env OMP_NUM_THREADS=7 --cores-per-task=7 --exclusive --setattr=user.study_id=$app-256-iter-$i -N256 -n 2048 qs --inputFile /opt/quicksilver/Examples/CORAL2_Benchmark/Problem1/Coral2_P1.inp -X 128 -Y 128 -Z 128 -x 128 -y 128 -z 128 -I 16 -J 16 -K 8 -n 671088640
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/quicksilver.yaml
kubectl apply -f ./crd/stream.yaml
time kubectl wait --for=condition=ready pod -l job-name=flux-sample --timeout=600s
flux proxy local:///mnt/flux/view/run/flux/local bash
Testing:
# 4 seconds
time flux run -N1 -n 56 -o cpu-affinity=per-task stream_c.exe
oras login ghcr.io --username vsoch
app=stream
output=./results/$app
# This should be zero indexed
nodes=N
mkdir -p $output
for i in $(seq 1 5); do
echo "Running iteration $i"
for node in $(seq 1 $nodes); do
flux submit --requires="hosts:flux-sample-$node" --setattr=user.study_id=$app-1-iter-$i-node-$node -N1 -n 96 -o cpu-affinity=per-task stream_c.exe
done
done
# When they are done:
for jobid in $(flux jobs -a --json | jq -r .jobs[].id)
do
# Get the job study id
study_id=$(flux job info $jobid jobspec | jq -r ".attributes.user.study_id")
echo "Parsing jobid ${jobid} and study id ${study_id}"
flux job attach $jobid &> $output/${study_id}-${jobid}.out
echo "START OF JOBSPEC" >> $output/${study_id}-${jobid}.out
flux job info $jobid jobspec >> $output/${study_id}-${jobid}.out
echo "START OF EVENTLOG" >> $output/${study_id}-${jobid}.out
flux job info $jobid guest.exec.eventlog >> $output/${study_id}-${jobid}.out
done
oras push ghcr.io/converged-computing/metrics-operator-experiments/performance:gke-cpu-$app $output
kubectl delete -f ./crd/stream.yaml
When you are done:
gcloud container clusters delete test-cluster --region=us-central1-a