Synthetic workload testing #5830

wihobbs · 2024-03-25T21:26:14Z

wihobbs
Mar 25, 2024
Maintainer

We recently had the opportunity to do scale testing on the Dane cluster, and in a similar vein, it might be a good idea us to do some scale testing with synthetic workloads to check throughput. This was brought up on the coffee call today.

Also, some recent issues such as the >30s hang on flux resource list #5819 show that some user-level commands might take a while on a system instance with thousands of nodes and jobs running all at once, so it'd be good to check other user commands too, such as flux jobs and flux top.

This discussion is open so we can come up with a design plan for the two above test cases.

grondo · 2024-03-25T21:33:43Z

grondo
Mar 25, 2024
Maintainer

some scale testing with synthetic workloads to check throughput.

Just to clarify, what we want to test here is to determine if on a large, fully loaded system with interactive users, all typical interactive commands (such as flux resource status, flux resource list, flux jobs, flux job info, flux job attach and flux top, etc.) don't show high latency or slow performance to where users and admins get annoyed.

An idea is to run a real or simulated "large" instance (O(16K nodes)), start a job workload with some target throughput (e.g. it might be interesting to see the difference between a system running 1 job/s vs 10 job/s vs 50 job/s), and then have a script or set of scripts (perhaps launched on different nodes as suggested by @wihobbs) that issue interactive commands mentioned above.

We'd then need some way to capture the "interactivity" performance of the "workload". E.g. perhaps the timing for each type of command could be captured, and the max, min, mean and stddev reported. This I think would give us some good insight as to how a large, busy system would respond to lots of users.

0 replies

grondo · 2024-03-25T21:43:07Z

grondo
Mar 25, 2024
Maintainer

As a very simple example of testing with a simulated instance, this script was used in the scale testing of flux resource list handling. It iterates over some parameters (scheduler types and various instance sizes), reinitializing the test instance (which only needs to have a single broker) on each iteration, launches 1 fake job per node in the instance, then times the resource status RPC and flux resource list separately.

#!/bin/bash
PROG=$(basename $0)
VERBOSE=0
RPC=${1-resource.sched-status}

log() { 
    test $VERBOSE -eq 0 && return
    local fmt=$1;
    shift
    printf >&2 "$PROG: $fmt\n" $@
}
rpc() { flux python -c "import flux, json; print(json.dumps(flux.Flux().rpc(\"$1\").get()))"; }

runtest() {
    SCHEDULER=$1
    NNODES=$2
    log "Starting test of ${SCHEDULER} with ${NNODES} nodes"
    log "Removing modules..."
    flux module remove -f sched-fluxion-qmanager
    flux module remove -f sched-fluxion-resource
    flux module remove -f sched-simple
    flux module remove resource

    log "Loading fake resources via config..."
    flux config load <<EOF
[resource]
noverify = true
norestrict = true

[[resource.config]]
hosts = "test[1-${NNODES}]"
cores = "0-63"
gpus = "0-8"
EOF

    log "Reloading resource module..."
    flux module load resource noverify monitor-force-up

    log "Loading ${SCHEDULER} modules..."
    if test "$SCHEDULER" = "sched-simple"; then
        flux module load sched-simple
    else
        flux module load sched-fluxion-resource
        flux module load sched-fluxion-qmanager
    fi

    log "Starting some active jobs..."
    flux submit --quiet -xN1 --cc=1-${NNODES} \
        --setattr=exec.test.run_duration=\"600\" --wait-event=start \
        hostname

    if test "$SCHEDULER" = "fluxion"; then
        # allow fluxion to initialize graph?
        rpc $RPC >/dev/null
        rpc $RPC >/dev/null
    fi
    log "Timing resource.sched-status"
    t0=$(date +%s.%N)
    rpc $RPC >/dev/null
    t1=$(date +%s.%N)
    dt1=$(echo "$t1 - $t0" | bc -l)

    log "Timing flux resource list"
    t0=$(date +%s.%N)
    flux resource list >/dev/null
    t1=$(date +%s.%N)
    dt2=$(echo "$t1 - $t0" | bc -l)

    printf "%-13s %8s %24.3f %22.3f\n" $SCHEDULER $NNODES $dt1 $dt2
    flux cancel --all --quiet  2>/dev/null
    flux queue idle --quiet
    flux module unload -f sched-fluxion-qmanager
    flux module unload -f sched-fluxion-resource
}

printf "%-13s %8s %18s %22s\n" \
       SCHEDULER NNODES "T($RPC)" "T(flux resource list)"
for scheduler in sched-simple fluxion; do
    for nnodes in 128 256 512 1024 2048 4096 8192 16384; do
        runtest $scheduler $nnodes
    done
done

# vi: ts=4 sw=4 expandtab

2 replies

wihobbs Mar 26, 2024
Maintainer Author

Thanks for sharing this @grondo! If I'm reading this correctly, all of the commands submitted are through flux submit and single node only. Do we want to simulate varying sizes of workloads, durations, and submission mechanisms? Or is that outside the scope of this testing?

grondo Mar 27, 2024
Maintainer

If I'm reading this correctly, all of the commands submitted are through flux submit and single node only.

Yeah, this line

flux submit --quiet -xN1 --cc=1-${NNODES} \
        --setattr=exec.test.run_duration=\"600\" --wait-event=start \
        hostname

is meant to fill the system with jobs for the purposes of this test, which was a kind of worst-case scenario for the resource.sched-status RPC. Therefore it submits NNODES worth of fake jobs that request 1 node each (the jobs do not actually run), and waits for them all to start.

Do we want to simulate varying sizes of workloads, durations, and submission mechanisms?

Yeah, the idea here would be to have multiple things running at once under an instance: jobs being submitted, starting, and ending, plus a simulated workload of common user tools: flux resource list, flux jobs, flux job info, etc.

At this point I'm not sure the size of the jobs matters since we're just determining if the fact that there are jobs affects interactivity of the instance.

A good start to this project might just be to write a script that times running some cli commands:

flux resource list
flux resource status
flux jobs -A

And a script that keeps a flow of jobs running through the instance (there are probably many ways to implement this one). The key being that the script would submit jobs at a constant rate instead of submitting them all at once, such that there are jobs starting and exiting at all times.

wihobbs · 2024-04-15T02:47:16Z

wihobbs
Apr 15, 2024
Maintainer Author

This weekend I took a node of tioga to do some testing on flux-core v0.61.2. This is one of our EAS clusters with 512Gb of memory per node. I started with fluke but its 16Gbs of memory OOM'ed at 8,192 "nodes" in the fake configuration.

commands:    		0.61.2
libflux-core:		0.61.2
broker:  		0.61.2
FLUX_URI:		local:///var/tmp/hobbs17/flux-9hJkt2/local-0
build-options:		+systemd+hwloc==2.8.0+zmq==4.3.4

command	instance_size	n	mean	median	stdev	max	min
flux jobs -A	128	100	0.2008	0.21	0.0187853	0.22	0.17
flux resource list	128	100	0.186	0.19	0.00492366	0.19	0.18
flux resource info	128	100	0.1685	0.17	0.0035887	0.17	0.16
flux jobs -A	256	100	0.2134	0.205	0.0303921	0.26	0.17
flux resource list	256	100	0.2057	0.21	0.00573048	0.23	0.2
flux resource info	256	100	0.1752	0.18	0.00521846	0.19	0.17
flux jobs -A	512	100	0.2266	0.2	0.0484762	0.32	0.17
flux resource list	512	100	0.2462	0.25	0.00873632	0.31	0.24
flux resource info	512	100	0.1853	0.18	0.00989388	0.26	0.18
flux jobs -A	1024	100	0.257	0.23	0.0715132	0.42	0.17
flux resource list	1024	100	0.3335	0.33	0.0307934	0.47	0.32
flux resource info	1024	100	0.2127	0.21	0.0260479	0.37	0.2
flux jobs -A	2048	100	0.3272	0.33	0.124609	0.61	0.17
flux resource list	2048	100	0.5159	0.49	0.082659	0.9	0.48
flux resource info	2048	100	0.2792	0.25	0.106275	0.85	0.24
flux jobs -A	4096	100	0.3753	0.33	0.181573	0.58	0.17
flux resource list	4096	100	0.9124	0.81	0.322002	2.33	0.8
flux resource info	4096	100	0.4125	0.33	0.299445	2	0.32
flux jobs -A	8192	100	0.2017	0.2	0.0250678	0.28	0.17
flux resource list	8192	100	1.909	1.48	1.29313	7.45	1.46
flux resource info	8192	100	0.6831	0.49	0.989125	7.38	0.48
flux jobs -A	16384	100	0.1903	0.19	0.0120985	0.22	0.17
flux resource list	16384	100	5.0348	2.79	7.2022	43.74	2.77
flux resource info	16384	100	2.6401	0.955	7.50726	51.55	0.82

Some (potential) confounding variables here:

The script that I created submits a job per node per second, but the most I saw going running at any one time (I kept an eye on this with flux top) was 50. So the pending jobs grow pretty fast once you get to big sizes, and as I said above it's possible to OOM nodes.
Linux is juggling a lot of running processes here (including my flux top monitor). Some of the 1 second jobs took up to 6s, and maybe longer. More jobs are getting submitted and running while the python process times flux jobs -A etc.

One thought I had was to make this test part of flux-test-collective, maybe on a less frequent (weekly? monthly?) basis. We could compare future numbers to these baselines, and fail the test if the interactive command mean increases by a set percentage.

8 replies

grondo Apr 15, 2024
Maintainer

One thought I had here is that the sub-processes spawned by python time flux resource list etc. might be spending time busy waiting due to one node trying to handle a lot of running jobs.

Are the test jobs using mock execution though (--setattr=exec.test.run_duration)? In that case there should be no execution traffic (i.e. job shell and task spawning and extra load from these processes).

At one point it was something like 57k pending jobs

That's a lot of pending jobs, and could be keeping the job-manager busy, which might be the cause of a delay. This is where it would be useful to separately time the resource.status and resource.sched-status RPCs vs the time for processing in the flux resource list command.

Interestingly, I noticed that no more than maybe 50 jobs were running at any one time, usually closer to 30. (I kept an eye on this instance with flux top the whole time.) Maybe its being throttled by the number of cores available on tioga (32)?

What was the actual output of flux resource list? How many cores were actually shown as free? With mock execution, the scheduler should be able to utilize all the cores from all 8K nodes, so there should be plenty of resources to get thousands of jobs running.
Make sure that the run_duration is something greater than just a few seconds when we want to simulate a lot of running jobs.

At these instance sizes, I'll also start collecting the raw data from each command -- real, user, and system time, rather than just giving an average for real time. Can you think of any other data that might be useful from time?

I'm not sure the separate components from time are that useful here. More useful would be the time the commands are waiting on the RPC vs processing data. We can get that from timing just the RPCs as part of the test as well (so we can see the difference), and then if we can reproduce the slow results specifically, we can run a perf and see where the time is spent.

wihobbs Apr 15, 2024
Maintainer Author

Are the test jobs using mock execution though (--setattr=exec.test.run_duration)?

Yep.

What was the actual output of flux resource list?

Trashed this...I'll keep it around next time.

Make sure that the run_duration is something greater than just a few seconds when we want to simulate a lot of running jobs.

This could be something. All of the jobs were 1s only.

grondo Apr 15, 2024
Maintainer

Another good thing to capture is the version of flux-core and Fluxion modules used in the test. The flux-core version can be obtained via flux version and the Fluxion modules emit their version in the logs at load time.

wihobbs Apr 15, 2024
Maintainer Author

I added the version of flux-core above the table. Thanks for the tip about the Fluxion version (I never knew where to look for that) -- will add that in next time.

wihobbs Apr 15, 2024
Maintainer Author

Oh oops I wrote v0.62.2 in my first sentence! That doesn't exist 😆 corrected.

wihobbs · 2024-04-16T23:55:28Z

wihobbs
Apr 16, 2024
Maintainer Author

This is a visualization of the heap generated with massif-visualizer. At the peak snapshot (no. 42), the heap was 4.5GiB.

I periodically checked the memory utilization with top while running this job, which had 8,192 nodes running fake 2 second jobs through 2 queues of equal size, using the same version of Flux as above. top reported memory usage >16 GiB.

I want to spend some time cleaning up the way this test is run, continuing the work @grondo suggested above and splitting RPC and command timing, and doing more appropriate logging instead of checking binary versions after the fact. But since it took some work just to get to this first-pass chart, I thought I'd put it up for public inspection.

2 replies

wihobbs Apr 17, 2024
Maintainer Author

*top reported memory usage >16GiB for all of the processes on the node that I was running the test on.

wihobbs Apr 17, 2024
Maintainer Author

To close this thread:

Based on 1-1 discussion with @grondo, we're pretty convinced that the other 10ish gigs of memory are getting eaten by the tens of thousands of pending jobs submitted through flux submit. There are better ways to handle submission of that many jobs. flux bulksubmit is an obvious one, but using the asynchronous event loop in python is the one we want to target for another iteration of this testing since we're going for a script that can keep the flow of jobs moving for "a while."

Linux juggling a lot of processes and memory contention at once could also be the culprit of those 54s max numbers at the 16,384 node size as well.

Mark didn't seem too concerned by the job-manager heap utilizing 4.5G at this instance size and number of pending jobs. So we're moving on to timing RPCs and trying to reproduce those long-running flux resource list numbers with a cleaner process that keeps the fake resources utilized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthetic workload testing #5830

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Synthetic workload testing #5830

wihobbs Mar 25, 2024 Maintainer

Replies: 4 comments · 12 replies

grondo Mar 25, 2024 Maintainer

grondo Mar 25, 2024 Maintainer

wihobbs Mar 26, 2024 Maintainer Author

grondo Mar 27, 2024 Maintainer

wihobbs Apr 15, 2024 Maintainer Author

grondo Apr 15, 2024 Maintainer

wihobbs Apr 15, 2024 Maintainer Author

grondo Apr 15, 2024 Maintainer

wihobbs Apr 15, 2024 Maintainer Author

wihobbs Apr 15, 2024 Maintainer Author

wihobbs Apr 16, 2024 Maintainer Author

wihobbs Apr 17, 2024 Maintainer Author

wihobbs Apr 17, 2024 Maintainer Author

wihobbs
Mar 25, 2024
Maintainer

Replies: 4 comments 12 replies

grondo
Mar 25, 2024
Maintainer

grondo
Mar 25, 2024
Maintainer

wihobbs Mar 26, 2024
Maintainer Author

grondo Mar 27, 2024
Maintainer

wihobbs
Apr 15, 2024
Maintainer Author

grondo Apr 15, 2024
Maintainer

wihobbs Apr 15, 2024
Maintainer Author

grondo Apr 15, 2024
Maintainer

wihobbs Apr 15, 2024
Maintainer Author

wihobbs Apr 15, 2024
Maintainer Author

wihobbs
Apr 16, 2024
Maintainer Author

wihobbs Apr 17, 2024
Maintainer Author

wihobbs Apr 17, 2024
Maintainer Author