-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploring variability in identical workflows #295
Comments
Hi @rsignell. Not sure if this is the root cause, but one thing that might be relevant is that in both cases, the cluster started doing work while some of the workers were still coming up. There's some expected variance in how long workers take to come up and be ready, here's what I see on your two clusters: This can then affect how tasks get distributed. For large/long-running workloads I'd expect this to make less of a relative difference, but since you're just running a few minutes of work I wouldn't be surprised if this makes a bigger difference. If you want to make this more consistent for comparative benchmarks on small workloads, you might try using Does that help / address what you're looking for? |
When I see long stragglers like this my default assumption is variance in S3 response times. Some S3 requests just take a minute or two for some reason. |
Yeah, cluster boot up seems unlikely here. The task was just running for a very long time, not sure why though. We don't see any network traffic either, so s3 seems unlikely as well. |
@martindurant, in your dealings with fsspec and GCS, have you occasionally seen requests to object storage taking much longer than the rest? I remember vaguely probing this issue with AWS and S3, where we were considering setting fsspec parameters like:
not sure if there are similar settings for GCS... I will check... |
Some people have commented that requests can take a long time to fail ( fsspec/gcsfs#633 ) but no, I don't hear much about occasional stragglers. I think it's assumed a fair price to pay. I should be possible to timeout requests by passing similar arguments to aiohttp, and you could combine this with gcsfs retries or dask retries. I can't say whether that really gets you the bytes faster or not. What we are really missing is a good async-> parallel model here. It should be possible to launch a large number of IO requests at once, and farm out the CPU-bound processing of bytes as they arrive in, but this is complex to do across a cluster. An interesting idea among some grib-oriented peoeple was to have an IO dedicated machine on the cluster with a fat NIC, which does light processing on the incoming data and provides data to other machines on the internal network for the real work. |
I ran the same workflow twice, which just extracts a time series from a collection of files in object storage.
The first time I ran the workflow, all the tasks except for one finished in 30s or so, but the last task didn't complete for another 2 minutes:
https://cloud.coiled.io/clusters/549132/account/esip-lab/information?organization=esip-lab
I then ran the workflow again and the whole thing completed in 30s:
https://cloud.coiled.io/clusters/549139/account/esip-lab/information?organization=esip-lab
Is there a way to use the Coiled diagnostics to help figure out what was going on in the first case?
Obviously I'd like to avoid having all the workers running doing nothing while one task takes a super long time!
The reproducible notebook is here: https://nbviewer.org/gist/rsignell/8321c6e3f8f30ec70cdb6d768734e458
The text was updated successfully, but these errors were encountered: