Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploring variability in identical workflows #295

Open
rsignell opened this issue Aug 6, 2024 · 5 comments
Open

Exploring variability in identical workflows #295

rsignell opened this issue Aug 6, 2024 · 5 comments

Comments

@rsignell
Copy link

rsignell commented Aug 6, 2024

I ran the same workflow twice, which just extracts a time series from a collection of files in object storage.

The first time I ran the workflow, all the tasks except for one finished in 30s or so, but the last task didn't complete for another 2 minutes:
https://cloud.coiled.io/clusters/549132/account/esip-lab/information?organization=esip-lab

I then ran the workflow again and the whole thing completed in 30s:
https://cloud.coiled.io/clusters/549139/account/esip-lab/information?organization=esip-lab

Is there a way to use the Coiled diagnostics to help figure out what was going on in the first case?

Obviously I'd like to avoid having all the workers running doing nothing while one task takes a super long time!

The reproducible notebook is here: https://nbviewer.org/gist/rsignell/8321c6e3f8f30ec70cdb6d768734e458

@ntabris
Copy link
Member

ntabris commented Aug 6, 2024

Hi @rsignell.

Not sure if this is the root cause, but one thing that might be relevant is that in both cases, the cluster started doing work while some of the workers were still coming up.

There's some expected variance in how long workers take to come up and be ready, here's what I see on your two clusters:

image image

This can then affect how tasks get distributed.

For large/long-running workloads I'd expect this to make less of a relative difference, but since you're just running a few minutes of work I wouldn't be surprised if this makes a bigger difference.

If you want to make this more consistent for comparative benchmarks on small workloads, you might try using coiled.Cluster(..., wait_for_workers=True) to wait for all the workers before cluster is considered "ready".

Does that help / address what you're looking for?

@mrocklin
Copy link
Member

mrocklin commented Aug 6, 2024

When I see long stragglers like this my default assumption is variance in S3 response times. Some S3 requests just take a minute or two for some reason.

@phofl
Copy link

phofl commented Aug 6, 2024

Yeah, cluster boot up seems unlikely here. The task was just running for a very long time, not sure why though. We don't see any network traffic either, so s3 seems unlikely as well.

@rsignell
Copy link
Author

rsignell commented Aug 6, 2024

@martindurant, in your dealings with fsspec and GCS, have you occasionally seen requests to object storage taking much longer than the rest? I remember vaguely probing this issue with AWS and S3, where we were considering setting fsspec parameters like:

fs_aws = fsspec.filesystem('s3', anon=True, 
         config_kwargs={'connect_timeout':5, 
                         'read_timeout':5,
                         'retries':{'max_attempts': 10}})

not sure if there are similar settings for GCS... I will check...

@martindurant
Copy link

Some people have commented that requests can take a long time to fail ( fsspec/gcsfs#633 ) but no, I don't hear much about occasional stragglers. I think it's assumed a fair price to pay.

I should be possible to timeout requests by passing similar arguments to aiohttp, and you could combine this with gcsfs retries or dask retries. I can't say whether that really gets you the bytes faster or not.

What we are really missing is a good async-> parallel model here. It should be possible to launch a large number of IO requests at once, and farm out the CPU-bound processing of bytes as they arrive in, but this is complex to do across a cluster. An interesting idea among some grib-oriented peoeple was to have an IO dedicated machine on the cluster with a fat NIC, which does light processing on the incoming data and provides data to other machines on the internal network for the real work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants