-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s3 performance is slow #23
Comments
Related iterative/dvc#5683 |
FYI I'm seeing very slow transfer speeds to S3 when doing a Here's the output from
|
@diehl Are you also using minio? |
@efiop I'm not. I don't know what minio is. |
@diehl So real aws s3 then, right? |
@efiop that is correct. using the AWS CLI. |
@diehl Btw, what does the data that you have look like? Whole directories tracked by dvc? How many files there typically, what approx size? IIRC you were talking about geojson before, so I suppose thousands of ~100M files? |
Chatted with @diehl privately. For the record, there are two directories, one 5G total with 224 files and the other is 14G total with 63 files. Inside there are misc files, with 5G being the biggest one in the second dataset. So likely we need to look at a scenario of transferring individual large files. Need to reproduce locally and see if there is anything on the surface. |
This is probably an s3fs issue similar to the adlfs one we had recently. s3fs |
@pmrowla Do you think we should prioritize it and solve it? What would be your estimation there? |
Handling it the same way we did with adlfs (where we only optimize the worst-case single large file upload/download scenario) should be relatively quick to do in s3fs, but I should note that it looks like they have tried to do concurrent multipart uploads before in some of the s3fs calls but ran into problems that made them revert to only sequential operations: https://github.com/fsspec/s3fs/blob/b1d98806952485be86379f0f4574ee4de24568a1/s3fs/core.py#L1768C31-L1769 |
It looks like it multipart upload requires a minimum size of 5 MiB for all but the last part: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html Also, to set expectations, the azure approach currently will only help when doing |
Yes that's correct |
Should be resolved by fsspec/s3fs#848 There's some raw s3fs numbers in the upstream PR, but for reference with DVC and a single file that takes with s3fs main:
with s3fs PR and default
|
This is merged upstream and will be available in the next s3fs release |
@pmrowla do you have an ETA for the next s3fs release by chance? |
I don't have an ETA but generally the fsspec/s3fs maintainers are fairly quick about doing releases for work we've contributed upstream cc @efiop |
Roger that - thanks @pmrowla |
Bug Report
s3 performance is slow
Description
Ticket is based on topic in discord (“need-help” channel).
The problem is that I tried to use DVC with miniO (s3 compatible storage) and noticed that its performance is very slow.
My env:
When I did dvc pull -j 20 maximum speed was 80 mbit/sec but average was about 40. I download the bucket for 220 minutes.
What I tried else:
dvc pull -j 80 - no improvements.
awscli - aws tool can maximum 160 Mbit/sec downloading speed. I tried different settings, but I couldn't exceed the limit.
s4cmd - maximum I got is about 130 Mbit/sec, 64minutes to get the bucket.
s5cmd - maximum 960 Mbit/sec and less than 10 minutes to download the whole bucket (GoLang)
So you can see that storage performance is okay but the download speed of tools written on python can not reach maximum.
Reproduce
Profiler stat: https://disk.yandex.ru/d/XNajwHgWYlPSHA
The text was updated successfully, but these errors were encountered: