-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible issue with latest s3fs
(2024.3.0): max_concurrency
kwarg
#80
Comments
Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
Fixes iterative/dvc-s3#80 It is similar to `gcsfs` and `adlfs`. On our end it seems `max_concurrency` is passed here https://github.com/iterative/dvc-objects/blob/main/src/dvc_objects/fs/generic.py#L210 and since the new version has this attr we pass it now, which most likely leads to this error. I'm not sure why `_gef_file` part was not yet implemented. @pmrowla might have a better idea on was it complicated, or less priority. It seems it is the natural next step to do so.
Thanks @ryan-williams . Should be fixed by fsspec/s3fs#863 . We should probably for now do a some workaround here also (in case of s3 avoid using |
Supporting concurrent chunked downloads is more complex than uploads, so it was not implemented due to time constraints. For uploads (fsspec For downloads (fsspec Since there is no guarantee that downloads will be completed in order, this means completed chunks need to either be kept in memory or written out to temp files on disk before re-assembling them at the end of the download operation (once all chunks are available, or at least when the "next" sequential chunk is available). This is something that can be implemented in s3fs, but IMO it would be better for it to be implemented at the outer client level (i.e. the client calling fsspec, in this case dvc-objects) which would make chunked downloads supported for all filesystems that support downloading a specific byte range (which is essentially every fsspec implementation). |
I think all local filesystems support seeking beyond the end of the file to write data, so reassembly should not be hard. On linux this even produces "sparse" files (on windows you get padding), which is not important in this case because we intend to fill all the gaps. |
Should we create a separate issue for concurrent chunked downloads? Edit: And can we then close this issue? I see @martindurant released |
It's worth making a note, but someone should benchmark that it actually makes a difference. Currently we go concurrent over all files (subject to a throttle for limiting number of file descriptors) but each file streams. Although windows does allow seeking beyond a file's end to extend it, I wonder if it allows multiple writers in a single file. If not, each task would need to open/seek/write/close every time. |
Closed by fsspec/s3fs#863, and released in |
…rsion 2024.3.1 commit efbe1e4c23a06e65b3df6a82f28fc49bab0dbd78 Author: Martin Durant <[email protected]> Date: Mon Mar 18 15:42:28 2024 -0400 changelog (#864) commit 5cf759d2e670eb4cb79d978491bf42ed0eff23a5 Author: Ivan Shcheklein <[email protected]> Date: Mon Mar 18 07:40:19 2024 -0700 fix(core): accept kwargs in get file (#863) Fixes iterative/dvc-s3#80
s3fs 2024.3.0 (released yesterday) added a
max_concurrency
kwarg (fsspec/s3fs#848), and today I have a job failing during advc pull
from S3, referencingan unexpected keyword argument 'max_concurrency'
:(GHA link)
I was unable to repro it locally (with most of the same relevant versions:
dvc{,_s3}
,*boto*
,s3fs
), but pinnings3fs<=2024.2
allowed the samedvc pull
to succeed (GHA link).Mentioning here in case others run into it / can better triage.
The text was updated successfully, but these errors were encountered: