-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read from http - httppathlib? #455
Comments
I do like this idea, and it is not the first time we have heard this. We've been sort of on the fence about HTTP since at a protocol level, it doesn't map to most path operations except for working with individual files. Some thoughts on a potential mapping to the abstract methods used by
I wouldn't be surprised if even beyond that most servers/scenarios people work with are limited to GET and maybe HEAD, but that may not be a total deal breaker. Given our philosophy of trying to keep official cloud SDKs as the only dependencies of Happy to consider a PR for this, or if someone wanted to make a |
Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here. One thing I'm wondering about is range reads. In boto, ranges can be read like this: import boto3
s3 = boto3.client('s3')
bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023 # First 1KB
response = s3.get_object(
Bucket=bucket_name,
Key=object_key,
Range=f'bytes={start_byte}-{end_byte}'
)
data = response['Body'].read() # Just the bytes we want I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the from pathlib import Path
file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023 # First 1KB
with file_path.open('rb') as file:
file.seek(start_byte)
data = file.read(end_byte - start_byte + 1) It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would |
Best approach to support reading data from http via a
pathlib
-like class, i.e.httppathlib
?In the pangeo / xarray community we do a lot of reading of remote scientific data (particularly netCDF and Zarr). We generally want to treat 3 cases the same way: local filesystems, cloud storage, and http urls. The latter is important partly because a lot of archival scientific data is still only available from servers via http (e.g. via openDAP urls), and we often want to pull it out and deposit it onto cloud storage (e.g. using pangeo-forge).
We currently use
fsspec
to abstract over these different filesystems, but despite much engagement upstream we have unfortunately experienced chronic reliability issues stemming from ill-defined interfaces.CloudPathlib looks really nice, especially the strict typing and clear interface. (I'm in awe of the
AnyPath
virtual superclass trick too - and with #347 would be even cooler!) ThePath
abstraction also just seems more like the minimally-useful one, rather than trying to emulate a whole filesystem.Rather than trying to support every filesystem under the sun as
fsspec
does, I'm wondering if we could just usepathlib
,cloudpathlib
, and some newhttppathlib
?Do you have any thoughts on:
httppathlib
to conform to thepathlib
interface?cloudpathlib
or in a separate repository?The text was updated successfully, but these errors were encountered: