-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add an OutputURL type with a streaming download/upload #2021
base: async
Are you sure you want to change the base?
Conversation
0bd49a9
to
3644869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think conceptually this is a clean solution.
Though it does expand the surface area of Cog (This is a concern I had about repurposing URLFile
too. But it already existed in a sort of clunky way).
We tried to extend the io.IOBase
approach to keep the interface consistent, but if I understand correctly we need to rework that for async anyway? So perhaps we need an entirely new File
and Path
implementation?
I think it'd be worth getting a second review from someone else familar with cog.
if isinstance(fh, OutputURL): | ||
async with self.file_client.stream("GET", fh.url) as resp: | ||
content = resp.aiter_bytes() | ||
resp = await self.file_client.put(url, content=content, headers=headers) | ||
else: | ||
resp = await self.file_client.put( | ||
url, content=ChunkFileReader(fh), headers=headers | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pulling this together. I'm trying to understand this, I'm not very familiar with how async/await
works in Python. I understand by going with the httpx
async client + iterator we get a non-blocking implementation but how does the existing ChunkFileReader
accomplish the same thing?
If it doesn't and we still need to solve that issue we should probably figure out an interface that works for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PUT method takes bytes or an async iterator that returns bytes. aiter_bytes counts, as does ChunkFileReader, which is implemented earlier in the same file. ChunkFileReader does do blocking disk reads, but doing so 1MB at a time is likely short enough that we can do all the other networking we need in between.
if you wanted to be fancy you could have a single FileIterator that could take a local or remote URI, but that's kind of annoying to do while holding the context manager for the download request, and this approach is kind of simpler
Co-authored-by: Aron Carroll <[email protected]> Signed-off-by: technillogue <[email protected]>
#1987 and followups aim to improve returning remote URLs. however, they use sync downloads, which can cause coroutines to queue up in the event loop under higher levels of concurrency. it also doesn't use a connection pool and can't overlap download+upload. these problems can be avoided by adding a new output type that's handled differently by ClientManager.upload_file. conveniently, ClientManager already holds a file_client with a connection pool suitable for downloading those images well, and httpx provides good ergonomics for streaming downloads/uploads.