Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an OutputURL type with a streaming download/upload #2021

Open
wants to merge 2 commits into
base: async
Choose a base branch
from

Conversation

technillogue
Copy link
Contributor

#1987 and followups aim to improve returning remote URLs. however, they use sync downloads, which can cause coroutines to queue up in the event loop under higher levels of concurrency. it also doesn't use a connection pool and can't overlap download+upload. these problems can be avoided by adding a new output type that's handled differently by ClientManager.upload_file. conveniently, ClientManager already holds a file_client with a connection pool suitable for downloading those images well, and httpx provides good ergonomics for streaming downloads/uploads.

Copy link
Contributor

@aron aron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think conceptually this is a clean solution.

Though it does expand the surface area of Cog (This is a concern I had about repurposing URLFile too. But it already existed in a sort of clunky way).

We tried to extend the io.IOBase approach to keep the interface consistent, but if I understand correctly we need to rework that for async anyway? So perhaps we need an entirely new File and Path implementation?

I think it'd be worth getting a second review from someone else familar with cog.

Comment on lines +251 to +258
if isinstance(fh, OutputURL):
async with self.file_client.stream("GET", fh.url) as resp:
content = resp.aiter_bytes()
resp = await self.file_client.put(url, content=content, headers=headers)
else:
resp = await self.file_client.put(
url, content=ChunkFileReader(fh), headers=headers
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pulling this together. I'm trying to understand this, I'm not very familiar with how async/await works in Python. I understand by going with the httpx async client + iterator we get a non-blocking implementation but how does the existing ChunkFileReader accomplish the same thing?

If it doesn't and we still need to solve that issue we should probably figure out an interface that works for both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PUT method takes bytes or an async iterator that returns bytes. aiter_bytes counts, as does ChunkFileReader, which is implemented earlier in the same file. ChunkFileReader does do blocking disk reads, but doing so 1MB at a time is likely short enough that we can do all the other networking we need in between.

if you wanted to be fancy you could have a single FileIterator that could take a local or remote URI, but that's kind of annoying to do while holding the context manager for the download request, and this approach is kind of simpler

python/cog/server/clients.py Outdated Show resolved Hide resolved
Co-authored-by: Aron Carroll <[email protected]>
Signed-off-by: technillogue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants