add an OutputURL type with a streaming download/upload #2021

technillogue · 2024-10-24T02:40:17Z

#1987 and followups aim to improve returning remote URLs. however, they use sync downloads, which can cause coroutines to queue up in the event loop under higher levels of concurrency. it also doesn't use a connection pool and can't overlap download+upload. these problems can be avoided by adding a new output type that's handled differently by ClientManager.upload_file. conveniently, ClientManager already holds a file_client with a connection pool suitable for downloading those images well, and httpx provides good ergonomics for streaming downloads/uploads.

aron

I think conceptually this is a clean solution.

Though it does expand the surface area of Cog (This is a concern I had about repurposing URLFile too. But it already existed in a sort of clunky way).

We tried to extend the io.IOBase approach to keep the interface consistent, but if I understand correctly we need to rework that for async anyway? So perhaps we need an entirely new File and Path implementation?

I think it'd be worth getting a second review from someone else familar with cog.

aron · 2024-10-25T21:14:29Z

python/cog/server/clients.py

+        if isinstance(fh, OutputURL):
+            async with self.file_client.stream("GET", fh.url) as resp:
+                content = resp.aiter_bytes()
+                resp = await self.file_client.put(url, content=content, headers=headers)
+        else:
+            resp = await self.file_client.put(
+                url, content=ChunkFileReader(fh), headers=headers
+            )


Thanks for pulling this together. I'm trying to understand this, I'm not very familiar with how async/await works in Python. I understand by going with the httpx async client + iterator we get a non-blocking implementation but how does the existing ChunkFileReader accomplish the same thing?

If it doesn't and we still need to solve that issue we should probably figure out an interface that works for both.

The PUT method takes bytes or an async iterator that returns bytes. aiter_bytes counts, as does ChunkFileReader, which is implemented earlier in the same file. ChunkFileReader does do blocking disk reads, but doing so 1MB at a time is likely short enough that we can do all the other networking we need in between.

if you wanted to be fancy you could have a single FileIterator that could take a local or remote URI, but that's kind of annoying to do while holding the context manager for the download request, and this approach is kind of simpler

python/cog/server/clients.py

Co-authored-by: Aron Carroll <[email protected]> Signed-off-by: technillogue <[email protected]>

add an OutputURL type with a streaming download/upload

3644869

technillogue force-pushed the syl/url-outputs branch from 0bd49a9 to 3644869 Compare October 24, 2024 06:06

technillogue requested review from aron and a team October 24, 2024 16:19

aron reviewed Oct 25, 2024

View reviewed changes

Update python/cog/server/clients.py

4da3e62

Co-authored-by: Aron Carroll <[email protected]> Signed-off-by: technillogue <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add an OutputURL type with a streaming download/upload #2021

add an OutputURL type with a streaming download/upload #2021

technillogue commented Oct 24, 2024

aron left a comment

aron Oct 25, 2024

technillogue Oct 28, 2024

add an OutputURL type with a streaming download/upload #2021

Are you sure you want to change the base?

add an OutputURL type with a streaming download/upload #2021

Conversation

technillogue commented Oct 24, 2024

aron left a comment

Choose a reason for hiding this comment

aron Oct 25, 2024

Choose a reason for hiding this comment

technillogue Oct 28, 2024

Choose a reason for hiding this comment