Resumable downloads #274

rvagg · 2023-05-31T01:59:39Z

We currently have no option to restart a download, which makes lassie pretty fussy and problematic for large downloads. If you fail, you have to start from scratch. At least with Kubo, you have the data in a blockstore so it can resume from there.

Challenges to be solved:

If you "resume" from an existing CAR, do you have to run a traversal over it to verify that the CAR DAG it has is correct up to the point that it ends (presumably prematurely)?
Can you "resume with bundle of blocks" where you supply a CAR (or multiple?) that have blocks that may be needed in your traversal, but the output CAR is still new?
What do we do about HTTP retrievals in this case since we have no "I already have this" facility, do we just document this behaviour and suggest removing the HTTP retriever?

As an experiment I've been trying to download a copy of wikipedia (bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze) and can't get more than ~500Mb in with lassie before I get timeouts or other errors and I have no way of resuming. Kubo gets much further although it slows to a crawl for me at a certain point, but at least I know I can cancel it and start again and it'll have what it already fetched in its blockstore.

There's a general problem set of "large data" that I don't think lassie is up to the challenge of solving yet.

The text was updated successfully, but these errors were encountered:

SgtPooki · 2023-06-13T02:07:50Z

I just started fetching the .zim file used for that wikipedia root with Lassie (./lassie fetch bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu). It passed 500Mb for me only some minor griping (multiple intermittent error messages in console: 2023-06-12T18:47:39.837-0700 ERROR dt_graphsync graphsync/graphsync.go:203 normal shutdown of state machine)

It's going ~~almost~~ half as fast as the only web2 mirror I found hosting that file: https://mirror.netcologne.de/kiwix/zim/wikipedia/wikipedia_en_all_maxi_2021-02.zim

-rw-r--r--   1 sgtpooki  staff   3.6G Jun 12 19:01 bafybeibkzwf3ffl44yfej6ak44i7aly7rb4udhz5taskreec7qemmw5jiu.car

vs

The mirror download was started at 6:43pm PST, lassie at 6:46pm PST.

rvagg · 2023-06-13T04:24:27Z

What's happening with the graphsync errors is that it's attempting multiple protocols but eventually giving up on ones that aren't yielding results - because this content is stored on multiple filecoin providers it's trying each one of them at the same time as fetching it from bitswap, but as they all fail for various reasons it leaves only bitswap. But I keep on getting context cancelled after some period of time on large downloads from the bitswap one too, regardless of --global-timeout and --provider-timeout values; I haven't worked out what's going on there yet.

rvagg · 2023-06-13T04:27:09Z

@hannahhoward had the idea of a --blockstore flag for lassie fetch, I imagine that when you use this mode, it doesn't bother trying to do the nicely-ordered CAR thing and will take an existing CAR (if it exists) under the name it's using ({cid}.car or whatever -o is specified as) and uses that as the LinkSystem to start from, so if blocks exist in it then it should skip over them in Graphsync and Bitswap traversals. For HTTP it'll have to re-fetch them but it shouldn't bother putting them into the CAR output. We have all the mechanics for this internally so it really shouldn't be hard to do. I think this is probably the easiest path to some level of resilience. I want to recover from fatal fetches without starting from scratch, especially when I have a multi-Gb file sitting in front of me (I'm experiencing this today).

rvagg added the enhancement New feature or request label May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumable downloads #274

Resumable downloads #274

rvagg commented May 31, 2023

SgtPooki commented Jun 13, 2023 •

edited

Loading

rvagg commented Jun 13, 2023

rvagg commented Jun 13, 2023

Resumable downloads #274

Resumable downloads #274

Comments

rvagg commented May 31, 2023

SgtPooki commented Jun 13, 2023 • edited Loading

rvagg commented Jun 13, 2023

rvagg commented Jun 13, 2023

SgtPooki commented Jun 13, 2023 •

edited

Loading