Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No "pick up where you left off" option for failed downloads #27

Open
baron-de-montblanc opened this issue Oct 22, 2024 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@baron-de-montblanc
Copy link

Hello, I am trying to download some rather large observations from ASVO to our group's supercomputer through giant-squid. It is very common for the download to fail (see attached screenshot for example), probably due to the connection getting interrupted.

Screenshot 2024-10-22 at 12 46 13 PM

My question is, is there an option/flag one can use with giant-squid to tell it to resume the download from where it crashed? (Or, alternatively, how could I successfully download these ~50Gb observations without it crashing?)

@gsleap gsleap added the enhancement New feature or request label Oct 23, 2024
@d3v-null
Copy link
Contributor

d3v-null commented Oct 23, 2024

Hey Jade,
That must be frustrating.
We have a little bit of retry / error handling logic in giant-squid, but it's clearly not doing its job.

In the meantime, here's how you can use wget to handle the download instead.

giant-squid list --json $query

will give you a bunch of metadata about the jobs matching $query including a download link.

{
   "801409":{
      "obsid":1413666792,
      "jobId":801409,
      "jobType":"DownloadVisibilities",
      "jobState":"Ready",
      "files":[
         {
            "jobType":"Acacia",
            "fileUrl":"https://projects.pawsey.org.au/mwa-asvo/1413666792_801409_vis.tar?AWSAccessKeyId=...",
            "filePath":null,
            "fileSize":152505477120,
            "fileHash":"d6dfb7391a495b0eb07cc885808e9e8058e90ec3"
         }
      ]
   }
}

you can chuck fileUrl straight into wget, which has a lot of options around retrying downloads. I use --wait=60 --random-wait

If you want to automated this for many jobs you can use jq, e.g.

giant-squid list -j --states=ready -- $obslist \
    | jq -r '.[]|[.jobId,.files[0].fileUrl//"",.files[0].fileSize//"",.files[0].fileHash//""]|@tsv' \
    | while read -r jobid url size hash; do
    [ -f ${obsid}.tar ] && continue
    wget $url -O${obsid}.tar --progress=dot:giga --wait=60 --random-wait
done

@gsleap
Copy link
Member

gsleap commented Oct 23, 2024

Hi Jade,

As Dev says, we currently don't have a continue-from-where-you-left-off feature as such, but it would be extremely valuable especially for large downloads. So it will definitely be on our roadmap for a future release.

In the meantime, I think Dev has used the above technique successfully, so please give that a go and let us know how it goes!

@gsleap
Copy link
Member

gsleap commented Oct 23, 2024

oh and @baron-de-montblanc @d3v-null - FYI you can also pass to wget:
-c, --continue to "resume getting a partially-downloaded file"
I only just found it and does appear to work quite nicely!

@d3v-null
Copy link
Contributor

A friendly reminder to anyone who comes across this issue: We take pull requests

The main download loop is here .

It's wrapped in an exponential backoff here .

Compared to wget , this is download handling from the stone age.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants