-
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mass org file export #1647
Comments
Renamed slightly to avoid confusion with #890 |
Most major clients don't support mutable torrents. Similarly while I got desktop webtorrent to work with them it's very slow and it cannot work in browsers due to a lack of dht. There's also a variation on mutable torrents where a newly downloaded torrent file can "update" on top of an older one. This doesn't do automatic updates but I believe it's more widely supported. |
I was asked to grab some large crawls out of our account since we hit our quota. I ended up writing an adhoc Python utility to do this, but along the way a few things occurred to me:
|
Hi Ed, this is really cool! Thanks for sharing :)
rclone is very flexible! I think you'd need the credentials of the Browsertrix S3 bucket if you wanted to copy files directly from there though, which we likely won't want to share out for our production app. We use rclone internally in Browsertrix in the background jobs to replicate content from the primary s3 bucket to backups/replica locations. Still worth thinking through the possibilities though!
That's a great idea, Python and/or Node Browsertrix API client libraries could be very useful for use cases like this! Definitely worth looking into to what degree we could use the OpenAPI docs to help with that.
That is generally the best endpoint if you already have the crawl ids or have gathered them from other crawl endpoints such as |
Thanks @tw4l! I thought maybe rclone could be used programmatically to pull the set of signed URLs instead of a bucket, and then it could write to the many endpoints it already supports? https://rclone.org/commands/rclone_copyurl/ But it wasn't a fully formed thought. |
Ah, that's a good call! I misunderstood but makes a lot of sense to me :) |
Thanks for sharing this @edsu. I think #578 will essentially provide built-in support for this, we hope to get to it soon. |
Context
Our service generates data but right now we require users to click a lot of stuff to download it all. Given the ease of mass data creation we should facilitate a similar level of ease of use when it comes to mass data export.
Possible methods of downloading data
The text was updated successfully, but these errors were encountered: