Mass org file export #1647

Shrinks99 · 2024-04-03T22:03:05Z

Context

Our service generates data but right now we require users to click a lot of stuff to download it all. Given the ease of mass data creation we should facilitate a similar level of ease of use when it comes to mass data export.

Possible methods of downloading data

Split into many ZIP files
- Presumably increases our storage costs? Maybe we could do the same file streaming thing we do for WACZs? Otherwise I imagine we'll have to implement a whole thing that creates all the files and makes them available for a certain time period...
Migrating to another S3 bucket
- A good solution for technical users who have this kind of infrastructure
- Will have to be implemented if we want to allow users to switch their S3 bucket to their own infra Custom S3 Buckets for Orgs #578
Torrents???
- Fairly nice fault tolerant option for downloading a lot of data with a reasonably low barrier of entry for users?
- We don't have to develop a desktop app!
- Mutable torrents are a thing! Unsure about support for them in clients...
- Object Torrent not supported by Digital Ocean Spaces 😭
Small dedicated downloading app
- An opportunity for Tessa to learn Go?
Mailing people storage
- I think this was a joke, but it's a thing and we could charge for it... Not sure we have the scale to support it well 🙃

tw4l · 2024-04-04T17:47:58Z

Renamed slightly to avoid confusion with #890

RangerMauve · 2024-04-08T17:16:59Z

Most major clients don't support mutable torrents. Similarly while I got desktop webtorrent to work with them it's very slow and it cannot work in browsers due to a lack of dht. There's also a variation on mutable torrents where a newly downloaded torrent file can "update" on top of an older one. This doesn't do automatic updates but I believe it's more widely supported.

edsu · 2024-07-18T15:04:01Z

I was asked to grab some large crawls out of our account since we hit our quota. I ended up writing an adhoc Python utility to do this, but along the way a few things occurred to me:

it would be nice for the script to have some options to limit what was copied
maybe lean on rclone to be able to more than just s3?
is it feasible to build a better browsertrix client from the openapi doc?
is orgs/{org_id}/crawls/{crawl_id}/replay.json the best place to find the list of WACZ files?

tw4l · 2024-07-19T00:41:38Z

Hi Ed, this is really cool! Thanks for sharing :)

it would be nice for the script to have some options to limit what was copied

maybe lean on rclone to be able to more than just s3?

rclone is very flexible! I think you'd need the credentials of the Browsertrix S3 bucket if you wanted to copy files directly from there though, which we likely won't want to share out for our production app. We use rclone internally in Browsertrix in the background jobs to replicate content from the primary s3 bucket to backups/replica locations. Still worth thinking through the possibilities though!

is it feasible to build a better browsertrix client from the openapi doc?

That's a great idea, Python and/or Node Browsertrix API client libraries could be very useful for use cases like this! Definitely worth looking into to what degree we could use the OpenAPI docs to help with that.

is orgs/{org_id}/crawls/{crawl_id}/replay.json the best place to find the list of WACZ files?

That is generally the best endpoint if you already have the crawl ids or have gathered them from other crawl endpoints such as orgs/{org_id}/crawls. One note, if you use orgs/{org_id}/all-crawls/{crawl_id}/replay.json, then the same endpoint will work for both crawls and uploads as long as you have the right id for the crawl/upload, whereas orgs/{org_id}/crawls/{crawl_id}/replay.json or orgs/{org_id}/uploads/{crawl_id}/replay.json will only work for that archived item type. Similarly, orgs/{org_id}/all-crawls will return both crawls and uploads in the same paginated list, and you can tell which is why by the type field value.

edsu · 2024-07-24T02:04:05Z

Thanks @tw4l! I thought maybe rclone could be used programmatically to pull the set of signed URLs instead of a bucket, and then it could write to the many endpoints it already supports?

https://rclone.org/commands/rclone_copyurl/

But it wasn't a fully formed thought.

tw4l · 2024-07-24T02:10:16Z

Thanks @tw4l! I thought maybe rclone could be used programmatically to pull the set of signed URLs instead of a bucket, and then it could write to the many endpoints it already supports?

Ah, that's a good call! I misunderstood but makes a lot of sense to me :)

ikreymer · 2024-07-24T02:11:22Z

Thanks for sharing this @edsu. I think #578 will essentially provide built-in support for this, we hope to get to it soon.
The idea would be to allow users' to switch primary and possibly secondary storage to a bucket of their choosing, which would use rclone internally to copy all of the data. There's still a few things to figure out with that, but should generally be doable. I think that will probably cover your specific use. Yeah, perhaps worth thinking how it could be made more flexible - we were thinking of this as a one-time switch, rather than periodic backups with filtering, eg. bring-your-own bucket instead of using ours.

Shrinks99 added question Further information is requested, label should be removed once answered investigation Research and/or prototyping before dev work labels Apr 3, 2024

Shrinks99 assigned Shrinks99, emma-sg and tw4l Apr 3, 2024

github-project-automation bot added this to Webrecorder Projects Apr 3, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Apr 3, 2024

ikreymer moved this from Triage to Todo in Webrecorder Projects Apr 3, 2024

tw4l changed the title ~~Org data export~~ Mass org file export Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mass org file export #1647

Mass org file export #1647

Shrinks99 commented Apr 3, 2024 •

edited

Loading

tw4l commented Apr 4, 2024

RangerMauve commented Apr 8, 2024

edsu commented Jul 18, 2024

tw4l commented Jul 19, 2024 •

edited by Shrinks99

Loading

edsu commented Jul 24, 2024

tw4l commented Jul 24, 2024

ikreymer commented Jul 24, 2024

Mass org file export #1647

Mass org file export #1647

Comments

Shrinks99 commented Apr 3, 2024 • edited Loading

Context

Possible methods of downloading data

tw4l commented Apr 4, 2024

RangerMauve commented Apr 8, 2024

edsu commented Jul 18, 2024

tw4l commented Jul 19, 2024 • edited by Shrinks99 Loading

edsu commented Jul 24, 2024

tw4l commented Jul 24, 2024

ikreymer commented Jul 24, 2024

Shrinks99 commented Apr 3, 2024 •

edited

Loading

tw4l commented Jul 19, 2024 •

edited by Shrinks99

Loading