Download files in fetch.txt #118

kba · 2018-11-15T15:43:37Z

How would I go about completing an incomplete bag, which has files referenced in fetch.txt not present in /data?

Is this outside the domain of the tool or just not implemented? Or have I missed something?

If the latter, would this be an interesting feature for bagit-python or should we implement it on our side?

The text was updated successfully, but these errors were encountered:

kba · 2018-11-15T15:51:49Z

BTW, an ad-hoc solution without any checks etc is this bash one-liner:

while read url size fpath;do mkdir -p "${fpath%/*}"; wget -O"$fpath" "$url";done < fetch.txt

…gress#118

bruth · 2019-04-04T11:06:35Z

Hi @kba and @acdha I am cross-posting here from the issue @sevein referenced (visible above). archivematica/Issues#583. I also want support for the fetch.txt file, however I only need/want validation and not automatic downloading of the files. For my use case I have bags that contain (reference) TBs of data that are already in archive-quality, content-addressable storage.

My team and I are happy to help contribute in reviews or code to get the validation functionality in at a minimum in lieu of fetching of the files. My feeling is that the default should be to validate and not fetch and rely on a parameter to cause a fetch to occur.

acdha · 2019-04-04T13:38:41Z

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code — people tend to ask for things like queuing, retries, concurrency controls, credentials & session management, storage management & cross-bag caching for identical files, etc. and have different opinions about what the answers to those look like.

I think there's a fairly reasonable argument to finish #119 and basically tell people that if they need anything more advanced it's probably best to use whatever system they prefer and simply use bagit-python to validate the final results.

bruth · 2019-04-04T14:34:32Z

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code

Yes I agree with that. I am in support of only doing the validation (looking up a data file entry in fetch.txt if found in the manifest file) and not downloading anything.

I think there's a fairly reasonable argument to finish #119

But this does involve downloading the files. Doesn't this contradict with what you said above?

acdha · 2019-04-04T14:41:33Z

I was just explaining why it hasn't happened before now. I do think there is a valid convenience argument for having a basic downloader for people who don't want anything fancy, however, so I'm open to accepting that pull-request as long as it doesn't get too complicated.

bruth · 2019-04-04T14:46:05Z

Ok understood. The #119 PR doesn't seem to validate the contents of the fetch.txt with respect to the manifest, so that could be a separate PR to perform that task, correct? If so my team would be happen to contribute this.

acdha · 2019-04-04T14:53:59Z

I think the idea is that we'd have a simple fetch function and then immediately call validate() afterwards. It looks like #119 (comment) also has some additional validation checks for things listed in fetch.txt which aren't in the manifests, which we should probably handle now but probably as a separate PR, too.

bruth · 2019-04-04T15:06:24Z

Right and the follow-up comment from @kba asserts the need to validate the fetch.txt file regardless if they are downloaded. Again for my use case, we don't want to download them simply for validation. So just to reiterate scope of this feature, there are two goals

Validate the fetch.txt file if present
- Check that URLs are valid (well-formed)
- Assert the path is listed in the manifest
- This would be baked into the existing validation step for the bag
Support for downloading the files in fetch.txt
- This would be separate from validation
- Fetched files would be materialized into the data/ directory

Is this accurate?

kba added a commit to kba/bagit-python that referenced this issue Nov 27, 2018

Minimal implementation of fetching entries of fetch.txt, LibraryOfCon…

3c14161

…gress#118

kba mentioned this issue Nov 27, 2018

Download files in fetch.txt #119

Open

kba added a commit to kba/bagit-python that referenced this issue Dec 10, 2018

Minimal implementation of fetching entries of fetch.txt, LibraryOfCon…

6f0ae02

…gress#118

kba added a commit to kba/bagit-python that referenced this issue Dec 10, 2018

Minimal implementation of fetching entries of fetch.txt, LibraryOfCon…

dae7b40

…gress#118

sevein mentioned this issue Apr 3, 2019

Respect bag-it fetch.txt file archivematica/Issues#583

Open

terrywbrady mentioned this issue Feb 11, 2020

Explore the idea of a Reverse Manifest Download of an Object/Version CDLUC3/mrt-doc#233

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download files in fetch.txt #118

Download files in fetch.txt #118

kba commented Nov 15, 2018

kba commented Nov 15, 2018

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019

Download files in fetch.txt #118

Download files in fetch.txt #118

Comments

kba commented Nov 15, 2018

kba commented Nov 15, 2018

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019

acdha commented Apr 4, 2019

bruth commented Apr 4, 2019