Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download files in fetch.txt #118

Open
kba opened this issue Nov 15, 2018 · 8 comments
Open

Download files in fetch.txt #118

kba opened this issue Nov 15, 2018 · 8 comments

Comments

@kba
Copy link
Contributor

kba commented Nov 15, 2018

How would I go about completing an incomplete bag, which has files referenced in fetch.txt not present in /data?

Is this outside the domain of the tool or just not implemented? Or have I missed something?

If the latter, would this be an interesting feature for bagit-python or should we implement it on our side?

@kba
Copy link
Contributor Author

kba commented Nov 15, 2018

BTW, an ad-hoc solution without any checks etc is this bash one-liner:

while read url size fpath;do mkdir -p "${fpath%/*}"; wget -O"$fpath" "$url";done < fetch.txt

@bruth
Copy link

bruth commented Apr 4, 2019

Hi @kba and @acdha I am cross-posting here from the issue @sevein referenced (visible above). archivematica/Issues#583. I also want support for the fetch.txt file, however I only need/want validation and not automatic downloading of the files. For my use case I have bags that contain (reference) TBs of data that are already in archive-quality, content-addressable storage.

My team and I are happy to help contribute in reviews or code to get the validation functionality in at a minimum in lieu of fetching of the files. My feeling is that the default should be to validate and not fetch and rely on a parameter to cause a fetch to occur.

@acdha
Copy link
Member

acdha commented Apr 4, 2019

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code — people tend to ask for things like queuing, retries, concurrency controls, credentials & session management, storage management & cross-bag caching for identical files, etc. and have different opinions about what the answers to those look like.

I think there's a fairly reasonable argument to finish #119 and basically tell people that if they need anything more advanced it's probably best to use whatever system they prefer and simply use bagit-python to validate the final results.

@bruth
Copy link

bruth commented Apr 4, 2019

Once the files have been downloaded, the regular bag validation process will handle it. We've been hesitant to put download support into bagit-python because it generally tends to get into a fair amount of code

Yes I agree with that. I am in support of only doing the validation (looking up a data file entry in fetch.txt if found in the manifest file) and not downloading anything.

I think there's a fairly reasonable argument to finish #119

But this does involve downloading the files. Doesn't this contradict with what you said above?

@acdha
Copy link
Member

acdha commented Apr 4, 2019

I was just explaining why it hasn't happened before now. I do think there is a valid convenience argument for having a basic downloader for people who don't want anything fancy, however, so I'm open to accepting that pull-request as long as it doesn't get too complicated.

@bruth
Copy link

bruth commented Apr 4, 2019

Ok understood. The #119 PR doesn't seem to validate the contents of the fetch.txt with respect to the manifest, so that could be a separate PR to perform that task, correct? If so my team would be happen to contribute this.

@acdha
Copy link
Member

acdha commented Apr 4, 2019

I think the idea is that we'd have a simple fetch function and then immediately call validate() afterwards. It looks like #119 (comment) also has some additional validation checks for things listed in fetch.txt which aren't in the manifests, which we should probably handle now but probably as a separate PR, too.

@bruth
Copy link

bruth commented Apr 4, 2019

Right and the follow-up comment from @kba asserts the need to validate the fetch.txt file regardless if they are downloaded. Again for my use case, we don't want to download them simply for validation. So just to reiterate scope of this feature, there are two goals

  • Validate the fetch.txt file if present
    • Check that URLs are valid (well-formed)
    • Assert the path is listed in the manifest
    • This would be baked into the existing validation step for the bag
  • Support for downloading the files in fetch.txt
    • This would be separate from validation
    • Fetched files would be materialized into the data/ directory

Is this accurate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants