-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect bag-it fetch.txt file #583
Comments
Thanks for this idea @bruth . You are correct that Archivematica has never implemented a fetch file in its bags. |
Thanks @sromkey. I believe this check is performed here? Are there other client scripts that would need to be aware of the fetch.txt file natively? In our use case, we will be validating the bag prior to uploading it to the AM transfer space, but we just don't want validation to fail due to the scenario I stated above. One workaround would be to simply not choose "zipped/unzipped bag" as the transfer type. However I was not sure of the ramifications of how the contents would be processed/re-structured if AM doesn't know its a bag. |
Ohhh I'm sorry- I misunderstood completely. I thought your request was for the bag that Archivematica makes as the AIP support fetch! Do you have a sample bag I could test with by chance? I'm curious to see the behaviour. If you can, contact me off github- sromkey [at] artefactual.com |
Ah! Right that would be more difficult to manage for sure. Here is a minimal bag example with a single file listed in the manifest to the path The output error from AM (v1.7) was:
So it did not acknowledge the entry in the |
Hi @bruth. We've recently pushed changes (not released yet) to use bagit-python where bagit-java v4 (via CLI) was used before. I thought that could bring different results but I did a quick test using your minimal bag example and the library raises a >>> import bagit
>>> bagit.VERSION
'1.7.0'
>>> bagit.Bag(os.getcwd()).validate()
data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 603, in validate
processes=processes, fast=fast, completeness_only=completeness_only
File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 785, in _validate_contents
self._validate_completeness()
File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 853, in _validate_completeness
raise BagValidationError(_("Bag validation failed"), errors)
bagit.BagValidationError: Bag validation failed: data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem
So maybe this could be done, but if I understood correctly that'd be a change upstream in bagit-python. This issue seems related: LibraryOfCongress/bagit-python#118. We could also have Archivematica download the files before validation but I understand that wouldn't be always desirable or something that would work consistently. |
Thanks @sevein. I will look into that bagit-python issue and see if I can help move that issue along.
We are working with bags containing many TBs of data (genomic data in this case) which is why we are using the fetch.txt file to be begin. We have an external process for managing that data in content-addressable storage which makes it much easier to produce a fetch.txt file and manifest entries (containing the hash) for those large files. |
nb. now supported in Steffen's Golang bag tool: steffenfritz/bagit#6 |
Please describe the problem you'd like to be solved.
The BagIt spec defines a fetch.txt file for referring to remote files that should be considered as part of the bag. Per this linked section:
It does not appear that this file is respected. I only tested this on AM 1.7, but the 1.9 demo does not include a bag example utilizing the fetch.txt file either.
Describe the solution you'd like to see implemented.
A basic solution would be to respect this file as a fallback if a file in the manifest is not physically present in the
data/
directory. If afetch.txt
file exists and the path from the manifest exists infetch.txt
then validation should pass.An additional check (which should be optional) would be to validate the checksums of the remote files. This may not be desirable or feasible (auth required, file size is huge, etc). However it could be a nice additional option.
Describe alternatives you've considered.
n/a
Additional context
n/a
For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:
The text was updated successfully, but these errors were encountered: