Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect bag-it fetch.txt file #583

Open
bruth opened this issue Mar 17, 2019 · 7 comments
Open

Respect bag-it fetch.txt file #583

bruth opened this issue Mar 17, 2019 · 7 comments
Labels
Type: enhancement An improvement to existing functionality.

Comments

@bruth
Copy link

bruth commented Mar 17, 2019

Please describe the problem you'd like to be solved.

The BagIt spec defines a fetch.txt file for referring to remote files that should be considered as part of the bag. Per this linked section:

Every file listed in the fetch file MUST be listed in every payload manifest. A fetch file MUST NOT list any tag files.

It does not appear that this file is respected. I only tested this on AM 1.7, but the 1.9 demo does not include a bag example utilizing the fetch.txt file either.

Describe the solution you'd like to see implemented.

A basic solution would be to respect this file as a fallback if a file in the manifest is not physically present in the data/ directory. If a fetch.txt file exists and the path from the manifest exists in fetch.txt then validation should pass.

An additional check (which should be optional) would be to validate the checksums of the remote files. This may not be desirable or feasible (auth required, file size is huge, etc). However it could be a nice additional option.

Describe alternatives you've considered.

n/a

Additional context

n/a


For Artefactual use:
Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

  • All PRs related to this issue are properly linked 👍
  • All PRs related to this issue have been merged 👍
  • Test plan for this issue has been implemented and passed 👍
  • Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍
@sromkey sromkey added the Type: enhancement An improvement to existing functionality. label Mar 17, 2019
@sromkey
Copy link
Contributor

sromkey commented Mar 17, 2019

Thanks for this idea @bruth . You are correct that Archivematica has never implemented a fetch file in its bags.

@bruth
Copy link
Author

bruth commented Mar 19, 2019

Thanks @sromkey. I believe this check is performed here? Are there other client scripts that would need to be aware of the fetch.txt file natively? In our use case, we will be validating the bag prior to uploading it to the AM transfer space, but we just don't want validation to fail due to the scenario I stated above.

One workaround would be to simply not choose "zipped/unzipped bag" as the transfer type. However I was not sure of the ramifications of how the contents would be processed/re-structured if AM doesn't know its a bag.

@sromkey
Copy link
Contributor

sromkey commented Mar 19, 2019

Ohhh I'm sorry- I misunderstood completely. I thought your request was for the bag that Archivematica makes as the AIP support fetch! Do you have a sample bag I could test with by chance? I'm curious to see the behaviour. If you can, contact me off github- sromkey [at] artefactual.com

@bruth
Copy link
Author

bruth commented Mar 19, 2019

I thought your request was for the bag that Archivematica makes as the AIP support fetch!

Ah! Right that would be more difficult to manage for sure. Here is a minimal bag example with a single file listed in the manifest to the path data/TechCrunchcontinentalUSA.csv (with the correct checksum) and an entry in fetch.txt of the remote URL to that same path.

The output error from AM (v1.7) was:

Result is false.
(error) Payload manifest manifest-sha256.txt contains missing file(s): [data/TechCrunchcontinentalUSA.csv]

So it did not acknowledge the entry in the fetch.txt even though that data path is listed.

@sevein
Copy link
Contributor

sevein commented Apr 3, 2019

Hi @bruth. We've recently pushed changes (not released yet) to use bagit-python where bagit-java v4 (via CLI) was used before. I thought that could bring different results but I did a quick test using your minimal bag example and the library raises a bagit.BagValidationError:

>>> import bagit
>>> bagit.VERSION
'1.7.0'
>>> bagit.Bag(os.getcwd()).validate()
data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 603, in validate
    processes=processes, fast=fast, completeness_only=completeness_only
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 785, in _validate_contents
    self._validate_completeness()
  File "/home/jesus/.local/lib/python3.6/site-packages/bagit.py", line 853, in _validate_completeness
    raise BagValidationError(_("Bag validation failed"), errors)
bagit.BagValidationError: Bag validation failed: data/TechCrunchcontinentalUSA.csv exists in manifest but was not found on filesystem

validate() did not raise when the file had been previously downloaded.

A basic solution would be to respect this file as a fallback if a file in the manifest is not physically present in the data/ directory. If a fetch.txt file exists and the path from the manifest exists in fetch.txt then validation should pass.

So maybe this could be done, but if I understood correctly that'd be a change upstream in bagit-python. This issue seems related: LibraryOfCongress/bagit-python#118.

We could also have Archivematica download the files before validation but I understand that wouldn't be always desirable or something that would work consistently.

@bruth
Copy link
Author

bruth commented Apr 4, 2019

Thanks @sevein. I will look into that bagit-python issue and see if I can help move that issue along.

We could also have Archivematica download the files before validation but I understand that wouldn't be always desirable or something that would work consistently.

We are working with bags containing many TBs of data (genomic data in this case) which is why we are using the fetch.txt file to be begin. We have an external process for managing that data in content-addressable storage which makes it much easier to produce a fetch.txt file and manifest entries (containing the hash) for those large files.

@ross-spencer
Copy link
Contributor

nb. now supported in Steffen's Golang bag tool: steffenfritz/bagit#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: enhancement An improvement to existing functionality.
Projects
None yet
Development

No branches or pull requests

4 participants