Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check all available checksum algorithm in DataVerse registry population #437

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dokempf
Copy link
Contributor

@dokempf dokempf commented Oct 1, 2024

Fixes #435.

@dokempf
Copy link
Contributor Author

dokempf commented Oct 1, 2024

Failures are unrelated. Should I look into fixing them?

@santisoler
Copy link
Member

santisoler commented Oct 3, 2024

Hi @dokempf! Thanks for taking care of this.

I don't think this fix solves the issue: with it the populate_registry doesn't throws an error, but it doesn't build the registry either. For example, if you run this code you'll see that the registry is still empty:

import pooch

example = pooch.create(
    path=pooch.os_cache("example"),
    base_url="doi:10.34894/5SOKTV",
)
example.load_registry_from_doi()
print(example.registry)
{}

After inspecting the content of self.api_response.json()["data"]["latestVersion"]["files"][0] (assigned to filedata in the populate_registry method), I see that the dictionary doesn't contain the "md5" key anymore. Instead the checksum data is stored inside another dictionary under the "checksum" key:

{
    "id": 425558,
    "persistentId": "",
    "filename": "README.md",
    "contentType": "text/markdown",
    "friendlyType": "Markdown Text",
    "filesize": 2324,
    "storageIdentifier": "file://190251cc731-ca82ca1d341b",
    "rootDataFileId": -1,
    "checksum": {"type": "SHA-1", "value": "0e4b27fd3d76c75c37303f47e25d74ef407d0752"},
    "tabularData": False,
    "creationDate": "2024-06-17",
    "publicationDate": "2024-06-28",
    "fileAccessRequest": True,
}

I suspect there might be a change in the DataVerse API and how they return information about the checksum.

Therefore I think the fix should check if the filedata has either a "checksum" key and get the info from there, or if it has keys like the one you included in the fix. Would be nice to find some docs on the DataVerse API to understand the changes they made (sorry, I didn't search for it yet).

On a side note, don't worry about the pylint complain, I'll fix it in another PR. We should replace it for some modern alternative (ruff or flake8) at some point.

@santisoler
Copy link
Member

My bad. I just saw that you added a link to the docs in the issue. Thanks for that!

In the sample response they added in https://guides.dataverse.org/en/latest/api/native-api.html#import-a-dataset-into-a-dataverse-collection there's the "checksum" key I mentioned.

So, my take would be:

  • Maintain support for the "md5" key (in case DataVerse instances are using the old API).
  • Add support for other algorithm keys: "sha256", "sha1", etc.
  • Add support for the new API with the "checksum" key.
  • Raise a verbose error in case we couldn't find a checksum in the API response. I think if we couldn't populate the registry we should raise an error instead of quietly exiting.

@santisoler
Copy link
Member

BTW, I merged #438. If you update this branch you won't find the pylint error.

@dokempf
Copy link
Contributor Author

dokempf commented Nov 13, 2024

So, I updated the code to include both of your recommendations (new API response, error throwing). Unfortunately, I do not reach the coverage target with this change, but mocking different API results seems quite an effort for little benefit.

Restructure how the two APIs were being supported: the previous
implementation would raise an error when "md5" and "checksum" were not
available in the response, and not check for any other algorithms.
Correctly parse the checksum algorithm that is available in the new
Dataverse API: we need to convert "SHA-1" to "sha1" to name one.
@santisoler
Copy link
Member

Hi @dokempf. Sorry for the delay, these weeks were a little bit busy.

I noticed that your implementation was raising an error when "md5" and "checksum" were not in the response. Even if "sha1" (for example) was there, the code would just raise an error because the first two if statements were false. I just pushed a change that would fix it.

I also pushed a commit that adds some tests. I think we should include tests for this feature because of two reasons:

  1. To make sure that the implementation works as expected both for the old and new APIs.
  2. To ensure that any future refactor of this piece of code supports the same behaviours.

I know it's tedious to write these tests, but sometimes they are they only way to ensure things work as expected.

I still would like to add some more things to the tests:

  • In the fake response I should use "SHA-1" instead of "sha1" (the current behaviour).
  • Add a test that checks if the error is being raised.

But at least for now, the following example works:

import pooch

example = pooch.create(
    path=pooch.os_cache("example"),
    base_url="doi:10.34894/5SOKTV",
)
example.load_registry_from_doi()
print(example.registry)

example.fetch("README.md")

"id": 12345,
"filename": "foobar.txt",
"checksum": {
"type": self._algorithm.upper(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "SHA-1", "SHA-256", etc. Not just "SHA1" and "SHA256".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOI registry assumes md5 hashing algorithm
2 participants