-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Dataverse files without a persistentID #355
Conversation
Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.
We should add some tests for this bugfix that downloads some file from a Dataverse repository that doesn't provide persistent ID for its files. I would avoid using preexisting repositories, we want very small files since these tests will be run multiple times. Creating a version of the test data for Pooch in another Dataverse repository would be nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More details in the comment I left but I'd suggest always downloading files using the (database) ID.
Please feel free to hit me up on https://chat.dataverse.org if you have any questions. Thanks for teaching Pooch to download from Dataverse! 🐶 ❤️
pooch/downloaders.py
Outdated
persistent_id = files[file_name]["persistentId"] | ||
if persistent_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file ID will always be there. It's the primary key in the database.
The persistent ID (DOI or Handle) is optional so you can't rely on it being there. I would simply avoid even checking for it if all you want to do is download the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I thought the persistentId
is always there, but it's empty if the files doesn't have one. Sorry for the misunderstanding.
If that's the case, you're right, we shouldn't assume that persistentId
will always be there. I'll change the if
statement then.
BTW, do you have any example where the persistentId
is not even included in the response (I'm thinking for testing purposes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, from a quick test it looks like persistentId
is always present but can be an empty string. I'll put an example below.
I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id
instead which will always be there and always be a number. I hope this helps! 😄
curl -s 'https://dataverse.unc.edu/api/datasets/:persistentId?persistentId=doi:10.15139/S3/TRSZ3X' | jq '.data.latestVersion.files[0]'
{
"description": "summary data file",
"label": "CureTB data summary and statistics.tab",
"restricted": false,
"version": 3,
"datasetVersionId": 32878,
"dataFile": {
"id": 7527010,
"persistentId": "",
"pidURL": "",
"filename": "CureTB data summary and statistics.tab",
"contentType": "text/tab-separated-values",
"filesize": 3660,
"description": "summary data file",
"storageIdentifier": "s3://unc-dataverse-prod:18704bbebb7-8e1788510d33",
"originalFileFormat": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
"originalFormatLabel": "MS Excel Spreadsheet",
"originalFileSize": 335887,
"originalFileName": "CureTB data summary and statistics.xlsx",
"UNF": "UNF:6:P0vi0QJZpFxCwM3pcX9YJw==",
"rootDataFileId": -1,
"md5": "4804fa6347742d850b0e1753c1668882",
"checksum": {
"type": "MD5",
"value": "4804fa6347742d850b0e1753c1668882"
},
"creationDate": "2023-03-21"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, from a quick test it looks like
persistentId
is always present but can be an empty string.
Great then! The following two lines actually work both with non-existing persistentId
and if persistentId
is an empty string. So I think we could leave them as they are, just in case in the future persistentId
is dropped from any Dataverse API.
Lines 1049 to 1050 in 1e670cf
persistent_id = files[file_name].get("persistentId") | |
if persistent_id: |
I think we're saying the same thing. Always there. Sometimes an empty string. So I'd suggest checking for id instead which will always be there and always be a number.
My strategy is to check for the id
only if the persistentId
is missing. This is due to what I commented above regarding defaulting to persistentId
, being it the first option offered in Dataverse docs.
Maybe I'm being too conservative about it... I'm trying to keep the chances of breaking backward compatibility as low as possible, while still supporting the cases where persistentId
is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest changing your strategy to this:
- Only check for
id
Simple! 😄
The `persistentId` key might be missing in the API response, while the `ID` is always there. So, don't assume it exists when deciding which id should be used to download the files.
Both for a persistent_id as a None or as an empty string, we can evaluate them with `if persistent_id:`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the API docs say:
Basic access URI:
/api/access/datafile/$id
and only after that they have a box saying you can also use the persistentID, I think we can go with @pdurbin's advise of only using the ID instead of PID. That would simplify the testing and code and it would only break if Dataverse were to break their API. That should be fine since if we assume they can break by removing the ID then they could break in so many other ways that we have no way to predict. If it does happen, we can always issue a patch. But I think it's unlikely that they will without bumping the API version. The case of Zenodo from #373 seems like a good example that things may break but unintentionally. In which case, we probably only have to report the issue.
I can merge in main and make the changes since @santisoler is on vacation. |
Plus, finding a dataverse instance that doesn't have persistentIDs that would be willing to host the Pooch test data would be non-trivial. And they could always enable the PIDs and break our tests without any warning. |
15f7536 looks like a nice simplification. By the way we have a new changelog for breaking changes to the Dataverse API, which we hope to keep short! Here's how it looks as of Dataverse 6.1: https://guides.dataverse.org/en/6.1/api/changelog.html |
Thanks for sharing @pdurbin! |
Add support for downloading Dataverse files that don't have a persistent ID. Use the file ID instead.
Relevant issues/PRs:
Fixes #354
TODO