Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move datasets to delete first in line #261

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jbrown-xentity
Copy link

We have reports at data.gov of datasets that get re-harvested with an extra 1
in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if
this is a new dataset or not;
but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it.
By running through the datasets removal first,
if the spatial harvester is essentially doing a "delete and add"
when it should be replacing, then the name of the new dataset
won't collide with the one that is marked for deletion
but still in the system. This will keep the URL the same, and not break as many workflows.

We have reports of datasets that get re-harvested with an extra `1`
in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if
this is a new dataset or not;
but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it.
By hopefully running through the datasets removal first,
if the spatial harvester is essentially doing a "delete and add"
when it should be replacing, then the name of the new dataset
won't collide with the one that is marked for deletion
but still in the system.
@amercader
Copy link
Member

@jbrown-xentity It's been a long time since I worked on this but IIRC the harvesters call package_delete to delete a dataset, which will mark it as deleted but leave it on the database (as opposed to a package_purge call), which means that the dataset name can't be used when creating a new one. Can you expand on why changing the order in which "to delete" harvest objects are created helps in this case? (I'm sure the changes help, I just want to understand better)

If the harvest is managing the datasets in ckan, it seems that the
harvest source should be the "source of truth".
If this is the case, we shouldn't need "revive" capability of soft
removing packages/datasets in ckan. I propose to actually purge
the dataset within ckan.
Since it's difficult/nearly impossible to track these files without a
unique id, sometimes the harvester will delete and create a new item if
the waf or files change in any way. This would keep that behind the
scenes, and allow the end user to get to the same dataset at the old URL
@jbrown-xentity
Copy link
Author

@amercader no, I believe you're right: we would need to purge the dataset. I forgot about that functionality. I believe we actually should be purging; I don't see a likely scenario where a user would want to keep or "revive" a dataset that was harvested and has been removed from source... I updated the PR to include the "purge" command instead of "delete".

@ccancellieri
Copy link
Contributor

I'm experiencing a problem after having purged a dataset harvevsted.
The next loop it will not be harvested anymore since the HarvestObject is still there tracking the date of last modification.
As result you have to go (as I'm doing) in the DB to remove the harvest object by GUID.

I think that the purge may take care eventually of harvest object or... (since the core cant depend on an extension) we've to provide purge for harvest object table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants