Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow full clear of completed jobs #503

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

bonnland
Copy link
Contributor

@bonnland bonnland commented May 18, 2022

The current behavior of the source clear-history -k true command is to retain all job ids where some harvested dataset came from that job id. This seems unnecessary, and for our CKAN instance, with many thousands of harvested datasets that get added/updated in small batches, it means that there is always a very long job history list.

This PR will require a small 2-line change to the ckan/ckanext-spatial WAF harvesting code, to handle the case where a harvest job ID has been cleared. It is possible that other harvesters will need small adjusts like this; I've only tested the WAF harvesting code because this is my organization's particular use case.

I have also made a suggested change to the options for this command, so that "keep currently running jobs" is always done. It seems potentially confusing, possibly unhelpful, and unnecessary to clear the history of jobs that are still running.

I have also removed the option of clearing currently harvested objects as part of the "history clear" behavior. See discussion below about why this may be a good idea.

I know that this initial PR version has failing tests. I do not have testing set up on my current vagrant VM instance, and will try to fix the tests by looking at the Github tests output.

Please feel free to suggest possible alternatives or edits. Thank you!

@bonnland
Copy link
Contributor Author

Related PR: ckan/ckanext-spatial#284

@bonnland
Copy link
Contributor Author

Note: I've verified on my vagrant VM that currently running harvest jobs are not cleared when the "job history clear" command is given. It is a source-based CKAN 2.9.5 install running uwsgi+nginx, with WAF harvesting enabled.

@bonnland
Copy link
Contributor Author

Related to #293

@@ -356,7 +357,7 @@ def define_harvester_tables():
Column('state', types.UnicodeText, default=u'WAITING'),
Column('metadata_modified_date', types.DateTime),
Column('retry_times', types.Integer, default=0),
Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id')),
Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id', ondelete='SET NULL'), nullable=True),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing table structure would require migration script, otherwise it would not be applied to any instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a schema change creates some challenges for existing databases. However, the ckanapi package gives very nice support for exporting and re-importing users, packages, and organizations, so that a fresh database instance with imported users, etc is not very time consuming. I could write up a set of instructions for the Wiki if this is useful.

@seitenbau-govdata
Copy link
Member

Hi @bonnland. Thanks for the interesting pull request. Could you explain the advantage of deleting the harvest jobs and keeping the harvest objects?

The current implementation with the option -k true preserves the harvest jobs with current harvest objects as long as a harvest job has at least one current harvest object. And these harvest jobs with their reports are still available in the UI.

@bonnland
Copy link
Contributor Author

I just realized that the user option --keep_current for keeping current harvest objects would no longer be needed with this pull request.

If there are organizations who want duplicate packages in the package table, this pull request should not be accepted. It's hard to think of a reason why duplicate records would be good, though, from my perspective.

@bonnland
Copy link
Contributor Author

bonnland commented May 18, 2022

Hi @bonnland. Thanks for the interesting pull request. Could you explain the advantage of deleting the harvest jobs and keeping the harvest objects?

Hi @seitenbau-govdata, thanks for the attention and interest. Keeping the harvest objects allows tracking of what has been harvested already. If all harvest objects are cleared as part of the history clear, then the harvester will re-harvest datasets that already appear in the package table, and duplicate rows in the package table will be created. For our organization, this is costly because our harvest sources have thousands of datasets. The package table fills up quickly with duplicate rows if harvest objects are not kept, and it can take hours to re-harvest everything.

So I would argue that keeping current harvested objects should be the default behavior, at the very least. Currently, it requires passing a flag. Users who do not realize they must pass a flag discover eventually that their package table gets large with repeated uses of the "history clear" command. They also discover that every harvested dataset URL has this strange integer at the end of it, which increments every time the "history clear" command is used.

Also, the advantage of clearing all completed harvest jobs is that it becomes much easier to track how often a harvest source is being updated. Without this pull request, even when using the -k flag for keeping harvest objects, I end up with a "cleared" job history list that is very large. It seems unnecessary and confusing to not have the option to fully clear all completed jobs.

@bonnland
Copy link
Contributor Author

And these harvest jobs with their reports are still available in the UI.

For our purposes, the harvest job reports help in the short term, to fix harvesting errors. Their usefulness rarely goes past a few days.

Is it useful to have job reports for cases where the datasets were harvested successfully? Because it seems that harvest job ids are kept when datasets are harvested successfully. If you can explain how this is helpful, I would appreciate knowing the advantages.

@@ -216,18 +216,12 @@ def clear_harvest_source_history(source_id, keep_current):
if source_id is not None:
tk.get_action("harvest_source_job_history_clear")(context, {
"id": source_id,
"keep_current": keep_current
})
})
return "Cleared job history of harvest source: {0}".format(source_id)
Copy link
Contributor Author

@bonnland bonnland May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the language used in the return statement. This command is most useful if it clears the "job history", not the entire source history. Perhaps a change in the command name to "clear-job-history" would be better, as it more clearly states the eventual outcome of the command.

@bonnland
Copy link
Contributor Author

I should probably not forget to add that our organization adds new records individually, at an average rate of 3-4 per week. That means I have a "cleared" job history list with over 100 entries for some of our WAFs. This might not be the usual case for others, so the urgency for this PR is probably not as relevant for others as it might be for us.

@Zharktas
Copy link
Member

And these harvest jobs with their reports are still available in the UI.

For our purposes, the harvest job reports help in the short term, to fix harvesting errors. Their usefulness rarely goes past a few days.

Is it useful to have job reports for cases where the datasets were harvested successfully? Because it seems that harvest job ids are kept when datasets are harvested successfully. If you can explain how this is helpful, I would appreciate knowing the advantages.

Not all instances harvest daily. For you the history might be relevant only for few days, but other might have a need for what happened three months ago.

What I would do is have separate command for what you are trying to achieve or at least an option in the current one. It would be a nasty surprise for someone running this command and noticing the functionality has changed.

@bonnland
Copy link
Contributor Author

bonnland commented May 19, 2022

What I would do is have separate command for what you are trying to achieve or at least an option in the current one. It would be a nasty surprise for someone running this command and noticing the functionality has changed.

I am not sure I understand. The command is run when someone wants to clear the job history. If they don't want to clear the job history, then they do not run the command.

What this pull request does is make this command behave as it did before 2016 or 2017. I am not sure why it was changed; it worked very well before.

@Zharktas
Copy link
Member

There's discussion on #484 and #397 why the change was made originally.

@bonnland
Copy link
Contributor Author

bonnland commented May 19, 2022

There's discussion on #484 and #397 why the change was made originally.

OK, perhaps it would be better if a new command option is made available. How does "harvester source clear-job-history" sound? It would still require a change to the foreign key constraint on the harvest_object table.

@seitenbau-govdata
Copy link
Member

seitenbau-govdata commented May 19, 2022

What this pull request does is make this command behave as it did before 2016 or 2017. I am not sure why it was changed; it worked very well before.

We have introduced the command clearsource_history (as click command source clear-history) with #268 at the end of 2016. And I mean the only change afterwards was the fix #484.

@bonnland
Copy link
Contributor Author

We have introduced the command clearsource_history (as click command source clear-history) with #268 at the end of 2016. And I mean the only change afterwards was the fix #484.

There used to be a command that would clear all completed jobs. It was very helpful. I suppose it disappeared when the clearsource_history command was added.

@seitenbau-govdata
Copy link
Member

seitenbau-govdata commented May 19, 2022

There used to be a command that would clear all completed jobs. It was very helpful. I suppose it disappeared when the clearsource_history command was added.

No, there wasn't removed any command when adding the command clearsource_history. The only command was and still is the clearsource (as click command clear source) which deletes the source with all datasets. That was the reason why we introduced the clearsource_history command, because there was no command like deleting only the harvest job history. If there had exists such a command we didn't introduces the clearsource_history command.

@bonnland
Copy link
Contributor Author

No, there wasn't removed any command when adding the command clearsource_history. The only command was and still is the clearsource (as click command clear source) which deletes the source with all datasets. That was the reason why we introduced the clearsurce_history command, because there was no command like deleting only the harvest job history. If there had exists such a command we didn't introduces the clearsurce_history command.

I remember a time when I could clear all completed jobs, and it would not cause duplicate rows in the package table after it was used. This issue of the package table filling up is potentially serious, and I am surprised it has not come up somewhere before.

@seitenbau-govdata
Copy link
Member

I remember a time when I could clear all completed jobs, and it would not cause duplicate rows in the package table after it was used. This issue of the package table filling up is potentially serious, and I am surprised it has not come up somewhere before.

Maybe somewhere else in a fork? But unfortunately not in ckanext-dcat. Yes, I agree. With many thousands of datasets it is really serious and a huge problem. After going into production in early 2016 with more than 20.000 datasets and a harvesting interval of 2 days we pointed out after a few month that the size of our database was increasing and the harvest UI was getting slower. Therefore, we started to implement the new command.

@bonnland
Copy link
Contributor Author

With many thousands of datasets it is really serious and a huge problem.

It is helpful to know that our organization is not the only one who has had this problem. The current interface does not provide any way of clearing past jobs without creating an entire new set of duplicate rows in the package table.

@seitenbau-govdata
Copy link
Member

It is helpful to know that our organization is not the only one who has had this problem. The current interface does not provide any way of clearing past jobs without creating an entire new set of duplicate rows in the package table.

Actually this should be possible with the new option --keep-current respectively -k. We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects. Does this not work for you?

@bonnland
Copy link
Contributor Author

bonnland commented May 19, 2022

Actually this should be possible with the new option --keep-current respectively -k. We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects. Does this not work for you?

It keeps hundreds of harvest jobs when I use this command. Any harvest job with a successfully harvested dataset is kept. For us, this is hundreds of harvest jobs. Many of the jobs involve adding a single new record to the harvest source. All of these jobs are kept. There is very little useful information in the jobs that are kept, because they all represent successful harvests. An always-growing, long list of successful harvests has very little useful information in it for our organization, and it becomes difficult to track changes to the harvesting behavior.

After more thought, perhaps some organizations want to track the rate of successful harvests over time. Maybe this is useful to some, but that is not true for us. And it seems that some organizations would want to "reset" their tracking by clearing all jobs at some point, without creating a full set of duplicate rows in the package table.

@bonnland
Copy link
Contributor Author

bonnland commented May 19, 2022

We use harvester source clear-history -k true to delete the old harvest jobs and keep the latest harvest jobs with the current harvest objects.

@seitenbau-govdata Do you find the remaining job reports useful in any way? I am really trying to understand the benefits of keeping old job reports that have no errors in them. EDIT: I can see how tracking harvest history could be valuable. See below for possible ways to allow no changes to keep_current=true.

@bonnland
Copy link
Contributor Author

bonnland commented May 19, 2022

If the keep_current=true behavior as it is now is useful to some organizations, there are a few ways we could go that would also work for our organization. Here are some possible choices for how to change the current interface behavior that could work:

  • When keep_current=false, also delete and purge harvested datasets in the package table. This would prevent duplicate entries from being created in the package table over the long term. But then it would be very similar to the behavior of the "clear source" command, and it would make datasets disappear from CKAN until they are harvested again.
  • When keep_current=false, do not delete all information about what has been harvested. Retain all harvest objects that are still current, but clear the job_id field. This would prevent duplicate entries in the package table by preserving the prior state of harvesting for the source.
  • Add a new command "clear-job-history" that removes past jobs without losing information about what has been harvested already. This would also prevent duplicate entries in the package table. Retain all harvest objects that are still current, but clear the job_id field in the harvest_object table.

At first, it seemed to me that the second and third choices would require setting the "job_id" field in the harvest_object table to NULL, but it might also work very well to set the job_ids to be the same (non-NULL) value, perhaps the value of a constant PRIOR_JOBS_ID. This would create a single "prior job" to replace (and summarize!) the potentially large number of "add new record" jobs that existed before. It might be an elegant and simple solution if there are not other foreign key constraints to prevent it.

I would really like to know how the current behavior of keep_current=false is helpful to organizations. It does almost the same thing as the "clear source" command, except that it keeps harvested datasets in the package table and makes it very easy to create a huge package table over time. It also has the side-effect of changing the dataset URLs because of existing URL collisions in the package table. Every time the command is run with keep_current=false, all datasets are re-harvested and dataset URLs get a new integer ending. The end-users at our organization have sharp eyes, and in the past, before I knew I had to pass the -k flag, they asked why the URLs were always changing.

@bonnland
Copy link
Contributor Author

bonnland commented May 21, 2022

Maybe somewhere else in a fork?

You may be correct that clearsource_history without keep_current=true has always led to the package table growing over time. Perhaps it was harder to notice before because the harvested dataset URLs did not change on every re-harvest after the command was used.

If this pull request is too controversial for existing users, then I will create a second pull request with a new command called clear_job_history or something similar. I will aim for setting out-of-date job_id values to some predetermined constant, so that it avoids changing the database schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants