-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow full clear of completed jobs #503
base: master
Are you sure you want to change the base?
Conversation
Related PR: ckan/ckanext-spatial#284 |
Note: I've verified on my vagrant VM that currently running harvest jobs are not cleared when the "job history clear" command is given. It is a source-based CKAN 2.9.5 install running uwsgi+nginx, with WAF harvesting enabled. |
Related to #293 |
@@ -356,7 +357,7 @@ def define_harvester_tables(): | |||
Column('state', types.UnicodeText, default=u'WAITING'), | |||
Column('metadata_modified_date', types.DateTime), | |||
Column('retry_times', types.Integer, default=0), | |||
Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id')), | |||
Column('harvest_job_id', types.UnicodeText, ForeignKey('harvest_job.id', ondelete='SET NULL'), nullable=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing table structure would require migration script, otherwise it would not be applied to any instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a schema change creates some challenges for existing databases. However, the ckanapi
package gives very nice support for exporting and re-importing users, packages, and organizations, so that a fresh database instance with imported users, etc is not very time consuming. I could write up a set of instructions for the Wiki if this is useful.
Hi @bonnland. Thanks for the interesting pull request. Could you explain the advantage of deleting the harvest jobs and keeping the harvest objects? The current implementation with the option |
I just realized that the user option If there are organizations who want duplicate packages in the package table, this pull request should not be accepted. It's hard to think of a reason why duplicate records would be good, though, from my perspective. |
Hi @seitenbau-govdata, thanks for the attention and interest. Keeping the harvest objects allows tracking of what has been harvested already. If all harvest objects are cleared as part of the history clear, then the harvester will re-harvest datasets that already appear in the package table, and duplicate rows in the package table will be created. For our organization, this is costly because our harvest sources have thousands of datasets. The package table fills up quickly with duplicate rows if harvest objects are not kept, and it can take hours to re-harvest everything. So I would argue that keeping current harvested objects should be the default behavior, at the very least. Currently, it requires passing a flag. Users who do not realize they must pass a flag discover eventually that their package table gets large with repeated uses of the "history clear" command. They also discover that every harvested dataset URL has this strange integer at the end of it, which increments every time the "history clear" command is used. Also, the advantage of clearing all completed harvest jobs is that it becomes much easier to track how often a harvest source is being updated. Without this pull request, even when using the |
For our purposes, the harvest job reports help in the short term, to fix harvesting errors. Their usefulness rarely goes past a few days. Is it useful to have job reports for cases where the datasets were harvested successfully? Because it seems that harvest job ids are kept when datasets are harvested successfully. If you can explain how this is helpful, I would appreciate knowing the advantages. |
@@ -216,18 +216,12 @@ def clear_harvest_source_history(source_id, keep_current): | |||
if source_id is not None: | |||
tk.get_action("harvest_source_job_history_clear")(context, { | |||
"id": source_id, | |||
"keep_current": keep_current | |||
}) | |||
}) | |||
return "Cleared job history of harvest source: {0}".format(source_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the language used in the return statement. This command is most useful if it clears the "job history", not the entire source history. Perhaps a change in the command name to "clear-job-history" would be better, as it more clearly states the eventual outcome of the command.
I should probably not forget to add that our organization adds new records individually, at an average rate of 3-4 per week. That means I have a "cleared" job history list with over 100 entries for some of our WAFs. This might not be the usual case for others, so the urgency for this PR is probably not as relevant for others as it might be for us. |
Not all instances harvest daily. For you the history might be relevant only for few days, but other might have a need for what happened three months ago. What I would do is have separate command for what you are trying to achieve or at least an option in the current one. It would be a nasty surprise for someone running this command and noticing the functionality has changed. |
I am not sure I understand. The command is run when someone wants to clear the job history. If they don't want to clear the job history, then they do not run the command. What this pull request does is make this command behave as it did before 2016 or 2017. I am not sure why it was changed; it worked very well before. |
We have introduced the command |
There used to be a command that would clear all completed jobs. It was very helpful. I suppose it disappeared when the |
No, there wasn't removed any command when adding the command |
I remember a time when I could clear all completed jobs, and it would not cause duplicate rows in the package table after it was used. This issue of the package table filling up is potentially serious, and I am surprised it has not come up somewhere before. |
Maybe somewhere else in a fork? But unfortunately not in |
It is helpful to know that our organization is not the only one who has had this problem. The current interface does not provide any way of clearing past jobs without creating an entire new set of duplicate rows in the package table. |
Actually this should be possible with the new option |
It keeps hundreds of harvest jobs when I use this command. Any harvest job with a successfully harvested dataset is kept. For us, this is hundreds of harvest jobs. Many of the jobs involve adding a single new record to the harvest source. All of these jobs are kept. There is very little useful information in the jobs that are kept, because they all represent successful harvests. An always-growing, long list of successful harvests has very little useful information in it for our organization, and it becomes difficult to track changes to the harvesting behavior. After more thought, perhaps some organizations want to track the rate of successful harvests over time. Maybe this is useful to some, but that is not true for us. And it seems that some organizations would want to "reset" their tracking by clearing all jobs at some point, without creating a full set of duplicate rows in the package table. |
@seitenbau-govdata Do you find the remaining job reports useful in any way? I am really trying to understand the benefits of keeping old job reports that have no errors in them. EDIT: I can see how tracking harvest history could be valuable. See below for possible ways to allow no changes to |
If the
At first, it seemed to me that the second and third choices would require setting the "job_id" field in the harvest_object table to NULL, but it might also work very well to set the job_ids to be the same (non-NULL) value, perhaps the value of a constant I would really like to know how the current behavior of |
You may be correct that If this pull request is too controversial for existing users, then I will create a second pull request with a new command called |
The current behavior of the
source clear-history -k true
command is to retain all job ids where some harvested dataset came from that job id. This seems unnecessary, and for our CKAN instance, with many thousands of harvested datasets that get added/updated in small batches, it means that there is always a very long job history list.This PR will require a small 2-line change to the ckan/ckanext-spatial WAF harvesting code, to handle the case where a harvest job ID has been cleared. It is possible that other harvesters will need small adjusts like this; I've only tested the WAF harvesting code because this is my organization's particular use case.
I have also made a suggested change to the options for this command, so that "keep currently running jobs" is always done. It seems potentially confusing, possibly unhelpful, and unnecessary to clear the history of jobs that are still running.
I have also removed the option of clearing currently harvested objects as part of the "history clear" behavior. See discussion below about why this may be a good idea.
I know that this initial PR version has failing tests. I do not have testing set up on my current vagrant VM instance, and will try to fix the tests by looking at the Github tests output.
Please feel free to suggest possible alternatives or edits. Thank you!