-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rate-limit the updates? #2279
Comments
If we rate-limit, we could propose different modes:
cc https://huggingface.slack.com/archives/C04BP5S7858/p1710243357259019 (private) |
Another way is to deprioritize the dataset by moving the job date to the future, depending on the time between the last two/three commits + filtering on the date when picking a job, so that dates in the future are ignored. |
Proposal: Create two new collections:
When we select the next job in the queue, 1. we filter out the blockedDatasets and 2. once we select a job, we compute the sum of the durations of all its pastJobs. If it's over a threshold of z hours, we add the dataset to What do you think? |
Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced with the new job with the same wait penalty. |
I think we need to keep computing the small datasets instantaneously, to help the users have a feedback loop with the viewer. We don't want to penalize anybody, apart from the very big datasets that are updated every x minutes. |
Hopefully #2933 fixes most of the issues. I think we need a fix though: the autoscaling is based on the number of jobs. But we need to remove the blocked datasets from the count. |
good news, it's not necessary, because if the dataset is updated while it's blocked (which is often the case: blocked for 1 hours, 1 commit every 5 minutes), all the cache entries + all the pending jobs are deleted -> it does not influenciate the metrics anymore. |
(until backfill does its job maybe ?) |
no, because it only creates two jobs (dataset-config-names and dataset-filetypes, the root steps) and they are never run while the dataset is blocked. |
hmmm, I changed my mind and opened #2945. I think it's urgent, to fix the autoscaling |
fixed |
See https://huggingface.co/datasets/lunaluan/chatbox8_history/commits/main: two commits are created every minute (47K commits). This dataset is currently blocked manually (#2184).
Should we create a rule to automatically limit the number of updates per hour/day? Note that doing so might imply changing how we handle the "revision". If we decide to process a specific commit, all the steps should use the same commit, even if new ones have appeared since.
The text was updated successfully, but these errors were encountered: