Rate-limit the updates? #2279

severo · 2024-01-11T13:27:47Z

See https://huggingface.co/datasets/lunaluan/chatbox8_history/commits/main: two commits are created every minute (47K commits). This dataset is currently blocked manually (#2184).

Should we create a rule to automatically limit the number of updates per hour/day? Note that doing so might imply changing how we handle the "revision". If we decide to process a specific commit, all the steps should use the same commit, even if new ones have appeared since.

severo · 2024-03-22T18:18:25Z

If we rate-limit, we could propose different modes:

limited
full

cc https://huggingface.slack.com/archives/C04BP5S7858/p1710243357259019 (private)

severo · 2024-05-03T14:22:04Z

Another way is to deprioritize the dataset by moving the job date to the future, depending on the time between the last two/three commits + filtering on the date when picking a job, so that dates in the future are ignored.

severo · 2024-06-19T12:55:17Z

Proposal:

Create two new collections:

pastJobs: it will contain all the finished jobs, including the duration. It will have a TTL of x hours.
blockedDatasets: it will contain a list of currently blocked datasets, with a TTL of y hours.

When we select the next job in the queue, 1. we filter out the blockedDatasets and 2. once we select a job, we compute the sum of the durations of all its pastJobs. If it's over a threshold of z hours, we add the dataset to blockedDatasets, and return to select the next job.

What do you think?

lhoestq · 2024-06-19T13:05:58Z

Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced with the new job with the same wait penalty.

severo · 2024-06-19T13:10:12Z

Is it possible to say that a job should wait e.g. 10min before being run ?

I think we need to keep computing the small datasets instantaneously, to help the users have a feedback loop with the viewer. We don't want to penalize anybody, apart from the very big datasets that are updated every x minutes.

severo · 2024-06-20T21:06:29Z

Hopefully #2933 fixes most of the issues.

I think we need a fix though: the autoscaling is based on the number of jobs. But we need to remove the blocked datasets from the count.

severo · 2024-06-21T07:52:36Z

I think we need a fix though: the autoscaling is based on the number of jobs. But we need to remove the blocked datasets from the count.

good news, it's not necessary, because if the dataset is updated while it's blocked (which is often the case: blocked for 1 hours, 1 commit every 5 minutes), all the cache entries + all the pending jobs are deleted -> it does not influenciate the metrics anymore.

lhoestq · 2024-06-21T09:39:13Z

(until backfill does its job maybe ?)

severo · 2024-06-21T09:44:22Z

no, because it only creates two jobs (dataset-config-names and dataset-filetypes, the root steps) and they are never run while the dataset is blocked.

severo · 2024-06-24T16:07:27Z

hmmm, I changed my mind and opened #2945. I think it's urgent, to fix the autoscaling

severo · 2024-07-30T15:58:56Z

fixed

severo added question Further information is requested P2 Nice to have labels Jan 11, 2024

severo added infra P1 Not as needed as P0, but still important/wanted and removed question Further information is requested P2 Nice to have labels May 3, 2024

severo mentioned this issue Jun 18, 2024

Store started_at or duration info in cached steps too #2892

Closed

severo linked a pull request Jun 20, 2024 that will close this issue

[refactoring] split queue.py in 3 modules #2930

Merged

severo removed a link to a pull request Jun 20, 2024

[refactoring] split queue.py in 3 modules #2930

Merged

This was referenced Jun 20, 2024

[refactoring] split queue.py in 3 modules #2930

Merged

Create pastJobs collection #2931

Merged

create datasetBlockages collection + block datasets #2933

Merged

severo mentioned this issue Jun 24, 2024

Add "blocked/not blocked" in job count metrics #2945

Open

severo closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate-limit the updates? #2279

Rate-limit the updates? #2279

severo commented Jan 11, 2024 •

edited

Loading

severo commented Mar 22, 2024 •

edited

Loading

severo commented May 3, 2024

severo commented Jun 19, 2024 •

edited

Loading

lhoestq commented Jun 19, 2024

severo commented Jun 19, 2024

severo commented Jun 20, 2024

severo commented Jun 21, 2024

lhoestq commented Jun 21, 2024

severo commented Jun 21, 2024

severo commented Jun 24, 2024

severo commented Jul 30, 2024

Rate-limit the updates? #2279

Rate-limit the updates? #2279

Comments

severo commented Jan 11, 2024 • edited Loading

severo commented Mar 22, 2024 • edited Loading

severo commented May 3, 2024

severo commented Jun 19, 2024 • edited Loading

lhoestq commented Jun 19, 2024

severo commented Jun 19, 2024

severo commented Jun 20, 2024

severo commented Jun 21, 2024

lhoestq commented Jun 21, 2024

severo commented Jun 21, 2024

severo commented Jun 24, 2024

severo commented Jul 30, 2024

severo commented Jan 11, 2024 •

edited

Loading

severo commented Mar 22, 2024 •

edited

Loading

severo commented Jun 19, 2024 •

edited

Loading