Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rate-limit the updates? #2279

Closed
severo opened this issue Jan 11, 2024 · 11 comments
Closed

Rate-limit the updates? #2279

severo opened this issue Jan 11, 2024 · 11 comments
Labels
infra P1 Not as needed as P0, but still important/wanted

Comments

@severo
Copy link
Collaborator

severo commented Jan 11, 2024

See https://huggingface.co/datasets/lunaluan/chatbox8_history/commits/main: two commits are created every minute (47K commits). This dataset is currently blocked manually (#2184).

Should we create a rule to automatically limit the number of updates per hour/day? Note that doing so might imply changing how we handle the "revision". If we decide to process a specific commit, all the steps should use the same commit, even if new ones have appeared since.

@severo severo added question Further information is requested P2 Nice to have labels Jan 11, 2024
@severo
Copy link
Collaborator Author

severo commented Mar 22, 2024

If we rate-limit, we could propose different modes:

  • limited
  • full

cc https://huggingface.slack.com/archives/C04BP5S7858/p1710243357259019 (private)

@severo
Copy link
Collaborator Author

severo commented May 3, 2024

Another way is to deprioritize the dataset by moving the job date to the future, depending on the time between the last two/three commits + filtering on the date when picking a job, so that dates in the future are ignored.

@severo severo added infra P1 Not as needed as P0, but still important/wanted and removed question Further information is requested P2 Nice to have labels May 3, 2024
@severo
Copy link
Collaborator Author

severo commented Jun 19, 2024

Proposal:

Create two new collections:

  • pastJobs: it will contain all the finished jobs, including the duration. It will have a TTL of x hours.
  • blockedDatasets: it will contain a list of currently blocked datasets, with a TTL of y hours.

When we select the next job in the queue, 1. we filter out the blockedDatasets and 2. once we select a job, we compute the sum of the durations of all its pastJobs. If it's over a threshold of z hours, we add the dataset to blockedDatasets, and return to select the next job.

What do you think?

@lhoestq
Copy link
Member

lhoestq commented Jun 19, 2024

Is it possible to say that a job should wait e.g. 10min before being run ? And if a commit happens in the meantime, the job is deleted and replaced with the new job with the same wait penalty.

@severo
Copy link
Collaborator Author

severo commented Jun 19, 2024

Is it possible to say that a job should wait e.g. 10min before being run ?

I think we need to keep computing the small datasets instantaneously, to help the users have a feedback loop with the viewer. We don't want to penalize anybody, apart from the very big datasets that are updated every x minutes.

@severo
Copy link
Collaborator Author

severo commented Jun 20, 2024

Hopefully #2933 fixes most of the issues.

I think we need a fix though: the autoscaling is based on the number of jobs. But we need to remove the blocked datasets from the count.

@severo
Copy link
Collaborator Author

severo commented Jun 21, 2024

I think we need a fix though: the autoscaling is based on the number of jobs. But we need to remove the blocked datasets from the count.

good news, it's not necessary, because if the dataset is updated while it's blocked (which is often the case: blocked for 1 hours, 1 commit every 5 minutes), all the cache entries + all the pending jobs are deleted -> it does not influenciate the metrics anymore.

@lhoestq
Copy link
Member

lhoestq commented Jun 21, 2024

(until backfill does its job maybe ?)

@severo
Copy link
Collaborator Author

severo commented Jun 21, 2024

no, because it only creates two jobs (dataset-config-names and dataset-filetypes, the root steps) and they are never run while the dataset is blocked.

@severo
Copy link
Collaborator Author

severo commented Jun 24, 2024

hmmm, I changed my mind and opened #2945. I think it's urgent, to fix the autoscaling

@severo
Copy link
Collaborator Author

severo commented Jul 30, 2024

fixed

@severo severo closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra P1 Not as needed as P0, but still important/wanted
Projects
None yet
Development

No branches or pull requests

2 participants