Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix new scheduled tasks jumping the queue #17962

Merged
merged 2 commits into from
Nov 28, 2024

Conversation

richvdh
Copy link
Member

@richvdh richvdh commented Nov 26, 2024

Currently, when a new scheduled task is added and its scheduled time has already passed, we set it to ACTIVE. This is problematic, because it means it will jump the queue ahead of all other SCHEDULED tasks; furthermore, if the Synapse process gets restarted, it will jump ahead of any ACTIVE tasks which have been started but are taking a while to run.

Instead, we leave it set to SCHEDULED, but kick off a call to _launch_scheduled_tasks, which will decide if we actually have capacity to start a new task, and start the newly-added task if so.

@richvdh richvdh requested a review from a team as a code owner November 26, 2024 13:12
Comment on lines -129 to -130
# We need to wait for the next run of the scheduler loop
self.reactor.advance((TaskScheduler.SCHEDULE_INTERVAL_MS / 1000))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this hasn't actually changed, I'm just tightening up the test a bit. We didn't need to wait for the next run of the loop, we just had to wait 0.1 seconds. Looks like this changed in matrix-org/synapse#16313 and matrix-org/synapse#16660 without the test being updated.

Currently, when a new scheduled task is added and its scheduled time has
already passed, we set it to ACTIVE. This is problematic, because it means it
will jump the queue ahead of all other SCHEDULED tasks; furthermore, if the
Synapse process gets restarted, it will jump ahead of any ACTIVE tasks which
have been started but are taking a while to run.

Instead, we leave it set to SCHEDULED, but kick off a call to
`_launch_scheduled_tasks`, which will decide if we actually have capacity to
start a new task, and start the newly-added task if so.
@richvdh richvdh force-pushed the rav/schedule_ready_tasks branch from 2865a43 to afb7d08 Compare November 26, 2024 15:43
@@ -495,7 +495,7 @@ def to_line(self) -> str:


class NewActiveTaskCommand(_SimpleCommand):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the NAME to TASK_READY (and related methods) to match the new behavior ?

Pro it's cleaner, con we may lost a message when doing a rolling restart. Not sure it's a big deal since the scheduler will go over all the tasks in DB when starting up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinions from me on this. Will await insight from @element-hq/synapse-core

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its particularly useful for now, and I think we can always add it in later if it is useful (e.g. for metrics maybe?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what exactly are you suggesting, Erik? We remove the command altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, sorry, misunderstood what was going on 🤦

Eh, I don't mind. No one really ever sees what's on the wire anyway 🤷

# Don't bother trying to launch new tasks if we're already at capacity.
if len(self._running_tasks) >= TaskScheduler.MAX_CONCURRENT_RUNNING_TASKS:
return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to remove this? It does avoid spinning up a background process, even if its redundant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow.

This method was previously (only) called when a NEW_ACTIVE_TASK replication command was received, which (previously) meant that another worker had added a new scheduled task and set it to ACTIVE immediately, so we should start running it if possible.

The problem is that now, the other worker doesn't make the decision about whether to set the task to ACTIVE or not -- we first have to check how many tasks are already running, so the other worker has to delegate the SCHEDULED -> ACTIVE transition to the background-tasks worker. Soooo, NEW_ACTIVE_TASK now just means "there is a new task, and it its scheduled time has already passed", which is a hint to the background-tasks worker that it might want to run its scheduler now rather than waiting 60s for it to come round again.

All of which is a long way of saying: launch_task_by_id isn't very useful any more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh, it was just the if len(self._running_tasks) >= TaskScheduler.MAX_CONCURRENT_RUNNING_TASKS you were talking about?

Right, yes, we could reinstate that. Bear with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

065b11b changes this for both the 60s timer and for the NEW_ACTIVE_TASK command handler.

def on_new_task(self, task_id: str) -> None:
"""Handle a notification that a new ready-to-run task has been added to the queue"""
# Just run the scheduler
self._launch_scheduled_tasks()

@wrap_as_background_process("launch_scheduled_tasks")
Copy link
Member Author

@richvdh richvdh Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can I just say: I am not a fan of this decorator. I keep forgetting that _launch_scheduled_tasks runs as a background process.

@erikjohnston erikjohnston enabled auto-merge (squash) November 28, 2024 16:51
@erikjohnston erikjohnston merged commit d80cd57 into develop Nov 28, 2024
39 checks passed
@erikjohnston erikjohnston deleted the rav/schedule_ready_tasks branch November 28, 2024 18:06
@Karinza38
Copy link

d80cd57

MatMaul pushed a commit to MatMaul/synapse that referenced this pull request Dec 4, 2024
Currently, when a new scheduled task is added and its scheduled time has
already passed, we set it to ACTIVE. This is problematic, because it
means it will jump the queue ahead of all other SCHEDULED tasks;
furthermore, if the Synapse process gets restarted, it will jump ahead
of any ACTIVE tasks which have been started but are taking a while to
run.

Instead, we leave it set to SCHEDULED, but kick off a call to
`_launch_scheduled_tasks`, which will decide if we actually have
capacity to start a new task, and start the newly-added task if so.
yingziwu added a commit to yingziwu/synapse that referenced this pull request Dec 20, 2024
This release contains the security fixes from [v1.120.2](https://github.com/element-hq/synapse/releases/tag/v1.120.2).

- Fix release process to not create duplicate releases. ([\#18025](element-hq/synapse#18025))

- Support for [MSC4190](matrix-org/matrix-spec-proposals#4190): device management for Application Services. ([\#17705](element-hq/synapse#17705))
- Update [MSC4186](matrix-org/matrix-spec-proposals#4186) Sliding Sync to include invite, ban, kick, targets when `$LAZY`-loading room members. ([\#17947](element-hq/synapse#17947))
- Use stable `M_USER_LOCKED` error code for locked accounts, as per [Matrix 1.12](https://spec.matrix.org/v1.12/client-server-api/#account-locking). ([\#17965](element-hq/synapse#17965))
- [MSC4076](matrix-org/matrix-spec-proposals#4076): Add `disable_badge_count` to pusher configuration. ([\#17975](element-hq/synapse#17975))

- Fix long-standing bug where read receipts could get overly delayed being sent over federation. ([\#17933](element-hq/synapse#17933))

- Add OIDC example configuration for Forgejo (fork of Gitea). ([\#17872](element-hq/synapse#17872))
- Link to element-docker-demo from contrib/docker*. ([\#17953](element-hq/synapse#17953))

- [MSC4108](matrix-org/matrix-spec-proposals#4108): Add a `Content-Type` header on the `PUT` response to work around a faulty behavior in some caching reverse proxies. ([\#17253](element-hq/synapse#17253))
- Fix incorrect comment in new schema delta. ([\#17936](element-hq/synapse#17936))
- Raise setuptools_rust version cap to 1.10.2. ([\#17944](element-hq/synapse#17944))
- Enable encrypted appservice related experimental features in the complement docker image. ([\#17945](element-hq/synapse#17945))
- Return whether the user is suspended when querying the user account in the Admin API. ([\#17952](element-hq/synapse#17952))
- Fix new scheduled tasks jumping the queue. ([\#17962](element-hq/synapse#17962))
- Bump pyo3 and dependencies to v0.23.2. ([\#17966](element-hq/synapse#17966))
- Update setuptools-rust and fix building abi3 wheels in latest version. ([\#17969](element-hq/synapse#17969))
- Consolidate SSO redirects through `/_matrix/client/v3/login/sso/redirect(/{idpId})`. ([\#17972](element-hq/synapse#17972))
- Fix Docker and Complement config to be able to use `public_baseurl`. ([\#17986](element-hq/synapse#17986))
- Fix building wheels for MacOS which was temporarily disabled in Synapse 1.120.2. ([\#17993](element-hq/synapse#17993))
- Fix release process to not create duplicate releases. ([\#17970](element-hq/synapse#17970), [\#17995](element-hq/synapse#17995))

* Bump bytes from 1.8.0 to 1.9.0. ([\#17982](element-hq/synapse#17982))
* Bump pysaml2 from 7.3.1 to 7.5.0. ([\#17978](element-hq/synapse#17978))
* Bump serde_json from 1.0.132 to 1.0.133. ([\#17939](element-hq/synapse#17939))
* Bump tomli from 2.0.2 to 2.1.0. ([\#17959](element-hq/synapse#17959))
* Bump tomli from 2.1.0 to 2.2.1. ([\#17979](element-hq/synapse#17979))
* Bump tornado from 6.4.1 to 6.4.2. ([\#17955](element-hq/synapse#17955))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants