Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use proper threading to encourage work completion of AMQP subscribers in a predictable manner. #95

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

TreyE
Copy link
Contributor

@TreyE TreyE commented Sep 13, 2023

Underlying Issues and Justification

Currently Event Source is categorized, when it comes to AMQP, by three properties:

  1. It is single process, but multi-threaded.
  2. It runs all consumers for AMQP events in the same process, but in (theoretically) different threads.
  3. It is 'greedy': it attempts to allow its consumers to process multiple messages simultaneously.

However, a problem can arise when allowing multiple consumers to perform work simultaneously without coordination in a multi-threaded environment: the system can switch the working thread during work being performed by a consumer, and there is no guarantee it will return to that message. Usually this isn't a problem under low loads for event_source, but becomes a problem when:

  1. Event source is facing a high volume of messages
  2. The messages are of different types, meaning multiple subscribers will not only be receiving messages, but also working to process those different types of messages simultaneously under different consumers and threads.
  3. One type of worker performs a complex, work intensive task.

Under these circumstances, since workers are not prevented from interruption, and AMQP subscribers don't have any coordination around when work they are doing is allowed to be interrupted, a worker can be suspended while processing a work intensive task, with no promise it may ever be resumed.

This can result in:

  1. Event Source workers beginning work they may never complete, but leaving the message in the 'unacked' state.
  2. Process bloat, as multiple Event Source workers are interrupted while performing their work and don't finish the work - thus never releasing the memory.
  3. Unpredictable system behaviour - if starting work doesn't promise when or how you might finish it, messages and their associated work can be processed at arbitrary, unpredictable times

The Fix

This can be fixed by marking the unit of work performed by an Event Source worker as atomic - so that it can not be interrupted.

However, certain portions of this approach must be taken into account in order not to cripple performance:

  1. Only prevent interruption during the minimal portion of worker execution needed to ensure the unit of work is completed successfully.
  2. Use a re-entrant synchronization primitive to avoid deadlocks.

In this case, the solution this offers is a ruby Monitor, synchronized only around the portion of the AMQP subscriber where work is actually being performed.

This ticket is tracked as: https://www.pivotaltracker.com/story/show/186036844

Caveats

Please note that while introducing a monitor to be used later, this fix does not attempt to manage or constrain the behaviour of the HTTP worker portion of Event Source. I was less certain of how that might behave in isolation and would rather exercise caution and handle that issue in a separate submission.

@TreyE TreyE force-pushed the finish_what_you_started branch from 6dd2e24 to 0016215 Compare September 13, 2023 20:49
@TreyE TreyE added the bug Something isn't working label Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants