Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune messenger retry strategy #1218

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Fine-tune messenger retry strategy #1218

merged 3 commits into from
Nov 20, 2024

Conversation

melroy89
Copy link
Member

@melroy89 melroy89 commented Nov 12, 2024

Since RabbitMQ is unable to handle huge amount of queue sizes and backlogs, when hitting above 200k messages or higher, RabbitMQ will drain slowly to a halt. And CPU usage will spike.

To prevent this, I & Jerry are running the follow retry strategy for the past months instead. To avoid stability issues with RabbitMQ proactively, reducing the retries looks very promising to us. Even when running RabbitMQ v4.0.

  • Changing max retries from 7 to 5
  • And with that, change max. delay from 86400000 to 76800000

We have observed, in case of network / congestions issues on my server and/or when running a large instance like @Jerry-infosecexchange , that decrease the RabbitMQ retries queues benefit a lot in terms of stability of RabbitMQ.

Above changes are still plenty of retries with a long enough delay duration.

Note: When this PR is applied. These changes will not directly seems to be working. The reason is simple: RabbitMQ still needs to slowly processes the latest delay queues. Which can still take a while (or you clean-up them manually).

@melroy89 melroy89 added enhancement New feature or request backend Backend related issues and pull requests labels Nov 12, 2024
@melroy89 melroy89 added this to the v1.7.3 milestone Nov 12, 2024
@melroy89 melroy89 enabled auto-merge (squash) November 12, 2024 18:45
@BentiGorlich
Copy link
Member

So this changes the delay pattern to:

  1. 5 minutes
  2. 20 minutes
  3. 80 minutes
  4. 5,333 hours
  5. 21,33 hours

In total these are ~29 hours. So if a server is down for a longer period than that it will only get the messages from before 22 hours

I guess your goal was to keep fewer messages around in total. Since before the messages could be in rabbit for about a week and with this change they are in for a maximum of 29 hours.

Imo the messengers are a very delicate topic and if rabbit actually gets a bunch of problems with messages being a week in there we should think about introducing new queues that are running in psql. What do you think about that?

I think a server should be able to be down for a few days without dropping anything really. Its a reliability thing and imo lemmy handles this very well. Without this change we are (again imo) doing very good regarding delivery reliability as well, but with this change we drop some of it (imo too much).

@melroy89
Copy link
Member Author

melroy89 commented Nov 20, 2024

I guess your goal was to keep fewer messages around in total. Since before the messages could be in rabbit for about a week and with this change they are in for a maximum of 29 hours.

Correct, this was and still is my goal. RabbitMQ can't handle so many queued messages, if something is wrong with delivery, your rabbitmq messages pile will fill very fast. After which, CPU spikes and RabbitMQ comes to a halt!

If you don't trust me, ask @Jerry-infosecexchange about his opinion.

I want to reduce the stress on the system and have a working platform, hence I still believe these values in the PR are better defaults.

I think a server should be able to be down for a few days without dropping anything really.

We disagree here. I you want this, you can increase the retry to 7 or 10 again on your server.

This PR is reduce the number of retries to get a better safe default. I got a beefy server (with 128GB RAM, SSDs & nvmes and 32 cores) and @Jerry-infosecexchange has a beefy server as well, even with this kind of hardware we would see significant performance drops (or even outages in some situations) due to RabbitMQ having well over 300k messages in the queue, this PR prevent that situation from happening again.

@BentiGorlich
Copy link
Member

I got a beefy server (with 128GB RAM, SSDs & nvmes and 32 cores) and @Jerry-infosecexchange has a beefy server as well, even with this kind of hardware we would see significant performance drops

I know, that is why I tried to phrase my message accordingly and suggested a middleground or "compromise":

if rabbit actually gets a bunch of problems with messages being a week in there we should think about introducing new queues that are running in psql. What do you think about that?


I think a server should be able to be down for a few days without dropping anything really.

We disagree here.

So you think if a server is down it should just loose traffic? Or what exactly is your opinion about that. Right now I just now it is not the same as mine :D

@melroy89
Copy link
Member Author

melroy89 commented Nov 20, 2024

So you think if a server is down it should just loose traffic? Or what exactly is your opinion about that. Right now I just now it is not the same as mine :D

If you server is down, you are missing data by design. Even other clients like Mastodon only does a retry until 24 hours: https://github.com/mastodon/mastodon/blob/main/app/workers/activitypub/delivery_worker.rb#L14

If a server is down for several days, you might have a bigger problems than missing a comment, like or a post here and there.

So in an ideal world you don't lose traffic and every client and server is trying to recover you. In this situation I choice reliability and stability of the server itself over potential missing data in case of down-time.

@Jerry-infosecexchange
Copy link

I agree - if a server is down for over 24 hours, the problems associated with trying to catch up are going to cause more problems than it's worth. Fedia.io is already hitting rate limits on nearly every lemmy instance without being down. The other fediverse software works in similar ways - after a period of non-response, the instance is flagged and no further delivery attempts are made until it gets an incoming message that indicates its alive again.

@Jerry-infosecexchange
Copy link

I should note that even with limiting retries the way I am now, my rabbitmq queues still race back up into the hundreds of thousands. I think it's hard to talk about losing messages when I am having to purge 100000 messages from rabbitmq every day just to keep it running

@BentiGorlich
Copy link
Member

Maybe we have to find another amqp server if rabbit is scaling so badly...

@melroy89 melroy89 merged commit e434f16 into main Nov 20, 2024
7 checks passed
@melroy89 melroy89 deleted the reduce_messenger_retries branch November 20, 2024 15:47
@melroy89
Copy link
Member Author

melroy89 commented Nov 20, 2024

. I think it's hard to talk about losing messages when I am having to purge 100000 messages from rabbitmq every day just to keep it running

Yea this PR is not going to solve that of course, but at least provide a better defaults for now.....

The large amount of message Jerry is seeing could still be caused by other activitypub implementation issues in our code. Handling errors better eg. rejecting messages more instead of retrying in situations it make sense.

Also some of these open issues will reduce the load on rabbitmq: #1175.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants