Fine-tune messenger retry strategy #1218

melroy89 · 2024-11-12T18:31:14Z

Since RabbitMQ is unable to handle huge amount of queue sizes and backlogs, when hitting above 200k messages or higher, RabbitMQ will drain slowly to a halt. And CPU usage will spike.

To prevent this, I & Jerry are running the follow retry strategy for the past months instead. To avoid stability issues with RabbitMQ proactively, reducing the retries looks very promising to us. Even when running RabbitMQ v4.0.

Changing max retries from 7 to 5
And with that, change max. delay from 86400000 to 76800000

We have observed, in case of network / congestions issues on my server and/or when running a large instance like @Jerry-infosecexchange , that decrease the RabbitMQ retries queues benefit a lot in terms of stability of RabbitMQ.

Above changes are still plenty of retries with a long enough delay duration.

Note: When this PR is applied. These changes will not directly seems to be working. The reason is simple: RabbitMQ still needs to slowly processes the latest delay queues. Which can still take a while (or you clean-up them manually).

BentiGorlich · 2024-11-19T22:20:41Z

So this changes the delay pattern to:

5 minutes
20 minutes
80 minutes
5,333 hours
21,33 hours

In total these are ~29 hours. So if a server is down for a longer period than that it will only get the messages from before 22 hours

I guess your goal was to keep fewer messages around in total. Since before the messages could be in rabbit for about a week and with this change they are in for a maximum of 29 hours.

Imo the messengers are a very delicate topic and if rabbit actually gets a bunch of problems with messages being a week in there we should think about introducing new queues that are running in psql. What do you think about that?

I think a server should be able to be down for a few days without dropping anything really. Its a reliability thing and imo lemmy handles this very well. Without this change we are (again imo) doing very good regarding delivery reliability as well, but with this change we drop some of it (imo too much).

melroy89 · 2024-11-20T11:45:46Z

I guess your goal was to keep fewer messages around in total. Since before the messages could be in rabbit for about a week and with this change they are in for a maximum of 29 hours.

Correct, this was and still is my goal. RabbitMQ can't handle so many queued messages, if something is wrong with delivery, your rabbitmq messages pile will fill very fast. After which, CPU spikes and RabbitMQ comes to a halt!

If you don't trust me, ask @Jerry-infosecexchange about his opinion.

I want to reduce the stress on the system and have a working platform, hence I still believe these values in the PR are better defaults.

I think a server should be able to be down for a few days without dropping anything really.

We disagree here. I you want this, you can increase the retry to 7 or 10 again on your server.

This PR is reduce the number of retries to get a better safe default. I got a beefy server (with 128GB RAM, SSDs & nvmes and 32 cores) and @Jerry-infosecexchange has a beefy server as well, even with this kind of hardware we would see significant performance drops (or even outages in some situations) due to RabbitMQ having well over 300k messages in the queue, this PR prevent that situation from happening again.

BentiGorlich · 2024-11-20T11:55:16Z

I got a beefy server (with 128GB RAM, SSDs & nvmes and 32 cores) and @Jerry-infosecexchange has a beefy server as well, even with this kind of hardware we would see significant performance drops

I know, that is why I tried to phrase my message accordingly and suggested a middleground or "compromise":

if rabbit actually gets a bunch of problems with messages being a week in there we should think about introducing new queues that are running in psql. What do you think about that?

I think a server should be able to be down for a few days without dropping anything really.

We disagree here.

So you think if a server is down it should just loose traffic? Or what exactly is your opinion about that. Right now I just now it is not the same as mine :D

melroy89 · 2024-11-20T12:16:40Z

So you think if a server is down it should just loose traffic? Or what exactly is your opinion about that. Right now I just now it is not the same as mine :D

If you server is down, you are missing data by design. Even other clients like Mastodon only does a retry until 24 hours: https://github.com/mastodon/mastodon/blob/main/app/workers/activitypub/delivery_worker.rb#L14

If a server is down for several days, you might have a bigger problems than missing a comment, like or a post here and there.

So in an ideal world you don't lose traffic and every client and server is trying to recover you. In this situation I choice reliability and stability of the server itself over potential missing data in case of down-time.

Jerry-infosecexchange · 2024-11-20T15:39:08Z

I agree - if a server is down for over 24 hours, the problems associated with trying to catch up are going to cause more problems than it's worth. Fedia.io is already hitting rate limits on nearly every lemmy instance without being down. The other fediverse software works in similar ways - after a period of non-response, the instance is flagged and no further delivery attempts are made until it gets an incoming message that indicates its alive again.

Jerry-infosecexchange · 2024-11-20T15:41:24Z

I should note that even with limiting retries the way I am now, my rabbitmq queues still race back up into the hundreds of thousands. I think it's hard to talk about losing messages when I am having to purge 100000 messages from rabbitmq every day just to keep it running

BentiGorlich · 2024-11-20T15:46:39Z

Maybe we have to find another amqp server if rabbit is scaling so badly...

melroy89 · 2024-11-20T16:45:48Z

. I think it's hard to talk about losing messages when I am having to purge 100000 messages from rabbitmq every day just to keep it running

Yea this PR is not going to solve that of course, but at least provide a better defaults for now.....

The large amount of message Jerry is seeing could still be caused by other activitypub implementation issues in our code. Handling errors better eg. rejecting messages more instead of retrying in situations it make sense.

Also some of these open issues will reduce the load on rabbitmq: #1175.

Reduce retries to 5

d296cf0

melroy89 added enhancement New feature or request backend Backend related issues and pull requests labels Nov 12, 2024

melroy89 added this to the v1.7.3 milestone Nov 12, 2024

melroy89 requested a review from BentiGorlich November 12, 2024 18:31

melroy89 enabled auto-merge (squash) November 12, 2024 18:45

melroy89 requested a review from Jerry-infosecexchange November 15, 2024 18:36

Merge branch 'main' into reduce_messenger_retries

871f2ee

BentiGorlich approved these changes Nov 20, 2024

View reviewed changes

Merge branch 'main' into reduce_messenger_retries

e998ed1

melroy89 merged commit e434f16 into main Nov 20, 2024
7 checks passed

melroy89 deleted the reduce_messenger_retries branch November 20, 2024 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tune messenger retry strategy #1218

Fine-tune messenger retry strategy #1218

melroy89 commented Nov 12, 2024 •

edited

Loading

BentiGorlich commented Nov 19, 2024

melroy89 commented Nov 20, 2024 •

edited

Loading

BentiGorlich commented Nov 20, 2024

melroy89 commented Nov 20, 2024 •

edited

Loading

Jerry-infosecexchange commented Nov 20, 2024

Jerry-infosecexchange commented Nov 20, 2024

BentiGorlich commented Nov 20, 2024

melroy89 commented Nov 20, 2024 •

edited

Loading

Fine-tune messenger retry strategy #1218

Fine-tune messenger retry strategy #1218

Conversation

melroy89 commented Nov 12, 2024 • edited Loading

BentiGorlich commented Nov 19, 2024

melroy89 commented Nov 20, 2024 • edited Loading

BentiGorlich commented Nov 20, 2024

melroy89 commented Nov 20, 2024 • edited Loading

Jerry-infosecexchange commented Nov 20, 2024

Jerry-infosecexchange commented Nov 20, 2024

BentiGorlich commented Nov 20, 2024

melroy89 commented Nov 20, 2024 • edited Loading

melroy89 commented Nov 12, 2024 •

edited

Loading

melroy89 commented Nov 20, 2024 •

edited

Loading

melroy89 commented Nov 20, 2024 •

edited

Loading

melroy89 commented Nov 20, 2024 •

edited

Loading