-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine-tune messenger retry strategy #1218
Conversation
So this changes the delay pattern to:
In total these are ~29 hours. So if a server is down for a longer period than that it will only get the messages from before 22 hours I guess your goal was to keep fewer messages around in total. Since before the messages could be in rabbit for about a week and with this change they are in for a maximum of 29 hours. Imo the messengers are a very delicate topic and if rabbit actually gets a bunch of problems with messages being a week in there we should think about introducing new queues that are running in psql. What do you think about that? I think a server should be able to be down for a few days without dropping anything really. Its a reliability thing and imo lemmy handles this very well. Without this change we are (again imo) doing very good regarding delivery reliability as well, but with this change we drop some of it (imo too much). |
Correct, this was and still is my goal. RabbitMQ can't handle so many queued messages, if something is wrong with delivery, your rabbitmq messages pile will fill very fast. After which, CPU spikes and RabbitMQ comes to a halt! If you don't trust me, ask @Jerry-infosecexchange about his opinion. I want to reduce the stress on the system and have a working platform, hence I still believe these values in the PR are better defaults.
We disagree here. I you want this, you can increase the retry to 7 or 10 again on your server. This PR is reduce the number of retries to get a better safe default. I got a beefy server (with 128GB RAM, SSDs & nvmes and 32 cores) and @Jerry-infosecexchange has a beefy server as well, even with this kind of hardware we would see significant performance drops (or even outages in some situations) due to RabbitMQ having well over 300k messages in the queue, this PR prevent that situation from happening again. |
I know, that is why I tried to phrase my message accordingly and suggested a middleground or "compromise":
So you think if a server is down it should just loose traffic? Or what exactly is your opinion about that. Right now I just now it is not the same as mine :D |
If you server is down, you are missing data by design. Even other clients like Mastodon only does a retry until 24 hours: https://github.com/mastodon/mastodon/blob/main/app/workers/activitypub/delivery_worker.rb#L14 If a server is down for several days, you might have a bigger problems than missing a comment, like or a post here and there. So in an ideal world you don't lose traffic and every client and server is trying to recover you. In this situation I choice reliability and stability of the server itself over potential missing data in case of down-time. |
I agree - if a server is down for over 24 hours, the problems associated with trying to catch up are going to cause more problems than it's worth. Fedia.io is already hitting rate limits on nearly every lemmy instance without being down. The other fediverse software works in similar ways - after a period of non-response, the instance is flagged and no further delivery attempts are made until it gets an incoming message that indicates its alive again. |
I should note that even with limiting retries the way I am now, my rabbitmq queues still race back up into the hundreds of thousands. I think it's hard to talk about losing messages when I am having to purge 100000 messages from rabbitmq every day just to keep it running |
Maybe we have to find another amqp server if rabbit is scaling so badly... |
Yea this PR is not going to solve that of course, but at least provide a better defaults for now..... The large amount of message Jerry is seeing could still be caused by other activitypub implementation issues in our code. Handling errors better eg. rejecting messages more instead of retrying in situations it make sense. Also some of these open issues will reduce the load on rabbitmq: #1175. |
Since RabbitMQ is unable to handle huge amount of queue sizes and backlogs, when hitting above 200k messages or higher, RabbitMQ will drain slowly to a halt. And CPU usage will spike.
To prevent this, I & Jerry are running the follow retry strategy for the past months instead. To avoid stability issues with RabbitMQ proactively, reducing the retries looks very promising to us. Even when running RabbitMQ v4.0.
7
to5
86400000
to76800000
We have observed, in case of network / congestions issues on my server and/or when running a large instance like @Jerry-infosecexchange , that decrease the RabbitMQ retries queues benefit a lot in terms of stability of RabbitMQ.
Above changes are still plenty of retries with a long enough delay duration.
Note: When this PR is applied. These changes will not directly seems to be working. The reason is simple: RabbitMQ still needs to slowly processes the latest delay queues. Which can still take a while (or you clean-up them manually).