Failing verification of LD-Signature does not remove it from inbox #14587
Labels
⚠️bug?
This might be a bug
🌌Federation
The Federation/ActivityPub feature
packages/backend
Server side specific issue/PR
💡 Summary
"skip: LD-Signatureの検証に失敗しました" was marked as a "unrecoverableerror" but doesn't remove the message from inbox pipeline, stalling it.
🥰 Expected Behavior
If an inbox message can't be accepted, it should be removed from the pipeline immediately (or at least a retry that happens later in time) so as other messages don't get starved of workers.
🤬 Actual Behavior
I found today my instance has a >8h delay in receiving external notes. I went to see the Bull Dashboard and say 9000 pending inbox messages and growing. The CPU and memory utility was low. I investigated the running jobs and found they have been stalled for ~20 minutes. All show an error but are still in the "running state" (see reproduction section). Probably this stalled the pipeline causing the pile up.
Someone on misskey suggested it might be retrying behavior but (1) If it's a signature mismatch it should be permanent error (2) No job should take up a slot in the workers for 25 minutes without releasing it to someone else first.
Deleting the offending server alone did not resolve the issue (temporarily unblocks it but get's blocked again quickly), I need to manually purge the redis inbox of that relay for the problem to resolve.
Screenshot: https://mi.yumechi.jp/notes/9ydk0fspg51f000e (sorry I forgot to screenshot the "Active" tab, but it was the 16 jobs on the Active tab that got stalled.
📝 Steps to Reproduce
I don't know how to inject events manually into the queue so not sure how to exacly reproduce it again but I saved the event json, and the error on the Bull Dashboard. (I see an option to manually put in a JSON data+options but I don't know the underlying logic of misskey so wasn't comfortable just putting it in on my real instance, however I am will to try it out if someone on the team could give me a JSON to put in)
All activities stalled on the queue look like this (it's different events but same type and data structure, from the same relay).
Bulls dashboard shows this error:
docker and docker-compose logs for some reason stops working as soon as it's stalled ... i hope the above is enough information.
💻 Frontend Environment
🛰 Backend Environment (for server admin)
Do you want to address this bug yourself?
I don't have time looking at the source today but I am willing to submit a PR if it is within my capabilities.
The text was updated successfully, but these errors were encountered: