Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing verification of LD-Signature does not remove it from inbox #14587

Open
1 task
eternal-flame-AD opened this issue Sep 20, 2024 · 4 comments
Open
1 task
Labels
⚠️bug? This might be a bug 🌌Federation The Federation/ActivityPub feature packages/backend Server side specific issue/PR

Comments

@eternal-flame-AD
Copy link
Contributor

eternal-flame-AD commented Sep 20, 2024

💡 Summary

"skip: LD-Signatureの検証に失敗しました" was marked as a "unrecoverableerror" but doesn't remove the message from inbox pipeline, stalling it.

🥰 Expected Behavior

If an inbox message can't be accepted, it should be removed from the pipeline immediately (or at least a retry that happens later in time) so as other messages don't get starved of workers.

🤬 Actual Behavior

I found today my instance has a >8h delay in receiving external notes. I went to see the Bull Dashboard and say 9000 pending inbox messages and growing. The CPU and memory utility was low. I investigated the running jobs and found they have been stalled for ~20 minutes. All show an error but are still in the "running state" (see reproduction section). Probably this stalled the pipeline causing the pile up.

Someone on misskey suggested it might be retrying behavior but (1) If it's a signature mismatch it should be permanent error (2) No job should take up a slot in the workers for 25 minutes without releasing it to someone else first.

Deleting the offending server alone did not resolve the issue (temporarily unblocks it but get's blocked again quickly), I need to manually purge the redis inbox of that relay for the problem to resolve.

Screenshot: https://mi.yumechi.jp/notes/9ydk0fspg51f000e (sorry I forgot to screenshot the "Active" tab, but it was the 16 jobs on the Active tab that got stalled.

📝 Steps to Reproduce

I don't know how to inject events manually into the queue so not sure how to exacly reproduce it again but I saved the event json, and the error on the Bull Dashboard. (I see an option to manually put in a JSON data+options but I don't know the underlying logic of misskey so wasn't comfortable just putting it in on my real instance, however I am will to try it out if someone on the team could give me a JSON to put in)

All activities stalled on the queue look like this (it's different events but same type and data structure, from the same relay).

{
  "jobData": {
    "activity": {
      "@context": [
        "https://www.w3.org/ns/activitystreams",
        "https://w3id.org/security/v1"
      ],
      "actor": "https://relay.infosec.exchange/actor",
      "id": "https://relay.infosec.exchange/activity/f390340e-a6d0-4b5b-83c4-130e025d770f",
      "object": "https://e-komik.org/chapter/413697",
      "to": [
        "https://relay.infosec.exchange/followers"
      ],
      "type": "Announce"
    },
    "signature": {
      "scheme": "Signature",
      "params": {
        "keyId": "https://relay.infosec.exchange/actor#main-key",
        "algorithm": "hs2019",
        "headers": [
          "(request-target)",
          "accept",
          "content-type",
          "date",
          "digest",
          "host"
        ],
        "signature": "pYJrmyxIXqEFqpJl/BZEyiNKcESqvrLNCjoHC7nWYk2cs51sC4EsdVIkBKYtJOVSwjC2F6s9YGixUiYMQs1P2NN6dBXWZzCz6EcXqjg/4wa83/a4sXi41ZTe4Zg25FtzrP4nmKcSXWnpqC0Yj9M1+Jhp066KRl4g948rN6YRG8UbCeqfyBO+3GsszlUspMy2it4MpSYUZkiYDgltNcsjtiJ6fQvcI7wYzTYPJU/YM1FRgxWJRNHrYQbUvuhZhbFXUDQgsxxA1+ntwv6umu4sDZSmSI9Kz2rq+fdsQseOEeVO7OjPF/SmFQaVBPPw13WFH3BA0G35nUCh/ZRnYQ5h0TInKG7C0bkw16V1aN16MOscz4ut3x1g9nxP5A+PerGIXjza5fRKrPdzcTOcPe6yRaR+N61OBEZLqSy77Z1ektCj4rhHY+HqJQwJWvP1vuCry0Jhukrp4pZOhJXePOBUy2rGqRkdTy7nMjh+5HVMwOywxM/lRyHrsDVG+iG/SFK7esSibZ491mH32lF1w+6CW5UCDgjsU0Qe3BXslXEkWiSJW57U+5IEvhgcwfKQnhaOY9xInFYSIAbyENnDvCNQKQM7lr4eLSXZwHTFHyLe0juoW7ClXkr4IGsEw1XY1xBUPpq6PbMklmZi/jwSjuCfhR1Hzq/IzVqY4R0e3rTx8kM="
      },
      "signingString": "(request-target): post /inbox\naccept: application/activity+json\ncontent-type: application/activity+json\ndate: Thu, 19 Sep 2024 18:12:52 GMT\ndigest: SHA-256=IWErZ8d8jaeErrZ7ICJ/l+3lfdNDR6w2Z9hbakNtSLY=\nhost: mi.yumechi.jp",
      "algorithm": "HS2019",
      "keyId": "https://relay.infosec.exchange/actor#main-key"
    }
  },
  "returnValue": null
}

Bulls dashboard shows this error:

Error: quote resolve failed
    at ApNoteService.createNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:239:27)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async ApNoteService.resolveNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:324:20)
    at async ApNoteService.createNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:197:40)
    at async ApNoteService.resolveNote (file:///misskey/packages/backend/built/core/activitypub/models/ApNoteService.js:324:20)
    at async ApInboxService.announceNote (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:282:26)
    at async ApInboxService.announce (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:261:36)
    at async ApInboxService.performOneActivity (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:165:20)
    at async ApInboxService.performActivity (file:///misskey/packages/backend/built/core/activitypub/ApInboxService.js:134:22)
    at async InboxProcessorService.process (file:///misskey/packages/backend/built/queue/processors/InboxProcessorService.js:187:28)
    at async Worker.processJob (/misskey/node_modules/.pnpm/[email protected]/node_modules/bullmq/dist/cjs/classes/worker.js:445:28)
    at async Worker.retryIfFailed (/misskey/node_modules/.pnpm/[email protected]/node_modules/bullmq/dist/cjs/classes/worker.js:634:24)

  e: {
    stack: 'UnrecoverableError: skip: LD-Signatureの検証に失敗しました\n' +
      '    at InboxProcessorService.process (file:///misskey/packages/backend/built/queue/processors/InboxProcessorService.js:1
32:27)\n' +
      '    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n' +
      '    at async Worker.processJob (/misskey/node_modules/.pnpm/[email protected]/node_modules/bullmq/dist/cjs/classes/worker.js
:445:28)\n' +
      '    at async Worker.retryIfFailed (/misskey/node_modules/.pnpm/[email protected]/node_modules/bullmq/dist/cjs/classes/worker
.js:634:24)',
    message: 'skip: LD-Signatureの検証に失敗しました',
    name: 'UnrecoverableError'
  }
}

docker and docker-compose logs for some reason stops working as soon as it's stalled ... i hope the above is enough information.

💻 Frontend Environment

Not a frontend problem. Firefox and aria both show delayed messages. 

git describe=2024.8.0-2-g882c8b93c

My instance is https://mi.yumechi.jp/

Here's note featuring some screenshots: https://mi.yumechi.jp/notes/9ydpeizaxdq8000f https://mi.yumechi.jp/notes/9ydpeizaxdq8000f

🛰 Backend Environment (for server admin)

Deployed with official docker-compose file, redis:7-alpine, postgres:15-alpine.

git describe=2024.8.0-2-g882c8b93c

OS is synology DSM, kernel 4.4, x86_64

Do you want to address this bug yourself?

I don't have time looking at the source today but I am willing to submit a PR if it is within my capabilities.

  • Yes, I will patch the bug myself and send a pull request
@eternal-flame-AD eternal-flame-AD added the ⚠️bug? This might be a bug label Sep 20, 2024
@eternal-flame-AD eternal-flame-AD changed the title Input queue stalled by certain relay messages Inbox queue stalled by certain relay messages Sep 20, 2024
@CHN-beta
Copy link

Same issue.

@KisaragiEffective KisaragiEffective added packages/backend Server side specific issue/PR 🌌Federation The Federation/ActivityPub feature labels Sep 23, 2024
@KisaragiEffective KisaragiEffective changed the title Inbox queue stalled by certain relay messages Failing verification of LD-Signature does not remove it from inbox Sep 23, 2024
@turkeysanwich
Copy link

I had the same issue, stuff from e-komik.org through a relay stalled the whole queue. blocking the relay didn't work but blocking e-komik.org seems to have cleared it up.

@KisaragiEffective
Copy link
Collaborator

permalink:

throw new Bull.UnrecoverableError('skip: LD-Signatureの検証に失敗しました');

@eternal-flame-AD
Copy link
Contributor Author

eternal-flame-AD commented Sep 23, 2024

Update: I read the source and could not find exactly where the issue was. However I see similar code paths to 'skip: LD-Signatureの検証に失敗しました' in today's log after I removed the offending relay and there was no stall. It seems to be something specific to the e-komik event.

Maybe add a hard timeout to the jobs, I think if one job takes more than 1 minute most instance owners would rather not receive it? ref: https://docs.bullmq.io/patterns/timeout-jobs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚠️bug? This might be a bug 🌌Federation The Federation/ActivityPub feature packages/backend Server side specific issue/PR
Projects
Development

No branches or pull requests

4 participants