-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
=> Bluesky: handle duplicate image blobs when some URLs aren't still serving #1650
Comments
Hmm, interesting bug, thanks for filing! Images (blobs) in ATProto are content-addressed by their CID. In these cases, evidently you uploaded them to your Mastodon instance earlier, but didn't use them, and them uploaded them again later and used them in new posts. The first uploads had different URLs, and evidently eventually got garbage collected, but had the same CIDs, so our blobs still point to the earlier URLs. Example: https://roddie.social/@ghostintheshell/113701671131291688 has image https://cdn.masto.host/roddiesocial/media_attachments/files/113/701/671/033/940/176/original/4c0e5692a6ced709.jpg . We stored our blob for that image at 2024-12-23 10:30:32 UTC with CID bafkreic4pr5u3pcea454drzc5gencumlqklwpt2uxo7n7oxyfaggl6z3ry. However, we'd also seen that same image before, at URL https://cdn.masto.host/roddiesocial/media_attachments/files/113/527/096/578/357/345/original/f9fcd28bd82a59c6.jpg, and created a blob for it at that URL (with the same CID!) at 2024-11-22 15:12:59 UTC. So, when a The fix will go into arroba, somewhere around https://github.com/snarfed/arroba/blob/061d385ef4bd6dec18d818dce66a5ad65b113ab4/arroba/datastore_storage.py#L365-L371 |
Thanks for figuring it out so quickly. Trying to get my head around what is happening here - the account is a bot that randomly selects images to post from a list of URLs in a CSV and posts without interaction are deleted after two weeks. The image would've always been used when it was uploaded because it's a single script that retrieves an image from a URL, uploads it to Mastodon to get a media ID and then creates a post referencing it. So it's when we get an image selected that has already been selected before but the post for it has been subsequently deleted? I'm surprised if it has the same ID because even if it randomly selects an image it's already selected, it still goes through the process of uploading it to Mastodon and getting a media ID. |
ATProto seems to force blobs to be (account, content)-addressed only, i.e. the distinct ID gets entirely erased on the Bluesky side. I think it's reasonable to serve the most recent URL seen for that (DID, CID) pair. |
Sure! Bridgy Fed is already misbehaving here and not really complying with the protocol. Technically, we should be storing the blob ourselves and serving it directly, instead of redirecting to another URL. We've already caused minor problems for Bluesky themselves when external image URLs changed and served a different resolution or type. I don't really want to take on the liability of hosting media, though, so I'm ok with this minor non-compliance and handling resulting issues like this one. |
To be fair that is untrusted data even within their system, so they should be able to handle the file being switched (somewhat) gracefully. |
Definitely! And they do now. It's still a problem for us, though, since the image stops serving in Bluesky when that happens. 🤷 |
We may also need a better approach than just using the latest URL for a given blob. In the example in #1650 (comment), the first URL stopped serving while the second URL is still ok, but that may not always be the case. We could easily see the opposite, ie the first URL serves ok, but the second one gets garbage collected. We could keep all URLs we've seen for all blobs, and regularly re-check them and switch them all to URLs that still serve, but that's pretty heavyweight. Not sure of a better answer yet though. |
So they actually check the blob CID instead of just using it for addressing? Hm… I suppose it does help with non-custodial (i.e. the user signs each update) PDS hosting, yeah that's fair. |
Yup. snarfed/arroba#35 |
Hello, I've recently noticed posts coming from a Mastodon account are sometimes being bridged but the included image is missing (it just shows a grey box in bsky). I think the first one I noticed was eleven days ago. They're not typified by being extra-large images (first one I checked was 128KB) and I can't see anything on its https://fed.brid.gy/ap/ page (the successful ones and the unsuccessful ones both show as "posted"). Going to the direct URL for the image just shows "Source image is unreachable". Any help in investigating or mitigating this issue would be much appreciated.
These are some example posts:
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldxpydeza222
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldwaztjxpqf2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldv7je7g5ky2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldqsz76qh4k2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldl5nowg3fu2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldhz2rsfi5y2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldhewwd3vbv2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ld4oxi7q36f2
The text was updated successfully, but these errors were encountered: