Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

=> Bluesky: handle duplicate image blobs when some URLs aren't still serving #1650

Open
roddie-digital opened this issue Dec 24, 2024 · 9 comments
Labels
compat Protocol differences that need special handling.

Comments

@roddie-digital
Copy link

Hello, I've recently noticed posts coming from a Mastodon account are sometimes being bridged but the included image is missing (it just shows a grey box in bsky). I think the first one I noticed was eleven days ago. They're not typified by being extra-large images (first one I checked was 128KB) and I can't see anything on its https://fed.brid.gy/ap/ page (the successful ones and the unsuccessful ones both show as "posted"). Going to the direct URL for the image just shows "Source image is unreachable". Any help in investigating or mitigating this issue would be much appreciated.

These are some example posts:

https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldxpydeza222
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldwaztjxpqf2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldv7je7g5ky2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldqsz76qh4k2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldl5nowg3fu2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldhz2rsfi5y2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ldhewwd3vbv2
https://bsky.app/profile/ghostintheshell.roddie.social/post/3ld4oxi7q36f2

@snarfed
Copy link
Owner

snarfed commented Dec 24, 2024

Hmm, interesting bug, thanks for filing!

Images (blobs) in ATProto are content-addressed by their CID. In these cases, evidently you uploaded them to your Mastodon instance earlier, but didn't use them, and them uploaded them again later and used them in new posts. The first uploads had different URLs, and evidently eventually got garbage collected, but had the same CIDs, so our blobs still point to the earlier URLs.

Example: https://roddie.social/@ghostintheshell/113701671131291688 has image https://cdn.masto.host/roddiesocial/media_attachments/files/113/701/671/033/940/176/original/4c0e5692a6ced709.jpg . We stored our blob for that image at 2024-12-23 10:30:32 UTC with CID bafkreic4pr5u3pcea454drzc5gencumlqklwpt2uxo7n7oxyfaggl6z3ry.

However, we'd also seen that same image before, at URL https://cdn.masto.host/roddiesocial/media_attachments/files/113/527/096/578/357/345/original/f9fcd28bd82a59c6.jpg, and created a blob for it at that URL (with the same CID!) at 2024-11-22 15:12:59 UTC.

So, when a getBlob call comes in for that CID, we return the old URL, which isn't serving any more.

The fix will go into arroba, somewhere around https://github.com/snarfed/arroba/blob/061d385ef4bd6dec18d818dce66a5ad65b113ab4/arroba/datastore_storage.py#L365-L371

@roddie-digital
Copy link
Author

roddie-digital commented Dec 24, 2024

Thanks for figuring it out so quickly. Trying to get my head around what is happening here - the account is a bot that randomly selects images to post from a list of URLs in a CSV and posts without interaction are deleted after two weeks. The image would've always been used when it was uploaded because it's a single script that retrieves an image from a URL, uploads it to Mastodon to get a media ID and then creates a post referencing it. So it's when we get an image selected that has already been selected before but the post for it has been subsequently deleted? I'm surprised if it has the same ID because even if it randomly selects an image it's already selected, it still goes through the process of uploading it to Mastodon and getting a media ID.

@Tamschi
Copy link
Collaborator

Tamschi commented Dec 24, 2024

ATProto seems to force blobs to be (account, content)-addressed only, i.e. the distinct ID gets entirely erased on the Bluesky side.

I think it's reasonable to serve the most recent URL seen for that (DID, CID) pair.
It does have to be keyed on the DID as well as the CID because otherwise it's trivial for anyone to post the same file and then take it offline, i.e. denial of service against other users.

@snarfed
Copy link
Owner

snarfed commented Dec 25, 2024

Sure!

Bridgy Fed is already misbehaving here and not really complying with the protocol. Technically, we should be storing the blob ourselves and serving it directly, instead of redirecting to another URL. We've already caused minor problems for Bluesky themselves when external image URLs changed and served a different resolution or type.

I don't really want to take on the liability of hosting media, though, so I'm ok with this minor non-compliance and handling resulting issues like this one.

@Tamschi
Copy link
Collaborator

Tamschi commented Dec 25, 2024

To be fair that is untrusted data even within their system, so they should be able to handle the file being switched (somewhat) gracefully.

@snarfed
Copy link
Owner

snarfed commented Dec 26, 2024

To be fair that is untrusted data even within their system, so they should be able to handle the file being switched (somewhat) gracefully.

Definitely! And they do now. It's still a problem for us, though, since the image stops serving in Bluesky when that happens. 🤷

@snarfed
Copy link
Owner

snarfed commented Dec 26, 2024

We may also need a better approach than just using the latest URL for a given blob. In the example in #1650 (comment), the first URL stopped serving while the second URL is still ok, but that may not always be the case. We could easily see the opposite, ie the first URL serves ok, but the second one gets garbage collected.

We could keep all URLs we've seen for all blobs, and regularly re-check them and switch them all to URLs that still serve, but that's pretty heavyweight. Not sure of a better answer yet though.

@snarfed snarfed changed the title Images missing in posts bridged from fediverse to Bluesky => Bluesky: handle duplicate image blobs when some URLs aren't still serving Dec 26, 2024
@Tamschi
Copy link
Collaborator

Tamschi commented Dec 29, 2024

To be fair that is untrusted data even within their system, so they should be able to handle the file being switched (somewhat) gracefully.

Definitely! And they do now. It's still a problem for us, though, since the image stops serving in Bluesky when that happens. 🤷

So they actually check the blob CID instead of just using it for addressing? Hm…

I suppose it does help with non-custodial (i.e. the user signs each update) PDS hosting, yeah that's fair.

@Tamschi Tamschi added the compat Protocol differences that need special handling. label Dec 29, 2024
@snarfed
Copy link
Owner

snarfed commented Dec 29, 2024

Yup. snarfed/arroba#35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compat Protocol differences that need special handling.
Projects
None yet
Development

No branches or pull requests

3 participants