-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
MSC2246 async uploads #12484
MSC2246 async uploads #12484
Conversation
f643b25
to
43ef05e
Compare
@sumnerevans had some questions in #synapse-dev about how to implement parts of this. See the PR description / comments from @sumnerevans about areas of question. |
We have Be warned that What async code do you have in mind? |
@squahtx I was specifically trying to use However, the media repo's |
synapse-core: We've discussed the first question in #synapse-devs and decided that The second, third and fourth questions could do with some guidance. |
Is the concern is that users on a homeserver without MSC2246 enabled will get 404s when fetching async uploaded media from a remote homeserver? I think the remote homeserver will wait for the default
For specced behaviour, complement or sytest tests would be desired. We're trying to move away from sytest long term, so complement tests would be preferred.
The failing test is: The test case in question can be found at https://github.com/matrix-org/sytest/blob/develop/tests/51media/48admin-quarantine.pl. It:
If you click on Summary and scroll down, you can download synapse logs from the sytest run under the Artifacts section: local homeserver:
remote homeserver:
/thumbnail/ on the remote homeserver is not returning a 404. It's returning a 200 after about 10 seconds. |
If I'm reading the conversation correctly, I think we've given all of Sumner's questions a response. I'll take this out of the review queue and mark it as waiting on changes. If that's not right, please let me know! |
I tried to use Here's my attempt: https://github.com/sumnerevans/synapse/blob/sumner/msc2246-async-uploads-try-lock/synapse/rest/media/v1/upload_resource.py#L143-L171 If I send the requests sequentially as fast as possible, it does not properly linearize and so it updates the content twice. So, the question is whether or not this is good enough or if I should continue making the locking mechanism more foolproof? |
Does it sometimes linearize correctly? And are you testing using postgres or sqlite? I wonder if you've found a bug in
I can't see anything wrong with your attempt at a glance. I think we should try to figure out why it's broken. It might be something to do with how |
I think
from twisted.internet import defer
from twisted.internet.defer import Deferred
...
class LockTestCase(unittest.HomeserverTestCase):
...
def test_acquire_contention(self):
# Track the number of tasks holding the lock.
# Should be at most 1.
in_lock = 0
max_in_lock = 0
release_lock: "Deferred[None]" = Deferred()
async def task():
nonlocal in_lock
nonlocal max_in_lock
lock = await self.store.try_acquire_lock("name", "key")
if not lock:
return
async with lock:
in_lock += 1
max_in_lock = max(max_in_lock, in_lock)
# Block to allow other tasks to attempt to take the lock.
await release_lock
in_lock -= 1
# Start 3 tasks.
defer.ensureDeferred(task())
defer.ensureDeferred(task())
defer.ensureDeferred(task())
# All three are now waiting for the database.
# Give the reactor a kick so that the database transaction returns.
self.pump()
release_lock.callback(None)
# At most one task should have held the lock at a time.
self.assertEqual(max_in_lock, 1) The above test fails with "twisted.trial.unittest.FailTest: 3 != 1". ie. 3 things held the lock at once. We'll probably want to fix that bug in a separate PR first. Sorry for the scope creep! |
5b1d225
to
394b1c1
Compare
f996c9f
to
12eddde
Compare
c8c752b
to
dbdff27
Compare
dbdff27
to
f13152c
Compare
Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
This allows for better handling of all of the versioned media endpoints Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
This depends on the new newly-fixed locking mechanism from matrix-org#12832 to prevent races. Signed-off-by: Sumner Evans <[email protected]>
The stall will default to 20s if MSC2246 is not enabled. For remote media, the max_stall_ms parameter through to remote server. This way, if servers over federation attempt to request the media, the request will only 404 if the media is large (or the client doesn't upload as soon as it can). Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
Signed-off-by: Sumner Evans <[email protected]>
f13152c
to
0ab87e5
Compare
@sumnerevans could you split this up into separate PRs please? It is a bit too unwieldy for me to follow I'm afraid. I'd suggest something like:
|
@@ -0,0 +1,84 @@ | |||
# Copyright 2014-2016 OpenMarket Ltd | |||
# Copyright 2020-2021 The Matrix.org Foundation C.I.C. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new file so should be copy righted to 2022
to whatever legal entity you're contributing as
@sumnerevans Should we close this in favor of #15503? |
Closing in favor of #15503 |
Implements MSC2246 by @tulir.
Note to reviewer: in general, you should be able to go commit-by-commit through this, but there are quite a few fixes at the end of the commit log since rebasing was getting difficult.
Outstanding questions:
Updating the content of an existing MXC URI must be atomic. Right now, it is not. I'm usingLockStore
, but there appears to be a bug (see MSC2246 async uploads #12484 (comment)).Blocked by LockStore: fix acquiring a lock viaLockStore.try_acquire_lock
#12832Should the(Decision: default to 20000ms)?fi.mau.msc2246.max_stall_ms
do anything when MSC2246 support is not enabled on the homeserver? My concern is that in the interim period when some homeservers have this enabled and some don't, that users could have a bad experience retrieving media over federation.Where is the best place to put tests for this. Should I create both unit tests and sytest tests?Not going to test this for now, since not in spec.There's one failing sytest test. I tried to look at the logs, but I couldn't really understand what the problem was. If anyone has any suggestions on how to debug, please let me know.(Test has been fixed)Where should theunused_expires_at
config option be placed? (See first comment below)Sign-off
Signed-off-by: Sumner Evans [email protected]
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)