-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(tee): fix race condition in batch locking #3342
base: main
Are you sure you want to change the base?
Conversation
a86fc98
to
e95cb27
Compare
e95cb27
to
46dcfde
Compare
After [scaling][1] [zksync-tee-prover][2] to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing [duplicated proving for some batches][3]. While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking. [1]: https://github.com/matter-labs/gitops-kubernetes/pull/7033/files [2]: https://github.com/matter-labs/zksync-era/blob/aaca32b6ab411d5cdc1234c20af8b5c1092195d7/core/bin/zksync_tee_prover/src/main.rs [3]: https://grafana.matterlabs.dev/goto/M1I_Bq7HR?orgId=1
46dcfde
to
7d96c1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: How is the locking made atomic in this PR? AFAIU, the first SELECT
statement, if queried concurrently, can still return the same L1 batch number unless some kind of row-level locking is implemented (cf. SELECT FOR UPDATE SKIP LOCKED
in this contract verifier query). I'm not even sure the UPDATE
query will fail for the transaction committed last in case of a race (maybe it would with serialization isolation level, but I'd argue that erroring is not the best cause of action here; row-level locks seem to work better).
let batch_number = match batch_number { | ||
Some(batch) => batch.l1_batch_number, | ||
None => { | ||
transaction.commit().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are seemingly no changes in the transaction at this point. Is the commit here just to not clean up DB resources faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I just presumed every .start_transaction()
needed to either end with .commit()
or .rollback()
. I'm not sure how you determine that it's OK to skip it altogether, but I trust your judgment here. I removed this line here. ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A transaction that is dropped w/o being committed is rolled back implicitly. You can add an explicit call to be more precise (but IMO it should be rollback
so that it's clear that no modifying actions have been performed).
Not a dumb question at all! The dumb one here was me! ;P I totally misunderstood what SQL transactions can actually handle in this context. Had to brush up on the finer details of SQL locking. Thanks for steering me in the right direction! These two links were super helpful: |
FOR UPDATE OF p | ||
SKIP LOCKED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to explicitly use OF p
here. If you leave it out or switch it to OF tee
, you will run into this error:
error: error returned from database: FOR UPDATE cannot be applied to the nullable side of an outer join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is understandable. Just to clarify, since it's the tee_proof_generation_details
table that is updated below: Do I understand correctly that this new query can actually have tee_proof_generation_details
side of the join absent (I guess if the query is invoked for a certain batch for the first time?). If so, I'd suggest to add a comment explaining why rows are locked on a table that's not modified in the transaction.
@slowli, I’ve addressed your code review comments. Take a look when you get a chance. It’s kinda hard to test properly without deploying it to |
What ❔
After scaling
zksync-tee-prover
to two instances/replicas on Azure forazure-stage2
,azure-testnet2
, andazure-mainnet2
, we started experiencing duplicated proving for some batches.While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking.
Why ❔
To fix the bug that only activates after running
zksync-tee-prover
on multiple instances.Checklist
zkstack dev fmt
andzkstack dev lint
.