fix(tee): fix race condition in batch locking #3342

pbeza · 2024-11-28T11:55:23Z

What ❔

After scaling zksync-tee-prover to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing duplicated proving for some batches.

While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking.

Why ❔

To fix the bug that only activates after running zksync-tee-prover on multiple instances.

Checklist

PR title corresponds to the body of PR (we generate changelog entries from PRs).
Tests for the changes have been added / updated.
Documentation comments have been added / updated.
Code has been formatted via zkstack dev fmt and zkstack dev lint.

After [scaling][1] [zksync-tee-prover][2] to two instances/replicas on Azure for azure-stage2, azure-testnet2, and azure-mainnet2, we started experiencing [duplicated proving for some batches][3]. While this is not an erroneous situation, it is wasteful from a resource perspective. This was due to a race condition in batch locking. This PR fixes the issue by adding atomic batch locking. [1]: https://github.com/matter-labs/gitops-kubernetes/pull/7033/files [2]: https://github.com/matter-labs/zksync-era/blob/aaca32b6ab411d5cdc1234c20af8b5c1092195d7/core/bin/zksync_tee_prover/src/main.rs [3]: https://grafana.matterlabs.dev/goto/M1I_Bq7HR?orgId=1

slowli

Dumb question: How is the locking made atomic in this PR? AFAIU, the first SELECT statement, if queried concurrently, can still return the same L1 batch number unless some kind of row-level locking is implemented (cf. SELECT FOR UPDATE SKIP LOCKED in this contract verifier query). I'm not even sure the UPDATE query will fail for the transaction committed last in case of a race (maybe it would with serialization isolation level, but I'd argue that erroring is not the best cause of action here; row-level locks seem to work better).

core/lib/dal/src/models/storage_tee_proof.rs

slowli · 2024-11-29T15:41:42Z

core/lib/dal/src/tee_proof_generation_dal.rs

+        let batch_number = match batch_number {
+            Some(batch) => batch.l1_batch_number,
+            None => {
+                transaction.commit().await?;


There are seemingly no changes in the transaction at this point. Is the commit here just to not clean up DB resources faster?

To be honest, I just presumed every .start_transaction() needed to either end with .commit() or .rollback(). I'm not sure how you determine that it's OK to skip it altogether, but I trust your judgment here. I removed this line here. ✅

A transaction that is dropped w/o being committed is rolled back implicitly. You can add an explicit call to be more precise (but IMO it should be rollback so that it's clear that no modifying actions have been performed).

…locking

pbeza · 2024-11-29T18:13:12Z

Dumb question: How is the locking made atomic in this PR? (...)

Not a dumb question at all! The dumb one here was me! ;P I totally misunderstood what SQL transactions can actually handle in this context. Had to brush up on the finer details of SQL locking. Thanks for steering me in the right direction! These two links were super helpful:

…locking

pbeza · 2024-11-29T18:42:39Z

core/lib/dal/src/tee_proof_generation_dal.rs

+            FOR UPDATE OF p
+            SKIP LOCKED


I had to explicitly use OF p here. If you leave it out or switch it to OF tee, you will run into this error:

error: error returned from database: FOR UPDATE cannot be applied to the nullable side of an outer join

The error is understandable. Just to clarify, since it's the tee_proof_generation_details table that is updated below: Do I understand correctly that this new query can actually have tee_proof_generation_details side of the join absent (I guess if the query is invoked for a certain batch for the first time?). If so, I'd suggest to add a comment explaining why rows are locked on a table that's not modified in the transaction.

pbeza · 2024-11-29T18:56:52Z

@slowli, I’ve addressed your code review comments. Take a look when you get a chance.

It’s kinda hard to test properly without deploying it to stage and letting it run for a while. Specifically, let me know if locking rows in the proof_generation_details table is okay (instead of just locking tee_proof_generation_details rows).

pbeza requested review from haraldh and slowli November 28, 2024 14:03

pbeza force-pushed the tee/fix/atomic-batch-locking branch from a86fc98 to e95cb27 Compare November 28, 2024 18:10

pbeza requested a review from RomanBrodetski November 29, 2024 11:59

pbeza force-pushed the tee/fix/atomic-batch-locking branch from e95cb27 to 46dcfde Compare November 29, 2024 12:06

pbeza force-pushed the tee/fix/atomic-batch-locking branch from 46dcfde to 7d96c1c Compare November 29, 2024 12:10

slowli reviewed Nov 29, 2024

View reviewed changes

pbeza added 2 commits November 29, 2024 17:44

Merge remote-tracking branch 'origin/main' into tee/fix/atomic-batch-…

17977dc

…locking

Addressed review comments

464b5d4

pbeza added 2 commits November 29, 2024 19:19

fixup! Addressed review comments

574896b

Merge remote-tracking branch 'origin/main' into tee/fix/atomic-batch-…

66a9154

…locking

pbeza commented Nov 29, 2024

View reviewed changes

pbeza requested a review from slowli November 29, 2024 19:00

slowli approved these changes Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tee): fix race condition in batch locking #3342

fix(tee): fix race condition in batch locking #3342

pbeza commented Nov 28, 2024 •

edited

Loading

slowli left a comment

slowli Nov 29, 2024

pbeza Nov 29, 2024 •

edited

Loading

slowli Dec 2, 2024

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza Nov 29, 2024

slowli Dec 2, 2024

pbeza commented Nov 29, 2024 •

edited

Loading

fix(tee): fix race condition in batch locking #3342

Are you sure you want to change the base?

fix(tee): fix race condition in batch locking #3342

Conversation

pbeza commented Nov 28, 2024 • edited Loading

What ❔

Why ❔

Checklist

slowli left a comment

Choose a reason for hiding this comment

slowli Nov 29, 2024

Choose a reason for hiding this comment

pbeza Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

slowli Dec 2, 2024

Choose a reason for hiding this comment

pbeza commented Nov 29, 2024 • edited Loading

pbeza Nov 29, 2024

Choose a reason for hiding this comment

slowli Dec 2, 2024

Choose a reason for hiding this comment

pbeza commented Nov 29, 2024 • edited Loading

pbeza commented Nov 28, 2024 •

edited

Loading

pbeza Nov 29, 2024 •

edited

Loading

pbeza commented Nov 29, 2024 •

edited

Loading

pbeza commented Nov 29, 2024 •

edited

Loading