Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inspect timing-out aead tests #548

Closed
th4s opened this issue Jul 29, 2024 · 11 comments · Fixed by #575
Closed

Inspect timing-out aead tests #548

th4s opened this issue Jul 29, 2024 · 11 comments · Fixed by #575

Comments

@th4s
Copy link
Member

th4s commented Jul 29, 2024

PR #547 comments the aead tests, which are currently timing out. This needs further investigation as it is not locally reproducible.

@th4s
Copy link
Member Author

th4s commented Aug 13, 2024

PR #558 is used for investigating.

@th4s
Copy link
Member Author

th4s commented Aug 21, 2024

This deadlock is hard to fix. Things I tried so far:

  • I could not reproduce this bug on my local machine, even when using act and running it in docker.
  • Deadlock seems to vanish when executing a single AEAD test. It is strange that either all 4 of the started aead tests hang or all pass. I never observed some test passing and another hanging.
  • Looking for static variables, i.e. shared state between unit tests -> nothing found
  • Manual code inspection, especially of mpz-garble/deap-> nothing found so far.
  • Replacing all .lock().unwrap() with try_lock().unwrap() -> deadlock vanishes: Deadlock vanish due to trylock th4s/tlsn#5
  • Adding tracing.
    • There seems to be a problem with set_default in that it omits some logging statements. Couldn't make any sense out of it, will probably create an issue for this. However sometimes I was able to capture a little bit of output. This can be found in the failed runs in these PRs
    • When trying to add proper tracing via set_global_default the deadlock vanishes.
  • Removing tracing and using println! statements the deadlock vanishes -> Deadlock vanish due to prints th4s/tlsn#4
  • Trying to use Tokio's task dump feature -> does not work because the future running does not support it. You get an empty log.
  • Using shuttle testing locally to trigger the deadlock. -> all tests pass

Saying the deadlock vanishes means that I tried to run it several times (4-5) and didn't encounter it.

@th4s
Copy link
Member Author

th4s commented Aug 22, 2024

Making some progress by carefully adding println! statements. This run is interesting: https://github.com/th4s/tlsn/actions/runs/10506339441/job/29106915720
It hangs in decrypt_private -> verify_tag -> compute_tag_share -> share_keystream_block -> compute -> execute

It does not seem to hang in setup_assigned_values.

@th4s
Copy link
Member Author

th4s commented Aug 22, 2024

New hanging run with more logging: https://github.com/th4s/tlsn/actions/runs/10507218244/job/29108495136
It hangs inside generate (also in evaluate but this is probably just a consequence of waiting for generate).

My current bet is that it has to do with feed, flush and the limited backpressure of the TestSTExecutor.

@th4s
Copy link
Member Author

th4s commented Aug 22, 2024

@th4s
Copy link
Member Author

th4s commented Aug 23, 2024

@th4s
Copy link
Member Author

th4s commented Aug 23, 2024

When running tests from the workspace root, the feature rayon although not specified, is activated by tlsn-verifier and tlsn-prover. This means the aead tests are run with feature rayon enabled.

@th4s
Copy link
Member Author

th4s commented Aug 26, 2024

What happens is the following:

  1. When running our test-suite from the workspace root, tlsn-verifier/tlsn-prover activate rayon for mpz-common. This means that although not specified, tlsn-aead unit tests are run with this feature enabled.
  2. When running unit tests, Rust uses 4 threads to parallelize test execution. But in each aead test the leader AND and the follower make use of rayon::spawn.
  3. Rayon uses std::thread::available_parallelism to estimate the amount of parallelism (when the rayon threadpool is left unconfigured as we do), i.e. how many tasks are run in parallel in the rayon threadpool. I determined this to be 4 for our github CI.
  4. All the 4 rayon-spawned evaluator tasks use expect_next in evaluate to get garbling batches from the generator. This means that when all 4 evaluator tasks start before a single generator can be run, the rayon threadpool is saturated. The generator can only queue their rayon tasks to send garbling batches but they are never executed because the evaluator tasks will never finish (since they are waiting for garbling batches) -> deadlock

@sinui0
Copy link
Member

sinui0 commented Aug 26, 2024

🫠

good work. Short term fix is to increase thread count, or decrease parallelism?

@th4s
Copy link
Member Author

th4s commented Aug 27, 2024

Yeah I think I am leaning towards increasing the thread count for the aead unit tests.

@th4s th4s linked a pull request Aug 27, 2024 that will close this issue
@heeckhau
Copy link
Member

Fixed in #575

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants