use direct executor to deflake tests #33187

m-trieu · 2024-11-21T21:05:11Z

MoreExecutors.directExecutor()/directExecutorService runs all tasks on the calling thread (w/o offloading to another thread for async work) and calls to submit and execute will block until the submitted task returns (i.e Runnable.run()).

Use this in test implementations of ChannelCache and FanOutStreamingEngineWorkerHarness to prevent threads waiting on each other. The old implementation seems to work locally but in the test runner environment has increased in flakiness.

Flakiness is referenced in #28957

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-11-21T22:06:09Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2024-11-26T03:42:30Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn added as fallback since no labels match configuration

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

codecov · 2024-11-26T03:42:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.93%. Comparing base (a06454a) to head (48a048e).
Report is 19 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #33187   +/-   ##
=========================================
  Coverage     58.93%   58.93%           
  Complexity     3112     3112           
=========================================
  Files          1133     1133           
  Lines        174989   174989           
  Branches       3343     3343           
=========================================
  Hits         103136   103136           
  Misses        68508    68508           
  Partials       3345     3345

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

m-trieu · 2024-11-26T08:46:46Z

R: @Abacn

github-actions · 2024-11-26T08:48:02Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Abacn

Thanks for the fix. I think we need to understand the test failure was purely testing issue or could also happen in production.

Abacn · 2024-11-26T15:03:26Z

...pache/beam/runners/dataflow/worker/streaming/harness/FanOutStreamingEngineWorkerHarness.java

-            getDataMetricTracker);
+            getDataMetricTracker,
+            // Run the workerMetadataConsumer on the direct calling thread to make testing more
+            // deterministic.


"to make testing more deterministic" gives an impression that the change just fix tests, however the test code path then diverts from the real one.

Please provide more information in this comment why the race observed in the test does not affect production, for future reference.

If this indeed could happen in production then we should fix the code.

In prod, this is to hand off the task from a thread that (may) perform network IO and we do not want the task to block that since it acquires a lock to do its work. Not needed in testing and can logically be called in line

Added comment.

Abacn · 2024-11-26T15:06:24Z

...in/java/org/apache/beam/runners/dataflow/worker/windmill/client/grpc/stubs/ChannelCache.java

@@ -85,7 +86,9 @@ static ChannelCache forTesting(
        notification -> {
          shutdownChannel(notification.getValue());
          onChannelShutdown.run();
-        });
+        },
+        // Run the removal on the calling thread for better determinism in tests.


added, this doesn't change any behavior we just want the removal to run synchronously so we don't have to rely on waiting in tests

Abacn · 2024-11-26T15:14:26Z

The old implementation seems to work locally but in the test runner environment has increased in flakiness.

We've seen similar scenario for different tests. This is due to CI/CD is often busier, has heavier CPU / thread pressure, which arguably more resemble to production workers

m-trieu · 2024-11-26T19:08:52Z

Thanks for the fix. I think we need to understand the test failure was purely testing issue or could also happen in production.

This has only shown up in these test suites (haven't run into in load testing). I wonder if its due to the threads waiting to be scheduled, but the resources are consumed while executing other tests.

m-trieu · 2024-11-26T19:22:47Z

done, back to you @Abacn thanks!

Abacn

Thank you!

github-actions bot added runners dataflow labels Nov 21, 2024

m-trieu force-pushed the mt-fix-flaky-tests branch from b47e630 to 6dc5803 Compare November 21, 2024 21:31

m-trieu force-pushed the mt-fix-flaky-tests branch from 6dc5803 to 782fc35 Compare November 22, 2024 00:07

use direct executor to deflake tests

48a048e

m-trieu force-pushed the mt-fix-flaky-tests branch from 782fc35 to 48a048e Compare November 26, 2024 03:11

github-actions bot added the Next Action: Reviewers label Nov 26, 2024

Abacn reviewed Nov 26, 2024

View reviewed changes

address PR comments

e8972f6

Abacn approved these changes Nov 26, 2024

View reviewed changes

Abacn merged commit 720b824 into apache:master Nov 26, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use direct executor to deflake tests #33187

use direct executor to deflake tests #33187

m-trieu commented Nov 21, 2024

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 26, 2024

codecov bot commented Nov 26, 2024

m-trieu commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

Abacn left a comment

Abacn Nov 26, 2024

m-trieu Nov 26, 2024 •

edited

Loading

Abacn Nov 26, 2024

m-trieu Nov 26, 2024

Abacn commented Nov 26, 2024 •

edited

Loading

m-trieu commented Nov 26, 2024

m-trieu commented Nov 26, 2024

Abacn left a comment

use direct executor to deflake tests #33187

use direct executor to deflake tests #33187

Conversation

m-trieu commented Nov 21, 2024

GitHub Actions Tests Status (on master branch)

github-actions bot commented Nov 21, 2024

github-actions bot commented Nov 26, 2024

codecov bot commented Nov 26, 2024

Codecov Report

m-trieu commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

Abacn left a comment

Choose a reason for hiding this comment

Abacn Nov 26, 2024

Choose a reason for hiding this comment

m-trieu Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Abacn Nov 26, 2024

Choose a reason for hiding this comment

m-trieu Nov 26, 2024

Choose a reason for hiding this comment

Abacn commented Nov 26, 2024 • edited Loading

m-trieu commented Nov 26, 2024

m-trieu commented Nov 26, 2024

Abacn left a comment

Choose a reason for hiding this comment

m-trieu Nov 26, 2024 •

edited

Loading

Abacn commented Nov 26, 2024 •

edited

Loading