Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce setDefaultPipelineOptions call on creating TestPipelineOptions #32723

Merged
merged 1 commit into from
Oct 14, 2024

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Oct 9, 2024

Fix #30512

Problem statement

Random tests in this test suite failing due to tmp file get deleted half way, likely a racing condition.

Investigation

This is a longstanding and tricky issue, see earlier investigation in

#30512 (comment)

as the validate runner test set adds//enables more tests, the test becomes increasingly flaky.

setDefaultPipelineOptions

In TestPipeline scenario, it is found that setDefaultPipelineOptions is called at least twice

  1. When TestPipelineOptions object is created:

    FileSystems.setDefaultPipelineOptions(options);

  2. Right before pipeline runs :

    FileSystems.setDefaultPipelineOptions(options);

Note that if "options.revision" does not change, the second call should usually be skipped. However there are multiple place modifying pipeline options before submit (amending some options). Checked that in the first call, revision=2; in the second call, revision=4. Therefore both run resulted in rebuild the map.

If run by "runWithAddidionalArgument", there is a third call:

FileSystems.setDefaultPipelineOptions(options);

The first run is even triggered when the test is filtered by a wildcard, where junit still runs @Rule TestPipelineOptions options = TestPipelineOption.create()

Assumptions of what happened

Assumption 1

Spark runner first put jars in a temp directory (/temp/.../userFiles/...), load jars into class loader (#30512 (comment)), and after pipeline run delete these files (likely in a async way, there is "shutdown hook called..." appeared in log)

At the same time in tests, setDefaultPipelineOptions are called, invoking ServiceLoader, looking for registered services in the classloader. For PortableSparkRunner tests. It causes racing condition when class loader still has jars entry from previous test that has been deleted.

Assumption 2

Executing the tests locally (on macOS) there are other intermittent test failure, but now FileNotFoundException has a hint (Too many files open). It could also be that ServiceLoader is opening lots of files if running frequently.

Nevertheless, both observation agrees with excessive call to setDefaultPipelineOptions in test pipeline submission time.

Solution

This PR reduces call to setDefaultPipelineOptions in two places in TestPipeline.java. This reduces setDefaultPipelineOptions call by at least half.

Risk

There is potential risk of FileSystem stored PipelineOptions is out of sync between TestPipelineOptions.create and testPipeline.run(). I would argue this is of less concern because

  • TestPipeline / TestPipelineOptions are meant by tests, this change does not affect dev and prod settings.

  • Under current setting out-of-sync can already happen. Consider the following code

p1 = TestPipeline.create()
// at this moment, FileSystem's option are set to p1's
p2 = TestPipeline.create()
// at this moment, FileSystem's option are set to p2's. Any operation on P1 involving FileSystem will use P2's option

Under the current implementation of using a Static map for PipelineOption initalized FileSystem, whenever there are multiple pipelines, it can always be out-of-sync.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn Abacn force-pushed the defaultPipelineOptions branch 2 times, most recently from 482a93c to 17ed411 Compare October 9, 2024 19:09
@Abacn
Copy link
Contributor Author

Abacn commented Oct 9, 2024

tested ./gradlew :runners:spark:3:job-server:validatesPortableRunnerBatch --tests "*ParDoTest*TimerTests*"

added log to "setDefaultPipelineOptionsOnce" as well as inside "setDefaultPipelineOptions". Found

  • "setDefaultPipelineOptions" executed 14 times, that is one called from TestPipeline and 13 from

    FileSystems.setDefaultPipelineOptions(options);
    , that is each time pipeline run

  • "setDefaultPipelineOptionsOnce" called 256 times in total (note - this is because filtered out test also call). Only one time goes to actual "setDefaultPipelineOptions"

This suggests there was excessive call to "setDefaultPipelineOptions" which becomes source of Exception when there is lots of jars in classLoader (added by Spark)

@Abacn
Copy link
Contributor Author

Abacn commented Oct 10, 2024

PostCommit Java PVR Spark Batch passed in first run

https://github.com/apache/beam/actions/runs/11261482935

@Abacn Abacn force-pushed the defaultPipelineOptions branch 3 times, most recently from 514c8cb to a9d504f Compare October 10, 2024 20:22
@Abacn Abacn force-pushed the defaultPipelineOptions branch from a9d504f to 0bdb00a Compare October 10, 2024 20:59
@Abacn Abacn marked this pull request as ready for review October 10, 2024 21:02
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Abacn
Copy link
Contributor Author

Abacn commented Oct 11, 2024

assign set of reviewers

Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @damccorm for label build.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn Abacn changed the title Attempt to fix PostCommit PVR Spark Reduce setDefaultPipelineOptions call on creating TestPipelineOptions Oct 11, 2024
Copy link
Contributor

@damccorm damccorm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@Abacn Abacn merged commit 1f575d4 into apache:master Oct 14, 2024
27 of 28 checks passed
@Abacn Abacn deleted the defaultPipelineOptions branch October 14, 2024 15:38
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The PostCommit Java PVR Spark Batch job is flaky
2 participants