Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The PostRelease Nightly Snapshot job is flaky #30505

Closed
github-actions bot opened this issue Mar 5, 2024 · 15 comments
Closed

The PostRelease Nightly Snapshot job is flaky #30505

github-actions bot opened this issue Mar 5, 2024 · 15 comments

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2024

The PostRelease Nightly Snapshot is failing over 50% of the time
Please visit https://github.com/apache/beam/actions/workflows/beam_PostRelease_NightlySnapshot.yml?query=is%3Afailure+branch%3Amaster to see the logs.

@shunping
Copy link
Contributor

Related to ##30447

@Abacn
Copy link
Contributor

Abacn commented Mar 14, 2024

Still failing:

Container image gcr.io/cloud-dataflow/v1beta3/beam_java8_sdk:beam-master-20240306 not downloaded yet.

It is strange that the container gets resolved to "beam_java8_sdk:beam-master-20240306". What happens is it picks the label for legacy runner but actually trying to pull runner v2 image. This is likely due to Dataflow switched to runner v2 by default in Beam 2.55.0+

ext.dataflowLegacyContainerVersion = 'beam-master-20240306'
ext.dataflowFnapiContainerVersion = 'beam-master-20240125'

entered #30634

@liferoad
Copy link
Collaborator

https://github.com/apache/beam/actions/runs/8619063045

java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DataflowRunner_team/insertAll?prettyPrint=false
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
    "reason" : "notFound"
  } ],
  "message" : "Not found: Table apache-beam-testing:beam_postrelease_mobile_gaming.leaderboard_DataflowRunner_team",
  "status" : "NOT_FOUND"
}

@liferoad
Copy link
Collaborator

Looks much better. Close this now.

@github-actions github-actions bot added this to the 2.56.0 Release milestone Apr 13, 2024
@Abacn Abacn reopened this May 21, 2024
@Abacn Abacn removed this from the 2.56.0 Release milestone May 21, 2024
@Abacn
Copy link
Contributor

Abacn commented May 21, 2024

Currently there is a flakiness due to downloading artifacts from maven snapshot repository not get retried. This is a maven tool thing, but probably we can first build (with retry) so the artifacts are get cached in local maven

@liferoad
Copy link
Collaborator

@shunping please check this when you have time.

@damondouglas
Copy link
Contributor

Related to the maven snapshot issue. I wonder if we could use artifact registry's ability to store Java packages https://cloud.google.com/artifact-registry/docs/java/store-java, instead of relying on maven central.

@liferoad
Copy link
Collaborator

liferoad commented Jun 8, 2024


[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:java (default-cli) on project word-count-beam: An exception occured while executing the Java class. java.lang.RuntimeException: com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found |  
-- | --
  | [ERROR] POST https://bigquery.googleapis.com/bigquery/v2/projects/apache-beam-testing/datasets/beam_postrelease_mobile_gaming/tables/leaderboard_DirectRunner_team/insertAll?prettyPrint=false |  
  | [ERROR] { |  
  | [ERROR]   "code" : 404, |  
  | [ERROR]   "errors" : [ { |  
  | [ERROR]     "domain" : "global", |  
  | [ERROR]     "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]     "reason" : "notFound" |  
  | [ERROR]   } ], |  
  | [ERROR]   "message" : "Not found: table Table is deleted: 844138762903:beam_postrelease_mobile_gaming.leaderboard_DirectRunner_team", |  
  | [ERROR]   "status" : "NOT_FOUND" |  
  | [ERROR] } |  
  | [ERROR] -> [Help 1] |  
  | [ERROR]


@liferoad
Copy link
Collaborator

liferoad commented Jun 8, 2024

Can we just add the retry to this task?

@chamikaramj
Copy link
Contributor

Looking at some of the recent failures seems like Java command was just crashing ?

https://github.com/apache/beam/actions/runs/9537373049/job/26285395593
https://ge.apache.org/s/pmba6vnub3yz4

"Process 'command '/opt/hostedtoolcache/Java_Temurin-Hotspot_jdk/8.0.412-8/x64/bin/java'' finished with non-zero exit value 1"

@chamikaramj
Copy link
Contributor

I also see the 404 error from BQ mentioned above in other failed runs, so seems like there are at least two failure modes.

@chamikaramj
Copy link
Contributor

I wonder if Java failure was due to an OOM. Can we increase the memory available to VMs running these tests ?

@damccorm
Copy link
Contributor

damccorm commented Jul 2, 2024

Trying this with #31749

@github-actions github-actions bot added this to the 2.59.0 Release milestone Aug 20, 2024
@github-actions github-actions bot reopened this Oct 22, 2024
Copy link
Contributor Author

Reopening since the workflow is still flaky

@liferoad
Copy link
Collaborator

Green now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants