Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Bundling beam job into jar for SparkRunner requires local execution #30214

Closed
1 of 16 tasks
sagget opened this issue Feb 5, 2024 · 1 comment
Closed
1 of 16 tasks
Labels
awaiting triage bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P3 python

Comments

@sagget
Copy link

sagget commented Feb 5, 2024

What happened?

Hi all,

When attempting to bundle a beam job like the wordcount example into a jar for execution on Spark it appears that this step requires local execution. I am following: https://beam.apache.org/documentation/runners/spark/#kubernetes. When running:

python -m beam_example_wc \
    --runner=SparkRunner \
    --output_executable_path=./wc_job.jar \
    --environment_type=PROCESS \
    --environment_config='{\"command\": \"/opt/apache/beam/boot\"}' \
    --spark_version=3

I get an error: OSError No files found based on the file pattern s3://bucket/path. My local machine running this command does not have access to the input files on s3. But Spark does.

The comment on output_executable_path states it builds the jar rather than running. If that's the case I'm confused why I get an access error? Not running the beam code should mean my local machine does not need to talk to s3?

Any help would be appreciated! Thanks!

PS: I'm able to build the jar, and run on spark successfully if I use an input path on s3 that I have access to on my local machine.

Issue Priority

Priority: 3 (minor)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@sagget
Copy link
Author

sagget commented Feb 6, 2024

I think this is not a bug.

There is a validate flag on ReadFromText, which checks if the file is available. Setting that to false, from it's default means during jar bundling the pipeline does not check for access.

@sagget sagget closed this as completed Feb 6, 2024
@github-actions github-actions bot added this to the 2.55.0 Release milestone Feb 6, 2024
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting triage bug done & done Issue has been reviewed after it was closed for verification, followups, etc. P3 python
Projects
None yet
Development

No branches or pull requests

2 participants