Portable runners should be able to issue checkpoints to Splittable DoFn #20979

damccorm · 2022-06-04T20:50:34Z

To execute unbounded Splittable DoFn over fnapi in streaming mode properly, portable runners should issue split(ProcessBundleSplitRequest with fraction_of_remainder > 0) or simply checkpoint(ProcessBundleSplitRequest with fraction_of_remainder == 0) to SDK regularly to make current bundle finished processing instead of running forever.

Imported from Jira BEAM-11998. Original Jira may contain additional context.
Reported by: boyuanz.

LemonU · 2022-08-18T14:36:55Z

I have managed to pull off the workaround of adding --experiments=use_deprecated_read mentioned in the original Jira ticket.

First, you'll need to start the Java expansion service on your own. If you are deploying the expansion service on docker, you can simply pull the flink job service image (apache/beam_flink1.14_job_server:2.40.0 is what I used) and override the entrypoint with the following commands

java -cp /opt/apache/beam/jars/* org.apache.beam.sdk.expansion.service.ExpansionService 8097  \
--javaClassLookupAllowlistFile="*" \
--defaultEnvironmentType=<your environment type here> \
--defaultEnvironmentConfig=<your environment config here> \
--experiments=use_deprecated_read

Explanation on each of the flags:

-cp /opt/apache/beam/jars/*: this is where the expansion service jars is located in the container

8097: this specifies the port the expansion service should be opened on

--javaClassLookupAllowlistFile="*": this is so that all transforms registered under the expansion service can be requested for external expansion

--defaultEnvironmentType=<your environment type here> and --defaultEnvironmentConfig=<your environment config here>: this specifies the Environment that the Java transforms you requested from this expansion service should be executed in. Be advised, your pipeline's environment configs will not affect this value, and the values set here for the expansion service will override that of your pipeline's.
That is, let's say you are running a python pipeline with --environment_type=EXTERNAL --environment_config=localhost:50000 and the expansion service is started with --defaultEnvironmentType=DOCKER, and you are requesting the expansion for the kafka IO transforms from the expansion service, the resulting pipeline Protobuf payload will have all stages' environment being set to the EXTERNAL environment but the Kafka IO transforms that you requested from the expansion service, which will be set to the DOCKER environment.

--experiments=use_deprecated_read: this is so that the legacy Read transform will replace the new SDF-based Kafka Read transform when the expansion service is expanding the kafka IO stage.

kennknowles · 2022-09-13T23:09:43Z

Since this is new functionality I think that P2 is the right level. This is still an important priority for portable runners to function properly with SDF.

kennknowles · 2022-09-13T23:09:56Z

CC @chamikaramj since tagged with xlang

Abacn · 2023-02-03T16:34:06Z

This is coming back as we are pushing forward for the SDF implementations for various sources (kafka, generate sequence, etc) as well as en route to runner v2 of Dataflow runner. I have read the context of BEAM-11998 and would like to work on this.

Abacn · 2023-03-02T21:49:37Z

Another problem related to this issue is that running PeriodicSequence on Flink runner, the pipeline first runs for ~1 minute but then will fail with error

Traceback (most recent call last):
  File ".../periodictest.py", line 82, in <module>
    test0(p)
  File ".../py38beam/lib/python3.8/site-packages/apache_beam/pipeline.py", line 601, in __exit__
    self.result.wait_until_finish()
  File "...py38beam/lib/python3.8/site-packages/apache_beam/runners/portability/portable_runner.py", line 614, in wait_until_finish
    raise self._runtime_exception
RuntimeError: Pipeline ...-6ce3c5fde435 failed in state FAILED: java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 100.81.162.100:53640-99d457 timed out.

code is pretty simple:

with beam.Pipeline(options=pipeline_options) as p:
    result = (p
  | PeriodicImpulse(fire_interval=1.0)
  | beam.Reshuffle()
  | beam.Map(print))

portikCoder · 2024-05-15T18:30:27Z

Guys, maybe it's a complete stupid question, but AFAIUnderstand it the workaround works only in case there is single worker/machine doing the job, but what in case e.g. Dataproc or where the load is theoretically spread and u cannot control everywhere how the expansion service is spun up?

kennknowles · 2024-05-17T13:54:43Z

The expansion happens before the job is launched. It isn't per-worker.

portikCoder · 2024-05-23T14:19:01Z

Thanks!
~~BTW, I figured if I run the ABeam template like~~

python -m beam_streamer \
  --runner=FlinkRunner \
  --flink_master=${FLINK_MASTER_URL} \
  --flink_submit_uber_jar \
  --environment_type=DOCKER \
  --experiments=use_deprecated_read <======|

~~then it passes the switch downwards correctly:~~

/usr/local/lib/python3.10/site-packages/apache_beam/runners/worker/sdk_worker_main.py:135 [] - Pipeline_options: {'streaming': True, 'project': ..., 'experiments': ['use_deprecated_read', 'beam_fn_api'], ...

Well, scratch the above... I re-invented the wheel by figuring it's not enough and u definitely need that extra service running manually, externally. 🤷

hemavenkatarangan-2 · 2024-08-23T00:14:00Z

hi, is this issue resolved. Im also facing same issue as of now.Need a fix

damccorm added cross-language new feature P1 runner-flink labels Jun 4, 2022

damccorm added flink runners spark and removed runner-flink labels Jun 16, 2022

kennknowles added P2 and removed P1 labels Sep 13, 2022

chamikaramj mentioned this issue Jan 27, 2023

[Bug]: Python SDK gets stuck when using Unbounded PCollection in streaming mode on GroupByKey after ReadFromKafka on DirectRunner, FlinkRunner and DataflowRunner #22809

Closed

Abacn mentioned this issue Jan 27, 2023

[Bug]: ReadFromKafka not forwarding in streaming mode version on portable runners #25114

Open

15 tasks

Abacn self-assigned this Mar 2, 2023

Abacn mentioned this issue Mar 2, 2023

Add timeout to unit tests causing SickBay PostCommit timeout #25664

Merged

3 tasks

Abacn mentioned this issue Mar 14, 2023

Kafka commitOffsetsInFinalize OOM on Flink #20689

Closed

Abacn mentioned this issue Jul 25, 2023

[Bug]: Python SDFs (e.g. PeriodicImpulse) running in Flink and polling using tracker.defer_remainder have checkpoint size growing indefinitely #27648

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portable runners should be able to issue checkpoints to Splittable DoFn #20979

Portable runners should be able to issue checkpoints to Splittable DoFn #20979

damccorm commented Jun 4, 2022

LemonU commented Aug 18, 2022

kennknowles commented Sep 13, 2022

kennknowles commented Sep 13, 2022

Abacn commented Feb 3, 2023

Abacn commented Mar 2, 2023 •

edited

Loading

portikCoder commented May 15, 2024

kennknowles commented May 17, 2024

portikCoder commented May 23, 2024 •

edited

Loading

hemavenkatarangan-2 commented Aug 23, 2024

Portable runners should be able to issue checkpoints to Splittable DoFn #20979

Portable runners should be able to issue checkpoints to Splittable DoFn #20979

Comments

damccorm commented Jun 4, 2022

LemonU commented Aug 18, 2022

kennknowles commented Sep 13, 2022

kennknowles commented Sep 13, 2022

Abacn commented Feb 3, 2023

Abacn commented Mar 2, 2023 • edited Loading

portikCoder commented May 15, 2024

kennknowles commented May 17, 2024

portikCoder commented May 23, 2024 • edited Loading

hemavenkatarangan-2 commented Aug 23, 2024

Abacn commented Mar 2, 2023 •

edited

Loading

portikCoder commented May 23, 2024 •

edited

Loading