-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Python SDK gets stuck when using Unbounded PCollection in streaming mode on GroupByKey after ReadFromKafka on DirectRunner, FlinkRunner and DataflowRunner #22809
Comments
I can also confirm the same problem does not happen when using the Java SDK, pointing to some sort of issue in the portability layer |
I can also confirm the same works in DataflowRunner with GCP PubSubLite which is a Java cross-language IO adapter, so nothing in the cross-language portability layer either, could be the way KafkaIO handles timestamps? On DirectRunner though, messages remain pending watermarks on the PubSubLiteIO component and windows never get triggered. On FlinkRunner the environment variable with the GCP credentials path is not passed through to the java SDK and it fails at that. |
CC: @chamikaramj |
I have a very similar issue, but instead of using Kafka module in Beam I am using the Kafka module in beam_nuggets, a wrapper of the Kafka Python client. With this source I have to add the timestamp by hand using beam.window.TimestampedValue. I tried to analyze the data after applying the window transformation by using the AnalyzeElement class defined here (Example 2). Data is correctly assigned to a window, but I do not know if it is related, but I have also tried with reading from file with ReadFromText and the streaming pipeline option: data is processed line by line up to the |
Ok, I've dug into this quite a bit:
|
cc: @lukecwik |
I've tried plain java with a kafka source, and it looks like SDF on runner v2 exhibits this as well, while unbounded on runner v1 works normally. I'm going to try SDF on v1 and unbounded on v2 to try and isolate. |
for whatever reason,
|
and there isn't a way to try use sdf on v1, as there is an overrride |
Well, the above is wrong. with
I am now seeing elements coming through. This appears to be an issue with SDF source, not unified worker |
which doesn't make much sense, as that was the original case that prevented data on python... |
I've re-confirmed the base python case does not work. So: Python xlang with unbounded underlyer: does not work I don't really know what is causing the issues |
Ok, it looks like python with unbounded actually is working?
with configs:
|
so this is certainly an SDF error. |
Ok, after some experimentation, this appears to be caused when data is 'sparse' on kafka. We only advance the SDF watermark when we see records, so if there are no records for a partition (which can be 1-1 with beam splits), then we will never advance the watermark for the DoFn. This results in no data being advanced, triggering the window. |
tyvm to @damccorm for the idea |
If I understood correctly, #24205 fixes the SDF implementation of kafka read, i.e.
still persist, i.e., edit: missed #22809 (comment) |
any updates on this issue? |
The issue from Java side (SDF implementation) should be fixed in #24205 and will be shipped with upcoming v2.45.0; @johnjcasey could not reproduce the issue of python using legacy implementation, but there is also another report #25114 (we temporarily changed python SDK to use legacy implementation from v2.42.0) I am going to take a look. |
Giving up testing locally... Python direct runner does not quite support streaming and stuck indefinetely; flink runner does not see incoming data, waiting for ~1 min then the following error happens and job fails:
setting max_records or max_read_time do see records popping up when num record reached or timeout. Will test on Dataflow runner tomorrow We definitely need to improve test infrastructure. The missing piece of direct runner has generated substantial confusions... |
Thank you for you reply. |
Confirms that the master branch works as expected on Dataflow runner: job: https://ci-beam.apache.org/job/beam_PerformanceTests_xlang_KafkaIO_Python/47/console pipeline setup hacked from python xlang kafka performance test: https://github.com/apache/beam/blob/8978e6375a52d9e676539bfaef2a4e35775443bb/sdks/python/apache_beam/io/external/xlang_kafkaio_perf_test.py _ = (
self.pipeline
| 'ReadFromKafka' >> kafka.ReadFromKafka(
consumer_config={
'bootstrap.servers': self.test_options.bootstrap_servers,
'auto.offset.reset': 'earliest'
},
topics=[self.kafka_topic])
| beam.Map(lambda x: ('0', x))
| "Fixed window 5s" >> beam.WindowInto(window.FixedWindows(5))
| "Group by key" >> beam.GroupByKey()
| 'Print Record Fn' >> beam.Map(printRec) Update: tested SDF read version on Dataflow, elements emitted as expected, as well. |
Note that portable runners may run into #20979 when reading from Kafka. |
Thanks @chamikaramj for stepping in. Is the Unbounded read also affected by #20979. I tested on local flink runner neither Unbounded nor SDF read not emitting records |
It only affects SDF but UnboundedSources get converted to SDFs when used on portable runners. Non-portable Spark/Flink should not be affected by that bug. |
Hello everyone. I am trying to build a Python DataFlow pipeline with Kafka as the input. I am experience issues with consuming from Kafka both with the DirectRunner and DataFlowRunner. If I add max_records I can can data from the DirectRunner but I haven't been able to consume messages with the DataFlowRunner. I think the DataFlow issue might actually be related to networking between GCP and my on-prem, I am working on that, but it looks like others have struggled to get DataFlow working correctly. I can see a couple of different tickets related to this issue and I wanted to ask for some clarity on the situation as there is a lot of information:
Thanks a lot for any help! |
This likely indicates that the issue you are running into for Dataflow is unrelated to other issues mentioned here. Possibly it's due to Dataflow workers not being able to connect to your on-prem cluster but hard to say without looking at the job. If you file a Dataflow support ticket, they should be able to look at your specific job. The issue #20979 mentioned above should not affect Dataflow. |
Sorry I am not setting max_records on my DataFlow jobs. I can try that to separate networking issues though. |
I have been able to consume with v2.43.0 from DataFlow |
What happened?
Consider the trivial example pipeline below:
When this pipeline is run at least in these 3 environments:
DataflowRunner
(streaming mode)FlinkRunner
(streaming mode, locally, not on cluster, haven't tested with cluster)DirectRunner
(streaming mode)All of them get stuck on the
GroupByKey
PTransform. The trigger is never fired apparently, though it is impossible to see it from the logging I get.When adding
max_num_records
to theReadFromKafka
step, effectively transforming the source collection into a bounded collection, this works, both in batch and streaming mode, in all of the environments listed above.Data is timestamped in Kafka using process time, although it is unclear from the documentation whether the KafkaIO adapter in Beam automatically timestamps entries in the source PCollection it generates.
I have also tried timestamping them manually using
with_metadata
and themsg.timestamp
property returned, to no avail.If I look at the Beam test suite, I see the
ReadFromKafka
PTransform is only tested without windowing and without grouping in Python. Should this maybe be added?This impacts all python workloads running on Kafka, and it seems rather surprising that no one else has run into this yet.
Issue Priority
Priority: 3
Issue Component
Component: io-java-kafka
The text was updated successfully, but these errors were encountered: