-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: ReadFromKafka not forwarding in streaming mode version on portable runners #25114
Comments
streaming by definition will not end; despite python directly runner is not for production and do not have full support for streaming. This is most likely working as intended |
True, so how can use the apache beam pipeline in streaming mode if it only gather data and not send them to the next step? |
#24528 tracks various issues related to streaming direct runner. I am not sure if it is able to run a simple KafkaIO pipeline. Are you able to use a portable Flink Runner by chance |
I am trying to to implement it but till now I am facing the same issue, I can see my pipeline in the apache flink localhost:8081 but nothing happens, I am debugging it to see if I made any mistake |
So after researching and testing, I found that the Flink Runner does not help because apache flink gather a lot of data before releasing them, my use case is every message I receive from kafka I need to forwarded to the next step in the pipeline (locally). |
fyi if flink runner has the same issue, it may hit #22809, the issue in python side may still persist |
To reproduce:
|
I am using this producer_config: producer = Producer(json.load(open("producer_config.json")))
|
Please Note that if I use flink runner 1.14 I get: ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', '/root/.apache_beam/cache/jars/beam-runners-flink-1.14-job-server-2.44.0.jar', '--flink-master', 'http://localhost:8081', '--artifacts-dir', '/tmp/beam-temp1of29sbe/artifactsz8je29uo', '--job-port', '33755', '--artifact-port', '0', '--expansion-port', '0'] |
So there is bugs in the portable runners? :( |
yes, or feature missing |
The implementation of Kafka in the Python SDK + Portable Runner is unfortunately rather broken for streaming use cases. I don't understand why there isn't a native python implementation based on https://github.com/confluentinc/confluent-kafka-python that doesn't have to deal with the portability layer. It would be much more reliable, even if maybe less capable of parallel compute. Our company has abandoned Beam and Dataflow for this very reason. Last bug I opened in August 2022, #22809 was closed today but still depends on 2 other issues, one of which remains unsolved #25114 half a year later. The Python SDK is clearly not a priority for the core team. Maybe they're too busy focusing on GCP-specific products like PubSub to put in the effort to make open source tools, like Kafka, work properly in Beam's Python SDK. There isn't even a single unit test in the test suite for an unbounded Kafka stream being windowed and keyed. As someone who really believes in Beam as a great portable standard for data engineering, it's sad to see the lack of interest from the core team in anything that is not making Google money (although we would still be paying for Dataflow if it worked). |
Hi @alexmreis sorry if there is any misunderstanding, #22809 is closed because the issue on KafkaIO side is fixed, by #24205 (it comments closes #22809: #24205 (comment)) That said, the use case of Dataflow Runner should be fixed in upcoming Beam v2.45.0 It still experiencing issues on portable runner (flink, direct streaming) is an issue not limited to kafka source. It affects all "splittable DoFn" streaming source. This functionality is not yet supported by portable runner (#20979). I also got bite by this issue quite often (when I validating the fix of #24205, see comments of #22809 I had). The gap between Dataflow and local runners is definitely an important thing need improve. This has direct impact to developers. Besides, no unit test in Python Kafka IO is intended. Within the cross-language framework, the code running kafka read is Java's KafkaIO and unit test is exercised there. We have CrossLanguage Validation Runner (XVR) Tests for each xlang IO and each SDK exercised in schedule. And I recently added a Python KafkaIO performance test also. That said KafkaIO in both Java and Python are our team's priority. |
Was this issue addressed in the new version 2.45? |
Not yet. This is the feature gap in portable runner. May need substantial effort. I am trying to work on it currently though |
Any update for this issue in version 2.47? |
Any update for this issue in version 2.48? |
Any update? |
Not able to get into this. |
Any update for this issue in version 2.49? |
Any idea where the problem is to try and make a work around? |
Any news for version 2.50? |
Almost 1 year and didn't get any clear response if that bug will be fixed or no, 7 versions from version 2.44 to 2.51 and the bug remains. |
For anybody stumbling upon this issue, a year later this bug is still present |
There is no plan to fix this for Python DirectRunner. We are moving to Prism Runner (#29650). The goal is make this as the default one for all SDKs to allow users to do local tests and developments with this local runner. This work is currently on-going. @kennknowles FYI. For Beam on Flink. |
What happened?
ReadFromKafka not forwarding in streaming mode.
using apache-beam 2.44.0
beam_options = PipelineOptions(streaming = True)
pipeline = beam.Pipeline(options=beam_options)
messages = (
pipeline
| 'Read from Kafka' >> ReadFromKafka(
consumer_config=json.load(open("config/consumer_config_beam.json")),
topics=topic
)
| 'Print messages' >> beam.Map(lambda message: print("received!"))
)
Hello, in the code above, the code is stuck on ReadFromKafka.
Adding max_num_records will only wait for the specific amount of data and them forward them to the next step and ends the codes.
(I am using the DirectRunner I need to run the code locally)
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
The text was updated successfully, but these errors were encountered: