-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: KinesisIO source on FlinkRunner initializes the same splits twice #31313
Comments
Adding dump of replication Flink code here: pom.xml
Java class
|
I suppose this is (similar, but) different issue, probably caused by the same underlying bug. #30903 fixed Impulse only. Does using |
@je-ik Hi, yes that did fix the issue. thank you! For my understanding, what does this option do exactly? And should I expect any performance degradation? |
I am noticing actually a lot of back pressure using this approach despite downstream operators having low CPU usage. Is the fix to the root cause relatively straight forward in which case I can implement it in a forked version of the repo? or is it more involved? |
I don't know the root cause, it seems that Flink does not send the snapshot state after restore from savepoint. I observed this on the Impulse (I suspected that it affects only bounded sources running in unbounded mode, but it seems it is not the case). It might be a Beam bug or a Flink bug.
The flag turns on different expansion for Read transform - it uses splittable DoFn (SDF), which uses Impulse which was fixed earlier. Performance should be similar to classical Read. |
Can you please provide a minimal example and setup to reproduce the behavior?
You can drain the Pipeline, see https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/#terminating-a-job
This is related to how Flink computes target splits. It is affected by maximal parallelism (which is computed automatically, if not specified). You can try increasing it via |
Thanks for the suggestions, will give them a try. I believe the first comment of the ticket provides a simple pipeline that exhibits this behavior on the flink runner but if that doesn’t work, happy to provide another. The example also submits the job in detached mode which may be related, although have seen similar behavior without it. Appreciate your help looking into this, if there’s anything I can assist with, please let me know |
Just to mimic the local setup I used: I ran used the and then stopped the job with a savepoint and then restarted using run with the savepoint path. When doing this, I looked inside the task manager logs and searched for I also switched to kafka and noticed the same behavior so it seems to be related to the runner. I was unable to fix the performance issues with beam_fn_api and notice the backpressure was causing my data to come in waves. Looking at a cpu chart, it was very cyclic with peaks of 99% cpu and troughs of 8% cpu leading me to believe that this pipeline option was causing some sort of build up and then a rush of data causing the cpu to spike. I can make do with kafka offset commits for now, but if there are any pointers on how to fix this in the beam source code, id be happy to take a look and even submit a PR to be included in version 2.57. Although still hoping the issue is somewhere on my end that can be fixed fairly easily |
Hi @akashk99, just to be sure, do you observe the same behavior when not using |
Hi @je-ik , was just able to reproduce the issue by manually running the jar file. Started the job by running
this was a few seconds after the job was submitted. I trimmed the output, but these two logs were there for all of my shards. |
Hi, we are seeing the same behavior on our pipeline.
In our case we are using:
|
What happened?
Bug description
Setup details:
beam-sdks-java-io-amazon-web-services2
)Bug details:
org.apache.beam.sdk.io.aws2.kinesis.KinesisReader
is assigned the same splits twice, once with snapshot state, and once without. This leads to duplicate data being processed.Replication steps:
Logs:
shardId-000000000000
toshardId-000000000003
are first initialized with checkpoint stateAFTER_SEQUENCE_NUMBER
(correct).AT_TIMESTAMP
(not correct).Issue Priority
Priority: 3 (minor)
Issue Components
The text was updated successfully, but these errors were encountered: