-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task doesn't fail after read.timeout.ms is exceeded #534
Comments
I'm experiencing the same condition as well. I was able to reproduce this with the TRACE level logging and it shows an interesting behavior in the retry portion. I've got the max.retries set to 19 and didn't set the retry.backoff.ms, so it's default to 100. I have a reverse load balancer on the kafka connect hosts that connects to Elasticsearch, so after indexing a few documents I just shut that off on the node hosting the single task I have configured. This is what the log shows
So it looks like the retry process is entered, but then the behavior doesn't make a lot of sense. There's only a single retry and then 0 records are put into Elasticsearch, then nothing. There's actually more records waiting but nothing happens. |
I've tried all manner of configuration options (increasing the linger.ms, max.infilght.requests, etc) and I simply cannot get the BulkProcessor portion of this code to do a retry. I've went as far as adding a couple of debug statements to the code (inside the ElasticSearchClient.java and RetryUtil.java) to try and understand what's going on here. It looks like we're running into an issue with the native Elasticsearch java client with an open issue here: #71159, I then see the PR that was made as a result PR513 which was the attempt to make this work outside of the native client. Question is, has anyone actually seen this retry behavior work outside of the initial index check? From what i can tell, the retry method that was implemented in the sink code here should be working, but is falling victim to whatever async thread condition that seems show up randomly in the elasticsearch native client BulkProcessor. Are there other recovery options we could use within the kafka connect framework? Such as inducing a retry-able task level failure? That would at least allow for some automated responses to the issues that enter this state instead of requiring manual intervention to restart the task as a result of outside monitoring (we're currently having to monitor the consumer group offset for this instead of relying on the status of the connector and associated tasks). |
@thomas-tomlinson I am experiencing similar issues with retries hanging. I noticed there is some active development going on in #575 and I hope it will fix the issue. |
I built the sink connector from the 11.0.x branch (which has the #575 PR merge) and my initial testing shows a retry behavior that works during a bulkProcessor request. I'm inducing this failure by stopping my local reverse proxy that the connect node is configured to use fo the search cluster address. Here's the logging on my test node that shows the desired retry behavior.
|
Did this ever get resolved? I am also seeing this one along with #739 and I am using the latest version of the connector |
TL;DR
Elasticsearch becomes unavailable in the middle of a connection, which makes read.timeout.ms be exceeded. I expected the task to fail, but the task keeps the status
RUNNING
.Description
Hello! I'm trying to set up this elasticsearch connector to interact with tenant-owned systems, hence I need it to be resilient to tenants' systems failure (as much as possible).
While testing for elasticsearch unavailability, I tried killing its only master node and I get this error:
The error seems ok, but the task doesn't fail as I expected
My connector configuration:
Versions
Docker image: confluentinc/cp-kafka-connect:6.1.0
kafka-connect-elasticsearch: confluentinc/kafka-connect-elasticsearch:11.0.4
Is this supposed to happen? Thanks in advance!!! <3
The text was updated successfully, but these errors were encountered: