[uReplicator] uReplicator worker is crashed with specific topic #313

binhtd · 2020-06-12T09:37:23Z

Deploy uReplicator on k8s in gcp we use uReplicator replicate data between DC and Cloud. One day one big topic with 1 partition and high throughput. It worked normally for long time. One day the uReplicator worker is crashed with error code is 255 after some restart.

Found some error in pod

[2020-06-08 17:00:35,440] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Proceeding to force close the producer since pending requests could not be completed within timeout 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1078)[2020-06-08 17:00:35,441] ERROR [group-uReplicator-kafka-d1-prod1-null-0] Closing producer due to send failure. topic: evoucher.event.voucher_serial (com.uber.stream.ureplicator.worker.DefaultProducer:123)java.lang.IllegalStateException: Producer is closed forcefully

we fix issue temporary by the way remove error topic out of list replication topics then we use other topic for replication purpose, the worker is up and running replication normally. We don't know exactly what happen in that case.

maxtpham · 2020-06-12T10:57:22Z

I analyzed & provide some more information about the issue above

Code Analysis

DefaultProducer.send() call to this.producer.send(record, new UReplicatorProducerCallback(record.topic(), srcPartition, srcOffset));
KafkaProducer.send() throw the exception log.debug("Exception occurred during message send:", e);
UReplicatorProducerCallback.onCompletion() → LOG('Closing producer due to send failure')

Log around

2020-06-08 22:53:35.346 ICT java.lang.IllegalStateException: Producer is closed forcefully.
	at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:696)
	at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:683)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:185)
	at java.lang.Thread.run(Thread.java:748)
[2020-06-08 15:53:35,346] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1054)
[2020-06-08 15:53:35,346] WARN [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Overriding close timeout 9223372036854775807 ms to 0 ms in order to prevent useless blocking due to self-join. This means you have incorrectly invoked close with a non-zero timeout from the producer call-back. (org.apache.kafka.clients.producer.KafkaProducer:1060)
[2020-06-08 15:53:35,346] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Proceeding to force close the producer since pending requests could not be completed within timeout 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1078)
[2020-06-08 15:53:35,346] ERROR [group-uReplicator-kafka-d1-prod1-null-0] Closing producer due to send failure. topic: evoucher.event.voucher_serial (com.uber.stream.ureplicator.worker.DefaultProducer:123)

POSSIBILITY

Logged for MirrorMaker BUG: https://issues.apache.org/jira/browse/KAFKA-6947

SOLUTIONS

Temporarily: change the topic to replication
Final Solution: NONE - please someone help, thanks!

yangy0000 · 2020-06-12T15:49:21Z

It looks like uReplicator crashed because of produce timeout. Can you share your produce configuration?

binhtd · 2020-06-15T03:48:08Z

Hi @yangy0000. Here is consumer.properties, producer.properties file inside worker

consumer.properties
`root@d1-kafka-ureplicator-worker-98ddf5cbb-7lb5f:/uReplicator/config# cat consumer.properties
zookeeper.connect=10.100.3.101:2181,10.100.3.102:2181,10.100.3.103:2181
bootstrap.servers=10.100.3.101:9092,10.100.3.102:9092,10.100.3.103:9092

zookeeper.connection.timeout.ms=30000
zookeeper.session.timeout.ms=30000

group.id=group-uReplicator-d1-kafka-test

consumer.id=consume-uReplicator-d1-kafka
socket.receive.buffer.bytes=1048576
fetch.message.max.bytes=10000000
queued.max.message.chunks=5
key.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
auto.offset.reset=earliest
`

producer.properties

root@d1-kafka-ureplicator-worker-98ddf5cbb-7lb5f:/uReplicator/config# cat producer.properties

bootstrap.servers=xxx

client.id=group-uReplicator-d1-kafka-test

producer.type=async

compression.type=none
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

batch.size=262144
linger.ms=1000
buffer.memory=167772160
send.buffer.bytes=62914560
delivery.timeout.ms=600000
request.timeout.ms=30000

queue.buffering.max.messages=10
max.in.flight.requests.per.connection=5
max.request.size=104857600

security.protocol=SSL
ssl.truststore.location=/uReplicator/bin/kafka.truststore.jks
ssl.truststore.password=xxx

ssl.keystore.location=/uReplicator/bin/manager-clients.int.vinid.net.keystore.jks
ssl.keystore.password=xxx
`

If you need more information or have more suggestion please let me know. Thanks,

yangy0000 · 2020-06-16T05:36:43Z

Can you try to increase the request.timeout.ms to 120000, my suspicion is worker crash because of request timeout.

binhtd · 2020-06-17T04:00:18Z

@yangy0000 Thanks your suggestion I will try and let back to you when have any information.

dungnt081191 · 2020-06-23T02:36:22Z

hi anh @binhtd @thanhptr any update in this issue ???

binhtd · 2020-06-29T03:01:49Z

@yangy0000 @dungnt081191 It is quite hard to reproduce this error on our side ( i tried to set firewall on source topic, target topic vm to simulate uReplicator couldn't connect to source and target topic. I saw the controller and worker pod in k8s that was restarted continuously it is quite similar with our case. We will be increasing request.timeout.ms in producer config and see what happen in the next time.

tranthechinh · 2020-06-29T07:51:12Z

Hi men!
I would like to update some information about this issue as below:

Our uReplicator cluster is servicing over tens topics (including 'evoucher.event.voucher_serial').
After removed this topic out of topicmaping by this command 'curl -X DELETE ...' (blacklist topic) then worker pod works fine again.
I also tried to change Helix cluster name and whitelist topic again but pod worker still crash back
So I think our issue maybe came from the topic 'evoucher.event.voucher_serial'. If issue came from request.timeout.ms parameter in the producer.properties file then error log must be shown other topics and after removed (blacklist) topic then worker pod must be crashed again but in our case then worker pod works fine!

yangy0000 · 2020-06-29T18:19:32Z

Hi,
Thanks for the follow-up, a few more question: what version of Kafka you are running on the source and destination Kafka cluster? what's the log format? as your mentioned the topic is a big topic, what's the throughput(bytes per second) on that topic?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[uReplicator] uReplicator worker is crashed with specific topic #313

[uReplicator] uReplicator worker is crashed with specific topic #313

binhtd commented Jun 12, 2020

maxtpham commented Jun 12, 2020 •

edited

Loading

yangy0000 commented Jun 12, 2020

binhtd commented Jun 15, 2020 •

edited

Loading

yangy0000 commented Jun 16, 2020

binhtd commented Jun 17, 2020

dungnt081191 commented Jun 23, 2020

binhtd commented Jun 29, 2020 •

edited

Loading

tranthechinh commented Jun 29, 2020

yangy0000 commented Jun 29, 2020

[uReplicator] uReplicator worker is crashed with specific topic #313

[uReplicator] uReplicator worker is crashed with specific topic #313

Comments

binhtd commented Jun 12, 2020

maxtpham commented Jun 12, 2020 • edited Loading

Code Analysis

Log around

POSSIBILITY

SOLUTIONS

yangy0000 commented Jun 12, 2020

binhtd commented Jun 15, 2020 • edited Loading

yangy0000 commented Jun 16, 2020

binhtd commented Jun 17, 2020

dungnt081191 commented Jun 23, 2020

binhtd commented Jun 29, 2020 • edited Loading

tranthechinh commented Jun 29, 2020

yangy0000 commented Jun 29, 2020

maxtpham commented Jun 12, 2020 •

edited

Loading

binhtd commented Jun 15, 2020 •

edited

Loading

binhtd commented Jun 29, 2020 •

edited

Loading