Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[uReplicator] uReplicator worker is crashed with specific topic #313

Open
binhtd opened this issue Jun 12, 2020 · 9 comments
Open

[uReplicator] uReplicator worker is crashed with specific topic #313

binhtd opened this issue Jun 12, 2020 · 9 comments

Comments

@binhtd
Copy link

binhtd commented Jun 12, 2020

Deploy uReplicator on k8s in gcp we use uReplicator replicate data between DC and Cloud. One day one big topic with 1 partition and high throughput. It worked normally for long time. One day the uReplicator worker is crashed with error code is 255 after some restart.

Found some error in pod

[2020-06-08 17:00:35,440] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Proceeding to force close the producer since pending requests could not be completed within timeout 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1078)[2020-06-08 17:00:35,441] ERROR [group-uReplicator-kafka-d1-prod1-null-0] Closing producer due to send failure. topic: evoucher.event.voucher_serial (com.uber.stream.ureplicator.worker.DefaultProducer:123)java.lang.IllegalStateException: Producer is closed forcefully

image

we fix issue temporary by the way remove error topic out of list replication topics then we use other topic for replication purpose, the worker is up and running replication normally. We don't know exactly what happen in that case.

@maxtpham
Copy link

maxtpham commented Jun 12, 2020

I analyzed & provide some more information about the issue above

Code Analysis

  1. DefaultProducer.send() call to this.producer.send(record, new UReplicatorProducerCallback(record.topic(), srcPartition, srcOffset));
  2. KafkaProducer.send() throw the exception log.debug("Exception occurred during message send:", e);
  3. UReplicatorProducerCallback.onCompletion() → LOG('Closing producer due to send failure')

Log around

2020-06-08 22:53:35.346 ICT java.lang.IllegalStateException: Producer is closed forcefully.
	at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:696)
	at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:683)
	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:185)
	at java.lang.Thread.run(Thread.java:748)
[2020-06-08 15:53:35,346] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Closing the Kafka producer with timeoutMillis = 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1054)
[2020-06-08 15:53:35,346] WARN [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Overriding close timeout 9223372036854775807 ms to 0 ms in order to prevent useless blocking due to self-join. This means you have incorrectly invoked close with a non-zero timeout from the producer call-back. (org.apache.kafka.clients.producer.KafkaProducer:1060)
[2020-06-08 15:53:35,346] INFO [Producer clientId=group-uReplicator-kafka-d1-prod1-null-0] Proceeding to force close the producer since pending requests could not be completed within timeout 9223372036854775807 ms. (org.apache.kafka.clients.producer.KafkaProducer:1078)
[2020-06-08 15:53:35,346] ERROR [group-uReplicator-kafka-d1-prod1-null-0] Closing producer due to send failure. topic: evoucher.event.voucher_serial (com.uber.stream.ureplicator.worker.DefaultProducer:123)

POSSIBILITY

SOLUTIONS

  • Temporarily: change the topic to replication
  • Final Solution: NONE - please someone help, thanks!

@yangy0000
Copy link
Collaborator

It looks like uReplicator crashed because of produce timeout. Can you share your produce configuration?

@binhtd
Copy link
Author

binhtd commented Jun 15, 2020

Hi @yangy0000. Here is consumer.properties, producer.properties file inside worker

consumer.properties
`root@d1-kafka-ureplicator-worker-98ddf5cbb-7lb5f:/uReplicator/config# cat consumer.properties
zookeeper.connect=10.100.3.101:2181,10.100.3.102:2181,10.100.3.103:2181
bootstrap.servers=10.100.3.101:9092,10.100.3.102:9092,10.100.3.103:9092

zookeeper.connection.timeout.ms=30000
zookeeper.session.timeout.ms=30000

group.id=group-uReplicator-d1-kafka-test

consumer.id=consume-uReplicator-d1-kafka
socket.receive.buffer.bytes=1048576
fetch.message.max.bytes=10000000
queued.max.message.chunks=5
key.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer
auto.offset.reset=earliest
`


producer.properties

root@d1-kafka-ureplicator-worker-98ddf5cbb-7lb5f:/uReplicator/config# cat producer.properties

bootstrap.servers=xxx

client.id=group-uReplicator-d1-kafka-test

producer.type=async

compression.type=none
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

batch.size=262144
linger.ms=1000
buffer.memory=167772160
send.buffer.bytes=62914560
delivery.timeout.ms=600000
request.timeout.ms=30000

queue.buffering.max.messages=10
max.in.flight.requests.per.connection=5
max.request.size=104857600

security.protocol=SSL
ssl.truststore.location=/uReplicator/bin/kafka.truststore.jks
ssl.truststore.password=xxx

ssl.keystore.location=/uReplicator/bin/manager-clients.int.vinid.net.keystore.jks
ssl.keystore.password=xxx
`

If you need more information or have more suggestion please let me know. Thanks,

@yangy0000
Copy link
Collaborator

Can you try to increase the request.timeout.ms to 120000, my suspicion is worker crash because of request timeout.

@binhtd
Copy link
Author

binhtd commented Jun 17, 2020

@yangy0000 Thanks your suggestion I will try and let back to you when have any information.

@dungnt081191
Copy link

hi anh @binhtd @thanhptr any update in this issue ???

@binhtd
Copy link
Author

binhtd commented Jun 29, 2020

@yangy0000 @dungnt081191 It is quite hard to reproduce this error on our side ( i tried to set firewall on source topic, target topic vm to simulate uReplicator couldn't connect to source and target topic. I saw the controller and worker pod in k8s that was restarted continuously it is quite similar with our case. We will be increasing request.timeout.ms in producer config and see what happen in the next time.

@tranthechinh
Copy link

image

Hi men!
I would like to update some information about this issue as below:

  • Our uReplicator cluster is servicing over tens topics (including 'evoucher.event.voucher_serial').
  • After removed this topic out of topicmaping by this command 'curl -X DELETE ...' (blacklist topic) then worker pod works fine again.
  • I also tried to change Helix cluster name and whitelist topic again but pod worker still crash back
    So I think our issue maybe came from the topic 'evoucher.event.voucher_serial'. If issue came from request.timeout.ms parameter in the producer.properties file then error log must be shown other topics and after removed (blacklist) topic then worker pod must be crashed again but in our case then worker pod works fine!

@yangy0000
Copy link
Collaborator

Hi,
Thanks for the follow-up, a few more question: what version of Kafka you are running on the source and destination Kafka cluster? what's the log format? as your mentioned the topic is a big topic, what's the throughput(bytes per second) on that topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants