Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka Connect. Significant delay in connector availability after restarting #1509

Open
berezinsn opened this issue Nov 2, 2024 · 0 comments

Comments

@berezinsn
Copy link

Description

After restarting the Kafka Connect application, Mongo Sink connectors become available (appear in the Kafka Connect UI) after a prolonged period, ranging from 5 to 15 minutes. Connectors become available simultaneously, and there are no subsequent issues with their operation.

Environment Details

  • Number of Connectors: 200
  • Deployment: Single-instance connector on Kubernetes
  • Pod Resource Limits:
    • Memory Limits: 8Gi
    • Memory Requests: 4Gi
    • Kafka Heap Options: -Xmx4096m -Xms2048m

Kafka Distributed Properties

key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.flush.interval.ms=10000
max.poll.records=10
plugin.path=/opt/bitnami/kafka/plugins
cleanup.policy=compact
group.id=kafka-connect-group
config.storage.topic=kafka-connect-configs
config.storage.replication.factor=1
offset.storage.topic=kafka-connect-offsets
offset.storage.replication.factor=1
status.storage.topic=kafka-connect-status
status.storage.replication.factor=1
producer.max.request.size=104857600
errors.retry.timeout=300000

Observations

Based on monitoring data, there are no network delays; CPU and Memory resources appear to be sufficient, indicating the application is not resource-bound.
During the researching on this problem the second test environment was set up. Restart on this stand occurs almost instantly and connectors become available rather fast (in seconds).
Configuration settings between environments are identical. The only diff is the size of the kafka-connect-offsets system topic, which is automatically created by Kafka Connect. The topic has 25 partitions by default. Newly created stand has the topic’s size in MBs, but original stand has kafka-connect-offset topic’s size from 0.8Gb to 2.4Gb (depends on compaction time)
Please check the screen with 30 days statistics on the kafka-connect-offset topic's size
Screenshot 2024-11-02 at 14 22 48

Kafka Connect Offsets Topic Settings

compression.type gzip
leader.replication.throttled.replicas
min.insync.replicas 1
message.downconversion.enable true
segment.jitter.ms 0
cleanup.policy compact
flush.ms 1000
follower.replication.throttled.replicas
segment.bytes 1073741824
retention.ms 604800000
flush.messages 10000
message.format.version 2.7-IV2
max.compaction.lag.ms 9223372036854775807
file.delete.delay.ms 60000
max.message.bytes 1000012
min.compaction.lag.ms 0
message.timestamp.type CreateTime
preallocate false
index.interval.bytes 4096
min.cleanable.dirty.ratio 0.5
unclean.leader.election.enable false
retention.bytes 1073741824
delete.retention.ms 86400000
segment.ms 604800000
message.timestamp.difference.max.ms 9223372036854775807

Compaction Observations

Despite compaction occurring periodically, the first offset remains at 0, and the total number of messages continues to grow despite the size of the topic decreasing. There are numerous duplicates, especially with heartbeat messages.

Typical Heartbeat Message Structure

Key:

[
    "mongo-source",
    {
        "ns": "mongo-source"
    }
]

Value:

{
    "_id": "{\"_data\": \"\"}",
    "HEARTBEAT": "true"
}

Every 10 seconds, each of the 200 connectors sends this message, resulting in 1200 messages per minute.

The topic continues to grow, leading me to suspect there may be an issue with compaction, as the size is decreasing while the total message count remains unchanged. Also, IDK. Is it normal, that the first offset is still equal 0?
Screenshot 2024-11-02 at 15 14 29

Resume

What strategies can we employ to reduce the restart time of the Kafka Connect application?
We suspect that the delays are due to Kafka Connect handling a high volume of messages from the topic. Increasing resources hasn’t noticeably improved the restart speed.
Has anyone else faced similar challenges or have any optimization tips?

Any assistance would be greatly appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant