Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Potential performance regression in KafkaIO and schema registry #26262

Open
1 of 15 tasks
aromanenko-dev opened this issue Apr 13, 2023 · 1 comment
Open
1 of 15 tasks

Comments

@aromanenko-dev
Copy link
Contributor

aromanenko-dev commented Apr 13, 2023

What needs to happen?

From email thread:

"I am trying to understand the effect of schema registry on our pipeline's performance. In order to do sowe created a very simple pipeline that reads from kafka, runs a simple transformation of adding new field and writes of kafka. the messages are in avro format

I ran this pipeline with 3 different options on same configuration : 1 kafka partition, 1 task manager, 1 slot, 1 parallelism:

  • when i used apicurio as the schema registry i was able to process only 2000 messages per second
  • when i used confluent schema registry i was able to process 7000 messages per second
  • when I did not use any schema registry and used plain avro deserializer/serializer i was able to process 30K messages per second.
KafkaIO.<String, T>read()
        .withBootstrapServers(bootstrapServers)
        .withTopic(topic)
        .withConsumerConfigUpdates(Map.ofEntries(
                Map.entry("schema.registry.url", registryURL),
                Map.entry(ConsumerConfig.GROUP_ID_CONFIG, consumerGroup+ UUID.randomUUID()),
        ))
        .withKeyDeserializer(StringDeserializer.class)
        .withValueDeserializerAndCoder((Class) io.confluent.kafka.serializers.KafkaAvroDeserializer.class, AvroCoder.of(avroClass));

I have made the suggested change and used ConfluentSchemaRegistryDeserializerProvider
the results are slightly better.. average of 8000 msg/sec "

We need to investigate and find out the cause of this performance issue.

Issue Priority

Priority: 2 (default / most normal work should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@aromanenko-dev aromanenko-dev changed the title [Task]: Potential performance regression in KafkaIO and schema registry [Bug]: Potential performance regression in KafkaIO and schema registry Apr 18, 2023
@aromanenko-dev aromanenko-dev added bug and removed task labels Apr 18, 2023
@sigalite
Copy link

any update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants