Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lambda Snapstart kafka connection errors #2715

Open
hamburml opened this issue Aug 5, 2024 · 11 comments
Open

Lambda Snapstart kafka connection errors #2715

hamburml opened this issue Aug 5, 2024 · 11 comments
Labels

Comments

@hamburml
Copy link

hamburml commented Aug 5, 2024

Describe the bug

copy from quarkusio/quarkus#42286

maybe here is a better place :)

Hi,

we use snapstart on our quarkus lambdas. Some of them use smallrye-messaging to write or receive messages from a kafka. This works as expected unfortunately in our logs we have some warnings that the connection to a kafka node was lost either to auth error or firewall blocking.

    "loggerClassName": "org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger",
    "loggerName": "org.apache.kafka.clients.NetworkClient",
    "level": "WARN",
    "message": "[Producer clientId=kafka-producer-event-xxxx] Connection to node xx (hxxxx.amazonaws.com/xxx:9096) terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue.",

Afaik during the init phase the whole memory of a started quarkus lambda is stored and when the lambda is reused reloaded into the memory to skip the init phase. That also means that pooled connections are "stored" but in reality are already closed.

Now I thought i simply need to close all open kafka connections before the snapshot is created. I did this with a org.crac.Resource and the beforeCheckpoint method. Now the warnings in the log are gone but it looks like no new connections are initiated and therefore all messages send via a channel fail. I also used KafkaProducer::flush but that didnt help.

Any ideas?

@ApplicationScoped
@Slf4j
public class KafkaHelper implements Resource {

    @Inject
    KafkaClientService kafkaClientService;

    void onStart(@Observes StartupEvent ev) {
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
            throws Exception {
        log.info("kafkaproducer {}", kafkaClientService.getProducerChannels());
        log.info("kafkaconsumer {}", kafkaClientService.getConsumerChannels());

        log.info("going to sleep");
        var listOfProducer = kafkaClientService.getProducerChannels().stream()
                .map(kafkaClientService::getProducer)
                .map(KafkaProducer::flush) // with KafkaProducer::close log warnings are gone but all future messages fail
                .toList();

        Uni.combine().all().unis(listOfProducer)
                .combinedWith(unused -> null)
                .await().atMost(Duration.ofSeconds(10));
        log.info("going to sleep 2");
    }
    @Override
    public void afterRestore(org.crac.Context<? extends Resource> context)
            throws Exception {

        // is there a 'init connection' method?
        log.info("i am back");

    }
}

I found quarkusio/quarkus#31401 which is the same issue but with database connections.

@cescoffier
Copy link
Contributor

Our Kafka support does not support snapstart or CRAC. How Kafka works makes it very hard to snapshot it. I would recommend, for safety reasons, to only initialize after the restore.

@hamburml
Copy link
Author

hamburml commented Aug 19, 2024

Thanks for your reply! There is an initial apache/kafka#13619 which tries to handle CRaC but it looks like there is not that much interest.

I would recommend, for safety reasons, to only initialize after the restore.

Exactly, this is what I want. I do not need a snapshot of a working kafka client, I need a method to call on the kafka client so that it reconnects and/or verifies current connections. This would remove old connections which are gone (because they were there during the snapshot) and create a new. Can you point me to a method which I could call in the afterRestore method?

@cescoffier
Copy link
Contributor

You cannot use reactive messaging, but you can create a low-level Kafka client in the afterRestore, or create a lazy producer and not use it during the snapshot phase (so basically, initialize it during the first HTTP call)

@hamburml
Copy link
Author

hamburml commented Aug 19, 2024

Hm yeah, but I still want to use this dependency...

@cescoffier
Copy link
Contributor

If it's only to produce, you can use the lazy feature (@ogunalp it should delay the initialization of the producer right?)

@ozangunalp
Copy link
Collaborator

Indeed, I forgot about the lazy-client flag. It should work for producers. And maybe even for consumers combined with pausable-channels, but I need to check.

@hamburml
Copy link
Author

Thanks, I'll try it with lazy-client and come back to you.

@hamburml
Copy link
Author

hamburml commented Sep 6, 2024

lazy-client worked! Thanks. Snapshot is created without a kafka connection, so there is no exception anymore.

@hamburml hamburml closed this as completed Sep 6, 2024
@hamburml
Copy link
Author

hamburml commented Sep 6, 2024

Sorry, closed it. @ozangunalp mentioned a test with pausable-channels.

@hamburml hamburml reopened this Sep 6, 2024
@ozangunalp
Copy link
Collaborator

@hamburml is there a test repository that you can share?
I suspect that Injecting a @Channel would work, because the subscription is lazy too. But consuming in an @Incoming channel would still create the client at startup.

@hamburml
Copy link
Author

@ozangunalp not right now. I try to prepare one this evening. My service only writes into a @Channel and it works now without exceptions :) But I can not prepare it with working CRaC because I was not able to get a JDK which creates a snapshot working on my end. I let AWS Lambda do this for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants