Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic if tls: bad certificate #45

Open
estahn opened this issue Jan 4, 2021 · 9 comments
Open

panic if tls: bad certificate #45

estahn opened this issue Jan 4, 2021 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@estahn
Copy link
Owner

estahn commented Jan 4, 2021

On the first run, it sometimes shows up with this. This should cause panic and restart of the process so it can pick up the certificate. I assume a race condition with the helm hook. Need to investigate.

2021/01/04 03:06:21 http: TLS handshake error from 100.96.2.0:30084: remote error: tls: bad certificate
2021/01/04 03:06:25 http: TLS handshake error from 100.96.2.0:9203: remote error: tls: bad certificate
2021/01/04 03:06:27 http: TLS handshake error from 100.96.2.0:13730: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:9523: remote error: tls: bad certificate
2021/01/04 03:06:29 http: TLS handshake error from 100.96.2.0:32653: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:33301: remote error: tls: bad certificate
2021/01/04 03:06:30 http: TLS handshake error from 100.96.2.0:6242: remote error: tls: bad certificate
@estahn estahn added the bug Something isn't working label Jan 4, 2021
@estahn estahn self-assigned this Jan 4, 2021
@adamstrawson
Copy link

To add to this @estahn, you're along the right lines with I assume a race condition with the helm hook, we were finding errors on the patch job that the serviceaccount didn't exist, because it was being removed as part of the hook before the job had been created, and so failing to run.

I've not had a chance to investigate further, but thought I'd mention this.

@estahn
Copy link
Owner Author

estahn commented Aug 16, 2021

@adamstrawson Have you seen this issue come up after applying this fix (estahn/charts@25cb0ca)?

@project0
Copy link
Contributor

We just faced the same issue yesterday, turned out the certificate had wrong dns names (subjet alternative names) what did not match the webhook url (we use cert manager btw).
I would rather have a better error message to help people identify the real problem faster and easier.

@Jasper-Ben
Copy link

Jasper-Ben commented Sep 8, 2023

Seeing the same issue:

image-swapper running on AWS spot instances, which caused it to be rescheduled on a new node during spot instance shutdown.
Stuck in the "bad certificate loop" -> a vital GitLab CI job failed, because an image was not available in upstream anymore.

Bummer, as we use k8s-image-swapper to protect us from this exact scenario, but ATM we cannot be certain that image swapper is running correctly. Since the pod is still running, our monitoring will also not report any issues.

@Jasper-Ben
Copy link

On further investigation, I believe in our case this is actually an issue with reloading a new cert-manager certificate:

  1. Pod gets certificate via secret from cert-manager which is valid until date x.
  2. At date x-y: cert-manager replaces soon-to-expire certificate and updates secret.
  3. k8s-image-swapper does NOT reload the updated certificate from file.
  4. At date x: old certificate expires -> k8s-image-swapper runs into this issue

@estahn
Copy link
Owner Author

estahn commented Sep 8, 2023

@Jasper-Ben Thanks for investigating this. We could possibly use https://github.com/dyson/certman to circumvent this issue. If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

@Jasper-Ben
Copy link

👋 @estahn,

We could possibly use https://github.com/dyson/certman to circumvent this issue.

I had a look at certman. There is an open issue, which seems relevant to this use-case: dyson/certman#2
There is also an open PR to address this but it hasn't been merged since end 2021: dyson/certman#1

Might be easier to go the "Kubernetes way" of just panicking, thus triggering a pod recreation?

Or to put it this way: IMO the "panic on TLS error" (as this ticket describes) should happen in any case, to catch any odd misbehavior. If we, in addition, want to be fancy about certificate rotation then we could look into some reload logic.

If you have time to contribute that would be amazing, otherwise, I will see if I can squeeze this in ASAP.

I might be able to take a look at it, but can't promise anything right now (busy schedule, you know how it is). If I manage, I'll let you know, otherwise feel free if you find the time 🙂

@estahn
Copy link
Owner Author

estahn commented Sep 8, 2023

@Jasper-Ben Fair enough.

This is related and can probably used as guidance:
golang/go#38877 (comment)

@martin31821
Copy link

FWIW, it might be easier for you to use https://github.com/stakater/Reloader, that can trigger a pod restart when the secret behind the cert changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants