Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-1690: KMS Encryption Provider for Etcd Secrets #1682

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

swghosh
Copy link
Member

@swghosh swghosh commented Sep 18, 2024

[TP] Support Kube KMS Integration in OCP (User-Provided)

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 18, 2024
Copy link
Contributor

openshift-ci bot commented Sep 18, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Signed-off-by: Swarup Ghosh <[email protected]>
@swghosh swghosh changed the title WIP: KMS Encryption Provider for Etcd Secrets API-1690: KMS Encryption Provider for Etcd Secrets Sep 23, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 23, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 23, 2024

@swghosh: This pull request references API-1690 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.18.0" version, but no target version was set.

In response to this:

[TP] Support Kube KMS Integration in OCP (User-Provided)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@swghosh
Copy link
Member Author

swghosh commented Sep 26, 2024

/cc @dgrisonnet

@openshift-ci openshift-ci bot requested a review from dgrisonnet September 26, 2024 13:46
- "@tkashem"
- "@rvanderp"
approvers:
- ""
Copy link
Contributor

@JoelSpeed JoelSpeed Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tkashem should be the default approver for anything api related as the team lead

You can add either myself or @deads2k for API approvers. But you need one of us. I've already looked at the API so probably me on this occasion.

@tkashem please review this EP and let me know that you're happy conceptually, and then I'll start reviewing from an API perspective

We should only merge the feature gate PR once this has had an initial round of review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add more details on this EP so we can have it ready for the initial round of review.

Signed-off-by: Swarup Ghosh <[email protected]>
Copy link
Contributor

openshift-ci bot commented Nov 15, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Swarup Ghosh <[email protected]>
Signed-off-by: Damien Grisonnet <[email protected]>
@swghosh
Copy link
Member Author

swghosh commented Nov 21, 2024

/cc @tkashem

@openshift-ci openshift-ci bot requested a review from tkashem November 21, 2024 13:15
@swghosh swghosh marked this pull request as ready for review November 22, 2024 08:42
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 22, 2024
Comment on lines +11 to +13
- "@tkashem"
- "@deads2k"
- "@derekwaynecarr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally you pick one approver, else it becomes difficult to rally everyone around, perhaps some of these should be reviewers instead?

Comment on lines +7 to +9
- "@TrilokGeer"
- "@tkashem"
- "@rvanderp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What expertise do each of the reviewers bring? What are you asking them to think about, see the template for examples of why we add context about reviewers

One aspect of the upstream feature that wasn't mentioned yet is that it requires a third-party application called a KMS plugin to bridge between the apiservers and the KMS. In OpenShift, these plugins will be configured and managed by the kube-apiserver-operator. There are multiple reasons behind this choice:

1. Reduces the complexity for the users that want to use the KMS feature
2. Simplifies key rotation when users manually rotate the key because it requires creating a second instance of the plugin that would use the new key while the old plugin would still allow using the old key as a read key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to see the mechanics of managing two instances fleshed out if they aren't already

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what "manually rotate the key" entails?

I was reading through the KMS provider documentation and came across this: https://kubernetes.io/docs/tasks/administer-cluster/kms-provider/#developing-a-kms-plugin-gRPC-server-notes-kms-v2

It sounds like in KMS v2 the expectation is that the KMS plugin will return a new key_id in the Status procedure call response when the KMS key encryption key (KEK) has rotated. If the key_id changes in the response received from the Status procedure call, the API server is signaled that the KEK has changed and the data encrypted with the old KEK should be marked as stale.

I do see that https://github.com/openshift/aws-encryption-provider?tab=readme-ov-file#rotation mentions this dual instance plugin approach, but to my untrained eye the documentation there seems to be tailored around the older KMS v1 implementation.

Please correct me if I misunderstood the KMS v2 approach, but from what I can tell it does seem like KMS v2 remediates the need to do something like this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see now that for something like the aws-encryption-provider the KMS master key is passed as a CLI flag in the Pod, which would explain the need to have one deployed with the old key reference and one deployed with the new.

Apologies for any noise :)

Copy link

@everettraven everettraven Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder if there is some room for improvements that could be proposed upstream to facilitate master key rotation in these plugins without having to run multiple instances to perform the migration. (assuming my understanding of being able to rotate the key_id in the KMS v2 API is true)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be interested to see the mechanics of managing two instances fleshed out if they aren't already

I have some flowschemas detailing the different mechanisms, I'll add them once they are fully complete.

Copy link
Member

@dgrisonnet dgrisonnet Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I've identified two different ways for user to rotate keys:

  1. The cloud KMS periodically rotates the key. In this case, the cloud resource for the key will remain the same, but the underlying key and its id will change. (https://docs.aws.amazon.com/kms/latest/developerguide/rotate-keys.html). For this particular scenario, we don't need a second instance nor do we need to restart the apiserver because we track the key_id and pass it down in the Decrypt procedure call. But we still want to migrate all the data to use the latest instance of the key to reduce the attack surface area.
  2. The user points the KMS provider to a new key (cloud resource). In this case you need to create a second instance because each instance is bound to a specific key and until you've migrated all the data to use the new key, the original key need remain usable as a read-key.

For 1) there is definitely room for improvement upstream and https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4192-svm-in-tree could help with automating the data migration process.

For 2) since we are essentially pointing to two different keys, it is consistent with the other encryption provider (aesgcm, aescbc, ... ) to have two different instances of KMS plugins. It is that way by design

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful, lets make sure this context gets into the doc so it doesn't end up lost to GitHub comments eternally

* https://github.com/openshift/aws-encryption-provider/
* https://github.com/openshift/azure-kubernetes-kms/

For the plugins we can't distribute, an `image` field will be available in the relevant KMS API to allow users to configure the plugin.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide an example of one that we cannot distribute/expand on why we can't distribute, and how we will support users using these? Where does out support boundary start/end?


During the tech-preview support of the KMS feature, it will be placed behind the `KMSEncryptionProvider` feature-gate.

OpenShift would need to align closer with KMS evolution upstream with respect to the different Kubernetes Encryption Providers available today.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow what you're meaning by this, can you expand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is referring to the various providers that are supported upstream but not downstream. Namely coming closer with this list: https://github.com/kubernetes/apiserver/blob/cccad306d649184bf2a0e319ba830c53f65c445c/pkg/apis/apiserver/types_encryption.go#L89-L101. I'll add some more info about that.

enhancements/kube-apiserver/kms-encryption-provider.md Outdated Show resolved Hide resolved

## Drawbacks

1. Core apiserver operators need host access to mount and manage permissions for the directory where the kms plugin runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a net new permission that they didn't need before?

## Drawbacks

1. Core apiserver operators need host access to mount and manage permissions for the directory where the kms plugin runs.
2. Can cause problems during disaster recovery and backups in case KMS keys becomes unavailable after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you mitigate this risk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still WIP, I'll elaborate on the risks from a high-level here, but https://issues.redhat.com/browse/API-1825 is a GA requirement which requires us to write a doc going in full depth about the risks and the mitigations that we have for them.
We will list all the failures scenarios there and explain in detail the actions that users need to take to detect and fix the issues. We will turn this document into KCS later on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the place to track more, might be worth linking that issue in the GA requirements part of the doc, and saying here that you can read more in jira


### Non-Goals

1. Allow the user to force key rotation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any caveats to this non-goal?
What happens if a user does an on-demand key rotation in their KMS instance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be here. We copied it from aescbc/aesgcm goals where users can't force rotation in a supported way.
For KMS, users will be able to force the rotation either in their KMS instance or manually by pointing the encryption configuration to a new key.

3. Support for hardware security modules
4. The user has an in-depth understanding of each phase of the encryption process
5. Completely recover the cluster in the event of the KMS instance itself going down or keys getting lost
6. Allow users to configure which resources will be encrypted

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something planned for the future?

I think this is a reasonable non-goal for a first pass, but taking a look at the EncryptionConfiguration resource type it appears to have support for this kind of configuration. I would imagine there is use case(s) for customers to have more control over the set of resources that are encrypted.

Copy link
Member

@dgrisonnet dgrisonnet Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it is not right now and I don't remember seeing a RFE about that since we've supported encryption.
For OpenShift we are choosing which resources are encrypted and users just choose if they want to turn on encryption or not.
In case you've not seen the doc yet, https://docs.openshift.com/container-platform/4.17/security/encrypting-etcd.html shows the procedure to enable encryption and from a user perspective it is super simple and very high-level.

Copy link
Contributor

openshift-ci bot commented Dec 3, 2024

@swghosh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 8bacc6c link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants