Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added code for Internal-tls #36865

Merged
merged 6 commits into from
Nov 19, 2024

Conversation

nish112022
Copy link
Contributor

@nish112022 nish112022 commented Oct 14, 2024

issue : #36864

I have a few questions regarding my approach.I will consolidate them here for feedback and review.Thanks

@sre-ci-robot sre-ci-robot added do-not-merge/work-in-progress Don't merge even CI passed. size/XXL Denotes a PR that changes 1000+ lines. labels Oct 14, 2024
@mergify mergify bot added dco-passed DCO check passed. kind/feature Issues related to feature request from users labels Oct 14, 2024
@xiaofan-luan
Copy link
Collaborator

/assign @xiaofan-luan

@nish112022 please check on the reason of ut failure

@xiaofan-luan
Copy link
Collaborator

I will help on the review of this pr.

@nish112022
Copy link
Contributor Author

@xiaofan-luan I have a few questions before I proceed with my changes.

  1. I've created a cert1 folder with custom certificates for testing in cluster mode and am using separate files: milvus_internaltls.yaml and docker-compose.yaml to run containers manually. Is there anything specific I need to consider or configure in this setup?Can I provide these in deployments/docker
  2. I've added the internal TLS configs under the grpcparams struct, as it contains all gRPC-related configurations. Should I keep it there, or would it be better to move these configs to a more general structure like commonCfg, or create a new struct (e.g., crpcCommonCfg) specifically for common gRPC configurations?
  3. For each type of node (datacoord, datanode, etc.), should the TLS configuration be uniform across all nodes, or should each node type have its own specific TLS settings?

paramtable.Init()
validPath := "../../../../configs/cert1/ca.pem"
ctx := context.Background()
client, err := NewClient(ctx, "test", 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, here I think I have written the test completely wrong.

I should test the NewClient() with TLS configs before replacing it with mock client.

paramtable.Get().Save(Params.TLSMode.Key, "1")

This will successfully create the NewClient().

How should I test if the NewClient is behaving in the same manner it is supposed to work?
1 option I can think is of writing a mock datanode server and try to connect to that.Are there any better suggestions in your mind @xiaofan-luan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing you can do is to add an integration test in tests/integration.

You can create a MiniCluster and try to connect between

Copy link
Contributor Author

@nish112022 nish112022 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the code in tests/integration. I am seeing a mini cluster already present.

I'll create a separate test suite for internal TLS, including necessary certificates. This suite will cover basic operations like creating collections, inserting data, and running queries to ensure secure communication between nodes.This way we can validate TLS functionality without affecting other test cases.

@nish112022
Copy link
Contributor Author

@xiaofan-luan Can you please take a look here?

@@ -0,0 +1,175 @@
version: '3.5'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configs folder seems like the wrong place for this docker compose file. There is a folder deployments/docker/cluster-distributed-deployment which seems to contain Ansible code for deploying Milvus in a distributed fashion, but I'm not sure if that's actively used.

I'd suggest removing this compose file from the PR for now. If we find a good place for it you can always raise another PR.

Similarly, I would also remove the new certs added to configs/cert1 for now, since they're not currently used anywhere else.

Copy link
Contributor Author

@nish112022 nish112022 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will be removing these certs, docker-compose, yaml for now.These were all part of my test-setup.

case "querynode":
serverCfg = &Params.QueryNodeGrpcServerCfg
case "rootcoord":
serverCfg = &Params.RootCoordGrpcServerCfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is a bit awkward. The startGrpcLoop function for each of the components calls this function by passing in a string indicating the component name, and then this function uses a switch to decide which component param to use.

However, in milvus.yaml, you have one common section for internalTLS configs. To match with this I suggest that you do one of the following:

  • Remove the common internal TLS parameters from grpcConfig and move them to commonCfg, or
  • Create a new config struct (say, internalTLSConfig) and add it to ComponentParam

One possible drawback of this approach is that you won't be able to specify different certificate paths for the different nodes, but this shouldn't be a problem. If the nodes are deployed in separate containers / machines, the same path can have different certificates, and if the nodes are on the same machine (for example in standalone) a single certificate should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a struct and do what you suggest, but the TLS configs are also part of the grpc configs. Hence, I added internalTLS configs here.Let me add a new struct to make this better.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is a bit awkward. The startGrpcLoop function for each of the components calls this function by passing in a string indicating the component name, and then this function uses a switch to decide which component param to use.

However, in milvus.yaml, you have one common section for internalTLS configs. To match with this I suggest that you do one of the following:

  • Remove the common internal TLS parameters from grpcConfig and move them to commonCfg, or
  • Create a new config struct (say, internalTLSConfig) and add it to ComponentParam

One possible drawback of this approach is that you won't be able to specify different certificate paths for the different nodes, but this shouldn't be a problem. If the nodes are deployed in separate containers / machines, the same path can have different certificates, and if the nodes are on the same machine (for example in standalone) a single certificate should be enough.

Thanks for the comment, I'm just gonna to point out the same thing

return p.IP + ":" + p.InternalPort.GetValue()
}
fmt.Println("address: ", p.Address.GetValue())
return p.Address.GetValue() + ":" + p.InternalPort.GetValue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need separate configurations for p.IP and p.Address? Can we reuse the existing configuration p.IP, with the understanding that you can use a hostname instead of the IP address if desired?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm wondering.
what if we simply use ip instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, I have modified the code.
It will pick from the IP configuration from the milvus.yaml

if !p.InternalTLSEnabled.GetAsBool() {
return p.IP + ":" + p.Port.GetValue()
}
fmt.Println("address: ", p.Address.GetValue())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the fmt.Println

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this config itself.

return p.IP + ":" + p.InternalPort.GetValue()
}
fmt.Println("address: ", p.Address.GetValue())
return p.Address.GetValue() + ":" + p.InternalPort.GetValue()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm wondering.
what if we simply use ip instead?


if !certPool.AppendCertsFromPEM(b) {
log.Error("credentials: failed to append certificates")
return nil, err // Cert pool is invalid, return nil and the error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

te err will be nil on this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to create error here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new error here

log.Error("Unknown NodeType")
return grpc.Creds(nil)
}
certFile := serverCfg.InternalTLSServerPemPath.GetValue()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use one PemPath for every node?
does it necessary to create a path for each node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified this.I have moved the InternalTLSconfig from grpcconfig to InternalTLSconfig.This has removed the need for a switch statement here.

case "querynode":
serverCfg = &Params.QueryNodeGrpcServerCfg
case "rootcoord":
serverCfg = &Params.RootCoordGrpcServerCfg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here is a bit awkward. The startGrpcLoop function for each of the components calls this function by passing in a string indicating the component name, and then this function uses a switch to decide which component param to use.

However, in milvus.yaml, you have one common section for internalTLS configs. To match with this I suggest that you do one of the following:

  • Remove the common internal TLS parameters from grpcConfig and move them to commonCfg, or
  • Create a new config struct (say, internalTLSConfig) and add it to ComponentParam

One possible drawback of this approach is that you won't be able to specify different certificate paths for the different nodes, but this shouldn't be a problem. If the nodes are deployed in separate containers / machines, the same path can have different certificates, and if the nodes are on the same machine (for example in standalone) a single certificate should be enough.

Thanks for the comment, I'm just gonna to point out the same thing

paramtable.Init()
validPath := "../../../../configs/cert1/ca.pem"
ctx := context.Background()
client, err := NewClient(ctx, "test", 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing you can do is to add an integration test in tests/integration.

You can create a MiniCluster and try to connect between

@xiaofan-luan
Copy link
Collaborator

Overall I like this pr, there is something else to point out.

  1. Milvus may have multiple querynodes/datanodes so I guess using one internalAdd for each role won't work.
  2. We see docker compose file. To enable distributed deployment, K8s operator and Helm also need to fixed. @LoveEachDay could help it.

Let me know if anything need to be discussed. Feel free to reach me out at [email protected] and I'd like to discuss on it

@xiaofan-luan
Copy link
Collaborator

right now there is e2e issue on master need to fixed. please wait and rebase

@nish112022 nish112022 force-pushed the internal_tls branch 2 times, most recently from 4507d2f to c4742ff Compare November 4, 2024 20:10
@sre-ci-robot sre-ci-robot added size/XL Denotes a PR that changes 500-999 lines. and removed size/XXL Denotes a PR that changes 1000+ lines. labels Nov 4, 2024
@nish112022
Copy link
Contributor Author

nish112022 commented Nov 6, 2024

right now there is e2e issue on master need to fixed. please wait and rebase

@xiaofan-luan
Can you please tell me for which I should specifically wait? I can see #37465 opened recently. Rest all seem to be fixed.

@nish112022
Copy link
Contributor Author

nish112022 commented Nov 6, 2024

Overall I like this pr, there is something else to point out.

  1. Milvus may have multiple querynodes/datanodes so I guess using one internalAdd for each role won't work.

@xiaofan-luan
Now that you mention this, I have removed the internalAdd and am using the IP field already present in the milvus.yaml for the same.This issue will still be there if the user specifies their IP for a particular node and if want to scale it.I wanted to ask how is this currently handled?

  1. We see docker compose file. To enable distributed deployment, K8s operator and Helm also need to fixed. @LoveEachDay could help it.

Sure, we can take this up in a separate effort and work on this

@sre-ci-robot sre-ci-robot added area/test sig/testing test/integration integration test and removed size/XL Denotes a PR that changes 500-999 lines. labels Nov 14, 2024
Copy link
Contributor

mergify bot commented Nov 19, 2024

@nish112022 Thanks for your contribution. Please submit with DCO, see the contributing guide https://github.com/milvus-io/milvus/blob/master/CONTRIBUTING.md#developer-certificate-of-origin-dco.

@mergify mergify bot added needs-dco DCO is missing in this pull request. and removed dco-passed DCO check passed. labels Nov 19, 2024
@mergify mergify bot added dco-passed DCO check passed. and removed needs-dco DCO is missing in this pull request. labels Nov 19, 2024
@nish112022 nish112022 marked this pull request as ready for review November 19, 2024 09:39
@sre-ci-robot sre-ci-robot removed the do-not-merge/work-in-progress Don't merge even CI passed. label Nov 19, 2024
Copy link

codecov bot commented Nov 19, 2024

Codecov Report

Attention: Patch coverage is 72.18543% with 42 lines in your changes missing coverage. Please review.

Project coverage is 81.03%. Comparing base (3bbbf3c) to head (2a62871).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
internal/distributed/utils/util.go 67.56% 9 Missing and 3 partials ⚠️
internal/distributed/datacoord/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/datanode/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/indexnode/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/proxy/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/querycoord/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/querynode/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/distributed/rootcoord/client/client.go 50.00% 3 Missing and 1 partial ⚠️
internal/util/mock/grpcclient.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #36865      +/-   ##
==========================================
- Coverage   82.88%   81.03%   -1.85%     
==========================================
  Files        1067     1356     +289     
  Lines      164522   190146   +25624     
==========================================
+ Hits       136362   154084   +17722     
- Misses      22698    30592    +7894     
- Partials     5462     5470       +8     
Components Coverage Δ
Client 72.16% <ø> (ø)
Core 68.87% <ø> (∅)
Go 83.20% <72.18%> (+0.03%) ⬆️
Files with missing lines Coverage Δ
internal/distributed/datacoord/service.go 90.44% <100.00%> (+1.18%) ⬆️
internal/distributed/datanode/service.go 82.57% <100.00%> (+0.14%) ⬆️
internal/distributed/indexnode/service.go 75.53% <100.00%> (+0.26%) ⬆️
internal/distributed/proxy/service.go 85.36% <100.00%> (+0.04%) ⬆️
internal/distributed/querycoord/service.go 77.20% <100.00%> (+0.16%) ⬆️
internal/distributed/querynode/service.go 82.48% <100.00%> (+0.16%) ⬆️
internal/distributed/rootcoord/service.go 79.80% <100.00%> (+0.13%) ⬆️
internal/util/grpcclient/client.go 98.18% <100.00%> (+11.27%) ⬆️
pkg/util/paramtable/base_table.go 79.87% <ø> (ø)
pkg/util/paramtable/component_param.go 98.35% <100.00%> (+<0.01%) ⬆️
... and 10 more

... and 321 files with indirect coverage changes

---- 🚨 Try these New Features:

@nish112022
Copy link
Contributor Author

Hi @xiaofan-luan, I have marked this as ready for review. Please take a look at your convenience.

Copy link
Contributor

mergify bot commented Nov 19, 2024

@nish112022 E2e jenkins job failed, comment /run-cpu-e2e can trigger the job again.

@nish112022
Copy link
Contributor Author

/run-cpu-e2e

@mergify mergify bot added the ci-passed label Nov 19, 2024
@xiaofan-luan
Copy link
Collaborator

/lgtm
/approve

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nish112022, xiaofan-luan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xiaofan-luan
Copy link
Collaborator

@nish112022
this looks all good to me, thanks for the amazing contribution.

Next step is we need to document on all to enable this . @haorenfsa could you work with nish to support tls in both k8s operator and Helm chart?

@sre-ci-robot sre-ci-robot merged commit 484c6b5 into milvus-io:master Nov 19, 2024
19 of 20 checks passed
@nish112022
Copy link
Contributor Author

I will start looking into the k8s operator and helm chart.I am not very familiar with those.
Then, I can work with @haorenfsa on this.

@xiaofan-luan
Copy link
Collaborator

Thanks for the contribution.
Good job and I'm sure haorenfsa can help. maybe you can setup a meeting when possible

JsDove pushed a commit to JsDove/milvus that referenced this pull request Nov 26, 2024
issue : milvus-io#36864

I have a few questions regarding my approach.I will consolidate them
here for feedback and review.Thanks

---------

Signed-off-by: Nischay Yadav <[email protected]>
Signed-off-by: Nischay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved area/test ci-passed dco-passed DCO check passed. kind/feature Issues related to feature request from users lgtm sig/testing size/XL Denotes a PR that changes 500-999 lines. test/integration integration test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants