Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask cluster creation issue with TLS #857

Closed
ame307 opened this issue Jan 31, 2024 · 1 comment
Closed

Dask cluster creation issue with TLS #857

ame307 opened this issue Jan 31, 2024 · 1 comment

Comments

@ame307
Copy link

ame307 commented Jan 31, 2024

Describe the issue

We are trying to create a DASK cluster secured by tls with KubeCluster (both classic and operator) in our K8s cluster with limited success. Self-signed certificate and key are generated and inserted into the secret as .pem files.
The secret is mounted to the different pods, client, scheduler, and worker.

Classic scenario

The following Python commands are executed (classic) in pod based on dask:latest-py3.8 where the secret is mounted under /certs and environment variables related to tls are also set.

from dask_kubernetes.classic import KubeCluster
from distributed.security import Security

sec = Security(
    tls_ca_file='/certs/myca.pem',
    tls_client_cert='/certs/myca.pem',
    tls_client_key='/certs/mykey.pem',
    require_encryption=True
)

cluster = KubeCluster(
    '/worker-spec.yaml',
    security=sec,
    protocol="tls"
)

where the worker-spec.yaml is:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: ghcr.io/dask/dask:latest-py3.8
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
    name: dask-worker
    env:
    - name: DASK_DISTRIBUTED__COMM__DEFAULT_SCHEME
      value: "tls"
    - name: DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION
      value: "true"
    - name: DASK_DISTRIBUTED__COMM__TLS__CA_FILE
      value: "/certs/myca.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY
      value: "/certs/mykey.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT
      value: "/certs/myca.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__KEY
      value: "/certs/mykey.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__CERT
      value: "/certs/myca.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__KEY
      value: "/certs/mykey.pem"
    - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__CERT
      value: "/certs/myca.pem"
    volumeMounts:
    - mountPath: /certs/
      name: certificates
      readOnly: true
    resources:
      limits:
        cpu: "1"
        memory: 1G
      requests:
        cpu: "1"
        memory: 1G
  volumes:
  - name: certificates
    secret:
      defaultMode: 420
      secretName: my-tls-secret

Current result

The dask_kubernetes.classic.Scheduler is created and seems to listen on tls on 8786 port, but the dask_kubernetes.classic.KubeCluster throws the following exception:

RuntimeError: encryption required by Dask configuration, refusing communication from/to 'tcp://dask-root-36eed16d-9.cswopt-proto:8786' 

This seems to be caused by the mismatch of connection_args and address (the address says tcp in the exception) The scheduler can be connected with dask.distributed.Client by giving the proper address with tls.

NOTE: However, if the KubeCluster is started up in local deploy mode, all starts to work all good.

cluster = KubeCluster(
    '/worker-spec.yaml',
    security=sec,protocol="tls",
    deploy_mode='local'
)

No exception, cluster can be scaled, etc.

Operator scenario

When it comes to the operator the following Python script is executed:

from dask_kubernetes.operator import KubeCluster

cluster = KubeCluster(custom_cluster_spec='/daskcluster-spec.yaml')

where the daskcluster-spec.yaml is:

apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: example
  labels:
    foo: bar
spec:
  worker:
    replicas: 2
    spec:
      restartPolicy: Always
      containers:
      - name: worker
        image: "ghcr.io/dask/dask:latest-py3.8"
        imagePullPolicy: "IfNotPresent"
        args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 1GB, --death-timeout, '60']
        env:
        - name: DASK_DISTRIBUTED__COMM__DEFAULT_SCHEME
          value: "tls"
        - name: DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION
          value: "true"
        - name: DASK_DISTRIBUTED__COMM__TLS__CA_FILE
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__CERT
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__CERT
          value: "/certs/myca.pem"
        volumeMounts:
        - mountPath: /certs/
          name: certificates
          readOnly: true
        resources:
          limits:
            cpu: "1"
            memory: 1G
          requests:
            cpu: "1"
            memory: 1G
      volumes:
      - name: certificates
        secret:
          defaultMode: 420
          secretName: my-tls-secret
  scheduler:
    spec:
      containers:
      - name: scheduler
        image: "ghcr.io/dask/dask:latest-py3.8"
        imagePullPolicy: "IfNotPresent"
        args:
          - dask-scheduler
        env:
        - name: DASK_DISTRIBUTED__COMM__DEFAULT_SCHEME
          value: "tls"
        - name: DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION
          value: "true"
        - name: DASK_DISTRIBUTED__COMM__TLS__CA_FILE
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__WORKER__CERT
          value: "/certs/myca.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__KEY
          value: "/certs/mykey.pem"
        - name: DASK_DISTRIBUTED__COMM__TLS__CLIENT__CERT
          value: "/certs/myca.pem"
        volumeMounts:
        - mountPath: /certs/
          name: certificates
          readOnly: true
        ports:
          - name: tcp-comm
            containerPort: 8786
            protocol: TCP
          - name: http-dashboard
            containerPort: 8787
            protocol: TCP
        readinessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            port: http-dashboard
            path: /health
          initialDelaySeconds: 15
          periodSeconds: 20
      volumes:
      - name: certificates
        secret:
          defaultMode: 420
          secretName: my-tls-secret
    service:
      type: ClusterIP
      selector:
        dask.org/cluster-name: example
        dask.org/component: scheduler
      ports:
      - name: tcp-comm
        protocol: TCP
        port: 8786
        targetPort: "tcp-comm"
      - name: http-dashboard
        protocol: TCP
        port: 8787
        targetPort: "http-dashboard"

Current result

The scheduler is created and listening on proper tls port on 8786. Workers are created too but cannot connect to the scheduler.

RuntimeError: encryption required by Dask configuration, refusing communication from/to 'tcp://10.240.116.72:0

Additionally, the scheduler reports periodic TLS handshake issues with the client.

Listener on 'tls://0.0.0.0:8786': TLS handshake failed with remote 'tls://10.240.80.34:55512': TLS/SSL connection has been closed (EOF)

Expected result

No exceptions are thrown and workers as well as clients are communicating properly with TLS.
Any suggestions to get it working? I am aware that the classic is no longer supported and will be phased out, so no fix is expected, but how about the operator-based KubecCluster?

Anything else we need to know?

The same happens with dask and dask-kubernetes latest (2024.1.0)

Environment

  • Dask versions: 2023.5.0, 2024.1.0
  • Dask-kubernetes versions: 2023.3.2, 2024.1.0
  • Python versions: 3.8, 3.10
  • Operating System: Linux
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

This issue appears to be a duplicate of #836. I'm going to close this out in favour of that one.

@jacobtomlinson jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants