Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show specific error messages in the UI related to connection errors #3808

Closed
upgradingdave opened this issue Aug 28, 2023 · 26 comments
Closed
Assignees
Labels
enhancement New feature or request spring cleaning Could be cleaned up one day ux
Milestone

Comments

@upgradingdave
Copy link

upgradingdave commented Aug 28, 2023

Problem you would like to solve

Currently, the Desktop Modeler shows an generic error message of "Unknown error: Please Check Zeebe cluster status".

Screen Shot 2023-08-28 at 1 56 37 PM

Tailing the logs shows the more specific connection error messages such as:

RequestError: certificate has expired

Proposed solution

Display the specific error messages in the ui to make troubleshooting easier.

Alternatives considered

It's possible to tail the Desktop Modeler logs. However this is tedious and not intuitive for most customers.

Additional context

No response

@upgradingdave upgradingdave added the enhancement New feature or request label Aug 28, 2023
@nikku
Copy link
Member

nikku commented Aug 28, 2023

We're using NodeJS/OpenSSL under the hood. To my knowledge they are as unspecific as stated in the case of a broken SSL handshake.

If there exists additional details in the response we get then we're happy to present them to the user.

@upgradingdave
Copy link
Author

upgradingdave commented Aug 28, 2023

Hi @nikku 👋

Yeah, that'd be great. Ideally, it'd be really convenient to be able to see the more specific reasons as to why the Desktop Modeler is unable to connect to the Zeebe Gateway without having to look at the logs.

So, any additional details that can be parsed out of a response, or out of an exception trace, and show in the UI could help a lot.

Or, I had another idea ... maybe it could be possible to show the more detailed error messages inside the Log panel? Perhaps the human readable portion of the stack trace (such as RequestError: certificate has expired) could be shown in the Log panel?

@nikku nikku added the spring cleaning Could be cleaned up one day label Aug 29, 2023
@nikku
Copy link
Member

nikku commented Aug 29, 2023

Parsing the log for special character streams does not seem to me like a satisfying (and robust) solution.

zeebe-node error handling is what we'd need to plug into.

Maybe, if you have the chance, you could give it a debugging session yourself, and figure if there is pragmatic improvements we can do.

I've tagged this as spring cleaning.

@nikku nikku added ux backlog Queued in backlog labels Aug 29, 2023
@nikku nikku added the ready Ready to be worked on label Aug 29, 2023 — with bpmn-io-tasks
@nikku nikku removed the backlog Queued in backlog label Aug 29, 2023
@jessesimpson36
Copy link

Could we perhaps add additional checks prior to making a request to ensure that the TLS certificate is valid and display an error message if it's not?

This form and lack of error response is a big frustration on my end, causing about 3-5 hours of debugging effort every time I have to deploy a model, which I often have to do for testing. (I have this issue both in web modeler and desktop modeler)

Additionally, I would like to offer my support. If there's a change that can be made in the helm chart (I suppose this would be for the web modeler) such that a user does not need to configure their oauth url / zeebe gateway url, like environment variables I can add for these to be auto-filled, I would be more than happy to write a helm-chart patch for that.

From my perspective, this form has shown the red box for:

  1. Untrusted TLS certificate in the gateway
  2. Untrusted TLS certificate in keycloak
  3. Failing to write to a cache file that I never even knew existed
  4. Invalid client credentials

There might be more reasons. In my debugging I've also messed around with the Audience form element and found that whatever you put in that field is completely ignored by the application.

Also, I do know that there is a troubleshoot link that goes to the documentation, and while I am grateful for this, it is not good enough in my opinion.

@nikku
Copy link
Member

nikku commented Aug 31, 2023

@jessesimpson36 Thanks for chiming in, and sorry to hear that you are frustrated abut SSL configuration issues in Camunda 8.

Before we can try to improve the situation, let me better understand some of your feedback:

This form and lack of error response is a big frustration on my end, causing about 3-5 hours of debugging effort every time I have to deploy a model, which I often have to do for testing. (I have this issue both in web modeler and desktop modeler)

Could you elaborate what you deploy, and why this always takes such long amount of time to deploy + debug? Is what you do a common thing ordinary users do, and if so, how frequently do it? Which documentation / guidelines do you follow as you do it?

@jessesimpson36
Copy link

jessesimpson36 commented Sep 11, 2023

Hi @nikku ,

I am a developer on the Distribution team who works on the helm charts. For me, it's pretty common to deploy the helm chart locally and do basic testing, especially as it relates to support tickets, new features, helping others internally, and the occasional customer calls where customers struggle to do similar things.

Could you elaborate what you deploy, and why this always takes such long amount of time to deploy + debug?

What I deploy is often a values.yaml for the https://github.com/camunda/camunda-platform-helm/tree/main/charts/camunda-platform repo, and many times, I take a customers values.yaml, and modify it so that I can test things locally with their configuration. The reason it takes a long time for me is because I don't know why it happens and that there are many reasons for the same error message (we basically just get a red box and something like "Unknown error. Please check Zeebe cluster status. Troubleshoot").

So what am I doing that it takes so long for me to debug?

Once I get this error, I have to wonder whether the issue has to do with the networking, the deployment configuration, or the application code.

  1. I check to see if the cluster endpoint matches the external url in the ingress configuration
  2. I check that the OAuth2 url host name matches the keycloak host name designated in the ingress configuration
  3. I modify the ending of the oauth2 url: /auth/realms/camunda-platform/protocol/openid-connect/token. I often try removing the auth part, or playing around with different urls because I have no idea how I'm supposed to get this magic url.
  4. I verify the client ID and client secret in identity. Sometimes I will make a new Application in identity with all privileges, and sometimes I will just use the Zeebe client.
  5. I test all my TLS certs to ensure they are all valid
  6. I check the logs for Zeebe, Zeebe-gateway, and the web modeler restapi. The logs have always been worthless for me in debugging this, but I check them anyway.
  7. I modify the Keycloak url to use the kubernetes service name as the hostname instead of the external-facing url
  8. I modify the OAuth url to use the kubernetes service name as the hostname instead of the external-facing url
  9. I try the desktop modeler with previous steps to see if that's any different
  10. I refer to daves message here: https://camunda.slack.com/archives/C05764N4VNZ/p1690906641310499

Usually, I go through those steps, they don't always help, and then I just make panic changes because theres no logs or error messages. I have gone to the troubleshoot link before, sometimes it helps. Most of the times it does not. It did help with my most recent frustration when I was trying to configure a read-only root filesystem though. That was when I learned about the magic file ZEEBE_CLIENT_CONFIG_PATH=/path/to/credentials/cache.txt using the docs link.

Is what you do a common thing ordinary users do, and if so, how frequently do it?

Every user who installs C8 will need to verify that their installation is correct, and the only way to do that is to deploy a model and access each of the web components. Users will only have to debug this once, but I have to deploy the helm charts many times. So ordinary users will not be as frustrated as internal devs who are testing their installation.

@nikku
Copy link
Member

nikku commented Sep 12, 2023

@jessesimpson36 Thanks for your feedback.

I modify the ending of the oauth2 url: /auth/realms/camunda-platform/protocol/openid-connect/token. I often try removing the auth part, or playing around with different urls because I have no idea how I'm supposed to get this magic url.

I see two things we want to address rather soonish (check if properly documented, and/or can be validated):

  • Partially in scope of the modeler: What OAuth URL should be entered, how do I learn about it, how can we validate it?
  • In scope of the (Desktop) modeler: Can we return more meaningful error messages?

Every user who installs C8 will need to verify that their installation is correct, and the only way to do that is to deploy a model and access each of the web components.

What I wonder is if we can provide a CLI utility that verifies the proper configuration of a C8 (self-managed) instance, using tools equipped to do the job? I.e. if I verify proper configuration of my mail server I turn to detailed diagnostics utilities (i.e. this) when just sending or receiving an email proves inconclusive.

@jessesimpson36
Copy link

What I wonder is if we can provide a CLI utility that verifies the proper configuration of a C8 (self-managed) instance, using tools equipped to do the job?

I'm not sure this is a good way to handle it. We have zbctl, but zbctl will often work regardless of whether the web modeler / desktop modeler can work. Or vice versa.

To me, a better solution would simply be to have environment variables that would pre-fill that form, and for us to set them as part of the helm chart. so for example:

CLUSTER_ENDPOINT=http://<RELEASE>-zeebe-gateway:26500
IDENTITY_OAUTH_URL=http://<RELEASE>-keycloak/auth/realms/camunda-platform/protocol/openid-connect/token
DEFAULT_AUDIENCE=test

Then the user only puts in the client id and secret. The helm chart can then properly set those environment variables.

That's more of a Web Modeler suggestion. I'm not sure if that's a good idea or not, or if that idea could be translated to the desktop modeler.

I still also think better error messages makes most since directly inside the modeler / web modeler.

@nikku
Copy link
Member

nikku commented Sep 14, 2023

What we'd accomplish with a test kit is to verify the remote end(s) are configured correctly, independent of zbctl (being extremely forgiving with SSL certificates) and the modelers (being fairly strict).

@jessesimpson36
Copy link

What sort of test would make sense here? some openssl command to verify the ssl cert? SSL isn't the only problem that can trigger this error. Perhaps accessing an OIDC in endpoint to ensure the keycloak url is properly set (https://developer.okta.com/docs/reference/api/oidc/#well-known-openid-configuration). Perhaps something that tests if a port is open on the zeebe gateway and whether we can get a certain response out of it.

@nikku
Copy link
Member

nikku commented Sep 19, 2023

If you ask me there is a couple of steps involved:

  • Ensure remote endpoints are reachable
  • SSL: Ensure remote endpoints are trusted / properly configured
  • Ensure remote endpoints are correct (right OAuth callback url)

The second step is fully supported by openssl connection diagnosis.

@CatalinaMoisuc
Copy link
Member

@nikku can we also apply same changes within the scope of Web Modeler where possible?

@jessesimpson36
Copy link

Ok, so we have the openssl command,

timeout 1 openssl s_client -alpn h2 -connect modeler.dev.jlscode.com:443 -servername modeler.dev.jlscode.com -brief 
CONNECTION ESTABLISHED
Protocol version: TLSv1.3
Ciphersuite: TLS_AES_256_GCM_SHA384
Peer certificate: CN = modeler.dev.jlscode.com
Hash used: SHA256
Signature type: RSA-PSS
Verification: OK
Server Temp Key: X25519, 253 bits

And the openid-configuration endpoint which can be queried like so:

> curl --silent \
         -X GET \
        https://keycloak.dev.jlscode.com/auth/realms/camunda-platform/.well-known/openid-configuration  \
        | jq .authorization_endpoint

"https://keycloak.dev.jlscode.com/auth/realms/camunda-platform/protocol/openid-connect/auth"

I'd say this is a solid start. Are there ways to verify a GRPC connection? Like some sort of curl / healthcheck endpoint over GRPC to verify the gateway url connection works?

@jessesimpson36
Copy link

With those two things passing, I still struggle to deploy a model.

@nikku
Copy link
Member

nikku commented Sep 19, 2023

You want to run the openssl command against all endpoints, keycloak, modeler, and zeebe gateway. alpn is, to my knowledge, only required for Zeebe.

Once Zeebe is connected you'd want to try to query the cluster topology using zbctl status.

If zbctl status succeeds then you may give the Desktop Modeler a try.

@nikku
Copy link
Member

nikku commented Sep 19, 2023

To verify the full OID endpoint you want to assert (in a shell script) that the token-endpoint is reachable + that it provides you with a token, cf. this stackoverflow answer. Adapted to use client id and secret, of course.

@jessesimpson36
Copy link

I'm realizing now that the desktop modeler can deploy a model easier for me than the web modeler can. same configuration for both, and the desktop modeler deploys successfully but not the web modeler.

zbctl status succeeds for me. also the openssl command against the gateway.

I need to look up what ALPN is, but the output from openssl has

ALPN protocol: h2

so I think that part is fine.

@jessesimpson36
Copy link

Parsing the log for special character streams does not seem to me like a satisfying (and robust) solution.

zeebe-node error handling is what we'd need to plug into.

Maybe, if you have the chance, you could give it a debugging session yourself, and figure if there is pragmatic improvements we can do.

I've tagged this as spring cleaning.

Just saw this message. yeah I'll give this a try and post here if I find something useful.

@nikku
Copy link
Member

nikku commented Sep 20, 2023

ALPN stands for Application-Layer-Protocol-Negotiation. It is being used by GRPC (Zeebe) to negotiate the binary GRPC protocol on top of HTTP(S)/2 + TLS connection to the server.

It must be appropriately configured for Zeebe only. Other endpoints use standard REST, where the protocol (HTTP(S)) is settled on the protocol layer, before initiating communication, after TLS.

@nikku
Copy link
Member

nikku commented Sep 20, 2023

To test deployment to Zeebe I'd suggest to use the desktop modeler, with DEBUG logging enabled.

This gives you detailed output (even if hidden via the UI) on things that go awry.

@nikku
Copy link
Member

nikku commented Sep 20, 2023

Last point: Desktop modeler trusts your OS root certificates. In web modeler you may need to double check the behavior.

@nikku
Copy link
Member

nikku commented Sep 20, 2023

Based on my investigation in Modeler land I followed up on three things:

@barmac
Copy link
Collaborator

barmac commented Sep 20, 2023

ChatGPT-produced script for connection validation:

#!/bin/bash

# Define the URLs of the remote endpoints
ZEEBE_ENDPOINT="https://example.com/zeebe"
OAUTH_ENDPOINT="https://example.com/oauth"
CLIENT_ID="your_client_id"
CLIENT_SECRET="your_client_secret"
SSL_ENABLED="true" # Set to "true" to enable SSL validation

# Function to check if an endpoint is reachable
check_endpoint_reachability() {
    local endpoint="$1"
    if curl -Is --connect-timeout 5 "$endpoint" >/dev/null; then
        echo "Endpoint $endpoint is reachable."
    else
        echo "Endpoint $endpoint is not reachable."
        exit 1
    fi
}

# Function to validate SSL for an endpoint if SSL_ENABLED is set to "true"
validate_ssl() {
    local endpoint="$1"
    local ssl_enabled="$SSL_ENABLED"
    
    if [[ "$ssl_enabled" == "true" ]]; then
        if openssl s_client -connect "$(echo "$endpoint" | sed -e 's/https:\/\/\([^/]*\).*/\1/')" < /dev/null 2>/dev/null | openssl x509 -noout -checkend 0; then
            echo "SSL for endpoint $endpoint is valid."
        else
            echo "SSL for endpoint $endpoint is not valid or expired."
            exit 1
        fi
    else
        echo "SSL validation is disabled."
    fi
}

# Function to check if the OAuth callback URL is correct and obtain a token
check_oauth_callback() {
    local oauth_callback_url="$1"
    local oauth_token_url="$oauth_callback_url/token"
    
    # Make a request to obtain an OAuth token using the client ID and client secret
    local response
    response=$(curl -s -X POST "$oauth_token_url" -d "client_id=$CLIENT_ID" -d "client_secret=$CLIENT_SECRET")
    local http_status=$(echo "$response" | head -n 1 | awk '{print $2}')
    
    if [[ "$http_status" == "200" ]]; then
        if echo "$response" | jq -e '.access_token' >/dev/null; then
            echo "OAuth callback URL is correct, received a 200 status code, and contains an access_token field in the JWT token."
        else
            echo "OAuth callback URL is correct and received a 200 status code, but the response does not contain an access_token field in the JWT token."
            exit 1
        fi
    else
        echo "OAuth callback URL is incorrect, or the request returned a non-200 status code."
        exit 1
    fi
}

# Check reachability of remote endpoints
check_endpoint_reachability "$ZEEBE_ENDPOINT" || exit 1
check_endpoint_reachability "$OAUTH_ENDPOINT" || exit 1

# Validate SSL for remote endpoints (if enabled)
validate_ssl "$ZEEBE_ENDPOINT"
validate_ssl "$OAUTH_ENDPOINT"

# Check if the OAuth callback URL is correct and obtain a token
check_oauth_callback "$OAUTH_ENDPOINT" || exit 1

@nikku
Copy link
Member

nikku commented Sep 20, 2023

Based on ChatGPT prior art camunda/zeebe-connection-test#10 adds a basic C8 connection checker, validating reachability, SSL and oauth. Can be extended in a fairly simple manner.

@nikku
Copy link
Member

nikku commented Sep 21, 2023

Another review of our deploy flow (against C8 SaaS) uncovered that the C8 SaaS /token endpoint indicates "client not found for CLIENT_ID" with HTTP 404, making it indistinguishable for us from "endpoint not found" (ref).

An update to C8 SaaS fixes the issue. In the future it will indicate wrong credentials as HTTP 401 as suggested by the OpenID specification (Page 45).

@nikku
Copy link
Member

nikku commented Sep 22, 2023

Thanks for all the feedback folks, especially @jessesimpson36! I think we were able to move the "deploy to C8 experience" forward substantially.

We'll ship with the next release of the Desktop Modeler:

We've otherwise worked on / improved:

We collected the following potential follow-ups:

Closing this issue. Let's handle further improvements in individual follow-ups.

@nikku nikku closed this as completed Sep 22, 2023
@bpmn-io-tasks bpmn-io-tasks bot removed the in progress Currently worked on label Sep 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request spring cleaning Could be cleaned up one day ux
Projects
None yet
Development

No branches or pull requests

5 participants