You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When deploying grafana loki in Simple Scalable with multiple read pods set in a kubernetes cluster, you sometimes end with a loki read pod that cannot execute any queries. This problem reflect in Grafana by a 504 Gateway timeout similar to this issue and is also linked to the 499 nginx issue found here
Expected behavior
Grafana loki should deploy no problem and should have a "tainted" read pods for no reason
Environment:
Deployed using the official Helm chart of loki, version 6.10.0
Deploying Grafana loki using version 3.2.1
Deployed on Internal Cloud. Using Cilium version 1.16.4
Storage is Azure Blob Storage
How to replicate:
Deploy using the helm upgrade or helm install. Here is my final loki config file:
If you have no query problems, redeploy using helm upgrade. Do the process multiple times (can take 1 time or it can 20)
At some point, you will get a "bad deploymend" and one of the read pods will no longer be able to make queries
Screenshots, Promtail config, or terminal output
So at first, i thought my problem was linked to the previously mentionned issue and it might be linked to what some users are facing since this problem does end up with 504 Gateway timeout when it happens. It does also make this error in our nginx pods that is linked to our ingress:
This result in a request that just returned...nothing. I waited 30 minutes and POSTMAN and only got this log line confirming the pod received the query:
When trying the same request to a different pod on the same helm release, i get the correct http response:
And i can see the grafana loki logs telling me the query was a success.
If you do query through grafana, at some point, you will hit the bad pod and result in either EOF error or a 504 gateway timeout. I already posted the nginx eror log, see above.
How to fix
This is a temporary fix, but the only known workaround is to simply restart the read pods. Thats it. If you redo a helm upgrade, there is a chance that the same scenario repeats itself. This should not be a problem and at this point, i almost certain its a Loki problem and not a networking problem. Although, if some Grafana loki dev could help me find a way to check what my read pod is missing or why its in a bad state, ill take any suggestion.
The text was updated successfully, but these errors were encountered:
When deploying grafana loki in Simple Scalable with multiple read pods set in a kubernetes cluster, you sometimes end with a loki read pod that cannot execute any queries. This problem reflect in Grafana by a 504 Gateway timeout similar to this issue and is also linked to the 499 nginx issue found here
Expected behavior
Grafana loki should deploy no problem and should have a "tainted" read pods for no reason
Environment:
How to replicate:
Screenshots, Promtail config, or terminal output
So at first, i thought my problem was linked to the previously mentionned issue and it might be linked to what some users are facing since this problem does end up with 504 Gateway timeout when it happens. It does also make this error in our nginx pods that is linked to our ingress:
10.171.106.139 - - [28/Nov/2024:20:59:22 +0000] "GET /loki/api/v1/labels?start=1732826612721000000&end=1732827512721000000 HTTP/1.1" 499 0 "-" "Grafana/11.1.3" 3985 49.972 [6723a512e7641cd9c37269ed-loki-read-3100] [] 172.16.6.5:3100 0 49.971 - 8ee94e8c2e23c4261ecbd3e8c037d4bb
In this case, the bad pod being 172.16.6.5 among my 3 read pods. So seing this error, i tried:
on my cilium setup, it did not help
I then tried to execute the query directly in the pod. Using port forwarding, i then executed the following request to the "tainted" read pod:
http://localhost:50697/loki/api/v1/query_range?direction=backward&end=1732889832897000000&limit=1000&query=%7Bubi_project_id%3D%2227edfc6c-eb78-4790-a6c8-ed82a0478f7c%22%7D+%7C%3D+%60%60&start=1732886232897000000&step=2000ms
This result in a request that just returned...nothing. I waited 30 minutes and POSTMAN and only got this log line confirming the pod received the query:
level=info ts=2024-11-29T15:58:49.425174467Z caller=roundtrip.go:364 org_id=fake traceID=761b6d90af6ae5c8 msg="executing query" type=range query="{ubi_project_id=\"27edfc6c-eb78-4790-a6c8-ed82a0478f7c\"} |= ``" start=2024-11-29T13:17:12.897Z end=2024-11-29T14:17:12.897Z start_delta=2h41m36.528171657s end_delta=1h41m36.528171836s length=1h0m0s step=2000 query_hash=2028412130
When trying the same request to a different pod on the same helm release, i get the correct http response:
And i can see the grafana loki logs telling me the query was a success.
If you do query through grafana, at some point, you will hit the bad pod and result in either EOF error or a 504 gateway timeout. I already posted the nginx eror log, see above.
How to fix
This is a temporary fix, but the only known workaround is to simply restart the read pods. Thats it. If you redo a helm upgrade, there is a chance that the same scenario repeats itself. This should not be a problem and at this point, i almost certain its a Loki problem and not a networking problem. Although, if some Grafana loki dev could help me find a way to check what my read pod is missing or why its in a bad state, ill take any suggestion.
The text was updated successfully, but these errors were encountered: