Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: deprecate the old user-error reporting via Prometheus #15192

Closed
fuyufjh opened this issue Feb 22, 2024 · 4 comments · Fixed by #15544
Closed

Discussion: deprecate the old user-error reporting via Prometheus #15192

fuyufjh opened this issue Feb 22, 2024 · 4 comments · Fixed by #15544
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Feb 22, 2024

Previously, we chose to report streaming user errors via Prometheus. Particularly, a counter metric compute_error_count was introduced with an error type label. Details here: #7831.

However, the usage is a bit counter-intuitive, so not all connectors are well integrated with it. IMO, it's not well maintained.

Now we have introduced structured logging, and it appears to be a better way than that. It's easier and also more natural for this case.

I'd like to deprecate the old one and completely move to structured logging.

UPDATE: See #15192 (comment)

@github-actions github-actions bot added this to the release-1.7 milestone Feb 22, 2024
@fuyufjh fuyufjh changed the title Discussion: deprecate user error reporting (compute_error_count) via Prometheus Discussion: deprecate the old user-error reporting via Prometheus Feb 22, 2024
@BugenZhao
Copy link
Member

Are we going to encourage users (especially open-source deployments) to analyze the logs on their own with their logging pipelines? I'm concerned that some users may be proficient in metrics tools like Grafana but pay less attention to logging.

BTW, I think it might be a good time to expose and advertise the JSON logging format to open-source users, rather than gating itself behind the RISINGWAVE_CLOUD environment variable (#10633).

@fuyufjh
Copy link
Member Author

fuyufjh commented Feb 22, 2024

Are we going to encourage users (especially open-source deployments) to analyze the logs on their own with their logging pipelines? I'm concerned that some users may be proficient in metrics tools like Grafana but pay less attention to logging.

I think this is probably because we only provide Prometheus + Grafana in our out-of-box Kubernetes deployment. If Loki is included, things might be different.

@fuyufjh
Copy link
Member Author

fuyufjh commented Feb 22, 2024

Besides, today I just talked with @xiangjinwu, and he changed my mind.

Rather than removing the Prometheus user-error reporting code completely, we might simplify it by only recording the error type, such as expression_error, sink_error, source_error, etc. In my mind, there should be only around 10 kinds of errors.

The previous design of Prometheus user-error reporting was becoming too complex because we want to record some fine-grain information, e.g. which expression fails or what the error message is. By combining structured logging and Prometheus metrics, we can just record a label=expression_error and ask the users to look into the logs.

@fuyufjh fuyufjh self-assigned this Feb 22, 2024
@fuyufjh fuyufjh modified the milestones: release-1.7, release-1.8 Mar 6, 2024
@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 8, 2024

Now we have introduced structured logging, and it appears to be a better way than that. It's easier and also more natural for this case.

I'd like to deprecate the old one and completely move to structured logging.

+1. At the very least we don't need error message in the metrics for cloud deployments.

By combining structured logging and Prometheus metrics, we can just record a label=expression_error and ask the users to look into the logs.

However, should we make the change to avoid showing error message in the prom metric's label after we have provided out-of-box persistent structure logging solution for user? IIRC, the original motivation for using label to store error message is to make it easier for us and user to find out what the errors are when the logs are not persistent or lost in their environment. Maybe not applied to all kinds of error metrics, but at least the sink error message in the sink error label has helped us identify sink issues quickly in such caess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants