Discussion: deprecate the old user-error reporting via Prometheus #15192

fuyufjh · 2024-02-22T07:49:21Z

Previously, we chose to report streaming user errors via Prometheus. Particularly, a counter metric compute_error_count was introduced with an error type label. Details here: #7831.

However, the usage is a bit counter-intuitive, so not all connectors are well integrated with it. IMO, it's not well maintained.

Now we have introduced structured logging, and it appears to be a better way than that. It's easier and also more natural for this case.

I'd like to deprecate the old one and completely move to structured logging.

UPDATE: See #15192 (comment)

The text was updated successfully, but these errors were encountered:

BugenZhao · 2024-02-22T09:14:19Z

Are we going to encourage users (especially open-source deployments) to analyze the logs on their own with their logging pipelines? I'm concerned that some users may be proficient in metrics tools like Grafana but pay less attention to logging.

BTW, I think it might be a good time to expose and advertise the JSON logging format to open-source users, rather than gating itself behind the RISINGWAVE_CLOUD environment variable (#10633).

fuyufjh · 2024-02-22T10:07:37Z

Are we going to encourage users (especially open-source deployments) to analyze the logs on their own with their logging pipelines? I'm concerned that some users may be proficient in metrics tools like Grafana but pay less attention to logging.

I think this is probably because we only provide Prometheus + Grafana in our out-of-box Kubernetes deployment. If Loki is included, things might be different.

fuyufjh · 2024-02-22T10:12:32Z

Besides, today I just talked with @xiangjinwu, and he changed my mind.

Rather than removing the Prometheus user-error reporting code completely, we might simplify it by only recording the error type, such as expression_error, sink_error, source_error, etc. In my mind, there should be only around 10 kinds of errors.

The previous design of Prometheus user-error reporting was becoming too complex because we want to record some fine-grain information, e.g. which expression fails or what the error message is. By combining structured logging and Prometheus metrics, we can just record a label=expression_error and ask the users to look into the logs.

hzxa21 · 2024-03-08T08:14:47Z

Now we have introduced structured logging, and it appears to be a better way than that. It's easier and also more natural for this case.

I'd like to deprecate the old one and completely move to structured logging.

+1. At the very least we don't need error message in the metrics for cloud deployments.

By combining structured logging and Prometheus metrics, we can just record a label=expression_error and ask the users to look into the logs.

However, should we make the change to avoid showing error message in the prom metric's label after we have provided out-of-box persistent structure logging solution for user? IIRC, the original motivation for using label to store error message is to make it easier for us and user to find out what the errors are when the logs are not persistent or lost in their environment. Maybe not applied to all kinds of error metrics, but at least the sink error message in the sink error label has helped us identify sink issues quickly in such caess.

github-actions bot added this to the release-1.7 milestone Feb 22, 2024

fuyufjh changed the title ~~Discussion: deprecate user error reporting (compute_error_count) via Prometheus~~ Discussion: deprecate the old user-error reporting via Prometheus Feb 22, 2024

fuyufjh self-assigned this Feb 22, 2024

fuyufjh modified the milestones: release-1.7, release-1.8 Mar 6, 2024

fuyufjh mentioned this issue Mar 8, 2024

feat(metrics): simplify user error metrics #15544

Merged

9 tasks

fuyufjh closed this as completed in #15544 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: deprecate the old user-error reporting via Prometheus #15192

Discussion: deprecate the old user-error reporting via Prometheus #15192

fuyufjh commented Feb 22, 2024 •

edited

Loading

BugenZhao commented Feb 22, 2024

fuyufjh commented Feb 22, 2024

fuyufjh commented Feb 22, 2024 •

edited

Loading

hzxa21 commented Mar 8, 2024

Discussion: deprecate the old user-error reporting via Prometheus #15192

Discussion: deprecate the old user-error reporting via Prometheus #15192

Comments

fuyufjh commented Feb 22, 2024 • edited Loading

BugenZhao commented Feb 22, 2024

fuyufjh commented Feb 22, 2024

fuyufjh commented Feb 22, 2024 • edited Loading

hzxa21 commented Mar 8, 2024

fuyufjh commented Feb 22, 2024 •

edited

Loading

fuyufjh commented Feb 22, 2024 •

edited

Loading