Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2214524: [release-4.13] Add deployment check for rgw gateway pods #2184

Open
wants to merge 1 commit into
base: release-4.13
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions metrics/deploy/prometheus-ocs-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -181,12 +181,14 @@ spec:
rules:
- alert: ClusterObjectStoreState
annotations:
description: RGW endpoint of the Ceph object store is in a failure state for more than 15s. Please check the health of the Ceph cluster.
message: Cluster Object Store is in unhealthy state. Please check Ceph cluster health.
description: RGW endpoint of the Ceph object store is in a failure state or one or more Rook Ceph RGW deployments have fewer ready replicas than required for more than 15s. Please check the health of the Ceph cluster and the deployments.
message: Cluster Object Store is in unhealthy state or number of ready replicas for Rook Ceph RGW deployments is less than the desired replicas.
severity_level: error
storage_type: RGW
expr: |
ocs_rgw_health_status{job="ocs-metrics-exporter"} == 2
or
kube_deployment_status_replicas_ready{deployment=~"rook-ceph-rgw-.*"} < kube_deployment_spec_replicas{deployment=~"rook-ceph-rgw-.*"}
for: 15s
labels:
severity: critical
Expand Down
6 changes: 4 additions & 2 deletions metrics/mixin/alerts/services.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,16 @@
alert: 'ClusterObjectStoreState',
expr: |||
ocs_rgw_health_status{%(ocsExporterSelector)s} == 2
or
kube_deployment_status_replicas_ready{deployment=~"rook-ceph-rgw-.*"} < kube_deployment_spec_replicas{deployment=~"rook-ceph-rgw-.*"}
||| % $._config,
'for': $._config.clusterObjectStoreStateAlertTime,
labels: {
severity: 'critical',
},
annotations: {
message: 'Cluster Object Store is in unhealthy state. Please check Ceph cluster health.',
description: 'RGW endpoint of the Ceph object store is in a failure state for more than %s. Please check the health of the Ceph cluster.' % $._config.clusterObjectStoreStateAlertTime,
message: 'Cluster Object Store is in unhealthy state or number of ready replicas for Rook Ceph RGW deployments is less than the desired replicas.',
description: 'RGW endpoint of the Ceph object store is in a failure state or one or more Rook Ceph RGW deployments have fewer ready replicas than required for more than %s. Please check the health of the Ceph cluster and the deployments.' % $._config.clusterObjectStoreStateAlertTime,
storage_type: $._config.objectStorageType,
severity_level: 'error',
},
Expand Down