Skip to content

Commit

Permalink
Add deployment check for rgw gateway pods
Browse files Browse the repository at this point in the history
This commit adds a check for the number of ready deployment of
rook-ceph-rgw-* pods, this change is done because, ODF/Rook used to run a routine that regularly created a bucket and then wrote/read the bucket to test the RGW health, now the status checking is removed. We now need to reflect the "Readyness" and the "Connected" nature of status of the CephObjectStore.

Signed-off-by: Divyansh Kamboj <dkamboj@redhat.com>
  • Loading branch information
weirdwiz authored and openshift-cherrypick-robot committed Sep 7, 2023
1 parent 618b50b commit 42ce94e
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
6 changes: 4 additions & 2 deletions metrics/deploy/prometheus-ocs-rules.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -181,12 +181,14 @@ spec:
rules:
- alert: ClusterObjectStoreState
annotations:
description: RGW endpoint of the Ceph object store is in a failure state for more than 15s. Please check the health of the Ceph cluster.
message: Cluster Object Store is in unhealthy state. Please check Ceph cluster health.
description: RGW endpoint of the Ceph object store is in a failure state or one or more Rook Ceph RGW deployments have fewer ready replicas than required for more than 15s. Please check the health of the Ceph cluster and the deployments.
message: Cluster Object Store is in unhealthy state or number of ready replicas for Rook Ceph RGW deployments is less than the desired replicas.
severity_level: error
storage_type: RGW
expr: |
ocs_rgw_health_status{job="ocs-metrics-exporter"} == 2
or
kube_deployment_status_replicas_ready{deployment=~"rook-ceph-rgw-.*"} < kube_deployment_spec_replicas{deployment=~"rook-ceph-rgw-.*"}
for: 15s
labels:
severity: critical
Expand Down
6 changes: 4 additions & 2 deletions metrics/mixin/alerts/services.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,16 @@
alert: 'ClusterObjectStoreState',
expr: |||
ocs_rgw_health_status{%(ocsExporterSelector)s} == 2
or
kube_deployment_status_replicas_ready{deployment=~"rook-ceph-rgw-.*"} < kube_deployment_spec_replicas{deployment=~"rook-ceph-rgw-.*"}
||| % $._config,
'for': $._config.clusterObjectStoreStateAlertTime,
labels: {
severity: 'critical',
},
annotations: {
message: 'Cluster Object Store is in unhealthy state. Please check Ceph cluster health.',
description: 'RGW endpoint of the Ceph object store is in a failure state for more than %s. Please check the health of the Ceph cluster.' % $._config.clusterObjectStoreStateAlertTime,
message: 'Cluster Object Store is in unhealthy state or number of ready replicas for Rook Ceph RGW deployments is less than the desired replicas.',
description: 'RGW endpoint of the Ceph object store is in a failure state or one or more Rook Ceph RGW deployments have fewer ready replicas than required for more than %s. Please check the health of the Ceph cluster and the deployments.' % $._config.clusterObjectStoreStateAlertTime,
storage_type: $._config.objectStorageType,
severity_level: 'error',
},
Expand Down

0 comments on commit 42ce94e

Please sign in to comment.