第八章：故障檢測方法 - Sam #96

samwu4166 · 2022-08-02T17:26:47Z

本章節一開始有帶到分布式系統不是一個理想的世界，時常發生預期外的錯誤，文中滿多著墨在網路的部分，不過我之前就滿常遇過不是單一網路問題造成的節點失效，還滿常 A(這邊假設GKE) -> B(POD/Container) 沒問題，不過某幾台 B -> C(Internal Service) 會偶爾出現問題或是直接掛掉，目前是用下面這種神奇的方式去主動偵測預期外的掛掉:

          livenessProbe:
            exec:
              command:
              - /bin/sh
              - -c
              - "cat `find ./health.json -mmin -1440 | awk -v def=default-cannot-cat-file '{print} END { if (NR==0) {print def} }'`"
            initialDelaySeconds: 60
            periodSeconds: 60
            failureThreshold: 5

不知道有沒有人有其他檢測的方法呢? 或是都怎麼偵測一個系統是不是活著或是一個活著的殭屍(?

kylemocode · 2022-08-03T09:30:13Z

我們也是差不多，也是靠 k8s 設定

livenessProbe:
          httpGet:
            path: /.healthcheck
            port: http
          initialDelaySeconds: 10
          periodSeconds: 2
          failureThreshold: 10

// server...
 server.get(`/.healthcheck`, (_req, res) => {
    res.send('OK');
  });

印象中是看回傳的 status code 200 <= status < 400，如果不是在這範圍就會砍掉 container 再重啟一個

0x171-0 · 2022-08-03T12:42:10Z

health check 機制
- 主動寫 health check 檔案（但是還是可能會有節點活著服務失敗的狀況）
  - k8s 判定死掉就砍掉重啟
- http 打看 status code 多首
- Spring 有多種自動化機制可以參考，但是客製化比較困難
- Grafana 可以監控所有服務，定期去打所有服務

kylemocode added the question Further information is requested label Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

第八章：故障檢測方法 - Sam #96

第八章：故障檢測方法 - Sam #96

samwu4166 commented Aug 2, 2022

kylemocode commented Aug 3, 2022 •

edited

Loading

0x171-0 commented Aug 3, 2022

第八章：故障檢測方法 - Sam #96

第八章：故障檢測方法 - Sam #96

Comments

samwu4166 commented Aug 2, 2022

kylemocode commented Aug 3, 2022 • edited Loading

0x171-0 commented Aug 3, 2022

kylemocode commented Aug 3, 2022 •

edited

Loading