-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LEAP prometheus server is down/scheduler faiiling #2248
Comments
|
I'm still learning about the system but my expectation about the support namespace is that it would just be for monitoring, etc and I'm seeing a lot of gke nodes in the namespace kubectl get nodes -n support
NAME STATUS ROLES AGE VERSION gke-leap-cluster-core-pool-1cc6bf7d-5fnp Ready 132d v1.24.5-gke.600 gke-leap-cluster-core-pool-1cc6bf7d-5fwt Ready 64d v1.24.5-gke.600 gke-leap-cluster-core-pool-1cc6bf7d-7wg2 Ready 34d v1.24.5-gke.600 gke-leap-cluster-core-pool-1cc6bf7d-g77w Ready 27h v1.24.5-gke.600 gke-leap-cluster-core-pool-1cc6bf7d-vlbq Ready 45h v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-24j4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-29kn Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2dxm Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2g5c Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2gc4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2khr Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2tzz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2vcj Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-2wzw Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-465z Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-46df Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-46zs Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-49x2 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-49zh Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4hfp Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4jdv Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4v49 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4vsf Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4wbr Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-4zd9 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-574h Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-59tc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5glt Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5k64 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5lmk Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5mb2 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5v5s Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5wks Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5zbh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-5zjr Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-6mkt Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-6q4b Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-6shh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-7cbz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-7dk6 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-7h7g Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-7ptq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-7vjc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-84d6 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-88rc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-8hks Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-8pzd Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-8s7b Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-8wvn Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-9js4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-9l9h Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-9vbf Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-9w2f Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-b7lc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-b86b Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-bknz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-blnl Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-bmrb Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-brzw Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-c4mc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-c58s Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-c68s Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-c92c Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-clbd Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-cnjz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-cnr4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-djjm Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-djz2 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-dppr Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-dr7q Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-dv9d Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-dvxx Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-f4w9 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-f7mm Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-f9mc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fbpp Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fj2z Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fljh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fmr2 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fr2c Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-fwnp Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-g6c9 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-g6ph Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gdll Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gf6p Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gfbl Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gkrr Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-glvb Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-grp2 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gvpq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gxkv Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-gzbd Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-h67b Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-h9kc Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-hkd6 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-hkwm Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-hrwh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-hrzd Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-hvpb Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-j729 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jbqb Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jglq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jk72 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jkj5 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jlfc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-js67 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-jsvf Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-k7rp Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-k8k9 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-k9kg Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-kn65 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-kqdw Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-kvlm Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-l7j6 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-lpdm Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-lrx4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-lwzr Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-m2kg Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-m6hj Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-m7vx Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-mk8k Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-mt8t Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-n5w6 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-n8js Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-n92c Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-ncgt Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nfws Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nkw4 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nqqx Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nqt5 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nr6s Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-nx2v Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-p7sj Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-pdkm Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-pn8q Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-ppd8 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-q8nt Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-q9cb Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qb6k Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qg9r Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qh9g Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qm5b Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qngf Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qpdb Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qrmv Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qtz6 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qv46 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qv8q Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qxnf Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-qzcq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-r4f7 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-r826 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rbw7 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rg2m Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rg5l Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rgv7 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rqsz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rrww Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-rtrh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-s7x5 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-sk5v Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-slbj Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-srpz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-ss25 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-swlb Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-swqf Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-t2k9 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-t68d Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-t7wg Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-t92m Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-td44 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-tdbk Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-thzt Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-tmxx Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-tzs6 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vbrd Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vc6g Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vhwq Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vkk2 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vnxq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vsp5 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vw6w Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-vxfp Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wbpj Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wkb5 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wmj8 Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wq8f Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wrml Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-wwqk Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-x4rz Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-x972 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xd4s Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xl7d Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xlhh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xlr8 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xp58 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-xxrq Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-z2mf Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-z6c7 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-z6xh Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-zdbs Ready 28m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-zjh4 Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-zwkn Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-zwmc Ready 29m v1.24.5-gke.600 gke-leap-cluster-dask-huge-b3a90f26-zx6b Ready 29m v1.24.5-gke.600 gke-leap-cluster-nb-huge-cf099c13-2zkf Ready 58m v1.24.5-gke.600 gke-leap-cluster-nb-huge-cf099c13-zssb Ready 5h7m v1.24.5-gke.600 gke-leap-cluster-nb-large-3ab858ce-gmz9 Ready 20m v1.24.5-gke.600 gke-leap-cluster-nb-large-3ab858ce-l7dv Ready 34m v1.24.5-gke.600 gke-leap-cluster-nb-medium-b9c8ba20-7thc Ready 131m v1.24.5-gke.600 gke-leap-cluster-nb-medium-b9c8ba20-hrhk Ready 3h15m v1.24.5-gke.600 gke-leap-cluster-nb-medium-b9c8ba20-wjmj Ready 4h15m v1.24.5-gke.600 |
Adding a node to the core pool to see if can get prometheus scheduled |
Added server is |
|
|
|
@damianavila and @yuvipanda as the formal oncalls tagging you here. I've added (from 5 to 6) nodes to gke-leap-cluster-core-pool-1cc6bf7d-grp this clears the paging condition but I've not done an extensive analysis of why this occurs TODO:
|
Looking at the events it looks like it was prempted (longer logs below) extract here. It looks like scheduling failed and then scale-up faiiled/was not triggered and max group size reached.
k get events -n supportLAST SEEN TYPE REASON OBJECT MESSAGE 26m Normal Scheduled pod/support-cryptnono-6rmkq Successfully assigned support/support-cryptnono-6rmkq to gke-leap-cluster-core-pool-1cc6bf7d-tmq0 26m Warning FailedCreatePodSandBox pod/support-cryptnono-6rmkq Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f53c22cf29bd154c5839d963af765266c5c2076f94b490a251100ab3693f5c7e": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/ 26m Normal Pulling pod/support-cryptnono-6rmkq Pulling image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" 26m Normal Pulled pod/support-cryptnono-6rmkq Successfully pulled image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" in 6.317139676s 26m Normal Created pod/support-cryptnono-6rmkq Created container kubectl-trace-init 26m Normal Started pod/support-cryptnono-6rmkq Started container kubectl-trace-init 24m Normal Pulling pod/support-cryptnono-6rmkq Pulling image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" 24m Normal Pulled pod/support-cryptnono-6rmkq Successfully pulled image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" in 3.724225512s 24m Normal Created pod/support-cryptnono-6rmkq Created container trace 24m Normal Started pod/support-cryptnono-6rmkq Started container trace 18m Normal Killing pod/support-cryptnono-jxqgh Stopping container trace 18m Warning FailedPreStopHook pod/support-cryptnono-jxqgh Exec lifecycle hook ([/bin/bash -c kill -SIGINT $(pidof bpftrace) && sleep 30]) for Container "trace" in Pod "support-cryptnono-jxqgh_support(668e877f-e240-454f-b701-969e684aaa12)" failed - error: command '/bin/bash -c kill -SIGINT $(pidof bpftrace) && sleep 30' exited with 137: , message: "" 16m Normal Scheduled pod/support-cryptnono-k9lvj Successfully assigned support/support-cryptnono-k9lvj to gke-leap-cluster-dask-huge-b3a90f26-574h 16m Warning FailedCreatePodSandBox pod/support-cryptnono-k9lvj Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "68bd292ac298b53878944d07d1fdb1a4b221ab7b2a4269474d2c80ffe787bb88": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/ 16m Normal Pulling pod/support-cryptnono-k9lvj Pulling image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" 16m Normal Pulled pod/support-cryptnono-k9lvj Successfully pulled image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" in 4.904876655s 16m Normal Created pod/support-cryptnono-k9lvj Created container kubectl-trace-init 16m Normal Started pod/support-cryptnono-k9lvj Started container kubectl-trace-init 15m Normal Pulling pod/support-cryptnono-k9lvj Pulling image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" 15m Normal Pulled pod/support-cryptnono-k9lvj Successfully pulled image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" in 3.016818465s 15m Normal Created pod/support-cryptnono-k9lvj Created container trace 15m Normal Started pod/support-cryptnono-k9lvj Started container trace 57m Normal Scheduled pod/support-cryptnono-pqfxr Successfully assigned support/support-cryptnono-pqfxr to gke-leap-cluster-nb-large-3ab858ce-gmz9 57m Warning FailedCreatePodSandBox pod/support-cryptnono-pqfxr Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e87524ae545827b1bb95712db0b0d42131f31faf0bb35d58f9cc798d40c46b41": plugin type="calico" failed (add): stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/ 57m Normal Pulling pod/support-cryptnono-pqfxr Pulling image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" 53m Normal Pulled pod/support-cryptnono-pqfxr Successfully pulled image "yuvipanda/kubectl-trace-init:9af18f1197b1b377be4816c2761049745b64c6b8" in 3m50.935963258s 53m Normal Created pod/support-cryptnono-pqfxr Created container kubectl-trace-init 53m Normal Started pod/support-cryptnono-pqfxr Started container kubectl-trace-init 51m Normal Pulling pod/support-cryptnono-pqfxr Pulling image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" 49m Normal Pulled pod/support-cryptnono-pqfxr Successfully pulled image "quay.io/iovisor/bpftrace:1c295d899035aaded34d5861375051ec1add1d3a-vanilla_llvm12_clang_glibc2.23" in 1m29.808725295s 49m Normal Created pod/support-cryptnono-pqfxr Created container trace 49m Normal Started pod/support-cryptnono-pqfxr Started container trace 57m Normal SuccessfulCreate daemonset/support-cryptnono (combined from similar events): Created pod: support-cryptnono-pqfxr 26m Normal SuccessfulCreate daemonset/support-cryptnono Created pod: support-cryptnono-6rmkq 18m Warning FailedDaemonPod daemonset/support-cryptnono Found failed daemon pod support/support-cryptnono-jxqgh on node gke-leap-cluster-dask-huge-b3a90f26-574h, will try to kill it 18m Normal SuccessfulDelete daemonset/support-cryptnono Deleted pod: support-cryptnono-jxqgh 16m Normal SuccessfulCreate daemonset/support-cryptnono Created pod: support-cryptnono-k9lvj 16m Normal UpdatedLoadBalancer service/support-ingress-nginx-controller Updated load balancer with new hosts 26m Normal Scheduled pod/support-prometheus-node-exporter-bzh9k Successfully assigned support/support-prometheus-node-exporter-bzh9k to gke-leap-cluster-core-pool-1cc6bf7d-tmq0 26m Normal Pulling pod/support-prometheus-node-exporter-bzh9k Pulling image "quay.io/prometheus/node-exporter:v1.5.0" 26m Normal Pulled pod/support-prometheus-node-exporter-bzh9k Successfully pulled image "quay.io/prometheus/node-exporter:v1.5.0" in 11.452735399s 26m Normal Created pod/support-prometheus-node-exporter-bzh9k Created container node-exporter 26m Normal Started pod/support-prometheus-node-exporter-bzh9k Started container node-exporter 16m Normal Scheduled pod/support-prometheus-node-exporter-hnhmf Successfully assigned support/support-prometheus-node-exporter-hnhmf to gke-leap-cluster-dask-huge-b3a90f26-574h 16m Normal Pulling pod/support-prometheus-node-exporter-hnhmf Pulling image "quay.io/prometheus/node-exporter:v1.5.0" 16m Normal Pulled pod/support-prometheus-node-exporter-hnhmf Successfully pulled image "quay.io/prometheus/node-exporter:v1.5.0" in 4.595472136s 16m Normal Created pod/support-prometheus-node-exporter-hnhmf Created container node-exporter 16m Normal Started pod/support-prometheus-node-exporter-hnhmf Started container node-exporter 18m Normal Killing pod/support-prometheus-node-exporter-pgd4t Stopping container node-exporter 57m Normal Scheduled pod/support-prometheus-node-exporter-s967n Successfully assigned support/support-prometheus-node-exporter-s967n to gke-leap-cluster-nb-large-3ab858ce-gmz9 57m Normal Pulling pod/support-prometheus-node-exporter-s967n Pulling image "quay.io/prometheus/node-exporter:v1.5.0" 57m Normal Pulled pod/support-prometheus-node-exporter-s967n Successfully pulled image "quay.io/prometheus/node-exporter:v1.5.0" in 5.175019332s 57m Normal Created pod/support-prometheus-node-exporter-s967n Created container node-exporter 57m Normal Started pod/support-prometheus-node-exporter-s967n Started container node-exporter 57m Normal SuccessfulCreate daemonset/support-prometheus-node-exporter (combined from similar events): Created pod: support-prometheus-node-exporter-s967n 26m Normal SuccessfulCreate daemonset/support-prometheus-node-exporter Created pod: support-prometheus-node-exporter-bzh9k 18m Warning FailedDaemonPod daemonset/support-prometheus-node-exporter Found failed daemon pod support/support-prometheus-node-exporter-pgd4t on node gke-leap-cluster-dask-huge-b3a90f26-574h, will try to kill it 18m Normal SuccessfulDelete daemonset/support-prometheus-node-exporter Deleted pod: support-prometheus-node-exporter-pgd4t 16m Normal SuccessfulCreate daemonset/support-prometheus-node-exporter Created pod: support-prometheus-node-exporter-hnhmf 57m Normal Preempted pod/support-prometheus-server-548d55b8bc-hfkvk Preempted by kube-system/calico-node-rnths on node gke-leap-cluster-core-pool-1cc6bf7d-g77w 57m Normal Killing pod/support-prometheus-server-548d55b8bc-hfkvk Stopping container prometheus-server-configmap-reload 57m Normal Killing pod/support-prometheus-server-548d55b8bc-hfkvk Stopping container prometheus-server 57m Warning FailedScheduling pod/support-prometheus-server-548d55b8bc-mhppb 0/212 nodes are available: 200 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 5 Insufficient cpu, 7 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 8 Insufficient memory. preemption: 0/212 nodes are available: 207 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod. 57m Normal NotTriggerScaleUp pod/support-prometheus-server-548d55b8bc-mhppb pod didn't trigger scale-up: 4 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 2 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 2 Insufficient memory, 1 Insufficient cpu, 2 max node group size reached 52m Normal NotTriggerScaleUp pod/support-prometheus-server-548d55b8bc-mhppb pod didn't trigger scale-up: 3 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 5 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 2 max node group size reached 54m Warning FailedScheduling pod/support-prometheus-server-548d55b8bc-mhppb 0/212 nodes are available: 200 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 5 Insufficient cpu, 7 Insufficient memory, 7 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}. preemption: 0/212 nodes are available: 207 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod. 36m Normal NotTriggerScaleUp pod/support-prometheus-server-548d55b8bc-mhppb pod didn't trigger scale-up: 5 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}, 3 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 2 max node group size reached 31m Normal NotTriggerScaleUp pod/support-prometheus-server-548d55b8bc-mhppb (combined from similar events): pod didn't trigger scale-up: 2 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 1 Insufficient cpu, 1 Insufficient memory, 2 max node group size reached, 5 node(s) had untolerated taint {hub.jupyter.org_dedicated: user} 29m Warning FailedScheduling pod/support-prometheus-server-548d55b8bc-mhppb 0/212 nodes are available: 11 Insufficient cpu, 200 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 7 Insufficient memory, 7 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}. preemption: 0/212 nodes are available: 207 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod. 26m Warning FailedScheduling pod/support-prometheus-server-548d55b8bc-mhppb 0/213 nodes are available: 1 node(s) had volume node affinity conflict, 11 Insufficient cpu, 200 node(s) had untolerated taint {k8s.dask.org_dedicated: worker}, 7 Insufficient memory, 7 node(s) had untolerated taint {hub.jupyter.org_dedicated: user}. preemption: 0/213 nodes are available: 208 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod. 26m Normal SuccessfulAttachVolume pod/support-prometheus-server-548d55b8bc-mhppb AttachVolume.Attach succeeded for volume "pvc-9f8c30e3-b63f-4c82-a819-80ea211e77fe" 26m Normal Pulling pod/support-prometheus-server-548d55b8bc-mhppb Pulling image "jimmidyson/configmap-reload:v0.8.0" 26m Normal Pulled pod/support-prometheus-server-548d55b8bc-mhppb Successfully pulled image "jimmidyson/configmap-reload:v0.8.0" in 950.407348ms 26m Normal Created pod/support-prometheus-server-548d55b8bc-mhppb Created container prometheus-server-configmap-reload 26m Normal Started pod/support-prometheus-server-548d55b8bc-mhppb Started container prometheus-server-configmap-reload 26m Normal Pulling pod/support-prometheus-server-548d55b8bc-mhppb Pulling image "quay.io/prometheus/prometheus:v2.41.0" 26m Normal Pulled pod/support-prometheus-server-548d55b8bc-mhppb Successfully pulled image "quay.io/prometheus/prometheus:v2.41.0" in 5.241799705s 26m Normal Created pod/support-prometheus-server-548d55b8bc-mhppb Created container prometheus-server 26m Normal Started pod/support-prometheus-server-548d55b8bc-mhppb Started container prometheus-server 24m Warning Unhealthy pod/support-prometheus-server-548d55b8bc-mhppb Readiness probe failed: HTTP probe failed with statuscode: 503 57m Normal SuccessfulCreate replicaset/support-prometheus-server-548d55b8bc Created pod: support-prometheus-server-548d55b8bc-mhppb |
Guessed diagnosis
Mitigation ideas
|
@consideRatio thanks for you input, I was planning to look at this further today as the issue came in late in my day yesterday. Can you walk me through the diagnosis steps to get the timeline for 1 - eg what logs and graphs did you consult, commands run etc. This would be helpful in learning effective debugging techniques for the system. |
In this case, I extracted information you had provided above and deduced most things from that based on previous experience. In these k8s Event's associated with the prometheus-server pod, we get informed about a huge number of nodes are running which could also be discovered via https://grafana.leap.2i2c.cloud under the "Cluster something" dashboard that lists nodes.
Looking at this again, I see that it seems that
|
Autoscaler logs are in GCP and can be queried eg |
|
Oh it looks like the node pool has already been reduced, I don't see an update in this issue (or the checklist item checked for doing this)
Time of downsize is 06:16UTC |
Ok terraform |
I only see one scale down in the time window
|
Given the autoscaler has downsized and the mitigation issues filed as per #2248 (comment) I am closing this as no longer active. |
Prometheus in leap server is down
Reported via ticket https://2i2c.freshdesk.com/a/tickets/482
The text was updated successfully, but these errors were encountered: