Skip to content
This repository has been archived by the owner on Mar 23, 2020. It is now read-only.

master don't come back online after reboot in 99_post_install - etcd member pod not running #100

Open
sreichar opened this issue Sep 9, 2019 · 8 comments

Comments

@sreichar
Copy link
Collaborator

sreichar commented Sep 9, 2019

[kni@worker-0 logs]$ cat 99_post_install-2019-09-07-190210.log

  • source common.sh
    ++++ dirname common.sh
    +++ cd .
    +++ pwd
    ++ SCRIPTDIR=/home/kni/install-scripts/OpenShift
    +++ whoami
    ++ USER=kni
    ++ '[' -z '' ']'
    ++ '[' -f /home/kni/install-scripts/OpenShift/config_kni.sh ']'
    ++ echo 'Using CONFIG /home/kni/install-scripts/OpenShift/config_kni.sh'
    Using CONFIG /home/kni/install-scripts/OpenShift/config_kni.sh
    ++ CONFIG=/home/kni/install-scripts/OpenShift/config_kni.sh
    ++ source /home/kni/install-scripts/OpenShift/config_kni.sh
    +++ set +x
    +++ export INT_IF=eno2
    +++ INT_IF=eno2
    +++ export PRO_IF=eno1
    +++ PRO_IF=eno1
    ++ export LIBVIRT_DEFAULT_URI=qemu:///system
    ++ LIBVIRT_DEFAULT_URI=qemu:///system
    ++ '[' kni '!=' root -a /run/user/1000 == /run/user/0 ']'
    ++ sudo -n uptime
    +++ awk -F= '/^VERSION_ID=/ { print $2 }' /etc/os-release
    +++ cut -f1 -d.
    +++ tr -d '"'
    ++ VER=8
    +++ tr -d '"'
    +++ awk -F= '/^ID=/ { print $2 }' /etc/os-release
    ++ [[ rhel != \r\h\e\l ]]
    ++ [[ 8 -ne 8 ]]
    ++ '[' 3940 = 0 ']'
  • export KUBECONFIG=ocp/auth/kubeconfig
  • KUBECONFIG=ocp/auth/kubeconfig
  • POSTINSTALL_ASSETS_DIR=./assets/post-install
  • IFCFG_INTERFACE=./assets/post-install/ifcfg-interface.template
  • IFCFG_BRIDGE=./assets/post-install/ifcfg-bridge.template
  • BREXT_FILE=./assets/post-install/99-brext-master.yaml
  • export bridge=brext
  • bridge=brext
  • create_bridge
  • echo 'Deploying Bridge brext...'
    Deploying Bridge brext...
    ++ head -1
    ++ oc get node -o 'custom-columns=IP:.status.addresses[0].address' --no-headers
  • FIRST_MASTER=10.19.1.231
    ++ ssh -q -o StrictHostKeyChecking=no [email protected] 'ip r | grep default | grep -Po '''(?<=dev )(\S+)''''
  • export interface=eno2
  • interface=eno2
  • '[' eno2 == '' ']'
  • '[' eno2 '!=' brext ']'
  • echo 'Using interface eno2'
    Using interface eno2
    ++ envsubst
    ++ base64 -w0
  • export interface_content=REVWSUNFPWVubzIKQlJJREdFPWJyZXh0Ck9OQk9PVD15ZXMKTk1fQ09OVFJPTExFRD15ZXMKQk9PVFBST1RPPW5vbmUK
  • interface_content=REVWSUNFPWVubzIKQlJJREdFPWJyZXh0Ck9OQk9PVD15ZXMKTk1fQ09OVFJPTExFRD15ZXMKQk9PVFBST1RPPW5vbmUK
    ++ envsubst
    ++ base64 -w0
  • export bridge_content=REVWSUNFPWJyZXh0Ck5BTUU9YnJleHQKVFlQRT1CcmlkZ2UKT05CT09UPXllcwpOTV9DT05UUk9MTEVEPXllcwpCT09UUFJPVE89ZGhjcApCUklER0lOR19PUFRTPXZsYW5fZmlsdGVyaW5nPTEKQlJJREdFX1ZMQU5TPSIxIHB2aWQgdW50YWdnZWQsMjAsMzAwLTQwMCB1bnRhZ2dlZCIK
  • bridge_content=REVWSUNFPWJyZXh0Ck5BTUU9YnJleHQKVFlQRT1CcmlkZ2UKT05CT09UPXllcwpOTV9DT05UUk9MTEVEPXllcwpCT09UUFJPVE89ZGhjcApCUklER0lOR19PUFRTPXZsYW5fZmlsdGVyaW5nPTEKQlJJREdFX1ZMQU5TPSIxIHB2aWQgdW50YWdnZWQsMjAsMzAwLTQwMCB1bnRhZ2dlZCIK
  • envsubst
  • echo 'Done creating bridge definition'
    Done creating bridge definition
  • apply_mc
  • for node_type in master worker
  • oc patch --type=merge '--patch={"spec":{"paused":true}}' machineconfigpool/master
    machineconfigpool.machineconfiguration.openshift.io/master patched
  • for node_type in master worker
  • oc patch --type=merge '--patch={"spec":{"paused":true}}' machineconfigpool/worker
    machineconfigpool.machineconfiguration.openshift.io/worker patched
  • '[' '' '!=' '' ']'
  • for node_type in master worker
    ++ find ./assets/post-install -iname '*-master.yaml' -type f
  • test ./assets/post-install/99-brext-master.yaml
  • echo 'Applying machine configs...'
    Applying machine configs...
  • oc create -f ./assets/post-install/99-brext-master.yaml
    machineconfig.machineconfiguration.openshift.io/99-brext-master created
  • oc patch --type=merge '--patch={"spec":{"paused":false}}' machineconfigpool/master
    machineconfigpool.machineconfiguration.openshift.io/master patched
  • echo 'Rebooting nodes...'
    Rebooting nodes...
  • sleep 30
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (Timeout): the server was unable to return a response in the time allotted, but may still be processing the request
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Unable to connect to the server: unexpected EOF
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    Error from server (NotFound): the server could not find the requested resource (get machineconfigpools.machineconfiguration.openshift.io master)
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
  • sleep 1
  • oc wait mcp/master --for condition=updated --timeout 600s
    error: timed out waiting for the condition on machineconfigpools/master
    .
    .
    .
@e-minguez
Copy link
Member

e-minguez commented Sep 9, 2019

I've observed in my environment that no host is even rebooted while applying the machine-configs:

$ oc get nodes
NAME                                         STATUS                     ROLES           AGE   VERSION
kni1-master-0.env.mydomain.example.com   Ready,SchedulingDisabled   master,worker   79m   v1.14.6+82219910a
kni1-master-1.env.mydomain.example.com   Ready                      master,worker   80m   v1.14.6+82219910a
kni1-master-2.env.mydomain.example.com   Ready                      master,worker   79m   v1.14.6+82219910a

$ for node in $(oc get nodes -o jsonpath="{.items[*].metadata.name}"); do echo -n ${node
}; ssh core@${node} uptime; done
kni1-master-0.env.mydomain.example.com 12:47:43 up  1:20,  0 users,  load average: 0.41, 0.33, 0.43
kni1-master-1.env.mydomain.example.com 12:47:51 up  1:21,  0 users,  load average: 0.42, 1.48, 1.38
kni1-master-2.env.mydomain.example.com 12:47:45 up  1:20,  0 users,  load average: 0.44, 0.70, 0.81

Digging up a bit, I've seen the machine-config-daemon pod running in the kni1-master-0 is complaining about the pod disruption budget for the etcd-quorum-guard

$ oc get pods -n openshift-machine-config-operator -o wide | grep kni1-master-0
etcd-quorum-guard-59f44bc47d-sxg8j           1/1     Running   0          77m   10.19.138.11   kni1-master-0.env.mydomain.example.com   <none>           <none>
machine-config-daemon-hgglp                  1/1     Running   0          78m   10.19.138.11   kni1-master-0.env.mydomain.example.com   <none>           <none>
machine-config-server-8sbx4                  1/1     Running   0          78m   10.19.138.11   kni1-master-0.env.mydomain.example.com   <none>           <none>
$ oc logs machine-config-daemon-hgglp
...
I0909 12:53:06.955223   13858 update.go:89] error when evicting pod "etcd-quorum-guard-59f44bc47d-sxg8j" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0909 12:53:11.961294   13858 update.go:89] error when evicting pod "etcd-quorum-guard-59f44bc47d-sxg8j" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0909 12:53:16.966611   13858 update.go:89] error when evicting pod "etcd-quorum-guard-59f44bc47d-sxg8j" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
$ oc get pods -o wide | grep etcd-quorum-guard
etcd-quorum-guard-59f44bc47d-br7qq           0/1     Running   0          78m   10.19.138.12   kni1-master-1.env.mydomain.example.com   <none>           <none>
etcd-quorum-guard-59f44bc47d-sxg8j           1/1     Running   0          78m   10.19.138.11   kni1-master-0.env.mydomain.example.com   <none>           <none>
etcd-quorum-guard-59f44bc47d-xd854           1/1     Running   0          78m   10.19.138.13   kni1-master-2.env.mydomain.example.com   <none>           <none>

$ oc get events | grep etcd-quorum-guard-59f44bc47d-br7qq
79m         Normal    Scheduled           pod/etcd-quorum-guard-59f44bc47d-br7qq            Successfully assigned openshift-machine-config-operator/etcd-quorum-guard-59f44bc47d-br7qq to kni1-master-1.env.mydomain.example.com
79m         Normal    Pulled              pod/etcd-quorum-guard-59f44bc47d-br7qq            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dfd89e339168edd91af05acd0f575474212f09b97d2b02b235bd1c17e7ae4802" already present on machine
79m         Normal    Created             pod/etcd-quorum-guard-59f44bc47d-br7qq            Created container guard
79m         Normal    Started             pod/etcd-quorum-guard-59f44bc47d-br7qq            Started container guard
22s         Warning   Unhealthy           pod/etcd-quorum-guard-59f44bc47d-br7qq            Readiness probe failed:
79m         Normal    SuccessfulCreate    replicaset/etcd-quorum-guard-59f44bc47d           Created pod: etcd-quorum-guard-59f44bc47d-br7qq

$ oc logs etcd-quorum-guard-59f44bc47d-br7qq
$ 

$ oc get pods -A | grep -v -E 'Running|Complete'
NAMESPACE                                               NAME                                                                  READY   STATUS             RESTARTS   AGE
openshift-etcd                                          etcd-member-kni1-master-1.env.mydomain.example.com                1/2     CrashLoopBackOff   17         86m
$ oc get pods -n openshift-etcd
NAME                                                     READY   STATUS             RESTARTS   AGE
etcd-member-kni1-master-0.env.mydomain.example.com   2/2     Running            0          87m
etcd-member-kni1-master-1.env.mydomain.example.com   1/2     CrashLoopBackOff   17         87m
etcd-member-kni1-master-2.env.mydomain.example.com   2/2     Running            0          87m

$ oc logs etcd-member-kni1-master-1.env.mydomain.example.com -n openshift-etcd
Error from server (BadRequest): a container name must be specified for pod etcd-member-kni1-master-1.env.mydomain.example.com, choose one of: [etcd-member etcd-metrics] or one of the init containers: [discovery certs]

$ oc logs etcd-member-kni1-master-1.env.mydomain.example.com -c etcd-member -n openshift-etcd
/bin/sh: line 3: /run/etcd/environment: Permission denied

In the nodes:

$ for node in $(oc get nodes -o jsonpath="{.items[*].metadata.name}"); do ssh core@${nod
e} sudo cat /run/etcd/environment; ssh core@${node} sudo ls -lZ /run/etcd/environment; done

export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com
ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com
ETCD_IPV4_ADDRESS=10.19.138.11
ETCD_DNS_NAME=etcd-0.kni1.env.mydomain.example.com
-rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep  9 11:29 /run/etcd/environment

export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com
ETCD_IPV4_ADDRESS=10.19.138.12
ETCD_DNS_NAME=etcd-1.kni1.env.mydomain.example.com
ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com
-rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep  9 11:29 /run/etcd/environment

export ETCD_DISCOVERY_SRV=kni1.env.mydomain.example.com
ETCD_WILDCARD_DNS_NAME=*.kni1.env.mydomain.example.com
ETCD_IPV4_ADDRESS=10.19.138.13
ETCD_DNS_NAME=etcd-2.kni1.env.mydomain.example.com
-rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 205 Sep  9 11:28 /run/etcd/environment

@jparrill
Copy link
Contributor

jparrill commented Sep 9, 2019

Same here... I've been facing this error after the bridge patch been applied and then rebooted. The etcd member never comes up again.

  • The bad master-0:
[core@master-0 ~]$ ls -lahZ /run/etcd/environment
-rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 183 Sep  9 11:46 /run/etcd/environment

[core@master-0 ~]$ getfacl /run/etcd/environment
getfacl: Removing leading '/' from absolute path names
# file: run/etcd/environment
# owner: root
# group: root
user::rw-
group::r--
other::r--
  • A good node (master-2):
[core@master-2 ~]$ ls -alhZ /run/etcd/environment
-rw-r--r--. 1 root root system_u:object_r:container_var_run_t:s0 183 Sep  9 11:32 /run/etcd/environment

[core@master-2 ~]$ getfacl /run/etcd/environment
getfacl: Removing leading '/' from absolute path names
# file: run/etcd/environment
# owner: root
# group: root
user::rw-
group::r--
other::r--

@e-minguez
Copy link
Member

I've just moved the etcd-member static pod definition in the affected host (to simulate a oc delete but for the static pod) and it seems to fix it...

$ ssh [email protected] sudo mv /etc/kubernetes/manifests/etcd-member.yaml /root/
$ ssh [email protected] sudo mv /root/etcd-member.yaml /etc/kubernetes/manifests/etcd-member.yaml
$ oc get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
etcd-member-kni1-master-0.env.mydomain.example.com   2/2     Running   2          129m
etcd-member-kni1-master-1.env.mydomain.example.com   2/2     Running   28         23m
etcd-member-kni1-master-2.env.mydomain.example.com   2/2     Running   0          129m

@sreichar
Copy link
Collaborator Author

sreichar commented Sep 9, 2019

@e-minguez That also worked for me.

Is this something we need to raise with OpenShift?

@e-minguez
Copy link
Member

@e-minguez That also worked for me.

Is this something we need to raise with OpenShift?

I believe it would be nice... the thing is this issue title seems misleading, I believe the main issue here is the etcd-member pod not running even if the install seems to be finished successfully.

@e-minguez
Copy link
Member

@sreichar sreichar changed the title master don't come back online after reboot in 99_post_install master don't come back online after reboot in 99_post_install - etcd member pod not running Sep 9, 2019
@hardys
Copy link

hardys commented Sep 11, 2019

I just tried rebooting a dev-scripts VM and cannot reproduce the same issue - could this be related to the other configuration changes made prior to the reboot?

@e-minguez
Copy link
Member

See the bugzilla. It seems there is a weird issue under the hood.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants