Etcd is cluster database in Kubernetes and Openshift. This is critical component to keep your cluster up and running. By design it is fault tolerant, but of course some failures might require administator intervention. In Openshift Etcd is running on master nodes collocated with api servers and controller.
In order to keep Etcd fully operational you need to have more than half cluster members running. If less than half Etcd cluster members are running cluster will switch to read only mode and in practice you won't be able to manage your cluster with kubectl/oc or web/admin console. In this situation you must recover cluster nodes or add new cluster members to have more than half cluster members running.
Here are Etcd failure scenarios and how you can recover in Openshift environment:
- Majority masters are up, minority master are down.
When failed masters comes back they will automatically recover and join the cluster.
- Network partition.
If there is any side where majority is running this side will remain fully operational. Once the network partition clears, the minority side automatically recognizes the leader from the majority side and recovers its state.
- Minority masters are up, majority masters are down.
First create a single-node etcd cluster following Restoring etcd quorum for static pods procedure.
Secondly add more cluster members following Adding etcd nodes after restoring procedure. 3 master clusters are recommended.
- All masters are down.
If all masters are down first you should check if you can get any of then up and running reasonably quickly. If yes you can proceed with scenario 3. From my experience if there is no file system corruption this should work pretty well in most cases. Otherwise you'll need to recover your Etcd cluster from the backup:
First Restore etcd from snapshot which will create single-node etcd cluster.
Secondly add more cluster members following Adding etcd nodes after restoring procedure. 3 master clusters are recommended.
Secondly add more cluster members following Adding etcd nodes after restoring procedure. 3 master clusters are recommended.
---