Recently I came across the following issue in Openshift: when pod has attached RWO persistent volume and node where this pod is running goes down for whatever reason persistent volume is never detached from the node and pod will never get automatically evicted to other node.
Here is how ithis looks like in CLI:
PV backed pod is running on the node
NAME READY STATUS RESTARTS AGE IP NODE
postgresql-1-6ppzs 1/1 Running 1 6d 10.129.2.141 ip-10-0-12-219.ec2.internal
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-12.ec2.internal Ready master 280d v1.11.0+d4cacc0
ip-10-0-12-219.ec2.internal Ready compute 246d v1.11.0+d4cacc0
ip-10-0-8-129.ec2.internal Ready compute 280d v1.11.0+d4cacc0
ip-10-0-9-236.ec2.internal Ready infra 280d v1.11.0+d4cacc0
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-12.ec2.internal Ready master 280d v1.11.0+d4cacc0
ip-10-0-12-219.ec2.internal NotReady compute 246d v1.11.0+d4cacc0
ip-10-0-8-129.ec2.internal Ready compute 280d v1.11.0+d4cacc0
ip-10-0-9-236.ec2.internal Ready infra 280d v1.11.0+d4cacc0
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
postgresql-1-4jtfz 0/1 ContainerCreating 0 6m <none> ip-10-0-8-129.ec2.internal
postgresql-1-6ppzs 1/1 Unknown 1 7d 10.129.2.141 ip-10-0-12-219.ec2.internal
You can also see the following events in the event log: Multi-Attach error for volume "pvc-53cd2ba8-a496-11e9-b701-0ea4b5a6d9c6" Volume is already used by pod(s) postgresql-1-6ppzs. This is because RWO (ReadWriteOnce) volumes can be mounted as read-write only by a single node at the time. RWO is the most common storage access mode which is provided by many popular storage technologies including AWS EBS or Vmware vmdk disks. You can find detailed list of RWO volumes here. If you are using RWO storage your stateful pod won’t be automatically evicted to other node. This problem is identified and tracked in this Kubernetes issue.
Manual Failover
Fortunately there is quite straight forward manual procedure to failover from this situation. You simply need to force delete pod in "Unknown" status without any grace period with following command:
$ kubectl delete pod postgresql-1-6ppzs --grace-period=0 --force
After this command is executed pod will be immediately deleted and after 6 minutes (this value is hardcoded in Kubernetes) persistent volume will be detached from failed node and will be attached to node where new pod replica has been scheduled.
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
postgresql-1-qd7q2 1/1 Running 0 10m 10.129.2.145 ip-10-0-8-129.ec2.internal
Automated Failover
Having Kubernetes self healing in mind we would like to automate this procedure so that all pods using RWO persistent volumes will be automatically evicted in case of node failure or maintenance window. Here is the proposed solution:
1. Implement shutdown taint for nodes
2. Write an external controller (could be a cronjob python/ruby script too) which watches node objects with shutdown taint and force deletes pods which are stuck in "Unknown" state from nodes which have shutdown taint.
Any other options?
This issue is specific to RWO storage. Other solution will be to use RWX (ReadWriteMany) volumes where each volume can be mounted to multiple nodes at the same time. You can check again here what storage technologies supports RWX access mode. As you can see there are only a few RWX storage technologies available. From my experience very good choice are Software Defined Storage technologies like CephFS or GlusterFS. On the other hand the easiest option which is NFS doesn’t offer enough quality of service at least in some use cases like database storage or storage for systems with large number of small files read/write operations i.e. Prometheus or Elasticsearch.
With Red Hat Openshift Container Storage you can take a step further and leverage Openshift nodes to run RWX storage cluster based on GlusterFS or CephFS, depending which Openshift version you'll use. You can learn more about Openshift Container Storage here.
Brak komentarzy:
Prześlij komentarz