piątek, 19 lipca 2019

Dealing with RWO storage limitations


Recently I came across the following issue in Openshift: when pod has attached RWO persistent volume and node where this pod is running goes down for whatever reason persistent volume is never detached from the node and pod will never get automatically evicted to other node.

Here is how ithis looks like in CLI:

PV backed pod is running on the node

$ kubectl get pods -o wide

NAME                 READY     STATUS    RESTARTS   AGE       IP            NODE                         
postgresql-1-6ppzs   1/1       Running   1          6d        10.129.2.141   ip-10-0-12-219.ec2.internal


$ kubectl get nodes

NAME                          STATUS     ROLES     AGE       VERSION
ip-10-0-1-12.ec2.internal     Ready      master    280d      v1.11.0+d4cacc0
ip-10-0-12-219.ec2.internal   Ready      compute   246d      v1.11.0+d4cacc0
ip-10-0-8-129.ec2.internal    Ready      compute   280d      v1.11.0+d4cacc0
ip-10-0-9-236.ec2.internal    Ready      infra     280d      v1.11.0+d4cacc0


When node went down, pod status changed to "Unknown" and new replica remains in ContainerCreating status forever

$ kubectl get nodes

NAME                          STATUS     ROLES     AGE       VERSION
ip-10-0-1-12.ec2.internal     Ready      master    280d      v1.11.0+d4cacc0
ip-10-0-12-219.ec2.internal   NotReady   compute   246d      v1.11.0+d4cacc0
ip-10-0-8-129.ec2.internal    Ready      compute   280d      v1.11.0+d4cacc0
ip-10-0-9-236.ec2.internal    Ready      infra     280d      v1.11.0+d4cacc0

$ kubectl get pods -o wide

NAME                 READY     STATUS              RESTARTS   AGE       IP             NODE                         
postgresql-1-4jtfz   0/1       ContainerCreating   0          6m        <none>         ip-10-0-8-129.ec2.internal
postgresql-1-6ppzs   1/1       Unknown             1          7d        10.129.2.141   ip-10-0-12-219.ec2.internal


You can also see the following events in the event log: Multi-Attach error for volume "pvc-53cd2ba8-a496-11e9-b701-0ea4b5a6d9c6" Volume is already used by pod(s) postgresql-1-6ppzs. This is because RWO (ReadWriteOnce) volumes can be mounted as read-write only by a single node at the time. RWO is the most common storage access mode which is provided by many popular storage technologies including AWS EBS or Vmware vmdk disks. You can find detailed list of RWO volumes here. If you are using RWO storage your stateful pod won’t be automatically evicted to other node. This problem is identified and tracked in this Kubernetes issue.

Manual Failover


Fortunately there is quite straight forward manual procedure to failover from this situation. You simply need to force delete pod in "Unknown" status without any grace period with following command:

$ kubectl delete pod postgresql-1-6ppzs --grace-period=0 --force

After this command is executed pod will be immediately deleted and after 6 minutes (this value is hardcoded in Kubernetes) persistent volume will be detached from failed node and will be attached to node where new pod replica has been scheduled.

$ kubectl get pods -o wide

NAME                 READY     STATUS    RESTARTS   AGE       IP             NODE                         
postgresql-1-qd7q2   1/1       Running   0          10m       10.129.2.145   ip-10-0-8-129.ec2.internal 


Automated Failover


Having Kubernetes self healing in mind we would like to automate this procedure so that all pods using RWO persistent volumes will be automatically evicted in case of node failure or maintenance window. Here is the proposed solution:

1. Implement shutdown taint for nodes

2. Write an external controller (could be a cronjob python/ruby script too) which watches node objects with shutdown taint and force deletes pods which are stuck in "Unknown" state from nodes which have shutdown taint.

Any other options?


This issue is specific to RWO storage. Other solution will be to use RWX (ReadWriteMany) volumes where each volume can be mounted to multiple nodes at the same time. You can check again here what storage technologies supports RWX access mode. As you can see there are only a few RWX storage technologies available. From my experience very good choice are Software Defined Storage technologies like CephFS or GlusterFS. On the other hand the easiest option which is NFS doesn’t offer enough quality of service at least in some use cases like database storage or storage for systems with large number of small files read/write operations i.e. Prometheus or Elasticsearch.

With Red Hat Openshift Container Storage you can take a step further and leverage Openshift nodes to run RWX storage cluster based on GlusterFS or CephFS, depending which Openshift version you'll use. You can learn more about Openshift Container Storage here.