środa, 5 maja 2021

Optimize Kubernetes/OpenShift clusters resources utilization

It is quite common situation when your OpenShift or Kubernetes cluster resources utilization is very low but despite this at some point you are unable to deploy new pods which remains in pending state due to insufficient resources error. It could be even more frustrating if number of pods running on your cluster nodes is significantly below limit of 500 pods/node in OpenShift or 100 pods/node in upstream Kubernetes.


 

In most cases solution will be proper assignment of hardware resources to pods in your cluster. First you should analyze whether your cluster has pod deployments which are significantly over estimated. In OpenShift you can leverage built in monitoring which will show you instantly whether your cluster in over estimated and which pod deployments are over estimated: 

 


 

Let's try to analyze where this numbers comes from. Utilization/Usage values are actual utilization of hardware resources by running pods, and Request Commitment values are based on what has been set as resource request value in pod deployment definition or is set in default settings in Limit Range object. It is very important to understand that requested resources are reserved by the Kubernetes Scheduler exclusively for the pod and if not consumed will be wasted. Hence difference between actual utilization and requested resource value should be as small as possible to minimize the waste of hardware resources. 

This leads to the next question: how to calculate optimal resource request value for pod deployment? Of course the best way would be to load test your application for longer time and come up with the most accurate value. There is a great tool  Vertical Pod Autoscaler (VPA) which can calculate for you recommended pod deployment resource request and limit values based on historical resources consumption. VPA can even automatically apply recommendations to running pods but I don't recommended this for production clusters.  

Moving forward lets see how pods scaling might impact resources utilization. Generally you have two ways to scale your pods:

1. Scale up pods by setting in pod deployment resource limit to higher value than resource request. This might lead to node overcommit situation when amount of resource limits set in pods is higher than amount of resources available on the node. This could impact negatively node stability when resources utilization demand will grow above what is available on the node. If you want to allow node overcommit on your cluster you should always remember to reserve some resources for node system processes and define pod disruption budget and pod priorites to hint Scheduler which pods are running critical workloads. I don't recommend to allow resource over commit on production clusters.

2. Scale out using Horizontal Pod Autoscaler (HPA). Using HPA you can configure pod auto scaling based on selected metrics i.e. CPU or memory utilization percentage compared to configured in pod deployment resource request. Hence again it is very important to configure resource request to optimal value as described above.

Especially for production workloads I recommend to implement scale out strategy for pods scaling. 

Other important factor related to resources utilization is node workload balancing. Over time due to different reasons some nodes might become significantly more utilized than the others. This might negatively impact overall cluster capacity. In order to solve this challenge you can deploy to your cluster Node Descheduler (ND). Using ND you can enable different profiles which will detect different pods deployment patterns:

1. AffinityAndTaints: This profile evicts pods that violate inter-pod anti-affinity, node affinity, and node taints.

2. TopologyAndDuplicates: This profile evicts pods in an effort to evenly spread similar pods, or pods of the same topology domain, among nodes.

3. LifecycleAndUtilization: This profile evicts long-running pods and balances resource usage between nodes.

I expect over time more advanced deschedulers will be created based on more advanced patterns that will utilize AI/ML for optimal resources utilization and balancing. I recommend to combine ND with Cluster Autoscaler (CA). The cluster autoscaler increases the size of the cluster when there are pods that fails to schedule on cluster nodes due to insufficient resources or when another node is necessary to meet deployment needs.

So far I've shortly described cluster level tools to optimize resources utilization but almost always there is a huge space for improvement in the applications itself. Using cloud native architectures like microservices, cloud native and containers optimized frameworks like Quarkus (especially in Native Mode) and advanced deployment patterns like Serverless for automated scaling will also help to significantly optimize cluster resources utilization. This is great topic for other post...

All aforementioned optimizations will require some time and effort. It is good to be able to estimate financial costs and savings of these optimizations in advance. In OpenShift we have created very handy Cost Management dashboard which is available for connected clusters in Red Hat Cloud