It is an expert opinion about Business Continuity Planning for IT applications concerning HA (High Availability Framework). This post is sponsored by Kuberneteslab. Kuberneteslab helps startups to manage the horizontal layer of the application lifecycle. It includes DevOps + BCP + App modernization.
High Availability and Disaster Recovery are two different crucial concepts about IT Applications.
Enterprises do adopt HA (High Availability) framework. It makes sure all mission-critical applications must be running and available to end-users ALWAYS. Netflix is an example to make sure its proper disaster recovery over different regions.
Let’s discuss steps about the HA framework. We can not make everything HA just in 1 day. There is a process to follow the HA framework. Also, it is better to adopt the top-down approach.
There are four major steps to make sure highest level of availability and easy disaster recovery. This is a top-down approach. Let’s talk about every step 1 by 1.
1. Node Level HA: Everything does run on the node (aka machine, instance, VM). It can be a physical server or a virtual one. The first step to moving with HA is to make sure the process/pod/container is running on a unique node each time. Cluster orchestration like Kubernetes provides pod affinity/anti-affinity to achieve this.
Following is an example of the hard rule with respect to Kubernetes Pod affinity/anti-affinity.
apiVersion: apps/v1 kind: Deployment metadata: name: cache-app spec: selector: matchLabels: app: store replicas: 3 template: metadata: labels: app: store spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "kubernetes.io/hostname" containers: - name: cache-server image: cache-server:1.0
2. Rack level HA: Once you have achieved the Node Level HA(High Availability) you should think about the Rack Level HA(High Availability). But why you do need this ?.So even in a rack-level disaster, you need to make sure everything runs smoothly.
Following is the example of the Node Level HA & Rack Level HA.
apiVersion: apps/v1 kind: Deployment metadata: name: cache-app spec: selector: matchLabels: app: store replicas: 3 template: metadata: labels: app: store spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "kubernetes.io/hostname" preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "topology.kubernetes.io/rack" containers: - name: cache-server image: cache-server:1.0
Rack level HA is good to have for stateful applications. It will make sure the availability over the Rack Failure.
3. Availability Zone Level HA: It’s called Zone Level HA as well. Here the process/pods/containers spread over the AZ.
Following is the kubernetes node affinity example from AWS. You can add Rack and Node level HA too.
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: azname operator: In values: - az1 - az2 preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: another-node-label-key operator: In values: - another-node-label-value containers: - name: with-node-affinity image: us.gcr.io/k8s-artifacts-prod/pause:2.0
4. Region Level HA: This one is the highest level of availability for any application except the multi-cloud approach. Here two sets of the application run in two different regions. Also, load balancing is a crucial factor for such a type of HA. This type of HA later can be extended to the multi-cloud approach. However, in stateful applications, we need to take care of the extra configuration. It’s challenging to achieve where costs become higher to manage this.
Recommendation for HA framework:
- Always make sure Rack level availability is minimum.
- In Kubernetes, you can configure the soft rule. It’s good to have a soft rule for the Rack and Zone level HA. Scaling will be not blocked in terms of the fewer Rack’s available and the unavailability of the zone.
- Take extra care of the stateful applications. For example, ETCD/Zookeeper with multiple zones needs to be tested with latency tests.
- While spreading the masters of the cluster in Zone and Region Level HA do proper latency and functionality tests with the highest load.
- Set proper monitoring and alerting for the application and infrastructure.
Metrix for the HA framework:
You can use the Metrix table to define your application HA compliance with HA framework.