Affinity and Anti-Affinity Scheduling
A nodeSelector provides a very simple way to constrain pods to nodes with specific labels, as mentioned in DaemonSets. Affinity and anti-affinity expands the types of constraints you can define.
Kubernetes supports node-level and pod-level affinity and anti-affinity. You can configure custom rules for affinity and anti-affinity scheduling. For example, you can deploy frontend pods and backend pods together, deploy the same type of applications onto specific nodes, or deploy applications onto different nodes.
Node Affinity
Node affinity is conceptually similar to a nodeSelector as it allows you to constrain which nodes your pod is eligible to be scheduled on, based on labels of the node. The following output lists the labels of a node.
$ kubectl describe node 192.168.0.212 Name: 192.168.0.212 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/is-baremetal=false failure-domain.beta.kubernetes.io/region=cn-east-3 failure-domain.beta.kubernetes.io/zone=cn-east-3a kubernetes.io/arch=amd64 kubernetes.io/availablezone=cn-east-3a kubernetes.io/eniquota=12 kubernetes.io/hostname=192.168.0.212 kubernetes.io/os=linux node.kubernetes.io/subnetid=fd43acad-33e7-48b2-a85a-24833f362e0e os.architecture=amd64 os.name=EulerOS_2.0_SP5 os.version=3.10.0-862.14.1.5.h328.eulerosv2r7.x86_64
These labels are automatically added by CCE during node creation. The following describes a few that are frequently used during scheduling.
- failure-domain.beta.kubernetes.io/region: region where the node is located. In the preceding output, the label value is cn-east-3, which indicates that the node is located in the CN East-Shanghai1 region.
- failure-domain.beta.kubernetes.io/zone: availability zone to which the node belongs.
- kubernetes.io/hostname: hostname of the node.
In addition to these automatically added labels, you can tailor labels to your service requirements, as introduced in Label for Managing Pods. Generally, large Kubernetes clusters have various kinds of labels.
When you deploy pods, you can use a nodeSelector, as described in DaemonSets, to constrain pods to nodes with specific labels. The following example shows how to use a nodeSelector to deploy pods only on the nodes with the gpu=true label.
apiVersion: v1 kind: Pod metadata: name: nginx spec: nodeSelector: # Node selection. A pod is deployed on a node only when the node is labeled with gpu=true. gpu: true ...
apiVersion: apps/v1 kind: Deployment metadata: name: gpu labels: app: gpu spec: selector: matchLabels: app: gpu replicas: 3 template: metadata: labels: app: gpu spec: containers: - image: nginx:alpine name: gpu resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi imagePullSecrets: - name: default-secret affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: gpu operator: In values: - "true"
Even though the node affinity rules require more lines of code, they are more expressive, which will be further described later.
requiredDuringSchedulingIgnoredDuringExecution seems to be complex, but it can be easily understood as a combination of two parts.
- requiredDuringScheduling indicates that pods can be scheduled to the node only when all the defined rules are met (required).
- IgnoredDuringExecution indicates that pods already running on the node do not need to meet the defined rules. If a label removed from the node, the pods that require the node to contain that label will not be re-scheduled.
In addition, the value of operator is In, indicating that the label value must be in the values list. Other available operator values are as follows:
- NotIn: The label value is not in a list.
- Exists: A specific label exists.
- DoesNotExist: A specific label does not exist.
- Gt: The label value is greater than a specified value (string comparison).
- Lt: The label value is less than a specified value (string comparison).
Note that there is no such a thing as nodeAntiAffinity because operators NotIn and DoesNotExist provide the same function.
Now, check whether the node affinity rule takes effect. Add the gpu=true label to the 192.168.0.212 node.
$ kubectl label node 192.168.0.212 gpu=true node/192.168.0.212 labeled $ kubectl get node -L gpu NAME STATUS ROLES AGE VERSION GPU 192.168.0.212 Ready <none> 13m v1.15.6-r1-20.3.0.2.B001-15.30.2 true 192.168.0.94 Ready <none> 13m v1.15.6-r1-20.3.0.2.B001-15.30.2 192.168.0.97 Ready <none> 13m v1.15.6-r1-20.3.0.2.B001-15.30.2
Create a Deployment. You can find that all pods are deployed on the 192.168.0.212 node.
$ kubectl create -f affinity.yaml deployment.apps/gpu created $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE gpu-6df65c44cf-42xw4 1/1 Running 0 15s 172.16.0.37 192.168.0.212 gpu-6df65c44cf-jzjvs 1/1 Running 0 15s 172.16.0.36 192.168.0.212 gpu-6df65c44cf-zv5cl 1/1 Running 0 15s 172.16.0.38 192.168.0.212
Node Preference Rule
The preceding requiredDuringSchedulingIgnoredDuringExecution rule is a hard selection rule. There is another type of selection rule preferredDuringSchedulingIgnoredDuringExecution. It is used to specify which nodes are preferred during scheduling.
To demonstrate its effect, add a node in a different AZ from other nodes to the cluster. Then, check the AZ of the node. As shown in the following output, the newly added node is in cn-east-3c.
$ kubectl get node -L failure-domain.beta.kubernetes.io/zone,gpu NAME STATUS ROLES AGE VERSION ZONE GPU 192.168.0.100 Ready <none> 7h23m v1.15.6-r1-20.3.0.2.B001-15.30.2 cn-east-3c 192.168.0.212 Ready <none> 8h v1.15.6-r1-20.3.0.2.B001-15.30.2 cn-east-3a true 192.168.0.94 Ready <none> 8h v1.15.6-r1-20.3.0.2.B001-15.30.2 cn-east-3a 192.168.0.97 Ready <none> 8h v1.15.6-r1-20.3.0.2.B001-15.30.2 cn-east-3a
Define a Deployment. Use the preferredDuringSchedulingIgnoredDuringExecution rule to set the weight of nodes in cn-east-3a to 80 and nodes with the gpu=true label to 20. In this way, pods are preferentially deployed onto the nodes in cn-east-3a.
apiVersion: apps/v1 kind: Deployment metadata: name: gpu labels: app: gpu spec: selector: matchLabels: app: gpu replicas: 10 template: metadata: labels: app: gpu spec: containers: - image: nginx:alpine name: gpu resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi imagePullSecrets: - name: default-secret affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: failure-domain.beta.kubernetes.io/zone operator: In values: - cn-east-3a - weight: 20 preference: matchExpressions: - key: gpu operator: In values: - "true"
After the Deployment is created, you can find that five pods are deployed on the 192.168.0.212 node, and two pods are deployed on the 192.168.0.100 node.
$ kubectl create -f affinity2.yaml deployment.apps/gpu created $ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE gpu-585455d466-5bmcz 1/1 Running 0 2m29s 172.16.0.44 192.168.0.212 gpu-585455d466-cg2l6 1/1 Running 0 2m29s 172.16.0.63 192.168.0.97 gpu-585455d466-f2bt2 1/1 Running 0 2m29s 172.16.0.79 192.168.0.100 gpu-585455d466-hdb5n 1/1 Running 0 2m29s 172.16.0.42 192.168.0.212 gpu-585455d466-hkgvz 1/1 Running 0 2m29s 172.16.0.43 192.168.0.212 gpu-585455d466-mngvn 1/1 Running 0 2m29s 172.16.0.48 192.168.0.97 gpu-585455d466-s26qs 1/1 Running 0 2m29s 172.16.0.62 192.168.0.97 gpu-585455d466-sxtzm 1/1 Running 0 2m29s 172.16.0.45 192.168.0.212 gpu-585455d466-t56cm 1/1 Running 0 2m29s 172.16.0.64 192.168.0.100 gpu-585455d466-t5w5x 1/1 Running 0 2m29s 172.16.0.41 192.168.0.212
In the preceding example, the node with both cn-east-3a and gpu=true labels has the first (highest) priority, the node with only the cn-east-3a label has the second priority (weight: 80), the node with only the gpu=true label has the third priority, and the node without any of these two labels have the fourth (lowest) priority.
According to the preceding output, you can find that no pods of the Deployment are scheduled to node 192.168.0.94. This is because the node already has many pods on it and its resource usage is high. This also indicates that the preferredDuringSchedulingIgnoredDuringExecution rule defines a preference rather than a hard requirement.
Workload Affinity (podAffinity)
Node affinity rules affect only the affinity between pods and nodes. Kubernetes also supports configuring inter-pod affinity rules. For example, the frontend and backend of an application can be deployed together on one node to reduce access latency. There are also two types of inter-pod affinity rules: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution.
For workload affinity, topologyKey cannot be left blank when requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution are used.
Assume that the backend of an application has been created and has the app=backend label.
$ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE backend-658f6cb858-dlrz8 1/1 Running 0 2m36s 172.16.0.67 192.168.0.100
You can configure the following pod affinity rule to deploy the frontend pods of the application to the same node as the backend pods.
apiVersion: apps/v1 kind: Deployment metadata: name: frontend labels: app: frontend spec: selector: matchLabels: app: frontend replicas: 3 template: metadata: labels: app: frontend spec: containers: - image: nginx:alpine name: frontend resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi imagePullSecrets: - name: default-secret affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname labelSelector: matchExpressions: - key: app operator: In values: - backend
Deploy the frontend and you can find that the frontend is deployed on the same node as the backend.
$ kubectl create -f affinity3.yaml deployment.apps/frontend created $ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE backend-658f6cb858-dlrz8 1/1 Running 0 5m38s 172.16.0.67 192.168.0.100 frontend-67ff9b7b97-dsqzn 1/1 Running 0 6s 172.16.0.70 192.168.0.100 frontend-67ff9b7b97-hxm5t 1/1 Running 0 6s 172.16.0.71 192.168.0.100 frontend-67ff9b7b97-z8pdb 1/1 Running 0 6s 172.16.0.72 192.168.0.100
The topologyKey field is used to divide topology keys to specify the selection range. If the label keys and values of nodes are the same, the nodes are considered to be in the same topology key. Then, the contents defined in the following rules are selected. The effect of topologyKey is not fully demonstrated in the preceding example because all the nodes have the kubernetes.io/hostname label, that is, all the nodes are within the range.
To see how topologyKey works, assume that the backend of the application has two pods, which are running on different nodes.
$ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE backend-658f6cb858-5bpd6 1/1 Running 0 23m 172.16.0.40 192.168.0.97 backend-658f6cb858-dlrz8 1/1 Running 0 2m36s 172.16.0.67 192.168.0.100
Add the prefer=true label to nodes 192.168.0.97 and 192.168.0.94.
$ kubectl label node 192.168.0.97 prefer=true node/192.168.0.97 labeled $ kubectl label node 192.168.0.94 prefer=true node/192.168.0.94 labeled $ kubectl get node -L prefer NAME STATUS ROLES AGE VERSION PREFER 192.168.0.100 Ready <none> 44m v1.15.6-r1-20.3.0.2.B001-15.30.2 192.168.0.212 Ready <none> 91m v1.15.6-r1-20.3.0.2.B001-15.30.2 192.168.0.94 Ready <none> 91m v1.15.6-r1-20.3.0.2.B001-15.30.2 true 192.168.0.97 Ready <none> 91m v1.15.6-r1-20.3.0.2.B001-15.30.2 true
If the topologyKey of podAffinity is set to prefer, the node topology keys are divided as shown in Figure 2.
affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: prefer labelSelector: matchExpressions: - key: app operator: In values: - backend
During scheduling, node topology keys are divided based on the prefer label. In this example, 192.168.0.97 and 192.168.0.94 are divided into the same topology key. If a pod with the app=backend label runs in the topology key, even if not all nodes in the topology key run the pod with the app=backend label (in this example, only the 192.168.0.97 node has such a pod), frontend is also deployed in this topology key (192.168.0.97 or 192.168.0.94).
$ kubectl create -f affinity3.yaml deployment.apps/frontend created $ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE backend-658f6cb858-5bpd6 1/1 Running 0 26m 172.16.0.40 192.168.0.97 backend-658f6cb858-dlrz8 1/1 Running 0 5m38s 172.16.0.67 192.168.0.100 frontend-67ff9b7b97-dsqzn 1/1 Running 0 6s 172.16.0.70 192.168.0.97 frontend-67ff9b7b97-hxm5t 1/1 Running 0 6s 172.16.0.71 192.168.0.97 frontend-67ff9b7b97-z8pdb 1/1 Running 0 6s 172.16.0.72 192.168.0.97
Workload Anti-Affinity (podAntiAffinity)
Unlike the scenarios in which pods are preferred to be scheduled onto the same node, sometimes, it could be the exact opposite. For example, if certain pods are deployed together, they will affect the performance.
For workload anti-affinity, when requiredDuringSchedulingIgnoredDuringExecution is used, the default access controller LimitPodHardAntiAffinityTopology of Kubernetes requires that topologyKey can only be kubernetes.io/hostname. To use other custom topology logic, modify or disable the access controller.
The following is an example of defining an anti-affinity rule. This rule divides node topology keys by the kubernetes.io/hostname label. If a pod with the app=frontend label already exists on a node in the topology key, pods with the same label cannot be scheduled to other nodes in the topology key.
apiVersion: apps/v1 kind: Deployment metadata: name: frontend labels: app: frontend spec: selector: matchLabels: app: frontend replicas: 5 template: metadata: labels: app: frontend spec: containers: - image: nginx:alpine name: frontend resources: requests: cpu: 100m memory: 200Mi limits: cpu: 100m memory: 200Mi imagePullSecrets: - name: default-secret affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - topologyKey: kubernetes.io/hostname # Topology key of the node labelSelector: # Pod label matching rule matchExpressions: - key: app operator: In values: - frontend
Create an anti-affinity rule and view the deployment result. In the example, node topology keys are divided by the kubernetes.io/hostname label. The label values of nodes with the kubernetes.io/hostname label are different, so there is only one node in a topology key. If a topology key contains only one node where a frontend pod already exists, pods with the same label will not be scheduled to that topology key. In this example, there are only four nodes. Therefore, there is one pod which is in the Pending state and cannot be scheduled.
$ kubectl create -f affinity4.yaml deployment.apps/frontend created $ kubectl get po -o wide NAME READY STATUS RESTARTS AGE IP NODE frontend-6f686d8d87-8dlsc 1/1 Running 0 18s 172.16.0.76 192.168.0.100 frontend-6f686d8d87-d6l8p 0/1 Pending 0 18s <none> <none> frontend-6f686d8d87-hgcq2 1/1 Running 0 18s 172.16.0.54 192.168.0.97 frontend-6f686d8d87-q7cfq 1/1 Running 0 18s 172.16.0.47 192.168.0.212 frontend-6f686d8d87-xl8hx 1/1 Running 0 18s 172.16.0.23 192.168.0.94
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot