Configuring Conditional Automatic Traffic Switchover

This section describes how to configure conditional automatic traffic switchover to identify CoreDNS faults in a cluster and automatically redirect traffic.

Installing CPD for a Cluster to Identify Faults

Before configuring automatic traffic switchover, you need to install cluster-problem-detector (CPD) in a cluster to automatically detect whether CoreDNS runs normally and report the results.

CPD periodically checks whether CoreDNS can resolve kubernetes.default and updates the result to conditions of the node object. The active CPD pod collects conditions on each node, determines whether cluster domain name resolution is normal, and reports the result to the federation control plane of the cluster.

CPD needs to be independently deployed as a DaemonSet on all nodes in each cluster. The following is an example CPD configuration file. You can modify the parameters by referring to Table 1.

**Table 1** CPD parameters
Parameter	Description
<federation-version>	Version of the federation that the cluster belongs to. On the Fleets tab, click the fleet name to obtain the version.
<your-cluster-name>	Name of the cluster where CPD is to be installed.
<kubeconfig-of-karmada>	The kubeconfig file of the federation control plane. For details about how to download the kubeconfig file that meets the requirements, see kubeconfig. CAUTION: When downloading the kubeconfig file, you need to select the VPC where the cluster resides, or the VPC that can communicate with the VPC where the cluster resides over a Cloud Connect or VPC peering connection. If the IP address of the federation control plane in the kubeconfig file is set to a domain name, you need to configure hostAliases in the YAML file.
hostAliases	If the IP address of the federation control plane in the kubeconfig file is set to a domain name, you need to configure hostAliases in the YAML file. If the IP address is not a domain name, delete hostAliases from the YAML file. Replace <host name of karmada server> with the domain name of the federation control plane. To obtain the domain name of the federation control plane, view the server field in the kubeconfig file. Replace <ip of host name of karmada server> with the IP address of the federation control plane. To obtain the IP address of the federation control plane, log in to the cluster node where the CPD component is to be deployed and run the ping {Domain name of the federation control plane} command. The domain name of the federation control plane can be resolved to the IP address.
coredns-detect-period	Interval for CoreDNS to detect and report data, which defaults to 5s (recommended value). A smaller value indicates more frequent data detection and reporting.
coredns-success-threshold	Threshold of the duration in which CoreDNS successfully resolves a domain name, which defaults to 30s (recommended value). If the duration exceeds this threshold, CoreDNS is normal. A higher value indicates more stable detection but lower sensitivity, while a lower value indicates less stable detection but higher sensitivity.
coredns-failure-threshold	Threshold of the duration in which CoreDNS fails to resolve a domain name, which defaults to 30s (recommended value). If the duration exceeds this threshold, CoreDNS is faulty. A higher value indicates more stable detection but lower sensitivity, while a lower value indicates less stable detection but higher sensitivity.

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: cluster-problem-detector
  namespace: kube-system
  labels:
    app: cluster-problem-detector
spec:
  selector:
    matchLabels:
      app: cluster-problem-detector
  template:
    metadata:
      labels:
        app: cluster-problem-detector
    spec:
      containers:
        - image: swr.ap-southeast-3.myhuaweicloud.com/hwofficial/cluster-problem-detector:<federation-version>
          name: cluster-problem-detector
          command:
            - /bin/sh
            - '-c'
            - /var/paas/cluster-problem-detector/cluster-problem-detector
              --karmada-kubeconfig=/tmp/config
              --karmada-context=federation
              --cluster-name=<your-cluster-name>
              --host-name=${HOST_NAME}
              --bind-address=${POD_ADDRESS}
              --healthz-port=8081
              --detectors=*
              --coredns-detect-period=5s
              --coredns-success-threshold=30s
              --coredns-failure-threshold=30s
              --coredns-stale-threshold=60s
          env:
            - name: POD_ADDRESS
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: status.podIP
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
            - name: HOST_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            timeoutSeconds: 3
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /healthz
              port: 8081
              scheme: HTTP
            initialDelaySeconds: 3
            timeoutSeconds: 3
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          volumeMounts:
            - mountPath: /tmp
              name: karmada-config
      serviceAccountName: cluster-problem-detector
      volumes:
        - configMap:
            name: karmada-kubeconfig
            items:
              - key: kubeconfig
                path: config
          name: karmada-config
      securityContext:
        fsGroup: 10000
        runAsUser: 10000
        seccompProfile:
          type: RuntimeDefault
      hostAliases:
      - hostnames:
          - <host name of karmada server>
        ip: <ip of host name of karmada server>
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: cluster-problem-detector
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cpd-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:cluster-problem-detector
subjects:
  - kind: ServiceAccount
    name: cluster-problem-detector
    namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: system:cluster-problem-detector
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - nodes/status
    verbs:
      - patch
      - update
  - apiGroups:
      - ""
      - events.k8s.io
    resources:
      - events
    verbs:
      - create
      - patch
      - update
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs:
      - get
      - list
      - watch
      - create
      - update
      - patch
      - delete
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: karmada-kubeconfig
  namespace: kube-system
data:
  kubeconfig: |+
    <kubeconfig-of-karmada>

Checking Whether CPD Runs Normally

After deploying CPD, check whether CPD runs normally.

Run the following command to check whether the ServiceDomainNameResolutionReady condition exists in conditions of the node and whether lastHeartBeatTime of this condition is updated in a timely manner:
kubectl get node <node-name> -oyaml | grep -B4 ServiceDomainNameResolutionReady

If the condition does not exist or lastHeartBeatTime of the condition is not updated for a long time:
1. Check whether the CPD pod is in the Ready state.
2. Check whether there is a LoadCorednsConditionFailed or StoreCorednsConditionFailed event in the member cluster. If the event exists, rectify the fault based on the error message in the event.
Run the following command to check whether the ServiceDomainNameResolutionReady condition exists in the federation cluster object:
kubectl --kubeconfig <kubeconfig-of-federation> get cluster <cluster-name> -oyaml | grep ServiceDomainNameResolutionReady

If the cluster object does not contain the preceding condition:
1. Check "failed to sync corendns condition to control plane, requeuing" in the CPD log.
2. Check the kubeconfig file configuration. If the kubeconfig file configuration is updated, deploy CPD again.
3. Check the network connectivity between the node where CPD resides and the VPC of the cluster you selected when the kubeconfig file is downloaded.

Configuring a Policy for Conditional Automatic Traffic Switchover

Once CPD is deployed and runs normally, you need to create a Remedy object to perform specific actions when certain conditions are met. For example, if CoreDNS in a cluster is faulty, the cluster traffic will be redirected to an available cluster.

The following is an example configuration file of the Remedy object. The Remedy object is defined to report exceptions of CoreDNS using CPD in the cluster member1 or member2. If CoreDNS is faulty, the cluster traffic will be redirected to an available cluster automatically. For details about the parameters of the Remedy object, see Table 2.

apiVersion: remedy.karmada.io/v1alpha1
kind: Remedy
metadata:
  name: foo
spec:
  clusterAffinity:
    clusterNames:
      - member1
      - member2
  decisionMatches:
  - clusterConditionMatch:
      conditionType: ServiceDomainNameResolutionReady
      operator: Equal
      conditionStatus: "False"
  actions:
  - TrafficControl

**Table 2** Remedy parameters
Parameter	Description
spec.clusterAffinity.clusterNames	List of clusters controlled by the policy. The specified action is performed only for clusters in the list. If this parameter is left blank, no action is performed.
spec.decisionMatches	Trigger condition list. When a cluster in the cluster list meets any trigger condition, the specified action is performed. If this parameter is left blank, the specified action is triggered unconditionally.
conditionType	Type of a trigger condition. Only ServiceDomainNameResolutionReady (domain name resolution of CoreDNS reported by CPD) is supported.
operator	Judgment logic. Only Equal (equal to) and NotEqual (not equal to) are supported.
conditionStatus	Status of a trigger condition.
actions	Action to be performed by the policy. Currently, only TrafficControl (traffic control) is supported.