Help Center/ Cloud Container Engine/ FAQs/ Cluster/ Cluster Running/ How Do I Locate the Fault When a Cluster Is Unavailable?
Updated on 2024-11-13 GMT+08:00

How Do I Locate the Fault When a Cluster Is Unavailable?

If a cluster is unavailable, perform the following operations to locate the fault.

Troubleshooting Process

The issues here are described in order of how likely they are to occur.

Check these causes one by one until you find the cause of the fault.

If the fault persists, submit a service ticket and contact the customer service to help you locate the fault.

Check Item 1: Whether the Security Group Is Modified

  1. Log in to the management console and choose Service List > Networking > Virtual Private Cloud. In the navigation pane, choose Access Control > Security Groups to find the security group of the master node in the cluster.

    The name of this security group is in the format of Cluster name-cce-control-ID.

  2. Click the security group. On the details page displayed, ensure that the security group rules of the master node are correct.

    For details, see How Can I Configure a Security Group Rule in a Cluster?

Check Item 2: Whether There Are Residual Listeners and Backend Server Groups on the Load Balancer

Reproducing the Problem

A cluster exception occurs when a LoadBalancer Service is being created or deleted. After the fault is rectified, the Service is deleted successfully, but there are residual listeners and backend server groups.

  1. Pre-create a CCE cluster. In the cluster, use the official Nginx image to create a workload, preset a load balancer, a Service, and an ingress.
  2. Ensure that the cluster is running properly and the Nginx workload is stable.
  3. Create and delete 10 LoadBalancer Services every 20 seconds.
  4. Verify that an injection exception occurs in the cluster. For example, the etcd is unavailable or the cluster is hibernated.

Possible Causes

There are residual listeners and backend server groups on the load balancer.

Solution

Manually clear residual listeners and backend server groups.

  1. Log in to the management console and choose Networking > Elastic Load Balance from the service list.
  2. In the load balancer list, click the name of the target load balancer to go to the details page. On the Listeners tab, locate the target listener and delete it.
  3. On the Backend Server Groups page, locate the target backend server group and delete it.

Check Item 3: Whether the KMS Key Used for Secret Encryption Is Valid

If a cluster is unavailable, you can check the cluster event to locate the fault.

If KMS key status abnormal is displayed in the events, check whether the key used by the cluster is in the Disabled or Pending deletion state.

Solution:

  1. Log in to the DEW console.
  2. In the custom key list, find the KMS key used by the cluster.

    • For a key in the Pending deletion state, click Cancel Deletion in the Operation column. If the key remains in a Disabled state even after cancellation, then cancel the action of disabling the key.
    • For a key in the Disabled state, click Enable in the Operation column.

  3. Verify whether the key has been enabled and wait for the cluster to be automatically restored. The restoration process should take about 5 to 10 minutes.