High-Risk Operations

During service deployment or running, you may trigger high-risk operations at different levels, causing service faults or interruption. To help you better estimate and avoid operation risks, this section introduces the consequences and solutions of high-risk operations from multiple dimensions, such as clusters, nodes, networking, load balancing, logs, and EVS disks.

Clusters and Nodes

**Table 1** High-risk operations and solutions
Category	Operation	Impact	Solution
Cluster	Using kube-apiserver to retrieve large volumes of data at a time. For example, a large number of LIST requests are initiated at a time, or a single LIST request is used to retrieve a large amount of data.	The master node's memory is overloaded, affecting system stability.	Stop executing a large number of queries. To prevent a cluster from being overloaded, scale up the cluster. For details, see Changing a Cluster Scale.
Master node	Modifying the security group of master nodes in a cluster NOTE: Naming rule of a master node's security group: Cluster name-cce-control-{Random ID}	The master nodes may be unavailable.	Restore the security group by referring to "Creating a Cluster" and allow traffic from and to the master nodes. For details, see How Can I Configure a Security Group Rule for a Cluster?
	Letting a master node expire or destroying a master node	The master node will become unavailable.	This operation cannot be undone.
	Reinstalling the OS	The master node components will be deleted.	This operation cannot be undone.
	Upgrading master node components or the etcd version	The cluster may be unavailable.	Roll back to the original version.
	Deleting or formatting core directories such as /etc/kubernetes on a node	The master node will become unavailable.	This operation cannot be undone.
	Changing the IP address of a master node	The master node will become unavailable.	Change the IP address back to the original one.
	Modifying parameters of core components (such as etcd, kube-apiserver, and Docker)	The master node may be unavailable.	Restore the parameter settings to the recommended values. For details, see Modifying Cluster Configurations.
	Replacing a master node or etcd certificate	The cluster may be unavailable.	This operation cannot be undone.
Worker node	Modifying the security group of worker nodes in a cluster NOTE: Naming rule of a worker node's security group: Cluster name-cce-node-{Random ID}	The node may be unavailable.	Restore the security group by referring to "Creating a Cluster" and allow traffic from and to the worker nodes. For details, see How Can I Configure a Security Group Rule for a Cluster?
	Modifying the DNS configuration (/etc/resolv.conf) of a node	Internal domain names cannot be accessed, and some functions such as add-ons and in-place node upgrades become abnormal. NOTE: If your service needs to use an on-premises DNS, configure the DNS in the workload. Do not change node's DNS address. For details, see DNS Configuration.	Restore the DNS configuration based on the DNS configuration of a new node.
	Deleting the node	The node will become unavailable.	This operation cannot be undone.
	Deleting the elastic network interface used by the node	The container network on the node is unavailable.	This operation cannot be undone.
	Reinstalling the OS	Node components are deleted, and the node becomes unavailable.	Reset the node. For details, see Resetting a Node.
	Upgrading the kernel or components on which the container platform depends (such as Open vSwitch, IPvlan, Docker, and containerd)	The node may be unavailable or the network may be abnormal. NOTE: Node running depends on the system kernel version. Do not use the yum update command to update or reinstall the kernel of a node unless necessary. (Reinstalling the operating system kernel using the original image or other images is a risky operation.)	Reset the node. For details, see Resetting a Node.
	Changing the IP address of a node	The node will become unavailable.	Change the IP address back to the original one.
	Modifying parameters of core components (such as kubelet and kube-proxy)	The node may become unavailable, and components may be insecure if security-related configurations are modified.	Restore the parameter settings to the recommended values. For details, see Configuring a Node Pool.
	Modifying OS configuration	The node may be unavailable.	Restore the configuration items or reset the node. For details, see Resetting a Node.
	Deleting or modifying the /opt/cloud/cce and /var/paas directories, and deleting a data disk	The node will become unavailable.	Reset the node. For details, see Resetting a Node.
	Modifying the node directory permission and the container directory permission. The following directories are involved: /usr/lib/systemd/system/kubelet.service /usr/lib/systemd/system/containerd-monit.service /usr/lib/systemd/system/docker-monit.service /opt/cloud/cce /var/paas /var/paas/script /var/paas/sys/log /var/paas/kubernetes /var/script/docker /var/script/kubelet /etc/containerd /etc/rc.local /etc/sudoers.d/sudoerspaas /etc/sysconfig/docker /etc/docker/daemon.json /var/lib/docker /mnt/paas/kubernetes /mnt/paas/runtime	The permissions will be abnormal.	Do not modify the permissions. Restore the permissions if they have been modified.
	Formatting or partitioning system disks, Docker disks, and kubelet disks on a node	The node may be unavailable.	Reset the node. For details, see Resetting a Node.
	Installing other software on nodes	This may cause exceptions on Kubernetes components installed on the node, and the node may be unavailable.	Uninstall the software and restore or reset the node. For details, see Resetting a Node.
	Modifying NetworkManager configurations	The node will become unavailable.	Reset the node. For details, see Resetting a Node.
	Deleting system images such as cce-pause from a node	Containers cannot be created and system images cannot be pulled.	Copy the image from a node that functions normally.
	Changing the flavor of a node in a node pool on the ECS console	If a node flavor is different from the flavor specified in the node pool where the node resides, the increased number of nodes in a node pool scale-out is different from the expected number.	Change the node flavor to the one specified in the node pool, or delete the node and perform a node pool scale-out again.

Network

**Table 2** Network
Operation	Impact	Solution
Changing the value of the kernel parameter net.ipv4.ip_forward to 0	The network becomes inaccessible.	Change the value to 1.
Changing the value of the kernel parameter net.ipv4.tcp_tw_recycle to 1	The NAT service becomes abnormal.	Change the value to 0.
Changing the value of the kernel parameter net.ipv4.tcp_tw_reuse to 1	The network becomes abnormal.	Change the value to 0.
Not configuring the node security group to allow UDP traffic to the container CIDR blocks over port 53	The DNS in the cluster cannot work properly.	Restore the security group by referring to the operations provided for a newly created cluster and allow traffic to the container CIDR blocks. For details, see How Can I Configure a Security Group Rule for a Cluster?
Deleting network-attachment-definitions CRD resources of default-network	The container network is disconnected, or the cluster fails to be deleted.	If the resources are deleted by mistake, use the correct configurations to create the default-network resources.
Enabling the iptables firewall	By default, the iptables firewall is disabled on CCE. Enabling the firewall can leave the network inaccessible. NOTE: Do not enable the iptables firewall. If the iptables firewall must be enabled, check whether the rules configured in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables in the test environment will affect the network.	Disable the iptables firewall and check the rules configured in /etc/sysconfig/iptables and /etc/sysconfig/ip6tables.
Upgrading the kernel or Open vSwitch on the nodes in a cluster that uses the tunnel network	The container network is disconnected. NOTE: The data plane of the cluster uses the Open vSwitch component optimized by Huawei and cannot be upgraded to Open vSwitch provided by the OS.	Refer to What Can I Do If the Container Network Becomes Unavailable After yum update Is Used to Upgrade the OS?

Containers

**Table 3** Containers
Operation	Impact	Solution
Configuring privileged containers for a workload and directly operating the host hardware. There may be misoperations on the system files of the node. For example, if you set the startup command to /usr/sbin/init and run systemctl in containers, the system files in the /lib directory of the node may be damaged.	All mount points of the node will be unmounted. As a result, the node will be malfunctional, resulting in failed pods and affected storage add-on functions.	Do not remove the mount points in the /lib directory of a node. Reset the node for recovery. For details, see Resetting a Node.
Mounting the system component directory using hostPath, for example, mounting files in /var/lib/docker	After the system component is restarted, the system component directory on the node may be rebuilt, which interrupts the inode link of the mounted directory. As a result, the content in the mounted directory cannot be used in the container.	Do not mount the system component directory.

Load Balancing

**Table 4** Load balancing
Operation	Impact	Solution
Deleting a load balancer that has been associated with a CCE cluster	Accessing the target Service or ingress will fail.	Do not delete such load balancer.
Disabling a load balancer that has been associated with a CCE cluster	Accessing the target Service or ingress will fail.	Do not disable such a load balancer. If a load balancer has been disabled, enable it.
Changing the private IPv4 address of a load balancer	The network traffic forwarded using the private IPv4 addresses will be interrupted. The IP addresses in the status field of Service or ingress YAML files will be changed.	Do not change private IPv4 addresses of load balancers. Change them back if they have been changed.
Unbinding the IPv4 EIP from a load balancer	After the EIP is unbound from the load balancer, the load balancer will not be able to forward Internet traffic.	Restore the EIP binding.
Adding a listener to a load balancer that has been associated with a CCE cluster	If a load balancer is automatically created when a Service or an ingress is created, any listener added to the load balancer on the ELB console cannot be deleted when the Service or ingress is deleted. In this case, the load balancer cannot be automatically deleted.	Use the listener automatically created when a Service or an ingress is created. If a listener added on the ELB console is used, manually delete this load balancer.
Deleting a listener automatically added by CCE	Accessing the target Service or ingress will fail. When master nodes are restarted due to reasons such as a cluster upgrade, all your modifications will be reset by CCE.	Re-create or update the Service or ingress.
Modifying the basic configurations such as the name, access control, timeout, or description of a listener added by CCE	When master nodes are restarted due to reasons such as a cluster upgrade, all your modifications will be reset by CCE.	Do not modify the basic configurations of the listener created by CCE. Restore the configurations if they have been modified.
Modifying the backend server group of a listener added by CCE, including adding or deleting backend servers to or from the server group	Accessing the target Service or ingress will fail. When master nodes are restarted due to reasons such as a cluster upgrade, all your modifications will be reset by CCE. Deleted backend servers will be restored. Added backend servers will be removed.	Re-create or update the Service or ingress.
Replacing the backend server group of a listener added by CCE	Accessing the target Service or ingress will fail. After master nodes are restarted due to reasons such as a cluster upgrade, all servers in the backend server group will be reset by CCE.	Re-create or update the Service or ingress.
Modifying the forwarding policy of a listener added by CCE, including adding or deleting forwarding rules	Accessing the target Service or ingress will fail. After master nodes are restarted due to reasons such as a cluster upgrade, all your modifications will be reset by CCE if the forwarding rules are added using an ingress.	Do not modify the forwarding policy of such a listener. Restore the configurations if they have been modified.
Replacing the certificate of the listener created by CCE on the ELB console or modifying the server certificate created by CCE using a TLS key on the Certificates page of ELB	In scenarios where a master node needs to be restarted, such as during a cluster upgrade, the modification will be reset by CCE. As a result, the Service or ingress may become inaccessible.	Use the CCE console or YAML to update the certificate associated with the Service or ingress, or update the TLS key associated with the Service or ingress.

Logs

**Table 5** Logs
Operation	Impact	Solution
Deleting the /tmp/ccs-log-collector/pos directory on the host machine	Logs are collected repeatedly.	None
Deleting the /tmp/ccs-log-collector/buffer directory on the host machine	Logs are lost.	None

Monitoring

**Table 6** Monitoring
Operation	Impact	Solution
Configuring a larger number of collection shards in Cloud Native Cluster Monitoring than the recommended value (one collection shard per 50 nodes)	Excessive shards may overload the master node's memory, affecting system stability.	Change the number of collection shards to the recommended value for Cloud Native Cluster Monitoring.

EVS Disks

**Table 7** EVS disks
Operation	Impact	Solution	Remarks
Manually unmounting an EVS disk on the console	An I/O error occurs when data is written into a pod.	Delete the mount path from the node and schedule the pod again.	The file in the pod records the location where files are to be collected.
Unmounting the disk mount path on the node	Pod data is written into a local disk.	Remount the corresponding path to the pod.	The buffer contains log cache files to be consumed.
Operating EVS disks on the node	Pod data is written into a local disk.	None	None
Creating a PV with parameters that are not declared in the file For example, if the YAML file contains parameters such as status, spec.claimRef, and annotation.everest.io/set-disk-metadata during PV creation, the PV may be abnormal.	This operation may bypass some standard processes for creating PVs. As a result, the created PVs may become unavailable or be deleted unexpectedly.	Before such PVs are deleted, manually delete related parameters in their YAML files.	None

Add-ons

**Table 8** Add-ons
Operation	Impact	Solution
Modifying add-on resources in the backend	Add-on exceptions or other unexpected issues may occur. For example, parameter settings are overwritten after an upgrade.	Perform operations on the add-on configuration page or using open add-on management APIs.