Help Center/ Cloud Container Engine/ Best Practices/ Checklist for Deploying Containerized Applications in the Cloud

Updated on 2024-05-10 GMT+08:00

View PDF

Checklist for Deploying Containerized Applications in the Cloud

Overview

Security, efficiency, stability, and availability are common requirements on all cloud services. To meet these requirements, the system availability, data reliability, and O&M stability must be coordinated. This checklist describes the check items for deploying containerized applications on the cloud to help you efficiently migrate services to CCE, reducing potential cluster or application exceptions caused by improper use.

Check Items

**Table 1** System availability
Category	Check Item	Type	Impact	FAQ & Example
Cluster	Before creating a cluster, properly plan the node network and container network based on service requirements to allow subsequent service expansion.	Network planning	If the subnet or container CIDR block where the cluster resides is small, the number of available nodes supported by the cluster may be less than required.	Network Planning Planning CIDR Blocks for a Cluster How Do I Set the VPC CIDR Block and Subnet CIDR Block for a CCE Cluster?
	Before creating a cluster, properly plan CIDR blocks for the related Direct Connect, peering connection, container network, service network, and subnet to avoid IP address conflicts.	Network planning	If CIDR blocks are not properly set and IP address conflicts occur, service access will be affected.	Connectivity Planning CIDR Blocks for a Cluster
	When a cluster is created, the default security group is automatically created and bound to the cluster. You can set custom security group rules based on service requirements.	Deployment	Security groups are key to security isolation. Improper security policy configuration may cause security risks and service connectivity problems.	Security Groups and Security Group Rules How Do I Prevent Cluster Nodes from Being Exposed to Public Networks?
	Enable the multi-master node mode, and set the number of master nodes to 3 when creating a cluster.	Reliability	After the multi-master node mode is enabled, three master nodes will be created. If a master node is faulty, the cluster can still be available without affecting service functions. In commercial scenarios, it is advised to enable the multi-master node mode.	How Do I Check Whether a Cluster Is in Multi-Master Mode? Once a cluster is created, the number of master nodes cannot be changed. Exercise caution when setting the number of master nodes.
	When creating a cluster, select a proper network model as needed. Select VPC network or Tunnel network for your CCE standard cluster. Select Cloud Native Network 2.0 for your CCE Turbo cluster.	Deployment	After a cluster is created, the network model cannot be changed. Exercise caution when selecting a network model.	Network Model Comparison
Workload	When creating a workload, set the CPU and memory limits to improve service robustness.	Deployment	When multiple applications are deployed on the same node, if the upper and lower resource limits are not set for an application, resource leakage occurs. As a result, resources cannot be allocated to other applications, and the application monitoring information will be inaccurate.	None
	When creating a workload, you can set probes for container health check, including liveness probe and readiness probe.	Reliability	If the health check function is not configured, a pod cannot detect service exceptions or automatically restart the service to restore it. This results in a situation where the pod status is normal but the service in the pod is abnormal.	Setting Health Check for a Container Enabling ICMP Security Group Rules
	When creating a workload, select a proper access mode (Service). Currently, the following types of Services are supported: ClusterIP, NodePort, DNAT, and LoadBalancer.	Deployment	Improper Service configuration may cause logic confusion for internal and external access and resource waste.	Network Overview
	When creating a workload, do not set the number of replicas for a single pod. Set a proper node scheduling policy based on your service requirements.	Reliability	For example, if the number of replicas of a single pod is set, the service will be abnormal when the node or pod is abnormal. To ensure that your pods can be successfully scheduled, ensure that the node has idle resources for container scheduling after you set the scheduling rule.	None
	Properly set affinity and anti-affinity.	Reliability	If affinity and anti-affinity are both configured for an application that provides Services externally, Services may fail to be accessed after the application is upgraded or restarted.	Scheduling Policy (Affinity/Anti-affinity) Negative example: For application A, nodes 1 and 2 are set as affinity nodes, and nodes 3 and 4 are set as anti-affinity nodes. Application A exposes a Service through the ELB, and the ELB listens to node 1 and node 2. When application A is upgraded, it may be scheduled to a node other than nodes 1, 2, 3, and 4, and it cannot be accessed through the Service. Cause: Scheduling of application A does not need to meet both affinity and anti-affinity policies. A node will be selected for application A according to either of the policies. In this example, the node selection is based on the anti-affinity scheduling policy.
	When creating a workload, set the pre-stop processing command (Lifecycle > Pre-Stop) to ensure that the services running in the pods can be completed in advance in the case of application upgrade or pod deletion.	Reliability	If the pre-stop processing command is not configured, the pod will be directly killed and services will be interrupted during application upgrade.	Setting Container Lifecycle Parameters When Is Pre-stop Processing Used?

**Table 2** Data reliability
Category	Check Item	Type	Impact	FAQ & Example
Container data persistency	Select a proper data volume type based on service requirements.	Reliability	When a node is faulty and cannot be recovered, data in the local disk cannot be recovered. Therefore, you are advised to use cloud storage volumes to ensure data reliability.	Storage Overview
Backup	Back up application data.	Reliability	Data cannot be restored after being lost.	What Are the Differences Among CCE Storage Classes in Terms of Persistent Storage and Multi-node Mounting?

**Table 3** O&M reliability
Category	Check Item	Type	Impact	FAQ & Example
Project	The quotas of ECS, VPC, subnet, EIP, and EVS resources must meet customer requirements.	Deployment	If the quota is insufficient, resources will fail to be created. Specifically, users who have configured auto scaling must have sufficient resource quotas.	Which Resource Quotas Should I Pay Attention To When Using CCE? Notes and Constraints
	You are not advised to modify kernel parameters, system configurations, cluster core component versions, security groups, and ELB-related parameters on cluster nodes, or install software that has not been verified.	Deployment	Exceptions may occur on CCE clusters or Kubernetes components on the node, making the node unavailable for application deployment.	For details, see High-Risk Operations and Solutions. Negative example: The container network is interrupted after the node kernel is upgraded. The container network is interrupted after an open-source Kubernetes network add-on is installed on a node. The /var/paas or /mnt/paas/kubernetes directory is deleted from a node, which causes exceptions on the node.
	Do not modify information about resources created by CCE, such as security groups and EVS disks. Resources created by CCE are labeled cce.	Deployment	CCE cluster functions may be abnormal.	Negative example: On the ELB console, a user changes the name of the listener created by CCE. On the VPC console, a user modifies the security group created by CCE. On the EVS console, a user deletes or uninstalls data disks mounted to CCE cluster nodes. On the IAM console, a user deletes cce_admin_trust. All the preceding actions will cause CCE cluster functions to be abnormal.
Proactive O&M	CCE provides multi-dimensional monitoring and alarm reporting functions, allowing users to locate and rectify faults as soon as possible. Application Operations Management (AOM): The default basic resource monitoring of CCE covers detailed container-related metrics and provides alarm reporting functions. Open source Prometheus: A monitoring tool for cloud native applications. It integrates an independent alarm system to provide more flexible monitoring and alarm reporting functions.	Monitoring	If the alarms are not configured, the standard of container cluster performance cannot be established. When an exception occurs, you cannot receive alarms and will need to manually locate the fault.	Monitoring Overview Monitoring Custom Metrics Using the Cloud Native Monitoring Add-on