Help Center> Cloud Container Engine> Best Practices> Checklist for Deploying Containerized Applications in the Cloud
Updated on 2023-10-27 GMT+08:00

Checklist for Deploying Containerized Applications in the Cloud

Overview

Security, efficiency, stability, and availability are common requirements on all cloud services. To meet these requirements, the system availability, data reliability, and O&M stability must be coordinated. This checklist describes the check items for deploying containerized applications on the cloud to help you efficiently migrate services to CCE, reducing potential cluster or application exceptions caused by improper use.

Check Items

Table 1 System availability

Category

Check Item

Type

Impact

FAQ & Example

Cluster

Before creating a cluster, properly plan the node network and container network based on service requirements to allow subsequent service expansion.

Network planning

If the subnet or container CIDR block where the cluster resides is small, the number of available nodes supported by the cluster may be less than required.

Before creating a cluster, properly plan CIDR blocks for the related Direct Connect, peering connection, container network, service network, and subnet to avoid IP address conflicts.

Network planning

If CIDR blocks are not properly set and IP address conflicts occur, service access will be affected.

When a cluster is created, the default security group is automatically created and bound to the cluster. You can set custom security group rules based on service requirements.

Deployment

Security groups are key to security isolation. Improper security policy configuration may cause security risks and service connectivity problems.

Enable the multi-master node mode, and set the number of master nodes to 3 when creating a cluster.

Reliability

After the multi-master node mode is enabled, three master nodes will be created. If a master node is faulty, the cluster can still be available without affecting service functions. In commercial scenarios, it is advised to enable the multi-master node mode.

How Do I Check Whether a Cluster Is in Multi-Master Mode?

Once a cluster is created, the number of master nodes cannot be changed. Exercise caution when setting the number of master nodes.

When creating a cluster, select a proper network model, such as container tunnel network or VPC network.

Deployment

After a cluster is created, the network model cannot be changed. Exercise caution when selecting a network model.

Network Model Comparison

Workload

When creating a workload, set the CPU and memory limits to improve service robustness.

Deployment

When multiple applications are deployed on the same node, if the upper and lower resource limits are not set for an application, resource leakage occurs. As a result, resources cannot be allocated to other applications, and the application monitoring information will be inaccurate.

None

When creating a workload, you can set probes for container health check, including liveness probe and readiness probe.

Reliability

If the health check function is not configured, a pod cannot detect service exceptions or automatically restart the service to restore it. This results in a situation where the pod status is normal but the service in the pod is abnormal.

When creating a workload, select a proper access mode (Service). Currently, the following types of Services are supported: ClusterIP, NodePort, DNAT, and LoadBalancer.

Deployment

Improper Service configuration may cause logic confusion for internal and external access and resource waste.

When creating a workload, do not set the number of replicas for a single pod. Set a proper node scheduling policy based on your service requirements.

Reliability

For example, if the number of replicas of a single pod is set, the service will be abnormal when the node or pod is abnormal. To ensure that your pods can be successfully scheduled, ensure that the node has idle resources for container scheduling after you set the scheduling rule.

None

Properly set affinity and anti-affinity.

Reliability

If affinity and anti-affinity are both configured for an application that provides Services externally, Services may fail to be accessed after the application is upgraded or restarted.

Scheduling Policy (Affinity/Anti-affinity)

Negative example:

For application A, nodes 1 and 2 are set as affinity nodes, and nodes 3 and 4 are set as anti-affinity nodes. Application A exposes a Service through the ELB, and the ELB listens to node 1 and node 2. When application A is upgraded, it may be scheduled to a node other than nodes 1, 2, 3, and 4, and it cannot be accessed through the Service.

Cause:

Scheduling of application A does not need to meet both affinity and anti-affinity policies. A node will be selected for application A according to either of the policies. In this example, the node selection is based on the anti-affinity scheduling policy.

When creating a workload, set the pre-stop processing command (Lifecycle > Pre-Stop) to ensure that the services running in the pods can be completed in advance in the case of application upgrade or pod deletion.

Reliability

If the pre-stop processing command is not configured, the pod will be directly killed and services will be interrupted during application upgrade.

Table 2 Data reliability

Category

Check Item

Type

Impact

FAQ & Example

Container data persistency

Select a proper data volume type based on service requirements.

Reliability

When a node is faulty and cannot be recovered, data in the local disk cannot be recovered. Therefore, you are advised to use cloud storage volumes to ensure data reliability.

Backup

Back up application data.

Reliability

Data cannot be restored after being lost.

What Are the Differences Among CCE Storage Classes in Terms of Persistent Storage and Multi-node Mounting?

Table 3 O&M reliability

Category

Check Item

Type

Impact

FAQ & Example

Project

The quotas of ECS, VPC, subnet, EIP, and EVS resources must meet customer requirements.

Deployment

If the quota is insufficient, resources will fail to be created. Specifically, users who have configured auto scaling must have sufficient resource quotas.

You are not advised to modify kernel parameters, system configurations, cluster core component versions, security groups, and ELB-related parameters on cluster nodes, or install software that has not been verified.

Deployment

Exceptions may occur on CCE clusters or Kubernetes components on the node, making the node unavailable for application deployment.

For details, see High-Risk Operations and Solutions.

Negative example:

  1. The container network is interrupted after the node kernel is upgraded.
  2. The container network is interrupted after an open-source Kubernetes network add-on is installed on a node.
  3. The /var/paas or /mnt/paas/kubernetes directory is deleted from a node, which causes exceptions on the node.

Do not modify information about resources created by CCE, such as security groups and EVS disks. Resources created by CCE are labeled cce.

Deployment

CCE cluster functions may be abnormal.

Negative example:

  1. On the ELB console, a user changes the name of the listener created by CCE.
  2. On the VPC console, a user modifies the security group created by CCE.
  3. On the EVS console, a user deletes or uninstalls data disks mounted to CCE cluster nodes.
  4. On the IAM console, a user deletes cce_admin_trust.

All the preceding actions will cause CCE cluster functions to be abnormal.

Proactive O&M

CCE provides multi-dimensional monitoring and alarm reporting functions, allowing users to locate and rectify faults as soon as possible.

  • Application Operations Management (AOM): The default basic resource monitoring of CCE covers detailed container-related metrics and provides alarm reporting functions.
  • Open source Prometheus: A monitoring tool for cloud native applications. It integrates an independent alarm system to provide more flexible monitoring and alarm reporting functions.

Monitoring

If the alarms are not configured, the standard of container cluster performance cannot be established. When an exception occurs, you cannot receive alarms and will need to manually locate the fault.