Design Principles

Building a Culture of Continuous Improvement with a Standardized O&M System

A strong team culture is essential for operational excellence. Operations is a continuous journey of improvement. Continuous learning from incidents and ongoing improvement are key to achieving operational excellence. To support continuous growth, teams should cultivate a culture of learning and improvement. When incidents arise, the priority should be system enhancement rather than individual blame. Punishing individuals often leads to negative consequences. For example, the O&M team may conceal incidents or their root causes out of fear, hindering improvements to system processes and operational capabilities. Fear of punishment can discourage frontline engineers from taking responsibility or initiating system changes. Similarly, departments may shift blame to others to protect their own teams. This creates a culture of fear, leading to organizational rigidity and stagnant operations. Without continuous iteration, code and architecture deteriorate, eventually making the system unmanageable. Culturally, we should emphasize learning and assigning responsibility at the organizational level, not blaming individuals.

Experience should be standardized by translating it into automation tools, processes, and organizational systems—forming a standardized O&M system. Non-standard O&M operations hinder large-scale success. Disorganized operations, with team members relying on individual approaches, lead to passive responses, low efficiency, frequent errors, and difficult troubleshooting. Standardized O&M is an efficient management approach that consolidates best practices and routine operations. By implementing standardized, process-driven, and tool-based O&M management, the O&M team can shift from disorder to order, simplify delivery, reduce skill dependency, improve efficiency, and lower operational costs.

Frequent, Small, Reversible Changes Through CI/CD

In software development, the phases of requirement analysis, design, development, testing, and deployment should be as short as possible, with frequent small iterative changes. Microservices, coupled with CI/CD, are a widely adopted practice. Known for their flexibility, scalability, and ease of maintenance, the microservice architecture have become the optimal choice for modern application development. It breaks an application into small, independent services, each handling specific business functions. These services can use different technology stacks and are developed, tested, deployed, and scaled by separate teams. They communicate through lightweight mechanisms. In CI/CD, the same team manages the development, testing, deployment, release, and O&M of microservices across different regions using pipelines.

DevOps-adopting organizations should extend the model beyond software project management. From an O&M perspective, frequent small iterations enable faster issue detection and easier rollbacks, minimizing the risk of widespread failures in the event of deployment issues.

X as Code: Automate Where Possible

Unlike traditional applications, a cloud application is defined as code—including the application itself, cloud infrastructure, security policies, and O&M. This means all aspects of operational excellence can be automated through code. For example, you can define infrastructure, deploy applications, update configurations, set security policies, and even automate routine O&M tasks and fault resolution.

Automation is key to building O&M expertise and establishing standardized practices. It limits human error, streamlines processes, and boosts efficiency. Even when manual intervention is needed—such as at decision points —automation can first verify permissions and provide the necessary context to support informed decisions, significantly reducing errors compared to fully manual processes.

Continuous Improvement Through Observability

Observability is the ability to infer the internal state of a system by analyzing its external output. The observability of cloud applications is typically assessed through three core approaches. The first is metrics—quantitative measurements of system performance, such as TPS, request latency, and number of calls. The second is logs—human-readable records of system events, including runtime information, errors, and security incidents. The third is traces—the path of a single request or transaction in the system, helping reveal how the system executes.

For cloud-based applications, observability quickly detects and resolves system faults, improving system recovery time. Observability also allows for the early detection of system issues, such as performance and capacity bottlenecks. Additionally, alarms triggered by observability can be linked to automated processes, enabling proactive responses like dynamic scaling, flow control, traffic switching, and node migration to resolve issues.

Parent topic: Operational Excellence Pillar

Previous topic: Concepts

Next topic: Questions and Checklists