Updated on 2025-05-22 GMT+08:00

Concepts

Term

Definition

Deterministic Operations

Deterministic operations is intended to create an O&M management system that makes risks avoidable, controllable, and manageable. Deterministic operations can help minimize risks through high-quality R&D and well-designed O&M processes and regulations. Technical measures are taken to make faults rectifiable, controllable, and preventable, and make their intervals, impact scope, and recovery time predictable. The uncertainty associated with digital transformation is reduced.

Infrastructure as code (IaC)

Infrastructure as Code (IaC) lets you manage infrastructure with code rather than manual interventions. Modern application environments depend on a variety of components, from operating systems and database connections to storage configurations. Developers must set up, update, and maintain infrastructure on a regular basis to develop, test, and deploy applications. Manual infrastructure management is time-consuming and error-prone, especially when managing applications at scale.

CI/CD

Continuous Integration (CI) promotes frequent small code commits to version control, with automated builds, tests, and packaging to improve collaboration and code quality.

Continuous Delivery (CD) automates deploying code to environments (production, development, and testing) after CI completes.

Telemetering

Telemetry is the remote measurement and transmission of data from a target to a receiving station for recording, display, and analysis.

CMDB

The configuration management database (CMDB) is an information technology infrastructure library (ITIL) term. It is a database used by organizations to store information about software and hardware assets (usually called CIs). It tracks the status of assets (products, systems, software, devices, personnel), their relationships, and supports IT management of various service data consumption through open APIs.

MTTR

MTTR (Mean Time to Repair) is the average time from the occurrence of a fault to the confirmation of fault recovery. It encompasses three dimensions: Mean Time to Identify (MTTI), which measures the time to detect a fault; Mean Time to Know (MTTK), which assesses the time to diagnose the fault; and Mean Time to Fix (MTTF), which quantifies the time taken to implement a resolution.

Change risk control

Involves pre-event check, in-event interception, and post-event verification to prevent abnormal behaviors.

Safe production

Safe production aims to ensure the continuous stability, security, and quality of live networks by implementing E2E management across personnel, tools, product capabilities, and processes. This includes safety prevention, real-time monitoring, and post-audit checks to minimize or prevent network failures, with a key focus on avoiding incidents triggered by abnormal behaviors.

Rapid fault recovery

Rapid fault recovery leverages a library of predefined failure patterns to develop emergency plans, enhancing efficiency and reducing recovery time. The duration of uncertain recovery is made certain through chaos engineering drills.

Resource life cycle management

Resource application, creation, delivery, O&M, and final destruction and release.

Fault drills

Fault drills involve replicating common failure scenarios in a controlled environment to expose vulnerabilities. Through continuous drills and regression testing, they validate and improve systems, tools, processes, and team capabilities. This proactive approach identifies and resolves preventable critical issues in advance or verifies fault detection and repair mechanisms to shorten recovery times.

Managed O&M

Managed O&M is a professional service that comprehensively manages and maintains the IT infrastructure of enterprises or organizations. It aims to improve the availability, reliability, and security of IT systems. This service covers multiple aspects, including system monitoring, troubleshooting, system optimization, and security protection.