Updated on 2025-05-22 GMT+08:00

Questions and Checklists

This checklist can help identify areas for improvement and guide their enhancement. Each item in the checklist represents a best practice, which will be explained in detail in the next section.

Question

Checklist/Best Practice

OPS01 Have you established a culture of continuous improvement with a standardized O&M system?

1. Build a growing culture.

2. Design a standardized O&M organization.

3. Standardize O&M processes and tools.

OPS02 Do you use CI/CD to implement frequent, small, and reversible changes?

1. Manage requirements and sprint development.

2. Link source code versions to deployed applications and applying code quality best practices.

OPS03 Do you have a comprehensive test and verification system?

1. Promote developer testing.

2. Perform integration tests in multiple environments and build a pre-production environment that is identical to the production environment.

3. Performance stress testing

4. Perform dialing test in the production environment.

5. Perform chaos testing and drills.

OPS04 Is the automated build and deployment process complete?

1. Effectively implement continuous integration.

2. Use continuous deployment models.

3. Implement infrastructure as code.

4. Perform automated O&M tasks.

OPS05 Is there an O&M preparation and change management system?

1. Conduct production readiness review.

2. Conduct change risk control.

3. Define a change process.

OPS06 Has an observability system established?

1. Establish an observability system.

2. Define observable objects.

3. Develop and implement observability metrics

4. Standardize application logs.

5. Implement dependency telemetry.

6. Implement distributed tracing.

7. Introduce automation measures based on observability metrics.

OPS07 Is fault analysis and management conducted?

1. Create alarms.

2. Create a dashboard.

3. Support event management.

4. Support fault recovery process.

OPS08 Is there an operating status measurement and continuous improvement mechanism?

1. Use metrics to measure operational objectives.

2. Review the accident and make improvements.

3. Manage knowledge.