Well-Architected Framework
Well-Architected Framework
All results for "
" in this service
All results for "
" in this service
Well-Architected Framework and Practices
Well-Architected Framework (WAF) Introduction
Resilience Pillar
Introduction to the Resilience Pillar
Concepts
Concepts
Application Resilience
Shared Responsibility Model for Resiliency
Availability Objectives
Availability and SLO
RTO and RPO
Data Durability
Availability Requirements
Design Principles
Questions and Checklists
HA Design
RES01 Redundancy
Overview
RES01-01 High-Availability Deployment of Application Components
RES01-02 Multi-Location Deployment of Application Components
RES01-03 Anti-Affinity for Cloud Servers
RES02 Backup
Overview
RES02-01 Identifying and Backing Up Critical Data That Needs to Be Backed Up
RES02-02 Automatically Backing Up Data
RES02-03 Periodically Restoring Data from Backups
RES03 Cross-AZ DR
Overview
RES03-01 Deploying Clusters Across AZs
RES03-02 Synchronizing Data Across AZs
RES03-03 Interconnecting with DR Arbitrator for Automatic Switchover
RES03-04 Supporting DR Management
RES04 Cross-Region/Cross-Cloud DR
Overview
RES04-01 Defining the RPO and RTO for an Application System
RES04-02 Deploying a DR System to Meet the DR Objectives
RES04-03 Automating the DR Process
RES04-04 Regularly Performing DR Drills to Check Whether the Recovery Can Meet DR Objectives
RES05 Network HA
Overview
RES05-01 Ensuring High Availability of Network Connections
RES05-02 Avoiding Unnecessarily Exposing Network Addresses
RES05-03 Isolating Network Bandwidth Among Services of Different Traffic Models
RES05-04 Reserving IP Resources for Expansion and HA
Comprehensive Fault Detection
RES06 Fault Detection
Overview
RES06-01 Fault Mode Analysis
RES06-02 Fault Detection
RES06-03 Subhealth Detection
RES07 Monitoring Alarms
Overview
RES07-01 Defining Key Metrics and Thresholds and Monitoring Such Metrics
RES07-02 Monitoring Logging
RES07-03 Sending Notifications When Exceptions Are Detected
RES07-04 Storing and Analyzing Monitoring Data
RES07-05 Tracking Requests End-to-End
Rapid Fault Recovery
RES08 Dependency Reduction and Degradation
Overview
RES08-01 Reducing Strong Dependencies
RES08-02 Using Loose Coupling to Reduce Dependencies
RES08-03 Minimizing the Impacts of Dependency Failures
RES09 Retries After Failures
Overview
RES09-01 Designing a Retry Mechanism for API Calls and Command Executions
RES09-02 Determining Whether to Retry from the Client Based on the Comprehensive Evaluation Results
RES09-03 Avoiding Creating Too Much Traffic Pressure from Excessive Retries
RES10 Fault Isolation
Overview
RES10-01 Isolating the Control Plane from the Data Plane
RES10-02 Deploying Application Systems in Multiple Locations
RES10-03 Adopting a Grid Architecture
RES10-04 Configuring Health Check and Automatic Isolation
RES11 Reliability Testing
Overview
RES11-01 Chaos Testing
RES11-02 Stress Testing
RES11-03 Long-Term Stability Testing
RES11-04 DR Drills
RES11-05 Red/Blue Attack/Defense
RES12 Emergency Recovery
Overview
RES12-01 Setting Up an Emergency Recovery Team
RES12-02 Developing an Emergency Response Plan
RES12-03 Conducting Emergency Recovery Drills Periodically
RES12-04 Recovering from Faults Immediately
RES12-05 Organizing Emergency Recovery Backtracking
Overload Control
RES13 Overload Protection
Overview
RES13-01 Automatic Elastic Scaling
RES13-02 Application System Load Balancing
RES13-03 Overload Detection and Control
RES13-04 Proactive Capacity Expansion
RES13-05 Quota Restrictions on Automatic Capacity Expansion
RES13-06 Load Testing
Change Error Prevention
RES14 Configuring Error Prevention
Overview
RES14-01 Fool-proofing Check for Changes
RES14-02 Automatic Changes
RES14-03 Data Backup Before a Change
RES14-04 Runbooks for Standardized Changes
RES15 Hitless Upgrade
Overview
RES15-01 Automatic Deployment and Upgrade
RES15-02 Automatic Checks
RES15-03 Automatic Rollback
RES15-04 Gray Deployment and Upgrade
Reference Architecture
Overview
Typical Deployment Architecture for Internal Tools or OBT Applications (99% Availability)
Typical Deployment Architecture for Internal Knowledge Management Applications (99.9% Availability)
Typical Deployment Architecture for Information Management Applications (99.95% Availability)
Typical Deployment Architecture for E-commerce Applications (99.99% Availability)
Single-Region Solution
Dual-Region Solution
Typical Deployment Architecture for Core Financial Applications (99.999% Availability)
Cross-Cloud Scenarios (99.99% Availability)
Overview
Cross-Cloud DR
Cross-Cloud Active-Active DR
Cloud Service Reliability
Overview
ECS
Reliability Functions
Common Faults
BMS
Reliability Functions
Common Faults
CCE
Reliability Functions
Common Faults
ELB
Reliability Functions
Common Faults
AS
Reliability Functions
Common Faults
DCS
Reliability Functions
Common Faults
DMS
Reliability Functions
Common Faults
RDS
Reliability Functions
Common Faults
TaurusDB
Reliability Functions
Common Faults
OBS
Reliability Functions
Common Faults
Security Pillar
Overview
Introduction to the Security Pillar
Shared Responsibility Model
Concepts
Concept Description
Conceptual Models
Design Principles
Questions and Checklists
Cloud Security Governance Policies
SEC01 Cloud Security Governance Policies
SEC01-01 Establishing a Security Management Team
SEC01-02 Establishing a Security Baseline
SEC01-03 Compiling an Asset List
SEC01-04 Separating Workloads
SEC01-05 Performing Threat Modeling Analysis
SEC01-06 Identifying and Validating Security Measures
Infrastructure Security
SEC02 Identity Authentication
SEC02-01 Account Protection
SEC02-02 Secure Login Mechanism
SEC02-03 Security Management and Credential Usage
SEC02-04 Integrated Identity Management
SEC03 Permission Management
SEC03-01 Defining Access Control Requirements
SEC03-02 Assigning Appropriate Permissions on Demand
SEC03-03 Regularly Reviewing Permissions
SEC03-04 Securely Sharing Resources
SEC04 Network Security
SEC04-01 Partitioning the Network
SEC04-02 Controlling Network Traffic Access
SEC02-03 Minimizing Network Access Permissions
SEC05 Runtime Environment Security
SEC05-01 Cloud Service Security Configuration
SEC05-02 Vulnerability Management
SEC05-03 Reducing Attack Surfaces of Resources
SEC05-04 Key Security Management
SEC05-05 Certificate Security Management
SEC05-06 Using Hosted Cloud Services
Application Security
SEC06 Application Security
SEC06-01 Ensuring Open-Source Software Security
SEC06-02 Establishing Secure Coding Specifications
SEC06-03 Implementing White-Box Code Reviews
SEC06-04 Configuring Application Security
SEC06-05 Performing Penetration Testing
Data Security and Privacy Protection
SEC07 General Data Security
SEC07-01 Identifying Data in Workloads
SEC07-02 Data Protection Control
SEC07-03 Monitoring Data Use
SEC07-04 Static Data Encryption
SEC07-05 Transmission Data Encryption
SEC08 Data Privacy Protection
SEC08-01 Specifying Privacy Protection Policies and Principles
SEC08-02 Proactive Notification to Data Subjects
SEC08-03 Data Subjects' Choice and Consent
SEC08-04 Data Collection Compliance
SEC08-05 Compliance of Data Usage, Retention, and Disposal
SEC08-06 Compliance of Personal Data Disclosure to Third Parties
SEC08-07 Data Subjects Have Rights to Access Their Privacy Data
Security Operations
SEC09 Security Awareness and Analysis
SEC09-01 Implementing Standardized Log Management
SEC09-02 Logging and Analyzing Security Incidents
SEC09-03 Implementing Security Audits
SEC09-04 Security Situation Awareness
SEC10 Security Incident Response
SEC10-01 Establishing a Security Response Team
SEC10-02 Developing an Incident Response Plan
SEC10-03 Automatic Response to Security Incidents
SEC10-04 Security Incident Drill
SEC10-05 Establishing a Review Mechanism
Reference Architecture
Organization-level Reference Architecture
Workload-Level Reference Architecture
Security Services
Performance Efficiency Pillar
Introduction
Concepts
Design Principles
Questions and Checklists
PERF01 Processes and Specifications
Full-lifecycle Performance Management
PERF01-01 Managing Performance Throughout the Lifecycle
Application Performance Programming Specifications
PERF01-02 Developing Application Performance Programming Specifications
PERF02 Performance Planning
Performance Planning
PERF02-01 Setting Performance Targets
PERF02-02 Capacity Planning
PERF03 Performance Modeling
Selecting Appropriate Compute Resources
PERF03-01 Selecting Appropriate Compute Services
PERF03-02 Selecting VMs and Container Nodes of Appropriate Specifications
PERF03-03 Applying Auto Scaling
Selecting Appropriate Network Services
PERF03-04 Selecting Appropriate Networking Services
Selecting Appropriate Storage Services
PERF03-05 Selecting Appropriate Storage Services
Selecting Appropriate Application Middleware Services
PERF03-06 Selecting Appropriate Message Queuing Services
PERF03-07 Selecting an Appropriate Kafka
PERF03-08 Selecting an Appropriate RocketMQ
PERF03-09 Selecting an Appropriate RabbitMQ
Selecting Appropriate Database Resources
PERF03-10 Selecting Appropriate Relational Databases
PERF03-11 Selecting Appropriate Non-Relational Databases
PERF04 Performance Analysis
Performance Testing
PERF04-01 Defining Acceptance Criteria
PERF04-02 Selecting Appropriate Test Methods
PERF04-03 Performance Test Procedure
Collecting Performance Data
PERF04-04 Collecting Resource Performance Data
PERF04-05 Collecting Application Performance Data
Developing a Performance Observability System
PERF04-06 Developing a Performance Observability System
PERF05 Performance Optimization
Design Optimization
PERF05-01 Optimizing Design
Algorithm Optimization
PERF05-02 Optimizing General Algorithms
Resource Optimization
PERF05-03 Optimizing Resources for Web Scenarios
PERF05-04 Optimizing Resources for Big Data Scenarios
PERF06 Performance Assurance
Performance Assurance
PERF06-01 Applying Layered Assurance
PERF06-02 Automatically Demarcating and Locating Performance Issues
PERF06-03 Automating Alarming
Cloud Service Performance Optimization
Optimizing Cache Performance
Optimizing Message Queue Performance
Optimizing Kafka Performance
Optimizing RabbitMQ Performance
Optimizing Serverless Performance
Optimizing Database Performance
Optimizing AI Performance
Optimizing Big Data Performance
Optimizing Hive
Optimizing Spark Performance
Optimizing Flink Performance
Cost Optimization Pillar
Overview
Concepts
Design Principles
Problems and Checklists
COST01 Planning Organizations and Processes According to Cost Optimization Requirements
COST01-01 Planning Enterprise Organizations, and Aligning the Organizational Structure, Process, and Cost Management
COST01-02 Designing a Structured IT Governance System to Optimize Management Efficiency
COST01-03 Clarifying Team Responsibilities, and Cultivating and Sustaining an Organizational Cost-consciousness Culture
COST01-04 Establishing Cloud Resource and Permissions Management Policies
COST02 Planning and Managing Budgets
COST02-01 Establishing Cloud Budgets and Forecasts
COST02-02 Refining Budget Management and Tracking
COST03 Allocating Costs
COST03-01 Developing Cost Allocation Principles
COST03-02 Visualizing Cost Allocation Results
COST03-03 Allocating Shared Costs
COST04 Performing Continuous Cost Governance
COST04-01 Establishing Governance Frameworks to Continuously Optimize the Cost Allocation Ratio
COST04-02 Proactively Monitoring Costs
COST05 Setting Optimization Strategies and Objectives
COST05-01 Analyzing Service Trends and Optimization Benefits
COST05-02 Establishing Measurable Optimization Objectives
COST05-03 Regularly Reviewing and Verifying Optimization Results
COST06 Choosing the Right Billing Mode
COST06-01 Understanding the Characteristics of Different Billing Modes on the Cloud
COST06-02 Selecting an Appropriate Billing Mode for Your Workloads
COST06-03 Tracking and Monitoring the Usage of Special Offerings
COST07 Managing and Optimizing Resources
COST07-01 Continuously Monitoring Resource Utilization
COST07-02 Releasing Idle Resources
COST07-03 Conducting Comprehensive Cloud Technology Selection Analysis
COST07-04 Changing Specifications Based on Workload Patterns and Resource Utilization
COST08 Optimizing Architectures
COST08-02 Implementing Cloud-Native Architecture Transformation
COST08-03 Decoupling Storage and Compute
COST08-04 Exploring Serverless
Cloud Services Used for Cost Optimization
Operational Excellence Pillar
Operational Excellence
Concepts
Design Principles
Questions and Checklists
OPS01 Building a Culture of Continuous Improvement with a Standardized O&M System
OPS01-01 Building a Growing Culture
OPS01-02 Designing a Standardized O&M Organization
OPS01-03 Standardizing O&M Processes and Tools
OPS02 Frequent, Small, Reversible Changes Through CI/CD
OPS02-01 Managing Requirements and Sprint Development
OPS02-02 Linking Source Code Versions to Deployed Applications and Applying Code Quality Best Practices
OPS03 Comprehensive Test and Verification System
OPS03-01 Promoting Developer Testing
OPS03-02 Performing Integration Testing in Multiple Environments and Mimicking Production with Staging Environments
OPS03-03 Performing Performance Testing
OPS03-04 Performing Echo Testing in the Production Environment
OPS03-05 Performing Chaos Testing and Drills
OPS04 Automated Build and Deployment
OPS04-01 Effectively Implementing Continuous Integration
OPS04-02 Using Continuous Deployment Models
OPS04-03 Implementing Infrastructure as Code
OPS04-04 Automating O&M Tasks
OPS05 O&M Preparation and Change Management
OPS05-01 Conducting a PRR
OPS05-02 Conducting a Change Risk Control
OPS05-03 Defining a Change Process
OPS06 Observability System
OPS06-01 Establishing an Observability System
OPS06-02 Defining Observable Objects
OPS06-03 Developing and Implementing Observability Indicators
OPS06-04 Standardizing Application Logs
OPS06-05 Implementing Dependency Telemetry
OPS06-06 Performing Distributed Tracing
OPS06-07 Automation Measures Based on Observability Metrics
OPS07 Fault Analysis and Management
OPS07-01 Creating Alarms
OPS07-02 Creating Monitoring Dashboards
OPS07-03 Managing Events
OPS07-04 Supporting Fault Recovery Process
OPS08 Operations Status Measurement and Continuous Improvement
OPS08-01 Using Metrics to Measure Operations Objectives
OPS08-02 Reviewing and Improving Events
OPS08-03 Managing Knowledge
Examples
Improving System O&M Capabilities to Reduce O&M Costs and Difficulties with AOM
Gathering Various Device Logs Using LTS for Full-Link Issue Tracking and Service Operations Analysis
Performing Daily Service O&M and Compliance with LTS
Operational Excellence
CodeArts
RFS
COC
Cloud Eye
LTS
AOM 2.0
APM
CBH
ServiceStage
MAS
General Reference
Glossary
Service Level Agreement
White Papers
Endpoints
Permissions