RES10-03 Adopting a Grid Architecture

Adopting a grid architecture can minimize the impacts of workload faults.

Risk level
High
Key strategies
In an application system, use multiple grids with same functions. Each grid has complete service functions and processes a replica of the workload. The grids do not interact with each other. If a grid becomes faulty, the services processed by that grid are affected, but other grids are not affected, reducing the blast radius.

The following figure shows a typical grid architecture.

Implementation procedure:
1. Determine partition keys based on the following principles:
  - Partition keys must match service granularity or minimize cross-grid interactions. For a multi-user system, use user IDs as partition keys. For a system in which resources are objects, use resource IDs as partition keys.
  - Partition keys must be directly included in all APIs or commands, or can be converted through other parameters.
  - Ensure each grid independently process services, avoiding or reducing grid interactions.
2. Determine the number of grids and the size of each grid. Mode grids can be added later if the number of grids is insufficient. There are some tips:
  - More grids mean smaller grid size. This limits blast radius, simplifies fault locating, and improves service availability. However, more grids lower resource utilization and increase costs.
  - Fewer grids mean more resources in each grid. This simplifies O&M, improves resource usage, and reduces costs. Fewer grids are right for large customers with high resource demands.
3. Determine the grid mapping algorithm. The following are some options:
  - Naive modulo mapping: Partition keys are used to modulo the number of grids. This algorithm ensures even distribution of data, but is not suitable when grids are added and deleted. Once a grid is added or deleted, service migration is required.
  - Range-Hash/Hash mapping: Partition keys are used to partition data by range and then hash the data, or partition keys are directly used to hash data. Metadata management is complex.
  - Full mapping: Every partition key is mapped to a grid. This comes with a severe read and write dependency on the mapping table and a read-your-writes consistency requirement. Generally, the metadata service needs to be introduced.
  - Prefix range-based mapping: Partition keys are mapped to grids by prefix range. This algorithm offsets the downsides of the full mapping algorithm while providing flexibility.
  - Mapping replacement: A specific key is forcibly allocated to a specific grid, which facilitates testing and isolation.
4. Design the grid routing layer as follows:
  - The grid routing layer is the only shared component, so you must minimize changes to it for stability.
  - Minimize changes to the service logic to maintain stability.
  - Due to the large blast radius, the grid routing layer needs to be lightweight and simple enough, but have full functionality.
  - In some cases, avoid routing all calls to help reduce the latency and the size of the grid routing layer.
  - Allow horizontal expansion to prevent the grid routing layer from becoming a performance bottleneck.
5. Provide the grid migration function for quick grid reassignment based on keys when a grid is added or deleted. Details are as follows:
  - Copy data from the old grid mapped by the key to the new grid.
  - Update the grid routing layer route to map the partition key to the new grid.
  - Delete data from the old grid.
6. Deploy and update grid code as follows:
  - Combine the grid code deployment with cross-AZ and cross-region deployment to reduce the impact scope of faults through multi-layer isolation.
  - Use canary deployments to update the grid service unit code to reduce the possibility that multiple grid service units are faulty at the same time due to version problems.