Using Pod Snapshots for Fast Recovery of LLM Inference Services

Context

As large language model (LLM) inference is deployed at scale in cloud-native environments, auto scaling and fast recovery have become critical to ensuring service level agreements (SLAs). However, when serving models with tens of billions of parameters, such as DeepSeek-V3, the traditional pod restart process faces significant challenges:

Prolonged model loading time: Take a Mixture-of-Experts (MoE) architecture like DeepSeek-V3 as an example. The complete loading process involves multiple serial steps: reading weight files from disk, dequantization, weight format conversion (for example, to NPU NZ format), and communication domain initialization. Cold starts typically take minutes. In scenarios requiring rapid capacity expansion to handle traffic spikes or recover from faults, this delay directly causes service interruption or degradation.
Non-migratable NPU device state: Large model inference heavily depends on the runtime state of NPUs, including dispatched compute tasks, HCCL communication channels, KV cache in device memory, and model weights. Native container snapshot technologies (such as CRIU) can only capture CPU-side process state and cannot detect or preserve hardware context on the NPU side. Consequently, device state is lost after recovery, and inference cannot resume.
Distributed communication topology disruption: Large model inference relies on cross-node parallelism strategies such as tensor parallelism (TP) and expert parallelism (EP), where the HCCL communication domain is tightly bound to the pod IP address. When a pod is rescheduled to a new node after fault recovery, the IP address change renders the original HCCL communication channel invalid. The distributed inference capability can only be restored by rebuilding the entire communication domain.
Loss of quantization weight processing state: In quantization scenarios such as W8A8, model weights must undergo post-processing after loading, including transposition, NZ format conversion, and quantization parameter expansion to generate derived states. These derived states are not included in the original checkpoint snapshot. If the pod is restored and the original weights are reloaded, these time-consuming post-processing steps must be performed again.

How Pod Snapshot Works

To address the challenges described above, the vLLM-Ascend engine integrates a pod snapshot mechanism for fast recovery. The following figure illustrates the implementation principle.

Click to enlarge

The core idea is to persist the complete runtime state of the model to shared storage while the inference pod is running normally. When the pod needs to be restored, the CANN runtime snapshot API reconstructs the NPU device state from the frozen snapshot. Model weights are loaded directly from shared storage into NPU memory, and the distributed communication topology is rebuilt. This bypasses disk parsing and full weight loading, reducing restoration time from minutes to seconds.

Snapshot phase: full-stack state freezing
After the model is loaded and weight post-processing (such as quantization parameter derivation and format conversion) is complete, the system serializes the complete state from the application layer through the hardware layer into shared storage.
- Compute cache persistence: Non-persistent caches, such as the cosine and sine tables for rotary position embedding (RoPE), as well as key parameters like scaling factors (scale) and offsets (offset) generated by the quantization layer, are cached on the CPU side to avoid recomputation after restoration.
- Device state freezing: The underlying layer invokes the CANN runtime API to lock in-flight tasks on the NPU and back up the complete device context to snapshot storage, achieving full-stack state freezing.

Restoration phase: memory passthrough and on-demand post-processing
After the new pod starts, the system bypasses the traditional disk parsing and loading process.
- Hardware state restoration: The device-side context is restored first through the CANN API, and the underlying HCCL communication tasks are reloaded.
- Memory-mapped direct read: mmap is used to load model states from shared storage and transfer weights and buffer data directly from CPU memory to NPU memory, eliminating deserialization overhead.
- On-demand derivation logic: For the W8A8 quantization layer, MLA attention mechanism, and attention mask, the system automatically triggers post-processing logic, including matrix transposition, NZ format conversion, and projection matrix reconstruction, based on the recovery flag, but only during the first forward pass. In addition, the mask cache matching the current sequence length is forcibly refreshed to ensure inference state consistency and correctness.

Communication domain reconstruction: seamless integration in distributed environments
- Topology cleanup and update: The original Ascend parallel communication groups (TP, EP, network projection, and more) are completely cleared, and the distributed environment is updated using the actual network information (IP address and master node address) of the new pod. An incremental port initialization policy effectively avoids network conflicts.
- MoE route rebinding: In EP scenarios, the parallel communication group and scheduling configuration are rebound, and the HCCL communication identifier is updated to ensure correct routing distribution for large models.
- Computational graph reuse: Leveraging the Graph Engine (GE) caching mechanism enabled by TorchAir, the restored computational graph can be reused directly, avoiding the significant performance loss associated with recompilation and achieving efficient bridging between communication and computation.

Core Components

Grus

Grus is a container snapshot engine that integrates with Kubernetes to provide high-performance, highly reliable container snapshot capabilities.

Compatible with the checkpoint interface defined by the CRI API specification.
Interoperates with OCI runtimes via grus-agent to enable seamless restore workflows.
Efficient checkpoint and restore of container read-write layer data.

CRIU

Checkpoint/Restore in Userspace (CRIU) is the leading open-source project for checkpointing and restoring Linux processes. It can freeze a running application (a tree of processes), checkpoint its state, and persist it as a collection of files on disk. The application can later be restored from the frozen state, enabling seamless migration and fault recovery. CRIU is implemented primarily in userspace and remains the most feature-rich and actively maintained checkpoint/restore solution for Linux. It has been integrated into OpenVZ, LXC/LXD, Docker, and other platforms, and is available in the software repositories of major Linux distributions.

Building on open-source CRIU, CCE has customized and extended CRIU with NPU checkpoint/restore capabilities, packaging it as an RPM. The package consists of a core binary and a .so library, which work together to implement checkpointing and restoration for container-level NPU workloads.

criu (core binary): provides dump and restore capabilities for container-level snapshots, handling resources associated with container processes.
npu_plugin.so: Built on the CRIU plugin mechanism, it interfaces with NPU-mapped host resources (such as NPU device file handles, device-mapped VMAs, and NPU-shared memory regions) to enable the import and export of NPU device context.

Parent Topic: Cloud Native AI

Previous topic: Best Practices for High-Performance EMS KV Cache in CCE Distributed Inference

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot