Solution Overview

Scenarios

In recent years, AI has developed rapidly and been applied to many fields, such as autonomous driving, large models, AI generated content (AIGC), and scientific AI. The implementation of AI requires a large number of infrastructure resources, including high-performance compute power and high-speed storage and network. The powerful compute, storage, and network performance of the AI infrastructure ensure the balanced development of AI compute power.

From classic AI in the past to large models and autonomous driving today, we can see that the parameters and compute power of AI models increase exponentially, bringing new challenges to the storage infrastructure.

High-throughput data access: As enterprises use more GPUs and NPUs, they expect the storage system to provide higher throughput capabilities that can fully use the compute performance of GPUs and NPUs. Specifically, they hope that training data can be read quicker to reduce the wait time of compute I/Os and checkpoint saves and loads take as little time as possible to reduce the training interruption time.
Shared data access through file interfaces: AI architectures require large-scale compute clusters (GPU and NPU servers). Data accessed by cluster servers comes from a unified data source, which is a shared storage space. Shared data access has many advantages. It ensures data consistency across different servers and reduces data redundancy when data is retained on different servers. PyTorch, a popular open-source deep learning framework in the AI ecosystem, is a good example. PyTorch accesses data through file interfaces by default. AI algorithm developers are used to using file interfaces. So, file interfaces are the most friendly shared storage access method.

To know more about this solution or if you have any questions when using this solution, seek support through Solution Consultation.

Solution Architecture

To address the problems faced in AI training, Huawei Cloud provides an AI cloud storage solution based on Object Storage Service (OBS) and Scalable File Service Turbo (SFS Turbo). As shown in Figure 1, SFS Turbo can work with OBS. You can use SFS Turbo HPC file systems to speed up data access to OBS and asynchronously store the generated data to OBS for low-cost, long term persistent storage.

Figure 1 Huawei Cloud AI cloud storage solution based on OBS+SFS Turbo
Click to enlarge

Solution Advantages

Table 1 describes the advantages of Huawei Cloud AI cloud storage solution.

**Table 1** Advantages of the Huawei Cloud AI cloud storage solution
No.	Advantage	Description
1	Decoupled storage and compute improves resource utilization.	GPU and NPU compute powers are decoupled from the SFS Turbo storage. Compute and storage resources can be expanded separately as needed, improving resource utilization.
2	High-performance SFS Turbo accelerates the training process.	Training datasets can be read at a high speed, so GPUs/NPUs do not have to wait for storage I/Os to complete, improving the GPU/NPU utilization. TB-level checkpoint files of large models can be saved and loaded in seconds, reducing the training job interruption time.
3	Asynchronous data import and export do not take up the training time. No external migration tools need to be deployed.	Before a training job starts, data is imported from OBS to SFS Turbo. During the training, checkpoint data written to SFS Turbo is asynchronously exported to OBS. All of these operations do not take up any of the training time. Data can be directly imported and exported between SFS Turbo and OBS. No external data copy machines or tools are required.
4	Automatic flow between hot and cold data reduces storage costs.	SFS Turbo allows you to define a data eviction duration. In this way, cold data can be automatically deleted from the high-performance storage and saved to OBS, making room for the hot data. When cold data needs to be accessed, SFS Turbo automatically loads the data from OBS to improve access performance.
5	Multiple AI development platforms and ecosystem compatibility.	Mainstream AI application frameworks, such as PyTorch and MindSpore, Kubernetes containers, and algorithm development can all access shared data in SFS Turbo through file semantics. No development adaptations need to be made.