Updated on 2025-05-29 GMT+08:00

Storage

Description

Storage varies depending on performance, usability, and cost. No storage media can cover all scenarios. Learning about in-cloud storage application scenarios for better usage.

This document describes DevEnviron storage SLAs and storage application scenarios.

Storage SLAs

  • Availability issues caused by unavailability of your OBS buckets or SFS for storing data are not included in service unavailability.
  • Service exceptions caused by unavailability of SFS managed by you are not included in service unavailability.
  • If you use EVS or default storage in the development environment, you are responsible for backing up data. Data loss caused by the deletion of the development environment instance is not included in service unavailability.

Storage Application Scenarios

Table 1 Storage application scenarios

Storage Type

Application Scenario

Advantages and Disadvantages

EVS

Single environment, large files

Used in a single development environment

PFS

Object storage, large files

Average performance in frequent read and write of small files

SFS

Dedicated resource pool, multiple environments

Lifecycle association

Local storage

First choice for heavy-duty training jobs

Lifecycle association

  • EVS

    Application scenarios: Data and algorithm exploration only in the development environment

    Advantages: Block storage SSDs feature better overall I/O performance than NFS. The storage capacity can be dynamically expanded to up to 4096 GB. As persistent storage, EVS disks are mounted to /home/ma-user/work. The data in this directory is retained after the instance is stopped. The storage capacity can be expanded online based on demand.

    Disadvantages: This type of storage can only be used in a single development environment.

  • PFS
    Application scenarios:
    • Storage for datasets. Mount the OBS parallel file system with a dataset stored to a notebook instance. The file system can be directly used during training.
    • Storage for code. After debugging on a notebook instance, specify the OBS path as the code path for starting training, facilitating temporary modification.
    • Storage for checking training. Mount storage to the training output path such as the path to training logs. In this way, view and check training on the notebook instance in real time. This is especially suitable for analyzing the output of jobs trained using TensorBoard or notebook.

    Advantages: PFS is an optimized high-performance object storage file system with low storage costs and large throughput. It can quickly process high-performance computing (HPC) workloads. PFS mounting is recommended if OBS is used.

    Disadvantages: The performance is weak in terms of frequent reads and writes of small files. The object storage semantics is different from the Posix semantics and needs to be further understood.

  • SFS

    Application scenarios: Available only in dedicated resource pools. Use SFS storage in informal production scenarios such as exploration and experiments. One SFS device can be mounted to both a development environment and a training environment. In this way, you do not need to download data each time your training job starts. This type of storage is not suitable for heavy I/O training on more than 32 cards.

    Advantages: EFS is implemented as NFS and can be shared between multiple development environments and between development and training environments. This type of storage is preferred for non-heavy-duty distributed training jobs, especially for the ones not requiring to download data additionally when the training jobs start.

    Disadvantages: The storage lifecycle is associated with the container lifecycle. Data needs to be downloaded each time the training job starts.

  • Local Storage

    Application scenarios: First choice for heavy-duty training jobs.

    Advantages: High-performance SSDs for the target VM or BMS, featuring high file I/O throughput. For heavy-duty training jobs, store data in the target directory and then start training. By default, the storage is mounted to the /cache directory of the container with 500 GB of available space.

    Disadvantages: The storage lifecycle is associated with the container lifecycle. Data needs to be downloaded each time the training job starts.