Updated on 2024-10-29 GMT+08:00

Creating a Custom Training Image

If you have developed a model or training script locally but the AI engine you used is not supported by ModelArts, create a custom image and upload it to SWR. Then, use this image to create a training job on ModelArts and use the resources provided by ModelArts to train models.

Procedure

Figure 1 Creating a custom image for a training job

Scenario 1: If the preset images meet ModelArts training constraints but lack necessary code dependencies, install additional software packages.

For details, see Creating a Custom Training Image Using a Preset Image.

Scenario 2: If the local images meet code dependency requirements but not ModelArts training constraints, adapt them to ModelArts.

For details, see Migrating Existing Images to ModelArts.

Scenario 3: If neither the preset nor local images meet your needs, create an image that has necessary code dependencies and meet ModelArts constraints. For details, see the following cases:

Creating a Custom Training Image (PyTorch + CPU/GPU)

Creating a Custom Training Image (MPI + CPU/GPU)

Creating a Custom Training Image (Tensorflow + GPU)

Constraints on Custom Images of the Training Framework

  • Use Ubuntu 18.04 for custom images to in case versions are not compatible.
  • Do not use a custom image larger than 15 GB. The size should not exceed half of the container engine space of the resource pool. Otherwise, the start time of the training job is affected.

    The container engine space of ModelArts public resource pool is 50 GB. By default, the container engine space of the dedicated resource pool is also 50 GB. You can customize the container engine space when creating a dedicated resource pool.

  • The uid of the default user of a custom image must be 1000.
  • The GPU or Ascend driver cannot be installed in a custom image. When you select GPU resources to run training jobs, ModelArts automatically places the GPU driver in the /usr/local/nvidia directory in the training environment. When you select Ascend resources to run training jobs, ModelArts automatically places the Ascend driver in the /usr/local/Ascend/driver directory.
  • x86- or Arm-based custom images can run only with specifications corresponding to their architecture.
    Run the following command to check the CPU architecture of a custom image:
    docker inspect {Custom image path} | grep Architecture
    The following is the command output for an Arm-based custom image:
    "Architecture": "arm64"
    • If the name of a specification contains Arm, this specification is an Arm-based CPU architecture.
    • If the name of a specification does not contain Arm, this specification is an x86-based CPU architecture.
  • The ModelArts backend does not support the download of open source installation packages. Install the dependency packages required for training in the custom image.
  • Custom images can be used to train models in ModelArts only after they are uploaded to Software Repository for Container (SWR).