Updated on 2024-04-11 GMT+08:00

Troubleshooting Process

Symptom

A training job using a custom image failed.

Locating Method

  1. Determine the image source.
    • Check whether the base image of the custom image is from ModelArts. Use a base image provided by ModelArts to create a custom image. For details, see Using a Base Image to Create a Training Image.
    • If the image is from a third party, check with the creator of the custom image for how to use this image.
  2. Determine the size of the custom image.

    Do not use a custom image larger than 15 GB. The size should not exceed half of the container engine space of the resource pool. Otherwise, the start time of the training job is affected.

    The container engine space of ModelArts public resource pool is 50 GB. By default, the container engine space of the dedicated resource pool is also 50 GB. You can customize the container engine space when creating a dedicated resource pool.

  3. Determine the error type.
    • If an error message is displayed indicating that a file could not be found, see Error Message "No such file or directory" Displayed in Training Job Logs.
    • If an error message is displayed indicating that a package could not be found, see Error Message "No module named .*" Displayed in Training Job Logs.
    • An error occurred in the Ascend startup script or initialization script.

      Check whether the script is obtained from the official website and whether the script is used strictly following the instructions provided in official documents. For example, check whether the script name and path are correct.

    • The driver version is incompatible with the underlying driver.

      Before upgrading the driver of a custom image, check whether the upgraded version is supported by the underlying driver. Obtain the supported driver versions.

    • You are not allowed to access a file.

      The possible cause is that the user of the custom image is different from that of the job container. In this case, modify the Dockerfile.

      RUN if id -u ma-user > /dev/null 2>&1 ; \
      then echo 'The ModelArts user already exists.' ; \
      else echo 'The ModelArts user does not exist.' && \
      groupadd ma-group -g 1000 && \
      useradd -d /home/ma-user -m -u 1000 -g 1000 -s /bin/bash ma-user ; fi && \
      chmod 770 /home/ma-user && \
      chmod 770 /root && \
      usermod -a -G root ma-user
    • For other issues, search for solutions in training failure cases.

Summary and Suggestions

Before using a custom image for training jobs, create the image by following the custom image specifications. which also provides end-to-end examples for your reference.