Updated on 2024-06-15 GMT+08:00

Troubleshooting a Training Job Failure

Symptom

A training job is in Failed state.

Cause Analysis and Solution

  • The error "MoxFileNotExistsException(resp, 'file or directory or bucket not found.')" is displayed in the training logs.
    • Cause: The train_data_obs directory is not found when MoXing copies files.
    • Solution: Correct the address of the train_data_obs directory and restart the training job.

      Do not delete any objects from the OBS directory while MoXing is downloading them. This will cause the download to fail.

  • The error CUDA capability sm_80 is not compatible with the current PyTorch installation.The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70' is displayed in the training logs.
    • Cause: The CUDA version of the image used by the training job supports only the sm_37, sm_50, sm_60, and sm_70 accelerator cards. The sm_80 accelerator card is not supported.
    • Solution: Use a custom image to create a training job and install the target CUDA and PyTorch versions.
  • The error "ERROR:root:label_map.pbtxt cannot be found. It will take a long time to open every annotation files to generate a tmp label_map.pbtxt." is displayed in the training logs.
    • If you use an algorithm that you subscribed to from AI Gallery, make sure the data label is accurate.
    • If you use an object detection algorithm, make sure the label box of the data is non-rectangular.

      Object detection algorithms support only rectangular label boxes.

  • The error "RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use)." is displayed in the training logs.
    • Cause: The port number of the training job is not unique.
    • Solution: Change the port number in the code and restart the training job.
  • The error "WARNING: root: Retry=7, Wait=0.4, Times tamp=1697620658.6282516" is displayed in the training logs.
    • Cause: The MoXing version is too old.
    • Solution: Contact technical support engineers to upgrade MoXing to 2.1.6 or later.