Troubleshooting a Training Job Failure

Updated on 2024-06-15 GMT+08:00

View PDF

Symptom

A training job is in Failed state.

Cause Analysis and Solution

The error "MoxFileNotExistsException(resp, 'file or directory or bucket not found.')" is displayed in the training logs.
- Cause: The train_data_obs directory is not found when MoXing copies files.
- Solution: Correct the address of the train_data_obs directory and restart the training job.
  NOTICE:
  
  Do not delete any objects from the OBS directory while MoXing is downloading them. This will cause the download to fail.
The error CUDA capability sm_80 is not compatible with the current PyTorch installation.The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70' is displayed in the training logs.
- Cause: The CUDA version of the image used by the training job supports only the sm_37, sm_50, sm_60, and sm_70 accelerator cards. The sm_80 accelerator card is not supported.
- Solution: Use a custom image to create a training job and install the target CUDA and PyTorch versions.
The error "ERROR:root:label_map.pbtxt cannot be found. It will take a long time to open every annotation files to generate a tmp label_map.pbtxt." is displayed in the training logs.
- If you use an algorithm that you subscribed to from AI Gallery, make sure the data label is accurate.
- If you use an object detection algorithm, make sure the label box of the data is non-rectangular.
  NOTE:
  
  Object detection algorithms support only rectangular label boxes.
The error "RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use)." is displayed in the training logs.
- Cause: The port number of the training job is not unique.
- Solution: Change the port number in the code and restart the training job.
The error "WARNING: root: Retry=7, Wait=0.4, Times tamp=1697620658.6282516" is displayed in the training logs.
- Cause: The MoXing version is too old.
- Solution: Contact technical support engineers to upgrade MoXing to 2.1.6 or later.

Parent topic: Running a Training Job Failed

Previous topic: Running a Training Job Failed

Next topic: An NCCL Error Occurs When a Training Job Fails to Be Executed

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot