Help Center/
ModelArts/
Troubleshooting/
Training Jobs/
Running a Training Job Failed/
Troubleshooting a Training Job Failure
Updated on 2024-06-15 GMT+08:00
Troubleshooting a Training Job Failure
Symptom
A training job is in Failed state.
Cause Analysis and Solution
- The error "MoxFileNotExistsException(resp, 'file or directory or bucket not found.')" is displayed in the training logs.
- Cause: The train_data_obs directory is not found when MoXing copies files.
- Solution: Correct the address of the train_data_obs directory and restart the training job.
Do not delete any objects from the OBS directory while MoXing is downloading them. This will cause the download to fail.
- The error CUDA capability sm_80 is not compatible with the current PyTorch installation.The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70' is displayed in the training logs.
- Cause: The CUDA version of the image used by the training job supports only the sm_37, sm_50, sm_60, and sm_70 accelerator cards. The sm_80 accelerator card is not supported.
- Solution: Use a custom image to create a training job and install the target CUDA and PyTorch versions.
- The error "ERROR:root:label_map.pbtxt cannot be found. It will take a long time to open every annotation files to generate a tmp label_map.pbtxt." is displayed in the training logs.
- If you use an algorithm that you subscribed to from AI Gallery, make sure the data label is accurate.
- If you use an object detection algorithm, make sure the label box of the data is non-rectangular.
Object detection algorithms support only rectangular label boxes.
- The error "RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use)." is displayed in the training logs.
- Cause: The port number of the training job is not unique.
- Solution: Change the port number in the code and restart the training job.
- The error "WARNING: root: Retry=7, Wait=0.4, Times tamp=1697620658.6282516" is displayed in the training logs.
- Cause: The MoXing version is too old.
- Solution: Contact technical support engineers to upgrade MoXing to 2.1.6 or later.
Parent topic: Running a Training Job Failed
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot