Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Training Performance Issues/ Training Performance Deteriorated

Updated on 2024-06-11 GMT+08:00

View PDF

Training Performance Deteriorated

Symptom

When a ModelArts algorithm is used for training, it will take more time than expected for training.

Possible Causes

The possible causes are as follows:

The job code or training parameters have been modified.
The GPU hardware for training malfunctions.

Solution

Check whether the training code and parameters have been modified.
Check whether the allocation of the CPU, memory, GPU, snt9, or Infiniband resources complies with the expectation.
Use CloudShell to log in to the Linux and check the GPU working status.
- Run the nvidia-smi command to check whether the GPU is working properly.
- Run the nvidia-smi -q -d TEMPERATURE command to check the temperature. If the temperature is too high, the training performance deteriorates.

Parent topic: Training Performance Issues

Previous topic: Training Performance Issues

Next topic: Inference Deployment

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel