Cette page n'est pas encore disponible dans votre langue. Nous nous efforçons d'ajouter d'autres langues. Nous vous remercions de votre compréhension.

On this page

Training Performance Deteriorated

Updated on 2024-06-11 GMT+08:00

Symptom

When a ModelArts algorithm is used for training, it will take more time than expected for training.

Possible Causes

The possible causes are as follows:

  1. The job code or training parameters have been modified.
  2. The GPU hardware for training malfunctions.

Solution

  1. Check whether the training code and parameters have been modified.
  2. Check whether the allocation of the CPU, memory, GPU, snt9, or Infiniband resources complies with the expectation.
  3. Use CloudShell to log in to the Linux and check the GPU working status.
    • Run the nvidia-smi command to check whether the GPU is working properly.
    • Run the nvidia-smi -q -d TEMPERATURE command to check the temperature. If the temperature is too high, the training performance deteriorates.
Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback