Updated on 2023-11-10 GMT+08:00

Training Performance Deteriorated

Symptom

When a ModelArts algorithm is used for training, it will take more time than expected for training.

Possible Causes

The possible causes are as follows:

  1. The job code or training parameters have been modified.
  2. The GPU hardware for training malfunctions.

Solution

  1. Check whether the training code and parameters have been modified.
  2. Check whether the allocation of the CPU, memory, GPU, snt9, or Infiniband resources complies with the expectation.
  3. Use CloudShell to log in to the Linux and check the GPU working status.
    • Run the nvidia-smi command to check whether the GPU is working properly.
    • Run the nvidia-smi -q -d TEMPERATURE command to check the temperature. If the temperature is too high, the training performance deteriorates.