Detecting Training Job Suspension

Overview

A training job may be suspended due to unknown reasons. If the suspension cannot be detected promptly, resources cannot be released, leading to a waste. To minimize resource cost and improve user experience, ModelArts provides suspension detection for training jobs. With this function, suspension can be automatically detected and displayed on the log details page. You can also enable notification so that you can be promptly notified of job suspension.

Detection Rules

Suspension detection determines whether a job is suspended based on the monitored job process status and resource usage. A coroutine is started to periodically monitor the changes of the two metrics. There are two types of suspension detection rules: single-instance detection and all-instance detection. Both apply simultaneously.

Single-instance detection
- Process status: If the process I/O of a single instance of a training job changes, the next detection period starts. If the process I/O remains unchanged in multiple detection periods, the resource usage detection starts.
- Resource usage: If the process I/O of a single instance of a training job remains unchanged, the system collects the GPU or NPU usage within a certain period of time and determines whether the resource usage changes based on the variance and median of the GPU or NPU usage within the period. If the GPU usage is not changed, the job is suspended.
All-instance detection
Resource usage: The system suspends a job if its GPU or NPU usage remains unchanged for a while and each instance's CPU usage stays below one core.

The system has the environment variable MA_HANG_DETECT_TIME set to 30. This means the job suspends if the system detects a metric issue for 30 minutes. To adjust this, update the value of the MA_HANG_DETECT_TIME variable. For details, see Managing Environment Variables of a Training Container.

Due to the limitation of detection rules, there is a certain error probability in suspension detection. If the suspension is caused by the logic of job code (for example, long-time sleep), ignore it.

Constraints

Suspension can be detected only for training jobs that run on GPUs or NPUs.

Procedure

Suspension detection is automatically performed during job running. No additional configuration is required. After detecting that a job is suspended, the system displays a message on the training job details page, indicating that the job may be suspended. If you want to be notified of suspension (by SMS or email), enable event notification on the job creation page.

Cases

Data replication suspension
Symptoms

The system stopped responding when mox.file.copy_parallel was called to copy data.

Solution
- Run the following commands to copy files or folders:
```
import moxing as mox
mox.file.set_auth(is_secure=False)
```
- Run the following command to copy a single file that is greater than 5 GB:
```
from moxing.framework.file import file_io
```
  Run file_io._LARGE_FILE_METHOD to check the version of the MoXing API. Output value 1 indicates V1 and 2 indicates V2.
  
  Run file_io._NUMBER_OF_PROCESSES=1 to resolve the issue for the V1 API.
  
  To resolve the issue for the V2 API, use file_io._LARGE_FILE_METHOD = 1 to switch to V1 and perform operations required in V1. Alternatively, use file_io._LARGE_FILE_TASK_NUM=1 to resolve this issue.
- Run the following command to copy a folder:
```
mox.file.copy_parallel(threads=0,is_processing=False) 
```
Suspension before training
If a job was trained on multiple nodes and suspension occurred before the job started, add os.environ["NCCL_DEBUG"] = "INFO" to the code to view the NCCL debugging information.
- Symptom 1
  The job was suspended before the NCCL debugging information was printed in logs.
  
  Solution 1
  
  Check the code for parameters such as master_ip and rank. Ensure that these parameters are specified.
- Symptom 2
  According to the distributed training logs, some nodes contain GDR information, but some nodes do not. The suspension may be caused by GDR.
```
# Logs of node A
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1136:1191 [2] NCCL INFO Channel 00 : 3[5f000] -> 10[5b000] [receive] via NET/IB/0/GDRDMA
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1140:1196 [6] NCCL INFO Channel 00 : 14[e1000] -> 15[e9000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1141:1187 [7] NCCL INFO Channel 00 : 15[e9000] -> 11[5f000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1138:1189 [4] NCCL INFO Channel 00 : 12[b5000] -> 14[e1000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1137:1197 [3] NCCL INFO Channel 00 : 11[5f000] -> 16[2d000] [send] via NET/IB/0/GDRDMA

# Logs of node B
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1139:1198 [2] NCCL INFO Channel 00 : 18[5b000] -> 19[5f000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1144:1200 [7] NCCL INFO Channel 00 : 23[e9000] -> 20[b5000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1142:1196 [5] NCCL INFO Channel 00 : 21[be000] -> 17[32000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1143:1194 [6] NCCL INFO Channel 00 : 22[e1000] -> 21[be000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1141:1191 [4] NCCL INFO Channel 00 : 20[b5000] -> 22[e1000] via P2P/IPC
```
  Solution 2
  
  Set os.environ["NCCL_NET_GDR_LEVEL"] = '0' at the beginning of the program to disable GDR or ask the O&M personnel to add the GDR information to the affected nodes.
- Symptom 3
  Communication information such as "Got completion with error 12, opcode 1, len 32478, vendor err 129" was displayed. The current network was unstable.
  
  Solution 3
  
  Add the following environment variables:
  - NCCL_IB_GID_INDEX=3: enables RoCEv2. RoCEv1 is enabled by default. However, RoCEv1 does not support congestion control on switches, which may lead to packet loss. In addition, later-version switches do not support RoCEv1, leading to a RoCEv1 failure.
  - NCCL_IB_TC=128: enables data packets to be transmitted through the queue 4 of switches, which is RoCE-compliant.
  - NCCL_IB_TIMEOUT=22: enables a longer timeout interval. Generally, there is a network interruption lasting about 5s if the network is unstable and then the timeout message is returned. Change the timeout interval to 22s, indicating that the timeout message will be returned in about 20s (4.096 µs x 2 ^ timeout).
Suspension during training
- Symptom 1
  According to the logs of the nodes on which a training job ran, an error occurred on a node but the job did not exit, leading to the job suspension.
  
  Solution 1
  
  Check the error cause and rectify the fault.
- Symptom 2
  The job was stuck in sync-batch-norm or the training speed was lowered down. If sync-batch-norm is enabled for PyTorch, the training speed is lowered down because all node data must be synchronized on each batch normalization layer in every iteration, which leads to heavy communication traffic.
  
  Solution 2
  
  Disable sync-batch-norm, or upgrade the PyTorch version to 1.10.
- Symptom 3
  The job is stuck in TensorBoard, and the following error is reported:
```
writer = Sumarywriter('./path)/to/log')
```
  Solution 3
  
  Set a local path for storage, for example, cache/tensorboard. Do not store data in OBS.
- Symptom 4
  When PyTorch DataLoader is used to read data, the job is stuck in data reading, and logs stop to update.
  
  Solution 4
  
  When the DataLoader is used to read data, reduce the value of num_worker.

Suspension in the Last Training Epoch

Symptoms

Logs showed that an error occurred in split data. As a result, processes are in different epochs, and uncompleted processes are suspended because they do not receive response from other processes. As shown in the following figure, some processes are in epoch 48 while others are in epoch 49 at the same time.

loss exit lane:0.12314446270465851
step loss is 0.29470521211624146
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:2 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.705(0.890) Loss 0.3403(0.3792)LR 0.00021887
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:1 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.705(0.891) Loss 0.3028(0.3466) LR 0.00021887
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:4 Epoch:[49][20384/all] Data Time 0.000(0.147) Net Time 0.705(0.709) Loss 0.3364(0.3414)LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:3 Epoch:[49][20384/all] Data Time 0.000 (0.115) Net Time 0.706(0.814) Loss 0.3345(0.3418) LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:0 Epoch:[49][20384/all] Data Time 0.000(0.006) Net Time 0.704(0.885) Loss 0.2947(0.3566) LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:7 Epoch:[49][20384/all] Data Time 0.001 (0.000) Net Time 0.706 (0.891) Loss 0.3782(0.3614) LR 0.00021887
[2022-04-26 13:57:20,759][INFO][train_epoch]:Rank:5 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.706(0.891) Loss 0.5471(0.3642) LR 0.00021887
[2022-04-26 13:57:20,763][INFO][train_epoch]:Rank:6 Epoch:[49][20384/all] Data Time 0.000(0.000) Net Time 0.704(0.891) Loss 0.2643(0.3390)LR 0.00021887
stage 1 loss 0.4600560665130615 mul_cls_loss loss:0.01245919056236744 mul_offset_loss 0.44759687781333923 origin stage2_loss 0.048592399805784225
stage 1 loss:0.4600560665130615 stage 2 loss:0.048592399805784225 loss exit lane:0.10233864188194275

Solution

Split tensors to align data.

Parent topic: High Model Training Reliability

Previous topic: Training Log Failure Analysis

Next topic: Training Job Restart Upon Suspension