Detecting Training Job Suspension

Overview

A training job may be suspended due to unknown reasons. If the suspension cannot be detected promptly, resources cannot be released, leading to a waste. To minimize resource cost and improve user experience, ModelArts provides suspension detection for training jobs. With this function, suspension can be automatically detected and displayed on the log details page. You can also enable notification so that you can be promptly notified of job suspension.

Detection Rules

Determine whether a job is suspended based on the monitored job process status and resource usage. A process is started to periodically monitor the changes of the two metrics.

Job process status: If the process I/O of a training job changes, the next detection period starts. If the process I/O of the job remains unchanged in multiple detection periods, the resource usage detection starts.
Resource usage: If the process I/O remains unchanged, the system collects the GPU or NPU usage within a certain period of time and determines whether the resource usage changes based on the variance and median of the GPU or NPU usage within the period. If the GPU usage is not changed, the job is suspended.

The environment variable MA_HANG_DETECT_TIME is set to 30 by default, which means a job is considered suspended if its process I/O does not change for 30 minutes. To adjust this, update the value of the MA_HANG_DETECT_TIME variable. For details, see Managing Environment Variables of a Training Container.

Due to the limitation of detection rules, there is a certain error probability in suspension detection. If the suspension is caused by the logic of job code (for example, long-time sleep), ignore it.

Constraints

Suspension can be detected only for training jobs that run on GPUs or NPUs.

Procedure

Suspension detection is automatically performed during job running. No additional configuration is required. After detecting that a job is suspended, the system displays a message on the training job details page, indicating that the job may be suspended. If you want to be notified of suspension (by SMS or email), enable event notification on the job creation page.

Case: Data Replication Suspension

Symptoms

The system stopped responding when mox.file.copy_parallel was called to copy data.

Solution

Run the following commands to copy files or folders:

import moxing as mox
mox.file.set_auth(is_secure=False)

Run the following command to copy a single file that is greater than 5 GB:
```
from moxing.framework.file import file_io
```
Run file_io._LARGE_FILE_METHOD to check the version of the MoXing API. Output value 1 indicates V1 and 2 indicates V2.

Run file_io._NUMBER_OF_PROCESSES=1 to resolve the issue for the V1 API.

To resolve the issue for the V2 API, run file_io._LARGE_FILE_METHOD = 1 to switch to V1 and perform operations required in V1. Alternatively, run file_io._LARGE_FILE_TASK_NUM=1 to resolve this issue.

Run the following command to copy a folder:

mox.file.copy_parallel(threads=0,is_processing=False)

Case: Suspension Before Training

If a job was trained on multiple nodes and suspension occurred before the job started, add os.environ["NCCL_DEBUG"] = "INFO" to the code to view the NCCL debugging information.

Symptom 1
The job was suspended before the NCCL debugging information was printed in logs.

Solution 1

Check the code for parameters such as master_ip and rank. Ensure that these parameters are specified.

Symptom 2

According to the distributed training logs, some nodes contain GDR information, but some nodes do not. The suspension may be caused by GDR.

# Logs of node A
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1136:1191 [2] NCCL INFO Channel 00 : 3[5f000] -> 10[5b000] [receive] via NET/IB/0/GDRDMA
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1140:1196 [6] NCCL INFO Channel 00 : 14[e1000] -> 15[e9000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1141:1187 [7] NCCL INFO Channel 00 : 15[e9000] -> 11[5f000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1138:1189 [4] NCCL INFO Channel 00 : 12[b5000] -> 14[e1000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-1:1137:1197 [3] NCCL INFO Channel 00 : 11[5f000] -> 16[2d000] [send] via NET/IB/0/GDRDMA

# Logs of node B
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1139:1198 [2] NCCL INFO Channel 00 : 18[5b000] -> 19[5f000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1144:1200 [7] NCCL INFO Channel 00 : 23[e9000] -> 20[b5000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1142:1196 [5] NCCL INFO Channel 00 : 21[be000] -> 17[32000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1143:1194 [6] NCCL INFO Channel 00 : 22[e1000] -> 21[be000] via P2P/IPC
modelarts-job-a7305e27-d1cf-4c71-ae6e-a12da6761d5a-worker-2:1141:1191 [4] NCCL INFO Channel 00 : 20[b5000] -> 22[e1000] via P2P/IPC

Solution 2

Set os.environ["NCCL_NET_GDR_LEVEL"] = '0' at the beginning of the program to disable GDR or ask the O&M personnel to add the GDR information to the affected nodes.

Symptom 3
Communication information such as "Got completion with error 12, opcode 1, len 32478, vendor err 129" was displayed. The current network was unstable.

Solution 3

Add the following environment variables:
- NCCL_IB_GID_INDEX=3: enables RoCEv2. RoCEv1 is enabled by default. However, RoCEv1 does not support congestion control on switches, which may lead to packet loss. In addition, later-version switches do not support RoCEv1, leading to a RoCEv1 failure.
- NCCL_IB_TC=128: enables data packets to be transmitted through the queue 4 of switches, which is RoCE-compliant.
- NCCL_IB_TIMEOUT=22: enables a longer timeout interval. Generally, there is a network interruption lasting about 5s if the network is unstable and then the timeout message is returned. Change the timeout interval to 22s, indicating that the timeout message will be returned in about 20s (4.096 µs x 2 ^ timeout).

Case: Suspension During Training

Symptom 1
According to the logs of the nodes on which a training job ran, an error occurred on a node but the job did not exit, leading to the job suspension.

Solution 1

Check the error cause and rectify the fault.
Symptom 2
The job was stuck in sync-batch-norm or the training speed was lowered down. If sync-batch-norm is enabled for PyTorch, the training speed is lowered down because all node data must be synchronized on each batch normalization layer in every iteration, which leads to heavy communication traffic.

Solution 2

Disable sync-batch-norm, or upgrade the PyTorch version to 1.10.
Symptom 3
The job is stuck in TensorBoard, and the following error is reported:
```
writer = Sumarywriter('./path)/to/log')
```
Solution 3

Set a local path for storage, for example, cache/tensorboard. Do not store data in OBS.
Symptom 4
When PyTorch DataLoader is used to read data, the job is stuck in data reading, and logs stop to update.

Solution 4

When the DataLoader is used to read data, reduce the value of num_worker.

Case: Suspension in the Last Training Epoch

Symptoms

Logs showed that an error occurred in split data. As a result, processes are in different epochs, and uncompleted processes are suspended because they do not receive response from other processes. As shown in the following figure, some processes are in epoch 48 while others are in epoch 49 at the same time.

loss exit lane:0.12314446270465851
step loss is 0.29470521211624146
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:2 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.705(0.890) Loss 0.3403(0.3792)LR 0.00021887
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:1 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.705(0.891) Loss 0.3028(0.3466) LR 0.00021887
[2022-04-26 13:57:20,757][INFO][train_epoch]:Rank:4 Epoch:[49][20384/all] Data Time 0.000(0.147) Net Time 0.705(0.709) Loss 0.3364(0.3414)LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:3 Epoch:[49][20384/all] Data Time 0.000 (0.115) Net Time 0.706(0.814) Loss 0.3345(0.3418) LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:0 Epoch:[49][20384/all] Data Time 0.000(0.006) Net Time 0.704(0.885) Loss 0.2947(0.3566) LR 0.00021887
[2022-04-26 13:57:20,758][INFO][train_epoch]:Rank:7 Epoch:[49][20384/all] Data Time 0.001 (0.000) Net Time 0.706 (0.891) Loss 0.3782(0.3614) LR 0.00021887
[2022-04-26 13:57:20,759][INFO][train_epoch]:Rank:5 Epoch:[48][20384/all] Data Time 0.000(0.000) Net Time 0.706(0.891) Loss 0.5471(0.3642) LR 0.00021887
[2022-04-26 13:57:20,763][INFO][train_epoch]:Rank:6 Epoch:[49][20384/all] Data Time 0.000(0.000) Net Time 0.704(0.891) Loss 0.2643(0.3390)LR 0.00021887
stage 1 loss 0.4600560665130615 mul_cls_loss loss:0.01245919056236744 mul_offset_loss 0.44759687781333923 origin stage2_loss 0.048592399805784225
stage 1 loss:0.4600560665130615 stage 2 loss:0.048592399805784225 loss exit lane:0.10233864188194275

Solution

Split tensors to align data.

Parent topic: High Model Training Reliability

Previous topic: Training Log Failure Analysis

Next topic: Training Job Rescheduling