Training Container Lifecycle

Overview

When executing training jobs on the ModelArts Standard platform, the platform encapsulates each sub-task as a Kubernetes pod and continuously monitors the health of the pod. In the event of a node failure, the platform automatically completes the entire closed loop of "failure eviction, node replenishment, abd job resumption", ensuring the continuity of the training job without requiring manual intervention.

Job and Container Structure

A training job you create is referred to as a job. For distributed jobs, multiple tasks are generated based on the number of instances you select. Each task runs as a pod on a compute node.

The relationship between them is shown in the following diagram.

Figure 1 Job and container structure

Fault Recovery

During the execution of a training job, in addition to logical issues within the job itself, tasks may also fail due to abnormal conditions of the platform's compute resources. The training platform continuously monitors the status of the resources supporting the job. When it detects software or hardware issues that affect the execution of the job (such as various failures of the accelerator card), it will take necessary measures to pause the training job and reassign it to an available compute node for execution.

The job rescheduling policy, in the event of a job failure, involves shutting down all pods involved in the job and completely rebuilding the job instance. The process is illustrated in the following diagram.

Figure 2 Job rescheduling process example

If your training job has already adapted to resumable training, the pod rescheduling policy can be used to accelerate the recovery process.

Compared to job rescheduling, which directly deletes and recreates the entire Job, pod rescheduling retains the existing Job instance and only recreates the faulty pod.

Figure 3 Pod rescheduling process example

User Job Perspective

From the perspective of your training process, each task will go through different states during rescheduling. As shown below, the job first undergoes job rescheduling, where all three task pods are stopped and then rescheduled. Then, during pod rescheduling, your processes in task 0 and task 1 are terminated, while the container environment is retained; until the pod of task 2 is rescheduled, the three tasks start running again, at which point your processes in task 0 and task 1 are also restarted.

Figure 4 Association of job rescheduling and pod rescheduling

ModelArts injects environment variables into your process to mark the lifecycle stages the job has gone through. Among these, MA_SCHEDULE_CNT represents the number of times the pod of the task has been scheduled, and MA_PROC_START_CNT represents the number of times your process has been started.

For example, in the diagram above, at the time point after pod rescheduling, the pod of task 0 has been started twice, and after the last pod container start, your process inside has been started twice; the pod of task 2 has been started three times, and after the last pod container start, your process inside has been started once.

Parent topic: High Model Training Reliability

Previous topic: Enabling Unconditional Auto Restart

Next topic: Configuring Supernode Affinity Group Instances