Running a Job Failed Due to Persistently Rising Memory Usage

A training job is in the Failed state.

The memory usage continues to rise, leading to the training job failure.

View the logs and monitoring data of the training job to check whether there are any OOM errors.
- If yes, go to 2.
- If there are no OOM errors but the monitoring metrics show anomalies, go to 3.
Check whether there is any code in the training script that keeps using resources and prevents them from being allocated efficiently.
- If yes, optimize the code and wait until the job runs properly.
- If no, either upgrade the resource specifications allocated to the training job or contact technical support.
Restart the training job. Use CloudShell to log in to the training container to check the memory metrics and see if the memory usage spikes.
- If yes, check the training job logs generated when the memory usage spikes and improve the relevant code logic to lower the memory consumption.
- If no, either upgrade the resource specifications allocated to the training job or contact technical support.