Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Service Code Issues/ ECC Error Occurs in the Log, Causing Training Job Failure
Updated on 2024-04-11 GMT+08:00

ECC Error Occurs in the Log, Causing Training Job Failure

Symptom

The following error occurs during the running of the training job log: RuntimeError: CUDA error: uncorrectable ECC error encountered

Possible Cause

ECC errors

Solution

If there are more than 64 ECC errors, the system automatically isolates the faulty nodes. After the isolation, restart the training job to check whether the fault is rectified. If the training job fails again or is suspended due to an unisolated node, contact technical support.