Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Hard Faults Due to Space Limit/ Insufficient Container Space for Copying Data
Updated on 2024-04-11 GMT+08:00

Insufficient Container Space for Copying Data

Symptom

When a ModelArts training job was running, the error below was printed in the log. As a result, data failed to be copied to the container.

OSError:[Errno 28] No space left on device

Possible Causes

The container space is insufficient for downloading data.

Solution

  1. Check if data is downloaded to the /cache directory. Each GPU node has a /cache directory with 4 TB of storage. Check if the directory is experiencing an excessive creation of files simultaneously, which will run out of inodes, leading to a shortage of space.
  2. Check whether GPU resources are used. If CPU resources are used, /cache and the code directory share 10 GB of memory. As a result, the memory is insufficient. In this case, use GPU resources instead.
  3. Add the following environment variable to the code:
    import os
    os.system('export TMPDIR=/cache')