Help Center> ModelArts> Troubleshooting> Training Jobs> Hard Faults Due to Space Limit> Error Message "No space left on device" Displayed in Logs
Updated on 2024-06-15 GMT+08:00

Error Message "No space left on device" Displayed in Logs

Symptom

When data, code, or model is copied during training, the error message "No space left on device" is displayed.

Figure 1 Error log

Possible Causes

The possible causes are as follows:

  • The disk space is insufficient.
  • When a distributed job is executed, the docker base size configuration does not take effect on certain nodes. As a result, the storage space of the / root directory in the container is only the default value of 10 GB, which should be 50 GB, leading to the job training failure.
  • The storage space is sufficient, but the error message "No Space left on device" is still displayed.

    If there are a large number of files in the same directory, the kernel creates an index table to accelerate file retrieval. If a large number of files are created in a short period of time, the number of indexes reaches the upper limit, and an error occurs.

    The issue occurs depending on the following factors:

    • A longer file name leads to a smaller upper limit for the number of files.
    • A smaller block size leads to a smaller upper limit for the number of files. (There are three block sizes, 1024 bytes, 2048 bytes, and 4096 bytes. The default size is 4096 bytes.)
    • This issue is more likely to occur if files are created in a shorter period of time.

Solution

  1. Rectify the fault by following the operations described in Error Message "write line error" Displayed in Logs.
  2. If the issue occurs only on certain nodes used by the distributed job, submit a service ticket to isolate the faulty nodes.
  3. If the issue is caused by EulerOS restrictions, take the following measures:
    • Reduce the number of files in a single directory.
    • Slow down the file creation speed.
    • Disable the dir_index attribute of the Ext4 file system, which may affect the file retrieval performance. For details, see https://access.redhat.com/solutions/29894.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.