Downloading Files Timed Out or No Space Left for Reading Data
Symptom
When data, code, or model is copied during training, the error message "No space left on device" is displayed.
Possible Causes
The possible causes are as follows:
- The disk space is insufficient.
- When a distributed job is executed, the docker base size configuration does not take effect on certain nodes. As a result, the storage space of the / root directory in the container is only the default value of 10 GB, which should be 50 GB, leading to the job training failure.
- The storage space is sufficient, but the error message "No Space left on device" is still displayed.
If there are a large number of files in the same directory, the kernel creates an index table to accelerate file retrieval. If a large number of files are created in a short period of time, the number of indexes reaches the upper limit, and an error occurs.
The issue occurs depending on the following factors:
- A longer file name leads to a smaller upper limit for the number of files.
- A smaller block size leads to a smaller upper limit for the number of files. (There are three block sizes, 1024 bytes, 2048 bytes, and 4096 bytes. The default size is 4096 bytes.)
- The issue is more likely to occur if files are created in a shorter period of time. The reason is as follows: There is a cache, the size of which is determined based on the preceding two factors. When the number of files in the directory is large, the cache is enabled. The resources are released if they are not used.
Solution
- Rectify the fault by following the operations described in Error Message "write line error" Displayed in Logs.
- If the issue occurs only on certain nodes used by the distributed job, submit a service ticket to isolate the faulty nodes.
- If the issue is caused by EulerOS restrictions, take the following measures:
- Reduce the number of files in a single directory.
- Slow down the file creation speed.
- Disable the dir_index attribute of the Ext4 file system, which may affect the file retrieval performance. For details, see https://access.redhat.com/solutions/29894.
Summary and Suggestions
- Use the online notebook environment for debugging. For details, see Using JupyterLab to Develop a Model.
- Use the local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Using the Local IDE to Develop a Model.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot