Error Message "No such file or directory" Displayed in Training Job Logs
Symptom
If a training job failed, error message "No such file or directory" is displayed in logs.
If a training input path is unreachable, error message "No such file or directory" is displayed.
If a training boot file is unavailable, error message "No such file or directory" is displayed.
Possible Causes
- If the training input path is unreachable, the path is incorrect. Perform the following operations to locate the fault:
- If the training boot file is unavailable, the path to the training job boot command is incorrect. Rectify the fault by referring to Checking the File Boot Path of a Training Job Created Using a Custom Image.
- Multiple processes or workers read and write the same file. If SFS is used, check whether multiple nodes concurrently write the same file. Analyze the code and check whether multiple processes write the same file. It is a good practice to prevent multiple processes or nodes from concurrently reading and writing the same file.
Checking Whether the Affected Path Is an OBS Path
When using ModelArts, store data in an OBS bucket. However, the OBS path cannot be used to read data during the execution of the training code.
The reason is as follows:
After a training job is created, the training performance is poor if the running container is directly connected to OBS. To prevent this issue, the system automatically downloads the training data to the local path of the running container. Therefore, an error occurs if an OBS path is used in training code. For example, if the OBS path to the training code is obs://bucket-A/training/, the training code will be automatically downloaded to ${MA_JOB_DIR}/training/.
For example, the OBS path to the training code is obs://bucket-A/XXX/{training-project}/, where {training-project} is the name of the folder where the training code is stored. During training, the system will automatically download the data from OBS {training-project} to the local path of the training container ($MA_JOB_DIR/{training-project}/).
If the affected path is to the training data, perform the following operations to resolve this issue (see Parsing Input and Output Paths for details):
- When creating an algorithm, set the code path parameter, which defaults to data_url, in the input path mapping configuration.
- Add a hyperparameter, which defaults to data_url, to the training code. Use data_url as the local path for inputting the training data.
Checking Whether the Affected Path Is Available
The code developed locally needs to be uploaded to the ModelArts backend. It is likely to incorrectly set the path to a dependency file in training code.
You are suggested to use the following general solution to obtain the absolute path to a dependency file through the OS API.
Example:
|---project_root # Root directory for code |---BootfileDirectory # Directory where the boot file is located |---bootfile.py # Boot file |---otherfileDirectory # Directory where other dependency files are located |---otherfile.py # Other dependency files
Do as follows to obtain the path to a dependency file, otherfile_path in this example, in the boot file:
import os current_path = os.path.dirname(os.path.realpath(__file__)) # Directory where the boot file is located project_root = os.path.dirname(current_path) # Root directory of the project, which is the code directory set on the ModelArts training console otherfile_path = os.path.join(project_root, "otherfileDirectory", "otherfile.py")
Checking the File Boot Path of a Training Job Created Using a Custom Image
Take OBS path obs://obs-bucket/training-test/demo-code as an example. The training code in this path will be automatically downloaded to ${MA_JOB_DIR}/demo-code in the training container, where demo-code is the last-level directory of the OBS path and can be customized.
If you use a custom image to create a training job, the system will automatically run the image boot command after the code directory is downloaded. The boot command must comply with the following rules:
- If the training startup script is a .py file, train.py for example, the boot command can be python ${MA_JOB_DIR}/demo-code/train.py.
- If the training startup script is an .sh file, main.sh for example, the boot command can be bash ${MA_JOB_DIR}/demo-code/main.sh,
where demo-code is the last-level directory of the OBS path and can be customized.
Summary and Suggestions
- Use in-cloud notebook for debugging. For details, see JupyterLab Overview and Common Operations.
- Use a local IDE (PyCharm or VS Code) to access the cloud environment for debugging. For details, see Operation Process in a Local IDE.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot