In a Multi-Node Training, the TensorFlow PS Node Functioning as a Server Will Be Continuously Suspended. How Does ModelArts Determine Whether the Training Is Complete? Which Node Is a Worker?
In a TensorFlow-powered distributed training, the PS task and worker task are started. The worker task is a key task. ModelArts will use a process exit code of the worker task to determine whether the training job is complete.
A task name will be used to determine which node is a worker. A Volcano job is issued for training, which contains a PS task and a worker task. The startup commands of the two tasks are different. The hyperparameter task_name will be automatically generated, which is ps for the PS task and worker for the worker task.
Functional Consulting FAQs
- What Are the Solutions to Underfitting?
- What Are the Precautions for Switching Training Jobs from the Old Version to the New Version?
- How Do I Obtain a Trained ModelArts Model?
- What Is TensorBoard Used for in Model Visualization Jobs?
- How Do I Obtain RANK_TABLE_FILE on ModelArts for Distributed Training?
- How Do I Obtain the CUDA and cuDNN Versions of a Custom Image?
- How Do I Obtain a MoXing Installation File?
- In a Multi-Node Training, the TensorFlow PS Node Functioning as a Server Will Be Continuously Suspended. How Does ModelArts Determine Whether the Training Is Complete? Which Node Is a Worker?
- How Do I Install MoXing for a Custom Image of a Training Job?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.
more