How Are Training Jobs Queued?
First in first out (FIFO) applies to training jobs. Subsequent jobs can be executed only after the preceding job is complete. This may lead to starvation of small jobs.
Job starvation is as follows: For example, a 64-card training job is queuing, and a 1-card training job follows the 64-card one. The 1-card training job can be executed only after the resources of 64 cards are idle. Even if the resources of 30 cards are available, the 1-card training job cannot be executed.
Resource Pools FAQs
- Can I Use ECSs to Create a Dedicated Resource Pool for ModelArts?
- Can I Deploy Multiple Services on One Dedicated Resource Pool Node?
- How Is a Node Newly Added to a Dedicated Resource Pool Billed?
- What Are the Differences Between a Public Resource Pool and a Dedicated Resource Pool?
- How Do I Log In to a Dedicated Resource Pool Node Through SSH?
- How Are Training Jobs Queued?
- What Do I Do If Resources Are Insufficient for Staring a New Real-Time Service After I Stop a Real-Time Service in a Dedicated Resource Pool?
- Can a Public Resource Pool Be Used for Network Connection Between ModelArts and the Authentication Service for Running Algorithms?
- Why Is a Dedicated Resource Pool That Fails to Be Created Still Displayed on the Console After It Is Deleted?
- How Do I Add a VPC Peering Connection Between a Dedicated Resource Pool and an SFS?
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbotmore