Help Center/ ModelArts/ Troubleshooting/ Inference Deployment/ Service Deployment/ What Do I Do If Resources Are Insufficient When a Service Is Deployed, Started, Upgraded, or Modified?

Updated on 2024-04-30 GMT+08:00

View PDF

What Do I Do If Resources Are Insufficient When a Service Is Deployed, Started, Upgraded, or Modified?

Symptom

The service fails to be started, and an error message is displayed, indicating that resources are insufficient and service scheduling fails. ("Schedule failed due to insufficient resources. Retry later." or "ModelArts.3976: No resources are available for the selected specification.")

Figure 1 Schedule failed due to insufficient resources
Click to enlarge

Possible Causes

The configured instance specifications are beyond the remaining CPU or memory resources. ("insufficient CPU" / "insufficient memory")
The disk capacity cannot meet the requirements of the model. ("x node(s) had taint {node.kubernetes.io/disk-pressure: }" / "No space")

Solution

When resources are insufficient, ModelArts retries for three times. If resources are released during these retries, the service can be successfully deployed.

If resources are still insufficient after three retries, the service deployment fails. In this case, perform the following operations to resolve this issue:

If the service is to be deployed in a public resource pool, wait until other users release resources.
If the service is to be deployed in a dedicated resource pool, select lower container specifications or custom specifications to deploy the service on the premise that the model requirements are met.
Expand the capacity of the current resource pool before deploying the service. To expand the capacity of the public resource pool, contact the system administrator. To expand the capacity of the dedicated resource pool, refer to Resizing a Resource Pool.
If the disk space is insufficient, try again to schedule the instance to another node. If the disk space of a single instance is still insufficient, contact the system administrator to use proper specifications.

If an AI application imported though a large model is used to deploy the service, ensure that the disk space of the dedicated resource pool is greater than 1 TB (1000 GB).