Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Training Jobs Created in a Dedicated Resource Pool/ Storage Volume Failed to Be Mounted to the Pod During Training Job Creation
Updated on 2024-04-30 GMT+08:00

Storage Volume Failed to Be Mounted to the Pod During Training Job Creation

Symptom

The training job remains in the Creating state. When you check the events of the training job, error message "Unable to mount volumes for pod xxx ... list of unmounted volumes=[nfs-x]" is displayed.

Possible Cause

For your SFS Turbo file system to function correctly, it must reside within a VPC network that is interconnected with the network of the dedicated resource pool. This connection is essential to ensure that the SFS can be successfully mounted to any training job executed within the dedicated resource pool. Disconnected network may lead to mounting failure.

Procedure

  1. Go to the training job details page and obtain the SFS Turbo name.
    Figure 1 Obtaining SFS Turbo name
  2. Log in to the SFS console, locate the SFS Turbo mounted to the training job, and click it to go to the details page. Obtain the VPC, security group, and endpoint information.
    • VPC: value of VPC
    • Security group: value of Security Group
    • Endpoint: value of Shared Path excludes ":/", for example, the shared path is 4ab556b5-d689-44f1-9302-24c09daxxxxc.sfsturbo.internal:/, then the SFS Turbo endpoint is 4ab556b5-d689-44f1-9302-24c09daxxxxc.sfsturbo.internal.
  3. Check whether the VPC CIDR block meets the following requirements:

    Requirement 1: To prevent CIDR block conflicts with the dedicated resource pool, the SFS Turbo CIDR block cannot overlap with 192.168.20.0/24 (default CIDR block of the dedicated resource pool). Go to the resource pool details page and check Network to obtain the actual CIDR block of the dedicated resource pool.

    Requirement 2: To prevent network conflicts with the container, the SFS Turbo CIDR block cannot overlap with 172 CIDR block (used by the container network).

    • If the requirements are not met, modify the VPC CIDR block of SFS Turbo. The recommended value is 10.X.X.X. For details, see Modifying the CIDR Block of a VPC.
    • If the requirements are met, go to the next step.
  4. Check whether the VPC CIDR block of SFS Turbo is limited by a security group rule.
    Create a training job in the selected dedicated resource pool without mounting SFS Turbo. Once the job is in the Running state, access the worker-0 instance via Cloud Shell. Execute the command curl {sfs-turbo-endpoint}:{port} to verify if the ports are open. The ports that SFS Turbo requires for inbound traffic are 111, 445, 2049, 2051, 2052, and 20048. For details, see Security Group in Create a File System. For details about how to use Cloud Shell, see Logging In to a Training Container Using Cloud Shell.
    • If yes, modify the security group configurations. For details, see Modifying a Security Group Rule.
    • If there is no such a security group rule, perform the following steps.
  5. Check whether SFS Turbo is normal.
    Create an ECS that uses the same CIDR block as SFS Turbo and mount the SFS Turbo to the ECS. If mounting failed, SFS Turbo is abnormal.
    1. If SFS Turbo is abnormal, contact SFS technical support.
    2. If SFS Turbo is normal, contact ModelArts technical support.