Cette page n'est pas encore disponible dans votre langue. Nous nous efforçons d'ajouter d'autres langues. Nous vous remercions de votre compréhension.

On this page
Help Center/ ModelArts/ Troubleshooting/ Training Jobs/ Memory Limit Issues/ Error Message "No space left" Displayed When a TensorFlow Multi-node Job Downloads Data to /cache

Error Message "No space left" Displayed When a TensorFlow Multi-node Job Downloads Data to /cache

Updated on 2024-06-11 GMT+08:00

Symptom

During training job creation, error message "No space left" is displayed when a TensorFlow multi-node job downloads data to /cache.

Possible Cause

In a TensorFlow multi-node job, the parameter server (ps) and worker roles are started. The ps and worker roles are scheduled to the same machine. Training data is useless for ps. Therefore, the ps-related logic in code does not need to download the training data. If ps also downloads data to /cache, the actually downloaded data will be doubled. For example, if only 2.5 TB data is downloaded, the program displays a message indicating that space is insufficient because the /cache has only 4 TB available space.

Solution

When a TensorFlow multi-node job is used to download data, the correct download logic is as follows:

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--job_name", type=str, default="")
args = parser.parse_known_args()

if args[0].job_name != "ps":
    copy..............................
Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback