Help Center/ ModelArts/ Troubleshooting/ ExeML/ Training a Model/ ExeML-powered Training Job Failed

Updated on 2024-04-30 GMT+08:00

View PDF

ExeML-powered Training Job Failed

A training job that is successfully created fails to be executed due to some faults.

To rectify this fault, check whether your account is in arrears first. If your account is normal, rectify the fault based on the job type.

For details about how to rectify the job training faults related to Image Classification, Sound Classification, and Text Classification, see Checking Whether Data Exists in OBS, Checking the OBS Access Permission, and Checking Whether the Images Meet the Requirements.
For details about how to rectify the job training faults related to Object Detection, see Checking Whether Data Exists in OBS, Checking the OBS Access Permission, Checking Whether the Images Meet the Requirements, and Checking Whether the Marking Boxes Meet the Object Detection Requirements.
For details about how to rectify the job training faults related to Predictive Analytics, see Checking Whether Data Exists in OBS, Checking the OBS Access Permission, and Troubleshooting of a Predictive Analytics Job Failure.

Checking Whether Data Exists in OBS

If the images or data stored in OBS is deleted and not synchronized to ModelArts ExeML or datasets, the task will fail.

Check whether data exists in OBS. For Image Classification, Sound Classification, Text Classification, and Object Detection, you can click Synchronize Data Source on the Data Labeling page of ExeML to synchronize data from OBS to ModelArts.

Checking the OBS Access Permission

If the access permission of the OBS bucket cannot meet the training requirements, the training fails. Do the following to check the OBS permissions:

Check whether the current account has been granted with the read and write permissions on the OBS bucket (specified in bucket ACLs).
1. Go to the OBS management console, select the OBS bucket used by the ExeML project, and click the bucket name to go to the Overview page.
2. In the navigation pane, choose Permissions and click Bucket ACLs. Then, check whether the current account has the read and write permissions. If it does not, contact the bucket owner to obtain the permissions.
Check whether the OBS bucket is unencrypted.
1. Go to the OBS management console, select the OBS bucket used by the ExeML project, and click the bucket name to go to the Overview page.
2. Ensure that the default encryption function is disabled for the OBS bucket. If the OBS bucket is encrypted, click Default Encryption and change its encryption status.
  Figure 1 Default encryption status
Check whether the direct reading function of archived data is disabled.
1. Go to the OBS management console, select the OBS bucket used by the ExeML project, and click the bucket name to go to the Overview page.
2. Ensure that the direct reading function is disabled for the archived data in the OBS bucket. If this function is enabled, click Direct Reading and disable it.
Figure 2 Disabled direct reading
Ensure that files in OBS are not encrypted.
Do not select KMS encryption when uploading images or files. Otherwise, the dataset fails to read data. File encryption cannot be canceled. In this case, cancel bucket encryption and upload images or files again.

Figure 3 File encryption status

Checking Whether the Images Meet the Requirements

Currently, ExeML does not support four-channel images. Check your data and exclude or delete this format of images.

Checking Whether the Marking Boxes Meet the Object Detection Requirements

Currently, object detection supports only rectangular labeling boxes. Ensure that the labeling boxes of all images are rectangular ones.

If a non-rectangle labeling box is used, the following error message may be displayed:

Error bandbox.

For other types of projects (such as image classification and sound classification), skip this checking item.

Troubleshooting of a Predictive Analytics Job Failure

Check whether the data used for predictive analytics meets the following requirements.
The predictive analytics task releases datasets without using the data management function. If the data does not meet the requirements of the training job, the job will fail to run.

Check whether the data used for training meets the requirements of the predictive analytics job. The following lists the requirements. If the requirements are met, go to the next step. If the requirements are not met, adjust the data based on the requirements and then perform the training again.
- The name of files in a dataset consists of letters, digits, hyphens (-), and underscores (_), and the file name suffix is .csv. The files cannot be stored in the root directory of an OBS bucket, but in a folder in the OBS bucket, for example, /obs-xxx/data/input.csv.
- The files are saved in CSV format. Use newline characters (\n or LF) to separate lines and commas (,) to separate columns of the file content. The file content cannot contain Chinese characters. The column content cannot contain special characters such as commas (,) and newline characters. The quotation marks are not supported. It is recommended that the column content consist of letters and digits.
- The number of training columns is the same. There are at least 100 different data records (a feature with different values is considered as different data) in total. The training columns cannot contain data of the timestamp format (such as yy-mm-dd or yyyy-mm-dd). Ensure that there are at least two values in the specified label column and no data is missing. In addition to the label column, the dataset must contain at least two valid feature columns. Ensure that there are at least two values in each feature column and that the percentage of missing data must be lower than 10%. The training data CSV file cannot contain the table header. Otherwise, the training fails. Due to the limitation of the feature filtering algorithm, place the label column in the last column of the dataset. Otherwise, the training may fail.
ModelArts automatically filters data and then starts the training job. If the preprocessed data does not meet the training requirements, the training job fails to be executed.
Filter policies for columns in a dataset:
- If the vacancy rate of a column is greater than the threshold (0.9) set by the system, the data in this column will be deleted during training.
- If a column has only one value (that is, the data in each row is the same), the data in this column will be deleted during training.
- For a non-numeric column, if the number of values in this column is equal to the number of rows (that is, the values in each row are different), the data in this column will be deleted during training.
After the preceding filtering, if the data in the dataset does not meet the training requirements in Item 1, the training fails or cannot be executed. Complete the data before starting the training.
Restrictions for a dataset file:
1. If you use the 2U8G flavor (2 vCPUs and 8 GB of memory), it is recommended that the size of the dataset file be less than 10 MB. If the file size meets the requirements but the data volume (product of the number of rows and the number of columns) is extremely large, the training may still fail. It is recommended that the product be less than 10,000.
  If you use the 8U32G flavor (8 vCPUs and 32 GB of memory), it is recommended that the size of the dataset file be less than 100 MB. If the file size meets the requirements but the data volume (product of the number of rows and the number of columns) is extremely large, the training may still fail. It is recommended that the product be less than 1,000,000.
If the fault persists, contact HUAWEI CLOUD technical support.