Preparing Data
Before using ModelArts ExeML to build a model, upload data to an OBS bucket. The OBS bucket and ModelArts must be in the same region.
Requirements on Datasets
- Files must be in TXT or CSV format, and cannot exceed 8 MB.
- Use line feed characters to separate rows in files, and each row of data represents a labeled object.
- Currently, only Chinese is supported.
Uploading Data to OBS
In this section, the OBS console is used to upload data.
- If you do not need to upload training data in advance, create an empty folder to store files generated in the future.
- If you need to upload files to be labeled in advance, create an empty folder, and save the files in the folder. An example of the file directory structure is /bucketName/data/text.csv.
- A label name can contain a maximum of 32 characters, including Chinese characters, letters, digits, hyphens (-), and underscores (_).
- If you want to upload labeled text files to an OBS bucket, upload them according to the following specifications:
- The objects and files to be labels must be in the same directory. The objects must be in one-to-one relationship with the files. For example, if the object file name is COMMENTS_114745.txt, the label file name must be COMMENTS_114745_result.txt.
The following shows an example of data file.
├─<dataset-import-path> │ COMMENTS_114732.txt │ COMMENTS_114732_result.txt │ COMMENTS_114745.txt │ COMMENTS_114745_result.txt │ COMMENTS_114945.txt │ COMMENTS_114945_result.txt
- The labeled objects and files are text files, and correspond to each other on rows. For example, the first row in the label file indicates the label of the first row in the labeled object.
- The objects and files to be labels must be in the same directory. The objects must be in one-to-one relationship with the files. For example, if the object file name is COMMENTS_114745.txt, the label file name must be COMMENTS_114745_result.txt.
Procedures for uploading data from OBS:
Perform the following operations to import data to the dataset for model training and building.
- Log in to OBS Console and create a bucket in the same region as ModelArts. If an available bucket exists, ensure that the OBS bucket and ModelArts are in the same region.
- Upload the local data to the OBS bucket. If you have a large amount of data, use OBS Browser+ to upload data or folders. The uploaded data must meet the dataset requirements of the ExeML project.
- Upload data from unencrypted buckets. Otherwise, training will fail because data cannot be decrypted.
- Training text files must be classified into at least two classes, and each class must contain at least 20 rows.
Creating a Dataset
After the data preparation is completed, create a dataset of the type supported by the project. For details, see Creating a Dataset.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot