Introduction to Data Processing

ModelArts provides the data processing function to extract valuable and meaningful data from a large amount of disordered and difficult-to-understand data. After data is collected and accessed, the data cannot directly meet the training requirements. To ensure data quality and avoid negative impact on subsequent operations (such as data labeling and model training), the data needs to be processed. Common data processing types are as follows:

Data validation: Generally, data needs to be verified after being collected to ensure data validity.
Data validation is a process of determining and verifying data availability. Generally, the collected data cannot be further processed due to some format problems. Take image recognition as an example. Users often find some images from the Internet for training, and the image quality cannot be ensured. The name, path, and extension of the images may not meet the requirements of the training algorithm. Images may also be partially damaged. As a result, the images cannot be decoded or processed by the algorithm. Therefore, data validation is very important. It can help AI developers detect data problems in advance and effectively prevent algorithm precision deterioration or training failures caused by noisy data.
Data cleansing: Data cleansing refers to the process of removing, correcting, or supplementing data.
Data cleansing is to check data consistency based on data validation and correct some invalid values. For example, in the deep learning field, data may be cleansed based on a positive sample and a negative sample that are input by a user, to retain a category that the user wants and remove a category that the user does not want.
Data selection: Data selection refers to the process of selecting data subsets from full data.
Data can be selected based on the similarity or deep learning algorithm. Data selection can avoid problems such as repeated images and similar images introduced during manual image collection. Among a batch of inference data input to an old model, data selection using built-in rules can further improve the precision of the old model.
Data augmentation:
Data amplification: Data amplification increases data volumes directly or indirectly through simple data amplification operations such as scaling, cropping, transformation, and composition.

Image generation: Image generation increases data volumes by applying deep learning models, learning the original dataset, and generating a new dataset.