Overview
MgC allows you to verify the consistency of data migrated from various big data computing and storage engines, such as Hive, HBase, Doris, and MaxCompute. Consistency verification ensures data accuracy and reliability and enables to you migrate big data to Huawei Cloud with confidence.
Precautions
- A pair of verification tasks for the source and the target must use the same verification method.
- If the source and target HBase clusters use different security authentication modes, the verification tasks cannot be executed at the same time, or they will fail to be executed. This is because the authentication information must be handled differently in each cluster. The secured cluster requires authentication information to be loaded, whereas the non-secured cluster needs that information cleared.
- If the source Lindorm or HBase service is locked due to arrears, you can still create data connections and verification tasks, but data access and operations will be restricted, preventing verification tasks from being executed. Before starting data verification, ensure that the source Lindorm service is active and your account balance is sufficient. If the service is locked, promptly pay the fee to unlock it. Once the service is unlocked, you can execute the data verification tasks again.
- The verification results of data migrated between Hive 2.x and Hive 3.x may be inaccurate. In Hive 2.x, when you query the fixed-length type CHAR (N) of data, if the actual data length does not meet the specified length N, Hive will pad the string with spaces to reach the required length. However, in Hive 3.x, this padding operation does not occur during queries. This may result in differences between different versions. To avoid this issue, you are advised to use Beeline to perform the verification.
- When you YARN to run data verification at the source and target MRS clusters, execute the verification tasks separately. Ensure that one task is completed before starting another.
- When you verify data consistency for clusters of MRS 3.3.0 or later, do not use cluster nodes as executors, or the verification will fail.
Notes and Constraints
- Before verifying data migrated from EMR Delta Lake to MRS Delta Lake, please note:
- If the source EMR cluster uses Spark 3.3.1, data verification is supported regardless of whether the source cluster contains metadata storage.
- If the source EMR cluster uses Spark 2.4.8, data verification is supported only when the source cluster contains metadata storage.
- Verification is not available for HBase tables that only store cold data.
- A verification task must be completed within the same day. If the task execution extends past midnight (00:00), the verification results may be inaccurate. Plan verification tasks carefully to avoid executing across days.
- Field verification is not supported if the source Alibaba Cloud cluster uses ClickHouse 21.8.15.7 and the target Huawei Cloud cluster uses ClickHouse 23.3.2.37. This is because the two versions process IPv4 and IPv6 data types and function calculation results differently.
- During the daily incremental verification, hourly incremental verification, and date-based verification for Hive, partitions with a Date-type partition field that does not follow the standard YYYY-MM-DD format cannot be verified.
- MgC cannot verify the consistency of data migrated between secured HBase 2.x clusters. The accuracy of verification is impacted by version compatibility restrictions, differences in security authentication mechanisms, protocol and interface inconsistencies, as well as variations in feature support and configuration between different versions.
Verification Methods
- Full verification: The consistency of all inventory data is verified.
- Daily incremental verification: The consistency of incremental data is verified based on the creation or update time. You can choose to verify incremental data for one day or several consecutive days.
- Hourly incremental verification: Data consistency is verified based on the creation time or update time multiple times within 24 hours. The verification automatically stops at 00:00 on the next day.
- Date-based verification: This method applies only to tables partitioned by date in the year, month, and day format. You can choose to verify consistency of such tables for one day or several consecutive days. Tables that are not partitioned by date are not verified.
- Selective verification: This method can be used to verify the consistency of the data within a specified time period. You can only select a period going backward from the current time for verification.
Supported Source and Target Components
Source Component |
Target Component |
---|---|
|
|
Verification Methods Available for Each Component
Component |
Verification Method |
---|---|
Hive |
|
DLI |
|
MaxCompute |
|
Doris |
|
HBase |
|
ClickHouse |
Full verification |
ApsaraDB for ClickHouse |
Full verification |
CloudTable (HBase) |
|
CloudTable (ClickHouse) |
Full verification |
Delta |
|
Hudi |
|
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.