Overview
MgC enables you to verify the consistency of data migrated from various big data computing and storage engines, such as Hive, HBase, Doris, and MaxCompute. Consistency verification ensures data accuracy and reliability and enables you to migrate big data to Huawei Cloud with confidence.
Precautions
- A pair of verification tasks for the source and the target must use the same verification method.
- If the data volume to be verified is large, a 99.5% success rate is considered normal.
- If the source and target HBase clusters use different security authentication modes, the verification tasks cannot be executed at the same time, or they will fail to be executed. This is because the authentication information must be handled differently in each cluster. The secured cluster requires authentication information to be loaded, whereas the non-secured cluster needs that information cleared.
- If the source Lindorm or HBase service is locked due to arrears, you can still create data connections and verification tasks, but data access and operations will be restricted, preventing verification tasks from being executed. Before starting data verification, ensure that the source big data service is active and your account balance is sufficient. If the service is locked, promptly pay the overdue amount to unlock it. Once the service is unlocked, you can run the data verification tasks again.
- The verification results of data migrated between Hive 2.x and Hive 3.x may be inaccurate. In Hive 2.x, when you query the fixed-length type CHAR (N) of data, if the actual data length does not meet the specified length N, Hive will pad the string with spaces to reach the required length. However, in Hive 3.x, this padding operation does not occur during queries. This may result in differences between different versions. To avoid this issue, you are advised to use Beeline to perform the verification.
- If you use Yarn to run verification tasks in the source and target MRS clusters, execute the verification tasks separately, and ensure that one task is completed before starting another.
- When you verify data consistency for clusters of MRS 3.3.0 or later, do not use cluster nodes as executors, or the verification will fail.
Constraints
- Before verifying data migrated from EMR Delta Lake to MRS Delta Lake, please note:
- If the source EMR cluster uses Spark 3.3.1, data verification is supported regardless of whether the source cluster contains metadata storage.
- If the source EMR cluster uses Spark 2.4.8, data verification is supported only when the source cluster contains metadata storage.
- Verification is not supported for HBase tables that only store cold data.
- A verification task must be completed within one day. If the task extends past midnight (00:00), the verification results may be inaccurate. Plan verification tasks carefully to avoid execution across days.
- Field verification is not supported if the source Alibaba Cloud cluster uses ClickHouse 21.8.15.7 and the target Huawei Cloud cluster uses ClickHouse 23.3.2.37. This is because the two versions process IPv4 and IPv6 data types and function calculation results differently.
- During the daily incremental verification, hourly incremental verification, and date-based verification for Hive, date partitions cannot be verified if their partition values do not follow the standard YYYY-MM-DD format.
- Content verification is supported for unsecured HBase clusters, regardless of whether the clusters are self-built or created using cloud services like EMR for HBase, MRS (HBase), and CloudTable (HBase).
Verification Methods
- Full verification: The consistency of all inventory data is verified.
- Daily incremental verification: The consistency of incremental data is verified based on the creation or update time. You can choose to verify incremental data for one day or several consecutive days.
- Hourly incremental verification: Data consistency is verified based on the creation time or update time multiple times within 24 hours. The verification automatically stops at 00:00 on the next day.
- Date-based verification: This method applies only to tables partitioned by date in the year, month, and day format. You can choose to verify consistency of such tables for one day or several consecutive days. Tables that are not partitioned by date are not verified.
- Selective verification: This method can be used to verify the consistency of the data within a specified time period. You can only select a period going backward from the current time for verification.
Supported Source and Target Components
Source Component |
Target Component |
---|---|
|
|
Verification Methods Available for Each Component
Component |
Verification Method |
---|---|
Hive |
|
DLI |
|
MaxCompute |
|
Doris |
|
HBase |
|
ClickHouse |
Full verification |
ApsaraDB for ClickHouse |
Full verification |
CloudTable (HBase) |
|
CloudTable (ClickHouse) |
Full verification |
Delta |
|
Hudi |
|
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot