Help Center/ Migration Center/ FAQs/ Big Data Verification/ How Do I Optimize the Verification Task When the Delta Lake Data Volume Is Large?
Updated on 2024-12-16 GMT+08:00

How Do I Optimize the Verification Task When the Delta Lake Data Volume Is Large?

This section explores how to use MgC to verify data consistency when the source Delta Lake data volume is huge (for example, more than 10,000 tables).

Procedure

  1. Create a metadata connection to the Delta Lake cluster.
  2. Use the metadata connection created in step 1 to create a metadata synchronization task to synchronize metadata from the source cluster to MgC.
  3. Create several more metadata connections to the source Delta Lake cluster using the IP addresses and ports of different executors. Keep the other parameter settings the same as the metadata connection created in step 1.

    • The number of metadata connections is determined by the number of executors and tables to be verified. If the executor resources are sufficient and there are a large number of tables to be verified, increasing the number of metadata connections can improve verification efficiency.
    • To avoid duplicate data, you only need to create a synchronization task using the metadata connection created in step 1.

  4. Create a table group and add source tables to the group. During the table group creation, select the metadata connection created in step 1.
  5. Create a connection to the source and target executors separately. For details, see Creating an Executor Connection.
  6. Create a data verification task for the source Delta Lake cluster and the target Delta Lake cluster, respectively, and execute the tasks. For more information, see Creating and Executing Verification Tasks. When configuring a task, in the spark-submit area, add parameter mgc.delta.metadata.client.ips and set the value to the IP addresses and ports of all metadata connections, which are separated by commas (,).

    For example, mgc.delta.metadata.client.ips = xx.xx.xx.xx:22,xx.xx.xx.xx:22