Help Center/ Migration Center/ User Guide/ Big Data Verification/ Creating and Executing Verification Tasks
Updated on 2024-12-20 GMT+08:00

Creating and Executing Verification Tasks

You can use the created source and target connections to create verification tasks.

For details about the supported big data components and verification methods, see Overview.

Precautions

  • A pair of verification tasks for the source and the target must use the same verification method.
  • If the source and target HBase clusters use different security authentication modes, the verification tasks cannot be executed at the same time, or they will fail to be executed. This is because the authentication information must be handled differently in each cluster. The secured cluster requires authentication information to be loaded, whereas the non-secured cluster needs that information cleared.
  • A verification task must be completed within the same day. If the task execution extends past midnight (00:00), the verification results may be inaccurate. Plan verification tasks carefully to avoid executing across days.
  • If the source Lindorm service is locked due to arrears, you can still create data connections and verification tasks, but data access and operations will be restricted, preventing verification tasks from being executed. Before starting data verification, ensure that the source Lindorm service is active and your account balance is sufficient. If the service is locked, promptly pay the fee to unlock it. Once the service is unlocked, you can execute the data verification tasks again.

Notes and Constraints

  • Before verifying data migrated from EMR Delta Lake to MRS Delta Lake, please note:
    • If the source EMR cluster uses Spark 3.3.1, data verification is supported regardless of whether the source cluster contains metadata storage.
    • If the source EMR cluster uses Spark 2.4.8, data verification is supported only when the source cluster contains metadata storage.
  • Verification is not available for HBase tables that only store cold data.
  • The verification results of data migrated between Hive 2.x and Hive 3.x may be inaccurate. In Hive 2.x, when you query the fixed-length type CHAR (N) of data, if the actual data length does not meet the specified length N, Hive will pad the string with spaces to reach the required length. However, in Hive 3.x, this padding operation does not occur during queries. To avoid this issue, you are advised to use Beeline to perform the verification.
  • Field verification is not supported if the source Alibaba Cloud cluster uses ClickHouse 21.8.15.7 and the target Huawei Cloud cluster uses ClickHouse 23.3.2.37. This is because the two versions process IPv4 and IPv6 data types and function calculation results differently.
  • During the daily incremental verification, hourly incremental verification, and date-based verification for Hive, partitions with a Date-type partition field that does not follow the standard YYYY-MM-DD format cannot be verified.

Prerequisites

Procedure

  1. Sign in to the MgC console.
  2. In the navigation pane on the left, choose Migrate > Big Data Verification. Select a migration project in the upper left corner of the page.
  3. In the Features area, click Task Management.
  4. Click Create Task in the upper right corner of the page.
  5. Select a big data component and verification method as needed and click Next.
  6. Configure task parameters based on the selected big data component and verification method.

    The task parameters vary with the big data component.

    Table 1 Parameters for creating a full verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-full-verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    HBase Connection

    This parameter is available for HBase and CloudTable (HBase).

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    ClickHouse Connection

    This parameter is available for ClickHouse, Alibaba Cloud ApsaraDB for ClickHouse, and CloudTable (ClickHouse).

    • To create a verification task for the source, select the source MRS ClickHouse or Alibaba Cloud ApsaraDB for ClickHouse connection.
    • To create a verification task for the target, select the target MRS ClickHouse or CloudTable (ClickHouse) connection.

    Metadata Connection (Optional)

    This parameter is available for Delta Lake.

    To improve processing efficiency, you can select metadata connections to the source Delta Lake cluster where the groups of tables to be verified belong to.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    NOTE:

    You are advised to run the verification during off-peak hours.

    Advanced Options

    • Concurrency: Specify the number of concurrent threads on an executor for the verification task. The default value is 3. The value ranges from 1 to 10.
      CAUTION:

      If you are creating a verification task for an Alibaba Cloud EMR Hive cluster, set this parameter based on the source data volume and master node specifications. Consider the following rules:

      • The total number of concurrent threads for verification tasks running simultaneously in the source cluster cannot exceed 70% of the total number of cores on the metadata node.
      • The total resources allocated to verification tasks cannot exceed the resources of the execution queue. The total resources allocated to tasks can be calculated as follows:

        Allocated memory = Number of executors × Memory on an executor × Concurrency

        Allocated cores = Number of executors × Cores on an executor × Concurrency

      Assume that the total source data volume is 500 GB spread across 10,000 tables, in which there are 8 large tables with 50 GB data and 100,000 partitions. The master node has 8 vCPUs and 32 GB of memory.

      • According to rule 1, the maximum number of concurrent requests is 5, which is the rounded down value of 5.6 (0.7 × 8).
      • According to rule 2, you need to select spark-sql in the Execution Command area and set the following parameters:

        executor-memory = 4G

        master = yarn

        num-executors = 20

        executor-cores = 2

        driver-memory = 10G

    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.
    • Send SMN Notifications: Determine whether to use SMN to notify you of the task status in a timely manner through emails, SMS messages, or customized URLs.
      NOTICE:
      • Before enabling this function, you need to create a topic on the SMN console. For details, see Creating a Topic.
      • Using this function may incur a small amount of fees, which are billed by SMN. For details, see SMN Billing.

    Data Filtering

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Command Parameters

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for verifying data consistency.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: used to execute SQL statements to query and analyze data.

    MaxCompute Parameters

    (This item is only available for MaxCompute.)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Execution Settings

    (This item is only available for DLI.)

    Parameter

    Set the parameters as required. For details about the supported custom parameters, see Custom Parameters.

    Execution Settings

    (This item is only available for HBase.)

    Run Mode

    The supported run modes include:

    Yarn: This mode applies to large-scale distributed environments. It can make full use of cluster resources and improve task concurrency and efficiency.

    Local: This mode applies to small-scale datasets or development and test environments for quick debugging and verification.

    Parameters

    Add command parameters based on the selected run mode and requirements.

    Command Parameters (This item is only available for Delta Lake and Hudi.)

    Security Authentication(available for target Delta Lake clusters)

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    • --class: indicates the name of the class of a Spark application.
    • --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.
    • application-jar: indicates the path of the JAR file of the Spark application.
    • application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    The parameters that need to be added depend on the scenario:

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake cluster that uses Spark 3, add the following parameters:
      • Parameter: jars
      • Value: '/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-core_2.12-*.jar,/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-storage-*.jar'
        CAUTION:

        Replace the parameter values with the actual environment directory and Delta Lake version.

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake 2.1.0 cluster that uses Spark 2.4.8, add the following parameter:
      • Parameter: mgc.delta.spark.version
      • Value: 2
    Table 2 Parameters for creating a daily incremental verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-daily-incremental-verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    HBase Connection

    If HBase is selected in the previous step:

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    (Optional) Metadata Connection

    This parameter is available for Hive and Delta Lake.

    If you are creating a verification task for the target Hive cluster, select the metadata connection to this Hive cluster. The connection is used to check whether the partitions to be verified can be found in the target cluster.

    If you are creating a verification task for the source Delta Lake cluster, to improve processing efficiency, you can select metadata connections to the source Delta Lake cluster.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    Advanced Options

    • Concurrency: Specify the number of concurrent threads on an executor for the verification task. The default value is 3. The value ranges from 1 to 10.
      CAUTION:

      If you are creating a verification task for an Alibaba Cloud EMR Hive cluster, set this parameter based on the source data volume and master node specifications. Consider the following rules:

      • The total number of concurrent threads for verification tasks running simultaneously in the source cluster cannot exceed 70% of the total number of cores on the metadata node.
      • The total resources allocated to verification tasks cannot exceed the resources of the execution queue. The total resources allocated to tasks can be calculated as follows:

        Allocated memory = Number of executors × Memory on an executor × Concurrency

        Allocated cores = Number of executors × Cores on an executor × Concurrency

      Assume that the total source data volume is 500 GB spread across 10,000 tables, in which there are 8 large tables with 50 GB data and 100,000 partitions. The master node has 8 vCPUs and 32 GB of memory.

      • According to rule 1, the maximum number of concurrent requests is 5, which is the rounded down value of 5.6 (0.7 × 8).
      • According to rule 2, you need to select spark-sql in the Execution Command area and set the following parameters:

        executor-memory = 4G

        master = yarn

        num-executors = 20

        executor-cores = 2

        driver-memory = 10G

    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.
    • Send SMN Notifications: Determine whether to use SMN to notify you of the task status in a timely manner through emails, SMS messages, or customized URLs.
      NOTICE:
      • Before enabling this function, you need to create a topic on the SMN console. For details, see Creating a Topic.
      • Using this function may incur a small amount of fees, which are billed by SMN. For details, see SMN Billing.

    Data Filtering

    Incremental Scope

    Select the time period in which the incremental data needs to be verified as consistent. By default, a 24-hour period is selected. T indicates the execution time of the task, and T-n indicates a time n × 24 hours before the execution time.

    If you select Consecutive days, the system verifies consistency of the incremental data generated during these consecutive days.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    • Partition Filtering: Determine to filter table partitions by creation time or update time.
      • By update time: An update time indicates the timestamp a table partition was last modified or updated. Choose this option if you are concerned about the latest status or changes of data in a partition.
      • By creation time: A creation time indicates the timestamp when a partition was created. Choose this option if you are concerned about the data generated from the time when the partition is created to a certain time point.
    • Max. Partitions: Limit how many partitions in a table are verified. The default value is 3. The value ranges from 1 to 50.

      For example, if this parameter is set to 3, the system verifies the consistency of only the first three partitions that are sorted by ID in descending order.

    • Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for verifying data consistency.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Execution Settings

    (This item is only available for DLI.)

    Parameter

    Set the parameters as required. For details about the supported custom parameters, see Custom Parameters.

    Statistics Configuration

    (HBase)

    Run Mode

    The supported run modes include:

    Yarn: This mode applies to large-scale distributed environments. It can make full use of cluster resources and improve task concurrency and efficiency.

    Local: This mode applies to small-scale datasets or testing and development environments. It enables rapid debugging and verification.

    Parameter

    Add command parameters based on the selected run mode and requirements.

    Command Parameters (available for Delta Lake and Hudi)

    Security Authentication (Only available for target Delta Lake clusters)

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    • --class: indicates the name of the class of a Spark application.
    • --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.
    • application-jar: indicates the path of the JAR file of the Spark application.
    • application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    The parameters that need to be added depend on the scenario:

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake cluster that uses Spark 3, add the following parameters:
      • Parameter: jars
      • Value: '/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-core_2.12-*.jar,/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-storage-*.jar'
        CAUTION:

        Replace the parameter values with the actual environment directory and Delta Lake version.

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake 2.1.0 cluster that uses Spark 2.4.8, add the following parameter:
      • Parameter: mgc.delta.spark.version
      • Value: 2
    Table 3 Parameters for creating an hourly incremental verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-hourly incremental verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    Metadata Connection (Optional)

    This parameter is available for Hive and Delta Lake.

    If you are creating a verification task for the target Hive cluster, select the metadata connection to this Hive cluster. The connections are used to check whether the partitions to be verified can be found in the target cluster.

    To improve processing efficiency, you can select metadata connections to the source Delta Lake cluster where the groups of tables to be verified belong to.

    Start Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time.

    Advanced Options

    • Concurrency: Specify the number of concurrent threads on an executor for the verification task. The default value is 3. The value ranges from 1 to 10.
      CAUTION:

      If you are creating a verification task for an Alibaba Cloud EMR Hive cluster, set this parameter based on the source data volume and master node specifications. Consider the following rules:

      • The total number of concurrent threads for verification tasks running simultaneously in the source cluster cannot exceed 70% of the total number of cores on the metadata node.
      • The total resources allocated to verification tasks cannot exceed the resources of the execution queue. The total resources allocated to tasks can be calculated as follows:

        Allocated memory = Number of executors × Memory on an executor × Concurrency

        Allocated cores = Number of executors × Cores on an executor × Concurrency

      Assume that the total source data volume is 500 GB spread across 10,000 tables, in which there are 8 large tables with 50 GB data and 100,000 partitions. The master node has 8 vCPUs and 32 GB of memory.

      • According to rule 1, the maximum number of concurrent requests is 5, which is the rounded down value of 5.6 (0.7 × 8).
      • According to rule 2, you need to select spark-sql in the Execution Command area and set the following parameters:

        executor-memory = 4G

        master = yarn

        num-executors = 20

        executor-cores = 2

        driver-memory = 10G

    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.
    • Send SMN Notifications: Determine whether to use SMN to notify you of the task status in a timely manner through emails, SMS messages, or customized URLs.
      NOTICE:
      • Before enabling this function, you need to create a topic on the SMN console. For details, see Creating a Topic.
      • Using this function may incur a small amount of fees, which are billed by SMN. For details, see SMN Billing.

    Data Filtering

    Execution Interval

    Control how frequent the task will be executed.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    • Partition Filtering: Determine to filter table partitions by creation time or update time.
      • By update time: An update time indicates the last time a table partition was modified or updated. Choose this option if you are concerned about the latest status or changes of data in a partition.
      • By creation time: A creation time indicates the timestamp when a partition was created. Choose this option if you are concerned about the data generated from the time when the partition is created to a certain time point.
    • Max. Partitions: Limit how many partitions in a table are verified. The default value is 3. The value ranges from 1 to 50.

      For example, if this parameter is set to 3, the system verifies the consistency of only the first three partitions that are sorted by ID in descending order.

    • Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the user name corresponding to the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for consistency verification.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Execution Settings

    (This item is only available for DLI.)

    Parameter

    Set the parameters as required. For details about the supported custom parameters, see Custom Parameters.

    Command Parameters (available for Delta Lake and Hudi)

    Security Authentication (Only available for target Delta Lake clusters)

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    • --class: indicates the name of the class of a Spark application.
    • --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.
    • application-jar: indicates the path of the JAR file of the Spark application.
    • application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    The parameters that need to be added depend on the scenario:

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake cluster that uses Spark 3, add the following parameter:
      • Parameter: jars
      • Value: '/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-core_2.12-*.jar,/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-storage-*.jar'
        CAUTION:

        Replace the parameter values with the actual environment directory and Delta Lake version.

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake 2.1.0 cluster that uses Spark 2.4.8, add the following parameter:
      • Parameter: mgc.delta.spark.version
      • Value: 2
    Table 4 Parameters for creating a date-based verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-date-based-Verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Metadata Connection (Optional)

    This parameter is available for Hive and Delta Lake.

    If you are creating a verification task for the target Hive cluster, select the metadata connection to this Hive cluster. The connection is used to check whether the partitions to be verified can be found in the target cluster.

    If you are creating a verification task for the source Delta Lake cluster, to improve processing efficiency, you can select metadata connections to the source Delta Lake cluster.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    Advanced Options

    • Concurrency: Specify the number of concurrent threads on an executor for the verification task. The default value is 3. The value ranges from 1 to 10.
      CAUTION:

      If you are creating a verification task for an Alibaba Cloud EMR Hive cluster, set this parameter based on the source data volume and master node specifications. Consider the following rules:

      • The total number of concurrent threads for verification tasks running simultaneously in the source cluster cannot exceed 70% of the total number of cores on the metadata node.
      • The total resources allocated to verification tasks cannot exceed the resources of the execution queue. The total resources allocated to tasks can be calculated as follows:

        Allocated memory = Number of executors × Memory on an executor × Concurrency

        Allocated cores = Number of executors × Cores on an executor × Concurrency

      Assume that the total source data volume is 500 GB spread across 10,000 tables, in which there are 8 large tables with 50 GB data and 100,000 partitions. The master node has 8 vCPUs and 32 GB of memory.

      • According to rule 1, the maximum number of concurrent requests is 5, which is the rounded down value of 5.6 (0.7 × 8).
      • According to rule 2, you need to select spark-sql in the Execution Command area and set the following parameters:

        executor-memory = 4G

        master = yarn

        num-executors = 20

        executor-cores = 2

        driver-memory = 10G

    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.
    • Send SMN Notifications: Determine whether to use SMN to notify you of the task status in a timely manner through emails, SMS messages, or customized URLs.
      NOTICE:
      • Before enabling this function, you need to create a topic on the SMN console. For details, see Creating a Topic.
      • Using this function may incur a small amount of fees, which are billed by SMN. For details, see SMN Billing.

    Data Filtering

    Time Range

    Select the time period in which the incremental data needs to be verified as consistent. By default, a 24-hour period is selected. T indicates the execution time of the task, and T-n indicates a time n × 24 hours before the execution time.

    If you select Consecutive days, the system verifies consistency of the incremental data generated during the specified days.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (Hive)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the user name corresponding to the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for consistency verification.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Execution Settings

    (This item is only available for DLI.)

    Parameter

    Set the parameters as required. For details about the supported custom parameters, see Custom Parameters.

    Command Parameters (available for Delta Lake and Hudi)

    Security Authentication(available for target Delta Lake clusters)

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    • --class: indicates the name of the class of a Spark application.
    • --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.
    • application-jar: indicates the path of the JAR file of the Spark application.
    • application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    The parameters that need to be added depend on the scenario:

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake cluster that uses Spark 3, add the following parameter:
      • Parameter: jars
      • Value: '/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-core_2.12-*.jar,/opt/apps/DELTALAKE/deltalake-current/spark3-delta/delta-storage-*.jar'
        CAUTION:

        Replace the parameter values with the actual environment directory and Delta Lake version.

    • If you are creating a verification task for an Alibaba Cloud EMR Delta Lake 2.1.0 cluster that uses Spark 2.4.8, add the following parameter:
      • Parameter: mgc.delta.spark.version
      • Value: 2
    Table 5 Parameters for creating a selective verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-selective verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    HBase Connection

    This parameter is available for HBase and CloudTable (HBase).

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    Advanced Options

    • Concurrency: Specify the number of concurrent threads on an executor for the verification task. The default value is 3. The value ranges from 1 to 10.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.
    • Send SMN Notifications: Determine whether to use SMN to notify you of the task status in a timely manner through emails, SMS messages, or customized URLs.
      NOTICE:
      • Before enabling this function, you need to create a topic on the SMN console. For details, see Creating a Topic.
      • Using this function may incur a small amount of fees, which are billed by SMN. For details, see SMN Billing.

    Data Filtering

    Time Range

    Select the time period in which the data needs to be verified as consistent.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Settings

    (HBase)

    Run Mode

    The supported run modes include:

    • Yarn: For large-scale distributed environments. It can make full use of cluster resources and improve verification concurrency and efficiency.
    • Local: For small-scale datasets or testing and development environments. It enables rapid debugging and verification.

    Parameter

    Add command parameters based on the selected run mode and requirements.

  7. Click Save. After the creation is successful, the system automatically synchronizes the task settings to the Edge device. Then in the task list, you can view the created task and its settings synchronization status.
  8. After the settings synchronization is complete, execute the task using either of the following methods:

    • Automatic execution: The task will be executed at the specified time automatically.
      1. In the task list, locate the task and click Activate in the Schedule Status column.
      2. In the displayed dialog box, click OK to activate the task.
    • Manual execution: You can manually execute the task immediately.
      1. In the task list, locate the task and click Execute in the Operation column.
      2. In the displayed dialog box, click OK to execute the task immediately.

  9. Click View Executions in the Operation column. On the executions list page, you can:

    • View the status, progress statistics, and execution start time and end time of each task execution.

      If a task execution takes a long time or the page is incorrectly displayed, set the log level of the executor's Driver to ERROR.

    • Upload the logs of a task execution to your OBS bucket for review and analysis by clicking Upload Log in the Operation column. Before uploading logs, you need to configure a log bucket on the Edge console. For details, see Configuring a Log Bucket.
    • Cancel a running execution or terminate an execution using the Cancel/Terminate button.
    • If there are tables whose verification results are not obtained, obtain the results again by clicking Path in the Statistics column.