Help Center/ Migration Center/ User Guide/ Big Data Verification/ Creating and Executing Verification Tasks
Updated on 2024-10-21 GMT+08:00

Creating and Executing Verification Tasks

You can use the created source and target connections to create verification tasks.

For details about the supported big data components and verification methods, see Overview.

Notes

  • A pair of verification tasks for the source and the target must use the same verification method.
  • If the source and target HBase clusters use different security authentication modes, the verification tasks cannot be executed at the same time, or they will fail to be executed. This is because the authentication information must be handled differently in each cluster. The secured cluster requires authentication information to be loaded, whereas the non-secured cluster needs that information cleared.

Prerequisites

Procedure

  1. Sign in to the MgC console.
  2. In the navigation pane on the left, choose Migrate > Big Data Verification. Select a migration project in the upper left corner of the page.
  3. In the Features area, click Task Management.
  4. Click Create Task in the upper right corner of the page.
  5. Select a big data component and verification method as needed and click Next.
  6. Configure task parameters based on the selected big data component and verification method.

    The task parameters vary with the big data component.

    Table 1 Parameters for creating a full verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-full-verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    HBase Connection

    This parameter is available for HBase and CloudTable (HBase).

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    ClickHouse Connection

    This parameter is available for ClickHouse, Alibaba Cloud ApsaraDB for ClickHouse, and CloudTable (ClickHouse)

    • To create a verification task for the source, select the source MRS ClickHouse or Alibaba Cloud ApsaraDB for ClickHouse connection.
    • To create a verification task for the target, select the target MRS ClickHouse or CloudTable (ClickHouse) connection.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    NOTE:

    You are advised to run the verification during off-peak hours.

    Advanced Options

    • Concurrency: Limit how many tasks that an executor can run concurrently. The default value is 3. The value ranges from 1 to 10.
    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.

    Data Filtering

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for verifying data consistency.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: used to execute SQL statements to query and analyze data.

    MaxCompute Parameters

    (This item is only available for MaxCompute.)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Execution Settings

    (This item is only available for HBase.)

    Run Mode

    The supported run modes include:

    Yarn: This mode applies to large-scale distributed environments. It can make full use of cluster resources and improve task concurrency and efficiency.

    Local: This mode applies to small-scale datasets or development and test environments for quick debugging and verification.

    Parameters

    Add command parameters based on the selected run mode and requirements.

    Command Parameters (This item is only available for Delta Lake and Hudi.)

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    --class: indicates the name of the class of a Spark application.

    --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.

    application-jar: indicates the path of the JAR file of the Spark application.

    application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    Table 2 Parameters for creating a daily incremental verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-daily-incremental-verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    HBase Connection

    If HBase is selected in the previous step:

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    Advanced Options

    • Concurrency: Limit how many tasks an executor can run concurrently. The default value is 3. The value ranges from 1 to 10.
    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.

    Data Filtering

    Incremental Scope

    Select the time period in which the incremental data needs to be verified as consistent. By default, a 24-hour period is selected. T indicates the execution time of the task, and T-n indicates a time n × 24 hours before the execution time.

    If you select Consecutive days, the system verifies consistency of the incremental data generated during these consecutive days.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    • Partition Filtering: Determine to filter table partitions by creation time or update time.
      • By update time: An update time indicates the timestamp a table partition was last modified or updated. Choose this option if you are concerned about the latest status or changes of data in a partition.
      • By creation time: A creation time indicates the timestamp when a partition was created. Choose this option if you are concerned about the data generated from the time when the partition is created to a certain time point.
    • Max. Partitions: Limit how many partitions in a table are verified. The default value is 3. The value ranges from 1 to 50.

      For example, if this parameter is set to 3, the system verifies the consistency of only the first three partitions that are sorted by ID in descending order.

    • Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the principal of the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for verifying data consistency.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Statistics Configuration

    (HBase)

    Run Mode

    The supported run modes include:

    Yarn: This mode applies to large-scale distributed environments. It can make full use of cluster resources and improve task concurrency and efficiency.

    Local: This mode applies to small-scale datasets or testing and development environments. It enables rapid debugging and verification.

    Parameter

    Add command parameters based on the selected run mode and requirements.

    Command Parameters (Delta Lake and Hudi)

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    --class: indicates the name of the class of a Spark application.

    --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.

    application-jar: indicates the path of the JAR file of the Spark application.

    application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    Table 3 Parameters for creating an hourly incremental verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-hourly incremental verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Doris Connection

    If Doris is selected in the previous step:

    • To create a verification task for the source, select the source Doris connection.
    • To create a verification task for the target, select the target Doris connection.

    Start Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time.

    Advanced Options

    • Concurrency: Limit how many tasks an executor can run concurrently. The default value is 3. The value ranges from 1 to 10.
    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.

    Data Filtering

    Execution Interval

    Control how frequent the task will be executed.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    • Partition Filtering: Determine to filter table partitions by creation time or update time.
      • By update time: An update time indicates the last time a table partition was modified or updated. Choose this option if you are concerned about the latest status or changes of data in a partition.
      • By creation time: A creation time indicates the timestamp when a partition was created. Choose this option if you are concerned about the data generated from the time when the partition is created to a certain time point.
    • Max. Partitions: Limit how many partitions in a table are verified. The default value is 3. The value ranges from 1 to 50.

      For example, if this parameter is set to 3, the system verifies the consistency of only the first three partitions that are sorted by ID in descending order.

    • Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (This parameter is only available for Hive.)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the user name corresponding to the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for consistency verification.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Command Parameters (Delta Lake and Hudi)

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    --class: indicates the name of the class of a Spark application.

    --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.

    application-jar: indicates the path of the JAR file of the Spark application.

    application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    Table 4 Parameters for creating a date-based verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-date-based-Verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    Executor Connection

    This parameter is available for Hive, Delta Lake, and Hudi.

    • To create a verification task for the source, select the source executor connection.
    • To create a verification task for the target, select the target executor connection.

    DLI Connection

    If DLI is selected in the previous step, the task can only be created for the target. You need to select the created DLI connection.

    Execution Time

    Specify when the task will be executed. After the task is activated, it will be automatically executed at the specified time every day.

    Advanced Options

    • Concurrency: Limit how many tasks an executor can run concurrently. The default value is 3. The value ranges from 1 to 10.
    • Max. SQL Statements Per File: Each time the task is executed, files are created for storing the SQL statements generated for querying tables. You can control how many SQL statements can be stored in a single file. The default value is 10. The recommended value ranges from 1 to 50.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.

    Data Filtering

    Time Range

    Select the time period in which the incremental data needs to be verified as consistent. By default, a 24-hour period is selected. T indicates the execution time of the task, and T-n indicates a time n × 24 hours before the execution time.

    If you select Consecutive days, the system verifies consistency of the incremental data generated during the specified days.

    Non-partitioned Table Verification

    Decide how to verify non-partitioned tables.

    • Verify all: All non-partition tables are verified as consistent.
    • Skip all: All non-partitioned tables are skipped during consistency verification.
    • Filter by update time: Only non-partitioned tables whose update time falls within the specified time range are verified as consistent. The update time of a non-partitioned table may not be accurate if it contains non-inserted data, and it may be excluded from verification.

    Advanced Options

    Max Fields Per SQL Statement: Limit how many fields can be queried by one SQL statement. Too many or few fields in a SQL statement results in low query efficiency. The default value 0 means no limit is set. The value ranges from 100 to 500.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Script

    (Hive)

    Security Authentication

    If security authentication (for example, Kerberos authentication) is enabled for the big data cluster, select this option and configure the security authentication command. You must first manually upload the .keytab file that contains the authentication key to the executor.

    • Keytab Path: Enter the path where the .keytab file is stored on the executor.
    • Keytab Principal: Enter the user name corresponding to the .keytab file.

    Execution Command

    You can configure Beeline or Spark SQL command parameters to run SQL statements for consistency verification.

    • Beeline: a command line tool used to interact with Hive.
    • Spark SQL: a command line tool used to execute SQL statements to query and analyze Hive data.

    MaxCompute Parameters

    (MaxCompute)

    -

    Add MaxCompute parameters as needed. For details about the parameters, see MaxCompute Documentation.

    Command Parameters (Delta Lake and Hudi)

    spark-sql

    Spark SQL is a module used in Spark for structured data processing. In Spark applications, you can seamlessly use SQL statements or DataSet APIs to query structured data. For more information, see SparkSQL Principles. Retain the default settings.

    spark-submit

    This is a basic Spark shell command used to submit Spark applications. The command is as follows:

    ./bin/spark-submit \
      --class <main-class> \
      --master <master-url> \
      ... # other options
      <application-jar> \
      [application-arguments]

    Parameter description:

    --class: indicates the name of the class of a Spark application.

    --master: indicates the master to which the Spark application links, such as Yarn-client and Yarn-cluster.

    application-jar: indicates the path of the JAR file of the Spark application.

    application-arguments: indicates the parameter required to submit the Spark application. (This parameter can be empty.)

    Table 5 Parameters for creating a selective verification task

    Area

    Parameter

    Configuration

    Basic Info

    Task Name

    The default name is Component-selective verification-4 random characters (including letters and numbers). You can also customize a name.

    Task Settings

    Table Groups

    Select the table groups that contain the tables to be verified.

    HBase Connection

    This parameter is available for HBase and CloudTable (HBase).

    • To create a verification task for the source, select the source HBase connection.
    • To create a verification task for the target, select the target HBase or CloudTable (HBase) connection.

    Advanced Options

    • Concurrency: Limit how many tasks an executor can run concurrently. The default value is 3. The value ranges from 1 to 10.
    • Timeout (s): indicates the maximum time allowed for an SQL statement to end normally. The unit is second (s). The default value is 600. The value ranges from 600 to 7,200.

    Data Filtering

    Time Range

    Select the time period in which the data needs to be verified as consistent.

    OBS Bucket Check

    -

    • If you need to upload task logs and content verification results to an OBS bucket for management and analysis, configure an OBS bucket. After the bucket is configured, the task logs and content verification results will be automatically uploaded to the specified OBS bucket.
    • If you do not need to upload the task logs and content verification results to OBS, select I confirm that I only need to view logs and data verification results on Edge and do not need to upload them to OBS.

    Execution Settings

    (HBase)

    Run Mode

    The supported run modes include:

    • Yarn: For large-scale distributed environments. It can make full use of cluster resources and improve verification concurrency and efficiency.
    • Local: For small-scale datasets or testing and development environments. It enables rapid debugging and verification.

    Parameter

    Add command parameters based on the selected run mode and requirements.

  7. Click Save. After the creation is successful, the system automatically synchronizes the task settings to the Edge device. Then in the task list, you can view the created task and its settings synchronization status.
  8. After the settings synchronization is complete, execute the task using either of the following methods:

    • Automatic execution: The task will be executed at the specified time automatically.
      1. In the task list, locate the task and click Activate in the Schedule Status column.
      2. In the displayed dialog box, click OK to activate the task.
    • Manual execution: You can manually execute the task immediately.
      1. In the task list, locate the task and click Execute in the Operation column.
      2. In the displayed dialog box, click OK to execute the task immediately.

  9. Click View Executions in the Operation column. On the executions list page, you can:

    • View the status, progress statistics, and execution start time and end time of each task execution.

      If a task execution takes a long time or the page is incorrectly displayed, set the log level of the executor's Driver to ERROR.

    • Upload the logs of a task execution to your OBS bucket for review and analysis by clicking Upload Log in the Operation column. Before uploading logs, you need to configure a log bucket on the Edge console. For details, see Configuring a Log Bucket.
    • Cancel a running execution or terminate an execution using the Cancel/Terminate button.
    • If there are tables whose verification results are not obtained, obtain the results again by clicking Path in the Statistics column.