Updated on 2024-10-28 GMT+08:00

Creating a Data Comparison Job

Data comparison is critical to ensure data consistency in data development and migration. The cross-source data comparison capability is the key to checking consistency of the data before and after migration or processing.

Comparison jobs in Quality Monitoring support cross-source data comparison. You can apply created rules to two tables for quality monitoring and output the comparison result.

Creating a Job

  1. On the DataArts Studio console, locate a workspace and click DataArts Quality.
  1. (Optional) In the left navigation pane, choose Quality Monitoring > Comparison Jobs and create a directory. If a directory exists, you do not need to create one. Note that rule templates, quality jobs, and comparison job are in the same directory.

    Currently, you can create a directory using either of the following methods:

    Click and enter a directory name in the displayed dialog box. In this way, you can create a maximum of seven layers of directories.
    Figure 1 Creating a directory
    You can also click to synchronize the subjects in DataArts Architecture as directories. (Only published subjects can be synchronized.) The synchronized directories are consistent with the published subjects in DataArts Architecture and are displayed by layer, such as and .
    1. The directories you create are not affected by the synchronization. If the name of a created directory conflicts with that of a subject:
      • If they conflict during the first synchronization, a subject layer (such as and ) is added to the name of the directory.
      • If they conflict after the subject is modified, the synchronization fails.
    2. Changes to subjects or subject layers in DataArts Architecture cannot be automatically synchronized. You must click again to synchronize them.

      If a subject or subject layer in DataArts Architecture is deleted and synchronized to DataArts Quality, the corresponding directory will not be deleted. Instead, the subject attributes will be deleted from the directory.

    3. After the synchronization is complete, the system automatically displays the synchronization result details. You can view the names of the subjects that fail to be synchronized.
    Figure 2 Synchronizing subjects from DataArts Architecture
  2. On the Quality Jobs page, click Create. In the displayed dialog box, set the parameters listed in Table 1.
    Table 1 Comparison job parameters

    Parameter

    Description

    Name

    Comparison job name

    Description

    Information to better identify a comparison job. It cannot exceed 1,024 characters.

    Tag

    Select desired tags from the list of tags that were defined Data Map. If Data Map is disabled, tags do not take effect.

    Directory

    The directory for storing the comparison job to create. You can select a created directory. For details about how to create a directory, see (Optional) Creating a Directory.

    Job Level

    The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.

    Timeout

    Timeout duration. Enter a value from 5 to 1440. The unit is minute. If this parameter is left empty or the default value 1440 is used, the timeout duration is 24 hours by default and can be changed.

  3. Click Next to go to the Define Rule page. Click on the rule card and configure it based on Table 2. You can also add comparison rules.
    Figure 3 Configuring rules for a comparison job
    Table 2 Parameters for configuring a rule template

    Module

    Parameter

    Description

    Basic Information

    Subjob Name

    In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.

    Description

    Information to better identify the subjob It can contain a maximum of 1,024 characters.

    Object

    Rule Type

    The options are Table rule, Field rule, and Custom rule. Field-level rules can be used to configure monitoring rules for specific fields in tables. For example, set this parameter to Table rule, and set other configuration items on the page to table-level rule configuration items correspondingly.

    The rule type of the destination object is automatically generated based on that of the source object.

    Data Connection

    Source and destination objects support the following data connection types: DWS, MRS Hive, MRS Spark, DLI, RDS (MySQL and PostgreSQL), Hetu, Oracle, Doris, MRS Spark (Hudi), and MRS ClickHouse.

    Select a created data connection from the drop-down list box.

    NOTE:
    • Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
    • For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
      • MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised to keep the default settings when editing the job.
      • Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.
    • The strict mode of the MRS Hive component is not supported.

    Database

    Select the database to which the configured data quality rules are applied.

    NOTE:
    • The database is tailored to the created data connection.
    • When Rule Type is set to Custom rule, set the data object to the corresponding database.

    Data Object

    The data table selected for the source object is compared with the data table of the destination object on the right. Select the table to which the configured comparison rule applies.

    NOTE:

    The table is closely related to the database. The database is tailored to the created data connection.

    SQL

    This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.

    Compute Engine

    Queue Name

    Select the engine for running the comparison job. This parameter is valid only for DLI or Hetu data connections. Enter a queue name.

    If the connection type is Hetu and rule type is cross-source, the queue name is the resource queue name of the Hetu engine. To view the resource queue name of the Hetu engine, log in to FusionInsight Manager of MRS and click HetuEngine in the navigation pane. In the Basic Information area, click the HSConsole WebUI link and view the resource queue name of the Hetu engine in the computing instance list.

    Rule Template

    Template

    This parameter defines how to monitor the quality of data objects.

    The template name of the source object contains the system rule template and custom rule template.

    The template name of the destination object is automatically generated based on the rule type of the source object.

    NOTE:

    The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating a Data Quality Rule.

    If Rule Type is set to Field rule and Rule Template is set to Regular expression verification or Regular expression verification ignore null, the regular expression rule can contain a maximum of 1,024 characters.

    Version

    This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.

    Object Scope

    Scanning Scope

    You can select All or Partial. The default value is All.

    If you want only part of data to be computed or comparison jobs to be executed periodically based on a timestamp, you can set a where clause for scanning.

    WHERE Clause

    Enter a WHERE clause. The system will scan the data that matches the clause.

    For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:

    age > 18 and age <= 60

    You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:

    time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))

    Alarm Condition

    Alarm Expression

    Set this parameter if you want to set an alarm condition for the current rule.

    After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and
    • abs: absolute value

    For example, if Rule Template of the source and destination of the comparison job is set to Table Rows, you can configure the alarm expression as follows:

    • To configure an alarm to be generated when the number of rows in the source table is less than 100, enter ${1_1}<100, where ${1_1} indicates the total number of rows in the source table.
    • To configure an alarm to be generated when the number of rows in the source table is not equal to that in the destination table, enter ${1_1}!=${2_1}, where ${1_1} indicates the total number of rows in the source table and ${2_1} indicates the total number of rows in the destination table.
    • To configure an alarm to be generated when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.
    • To configure an alarm to be generated when the absolute value of the number of rows in the source table minus the number of rows in the destination table divided by the number of rows in the source table is greater than 0.1, enter abs(${1_1}-${2_1})/${1_1}>0.1, where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively.

    Parameter

    The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.

    For example, if Template is set to Table Rows, ${1_1} is displayed in Alarm Expression when you click alarm parameter Table Rows.

    Logical Operator

    This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.

    You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and
    • abs: absolute value

    For example, if Template is Table Rows and if you want to generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.

  4. Click Next and set the subscription configuration. If the SMN notification is required, enable Notification, and set Notification Type and Topic. See Figure 4.
    Figure 4 Subscription configuration

    When notification is enabled, a notification is sent for all the subjobs of the configured notification type.

    If you enable alarming, you do not need to set the notifications for failures. Alarms will be automatically reported if a task fails.

    Currently, only SMS and email are available for subscribing to topics.

    You can select Alarm triggered or Run successfully for Notification Type.

    If you enable Notification Policy, you can set the condition for sending an alarm notification, that is, the number of consecutive alarms within a certain period of time (minutes). The period ranges from 1 minute to 360 minutes, and the number of consecutive times ranges from 1 to 10.

  5. Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 3. Click Submit.
    1. If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
    2. If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
    3. When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M management on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
    4. Only MRS clusters that support job submission through an agency support periodic scheduling of comparison jobs. MRS clusters that support job submission through an agency are as follows:
      • Non-security MRS cluster
      • MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
    Table 3 Parameters for setting the scheduling mode

    Parameter

    Description

    Effective

    Effective date of a scheduling task.

    Cycle

    The frequency at which a scheduling task is executed. Related parameters are:

    • Minutes
    • Hours
    • Days
    • Weeks
      NOTE:
      • If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task.
      • If Cycle is set to Days, set the start time of the scheduling task.
      • If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.

    After a comparison job is created, you can view it in the job list. You can also filter jobs by job name, creator, and time range. The system supports fuzzy search.

    After a comparison job is created, you can edit, delete, run, start scheduling, and stop scheduling it.

    You cannot start scheduling a one-off comparison job.

Running a Comparison Job

To run a comparison job, perform the following operations:

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, locate a comparison job.
  2. Click Run in the Operation column.
  3. In enterprise mode, select the development environment or production environment.
  4. Click OK.

Exporting Comparison Jobs

You can export a maximum of 200 comparison jobs. Each cell of the exported file can contain a maximum of 65,534 characters.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs to export.
  2. Click Export. The Export Comparison Job dialog box is displayed.
  3. Click Export to switch to the Export Records tab.
  4. In the list of exported files, locate an exported comparison job and click Download in the Operation column to download the Excel file of the job to the local PC.

Importing Comparison Jobs

You can import a file containing a maximum of 4 MB data. Each cell of the file to be imported can contain a maximum of 65,534 characters.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, click Import. The Import Comparison Job dialog box is displayed.

    Figure 5 Importing comparison jobs

  2. On the Import Configurations tab page, set Duplicate Name Policy.

    • Terminate: If comparison job names repeat, all comparison jobs will fail to be imported.
    • Skip: If comparison job names repeat, the comparison jobs will still be imported.
    • Overwrite: If comparison job names repeat, new jobs will replace existing ones with the same names.

      If you select Overwrite, stop job scheduling before uploading a file. Otherwise, the upload will fail.

  3. Click Upload and select the prepared data file.

    Edit the data file using either of the following methods:

    • (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
    • Click Download Template, fill in the template with the data to import, and import the data to the system.

  4. Configure resource mapping for the data connection, cluster, directory, and topic. If you do not configure the resource mapping, the original mapping is used by default.

    Figure 6 Configuring the resource mapping

    • Data Connection: Select the type of the imported data connection.
    • Cluster: If the data connection type is DLI, select the corresponding queue.
    • Directory: Select the directory where the imported comparison job is stored.
    • Topic: If SMN is configured, you need to select a topic.

  5. Click Import to import the Excel template to the system.
  6. Click the Import Records tab to view the import records.

Running Comparison Jobs

You can run a maximum of 200 comparison jobs at a time.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to run.
  2. Above the job list, click More and select Run to run the selected comparison jobs.

    Figure 7 Running jobs

  3. In enterprise mode, select the development environment or production environment.
  4. Click OK.

Scheduling Comparison Jobs

You can schedule a maximum of 200 comparison jobs at a time.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to schedule.
  2. Above the job list, click More and select Start Schedule to schedule the selected comparison jobs.

    Figure 8 Scheduling jobs

Stopping Scheduling Comparison Jobs

You can stop scheduling a maximum of 200 comparison jobs at a time.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to stop scheduling.
  2. Above the job list, click More and select Stop Scheduling to stop scheduling the selected comparison jobs.

    Figure 9 Stopping scheduling jobs

Stopping Comparison Jobs

You can stop a maximum of 200 comparison jobs at a time.

Only comparison jobs in Running state can be stopped.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > O&M. In the right pane, select the comparison jobs you want to stop.
  2. Click Stop. In the displayed Stop Instance dialog box, confirm the instances to stop and click Yes.

    Figure 10 Stopping instances

    Figure 11 Stopping instances