Updated on 2022-09-23 GMT+08:00

Creating a Comparison Job

Data comparison is critical to ensure data consistency in data development and migration. The cross-source data comparison capability is the key to checking consistency of the data before and after migration or processing.

Comparison jobs in Quality Monitoring support cross-source data comparison. You can apply created rules to two tables for quality monitoring and output the comparison result.

Prerequisites

You have created a directory for storing the comparison job. To create a directory, choose Quality Monitoring > Comparison Jobs in the navigation pane. Before creating a comparison job for a data connection, select a directory to store the comparison job. For details, see Figure 1.

Table 1 describes the directory-related operations.

Figure 1 Directory that stores the comparison job to create
Table 1 Buttons in the navigation bar

No.

Description

1

Create Directory

2

Refresh Directory

3

Select All. Right-click to create, delete, and rename directories.

Creating a Job

  1. On the DataArts Studio console, locate an instance and click Access. On the displayed page, locate a workspace and click DataArts Quality.
  1. Choose Quality Monitoring > Comparison Job from the left navigation bar.
  2. Click Create. In the dialog box displayed, set the parameters based on Table 2.
    Table 2 Comparison job parameters

    Parameter

    Description

    Name

    Comparison job name

    Description

    Information to better identify a comparison job. It cannot exceed 256 characters.

    Directory

    The directory for storing the comparison job to create. You can select a created directory. Figure 1 shows the directory.

    Job Level

    The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.

  3. Click Next to go to the Define Rule page. Click on the rule card and configure it based on Table 3. You can also add comparison rules.
    Figure 2 Configuring rules for a comparison job
    Table 3 Parameters for configuring a rule template

    Module

    Parameter

    Description

    Basic Information

    Subjob Name

    In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.

    Description

    Information to better identify the subjob

    Object

    Rule Type

    The options are Table rule, Field rule, and Custom rule. Field-level rules can be used to configure monitoring rules for specific fields in tables. For example, set this parameter to Table rule, and set other configuration items on the page to table-level rule configuration items correspondingly.

    The rule type of the destination object is automatically generated based on that of the source object.

    Data Connection

    Source and destination objects support the following data connection types: DWS, MRS Hive, DLI, RDS (MySQL and PostgreSQL), Oracle, and MRS Spark (Hudi).

    Select a created data connection from the drop-down list box.

    NOTE:
    • Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
    • For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
      • MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised keep the default settings when editing the job.
      • Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.

    Data Object

    The data table selected for the source object is compared with the data table of the destination object on the right. Select the table to which the configured comparison rule applies.

    NOTE:

    The table is closely related to the database. The database is tailored to the created data connection.

    SQL

    This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.

    Compute Engine

    Cluster Name

    Select the engine for running the comparison job. This parameter is valid only for DLI data connections.

    Rule Template

    Template

    This parameter defines how to monitor the quality of data objects.

    The template name of the source object contains the system rule template and custom rule template.

    The template name of the destination object is automatically generated based on the rule type of the source object.

    NOTE:

    The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating Rule Templates.

    Version

    This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.

    Object Scope

    Scanning Scope

    You can select All or Partial. The default value is All.

    If you want only part of data to be computed or quality jobs to be executed periodically based on a timestamp, you can set a WHERE condition for scanning.

    WHERE Clause

    Enter a WHERE clause. The system will scan the data that matches the clause.

    For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:

    age > 18 and age <= 60

    You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:

    time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))

    Alarm Condition

    Alarm Expression

    Set this parameter if you want to set an alarm condition for the current rule.

    After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Rule Template of the source and destination of the comparison job is set to Table Rows, you can configure the alarm expression as follows:

    • To generate an alarm when the number of rows in the source table is less than 100, enter ${1_1}<100, where ${1_1} indicates the total number of rows in the source table.
    • To generate an alarm when the number of rows in the source table is not equal to that in the destination table, enter ${1_1}!=${2_1}, where ${1_1} indicates the total number of rows in the source table and ${2_1} indicates the total number of rows in the destination table.
    • To generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.

    Parameter

    The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.

    For example, if Template is set to Table Rows, ${1_1} is displayed in Alarm Expression when you click alarm parameter Table Rows.

    Logical Operator

    This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.

    You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Template is Table Rows and if you want to generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.

  4. Click Next and set the subscription configuration. If the SMN notification is required, enable Notification, and set Notification Type and Topic. See Figure 3.
    Figure 3 Subscription configuration
  5. Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 4. Click Submit.
    1. If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
    2. If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
    3. When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M management on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
    4. Only MRS clusters that support job submission through an agency support periodic scheduling of comparison jobs. MRS clusters that support job submission through an agency are as follows:
      • Non-security MRS cluster
      • MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
    Table 4 Parameters for setting the scheduling mode

    Parameter

    Description

    Effective

    Effective date of a scheduling task.

    Cycle

    The frequency at which a scheduling task is executed. Related parameters are:

    • Minutes
    • Hours
    • Days
    • Weeks
      NOTE:
      • If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task.
      • If Cycle is set to Days, set the start time of the scheduling task.
      • If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.

Exporting Comparison Jobs

You can export a maximum of 200 comparison jobs.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs to export.
  2. Click Export. The Export Comparison Job dialog box is displayed.
  3. Click Export to switch to the Export Records tab.
  4. In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.

Importing Comparison Jobs

You can import a file containing a maximum of 4 MB data.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, click Import. The Import Comparison Job dialog box is displayed.

  2. On the Import Configurations tab page, set Duplicate Name Policy.

    • Terminate: If comparison job names repeat, all comparison jobs will fail to be imported.
    • Skip: If comparison job names repeat, the comparison jobs will still be imported.
    • Overwrite: If comparison job names repeat, new jobs will replace existing ones with the same names.

  3. Click Upload and select the prepared data file.

    Edit the data file using either of the following methods:

    • (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
    • Click Download Template, fill in the template with the data to import, and import the data to the system.

  4. Configure resource mapping for the data connection, cluster, directory, and topic. Click Import to import the Excel template to the system.

    • Data Connection: Select the type of the imported data connection.
    • Cluster: If the data connection type is DLI, select the corresponding queue.
    • Directory: Select the directory where the imported comparison job is stored.
    • Topic: If SMN is configured, you need to select a topic.

  5. Click the Import Records tab to view the import records.