Creating a Comparison Job
Data comparison is critical to ensure data consistency in data development and migration. The cross-source data comparison capability is the key to checking consistency of the data before and after migration or processing.
Comparison jobs in Quality Monitoring support cross-source data comparison. You can apply created rules to two tables for quality monitoring and output the comparison result.
Prerequisites
You have created a directory for storing the comparison job. To create a directory, choose Figure 1.
in the navigation pane. Before creating a comparison job for a data connection, select a directory to store the comparison job. For details, seeTable 1 describes the directory-related operations.
No. |
Description |
---|---|
1 |
Create Directory |
2 |
Refresh Directory |
3 |
Select All. Right-click to create, delete, and rename directories. |
Creating a Job
- On the DataArts Studio console, locate an instance and click Access. On the displayed page, locate a workspace and click DataArts Quality.
- Choose from the left navigation bar.
- Click Create. In the dialog box displayed, set the parameters based on Table 2.
Table 2 Comparison job parameters Parameter
Description
Name
Comparison job name
Description
Information to better identify a comparison job. It cannot exceed 256 characters.
Directory
The directory for storing the comparison job to create. You can select a created directory. Figure 1 shows the directory.
Job Level
The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.
- Click Next to go to the Define Rule page. Click
on the rule card and configure it based on Table 3. You can also add comparison rules.
Figure 2 Configuring rules for a comparison jobTable 3 Parameters for configuring a rule template Module
Parameter
Description
Basic Information
Subjob Name
In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.
Description
Information to better identify the subjob
Object
Rule Type
The options are Table rule, Field rule, and Custom rule. Field-level rules can be used to configure monitoring rules for specific fields in tables. For example, set this parameter to Table rule, and set other configuration items on the page to table-level rule configuration items correspondingly.
The rule type of the destination object is automatically generated based on that of the source object.
Data Connection
Source and destination objects support the following data connection types: DWS, MRS Hive, DLI, RDS (MySQL and PostgreSQL), Oracle, and MRS Spark (Hudi).
Select a created data connection from the drop-down list box.
NOTE:- Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
- For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
- MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised keep the default settings when editing the job.
- Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.
Data Object
The data table selected for the source object is compared with the data table of the destination object on the right. Select the table to which the configured comparison rule applies.
NOTE:The table is closely related to the database. The database is tailored to the created data connection.
SQL
This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.
Compute Engine
Cluster Name
Select the engine for running the comparison job. This parameter is valid only for DLI data connections.
Rule Template
Template
This parameter defines how to monitor the quality of data objects.
The template name of the source object contains the system rule template and custom rule template.
The template name of the destination object is automatically generated based on the rule type of the source object.
NOTE:The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating Rule Templates.
Version
This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.
Object Scope
Scanning Scope
You can select All or Partial. The default value is All.
If you want only part of data to be computed or quality jobs to be executed periodically based on a timestamp, you can set a WHERE condition for scanning.
WHERE Clause
Enter a WHERE clause. The system will scan the data that matches the clause.
For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:
age > 18 and age <= 60
You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:
time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))
Alarm Condition
Alarm Expression
Set this parameter if you want to set an alarm condition for the current rule.
After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".
- +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
For example, if Rule Template of the source and destination of the comparison job is set to Table Rows, you can configure the alarm expression as follows:
- To generate an alarm when the number of rows in the source table is less than 100, enter ${1_1}<100, where ${1_1} indicates the total number of rows in the source table.
- To generate an alarm when the number of rows in the source table is not equal to that in the destination table, enter ${1_1}!=${2_1}, where ${1_1} indicates the total number of rows in the source table and ${2_1} indicates the total number of rows in the destination table.
- To generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.
Parameter
The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.
For example, if Template is set to Table Rows, ${1_1} is displayed in Alarm Expression when you click alarm parameter Table Rows.
Logical Operator
This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.
You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().
- +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
For example, if Template is Table Rows and if you want to generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.
- Click Next and set the subscription configuration. If the SMN notification is required, enable Notification, and set Notification Type and Topic. See Figure 3.
- Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 4. Click Submit.
- If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
- If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
- When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M management on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
- Only MRS clusters that support job submission through an agency support periodic scheduling of comparison jobs. MRS clusters that support job submission through an agency are as follows:
- Non-security MRS cluster
- MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
Table 4 Parameters for setting the scheduling mode Parameter
Description
Effective
Effective date of a scheduling task.
Cycle
The frequency at which a scheduling task is executed. Related parameters are:
- Minutes
- Hours
- Days
- Weeks
NOTE:
- If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task.
- If Cycle is set to Days, set the start time of the scheduling task.
- If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.
Exporting Comparison Jobs
You can export a maximum of 200 comparison jobs.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs to export.
- Click Export. The Export Comparison Job dialog box is displayed.
- Click Export to switch to the Export Records tab.
- In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.
Importing Comparison Jobs
You can import a file containing a maximum of 4 MB data.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, click Import. The Import Comparison Job dialog box is displayed.
- On the Import Configurations tab page, set Duplicate Name Policy.
- Terminate: If comparison job names repeat, all comparison jobs will fail to be imported.
- Skip: If comparison job names repeat, the comparison jobs will still be imported.
- Overwrite: If comparison job names repeat, new jobs will replace existing ones with the same names.
- Click Upload and select the prepared data file.
Edit the data file using either of the following methods:
- (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
- Click Download Template, fill in the template with the data to import, and import the data to the system.
- Configure resource mapping for the data connection, cluster, directory, and topic. Click Import to import the Excel template to the system.
- Data Connection: Select the type of the imported data connection.
- Cluster: If the data connection type is DLI, select the corresponding queue.
- Directory: Select the directory where the imported comparison job is stored.
- Topic: If SMN is configured, you need to select a topic.
- Click the Import Records tab to view the import records.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.