Creating a Data Comparison Job
Data comparison is critical to ensure data consistency in data development and migration. The cross-source data comparison capability is the key to checking consistency of the data before and after migration or processing.
Comparison jobs in Quality Monitoring support cross-source data comparison. You can apply created rules to two tables for quality monitoring and output the comparison result.
Creating a Job
- On the DataArts Studio console, locate a workspace and click DataArts Quality.
- (Optional) In the left navigation pane, choose  and create a directory. If a directory exists, you do not need to create one. Note that rule templates, quality jobs, and comparison job are in the same directory.
    Currently, you can create a directory using either of the following methods: Click and enter a directory name in the displayed dialog box. In this way, you can create a maximum of seven layers of directories.
     Figure 1 Creating a directory and enter a directory name in the displayed dialog box. In this way, you can create a maximum of seven layers of directories.
     Figure 1 Creating a directory You can also click You can also click to synchronize the subjects in DataArts Architecture as directories. (Only published subjects can be synchronized.) The synchronized directories are consistent with the published subjects in DataArts Architecture and are displayed by layer, such as to synchronize the subjects in DataArts Architecture as directories. (Only published subjects can be synchronized.) The synchronized directories are consistent with the published subjects in DataArts Architecture and are displayed by layer, such as and and . .  - The directories you create are not affected by the synchronization. If the name of a created directory conflicts with that of a subject:
         - If they conflict during the first synchronization, a subject layer (such as  and and ) is added to the name of the directory. ) is added to the name of the directory.
- If they conflict after the subject is modified, the synchronization fails.
 
- If they conflict during the first synchronization, a subject layer (such as 
- Changes to subjects or subject layers in DataArts Architecture cannot be automatically synchronized. You must click  again to synchronize them. again to synchronize them.If a subject or subject layer in DataArts Architecture is deleted and synchronized to DataArts Quality, the corresponding directory will not be deleted. Instead, the subject attributes will be deleted from the directory. 
- After the synchronization is complete, the system automatically displays the synchronization result details. You can view the names of the subjects that fail to be synchronized.
 Figure 2 Synchronizing subjects from DataArts Architecture  
- The directories you create are not affected by the synchronization. If the name of a created directory conflicts with that of a subject:
         
- On the Quality Jobs page, click Create. In the displayed dialog box, set the parameters listed in Table 1. 
    Table 1 Comparison job parameters Parameter Description Name Comparison job name Description Information to better identify a comparison job. It cannot exceed 1,024 characters. Tag Select desired tags from the list of tags that were defined Data Map. If Data Map is disabled, tags do not take effect. Directory The directory for storing the comparison job to create. You can select a created directory. For details about how to create a directory, see (Optional) Creating a Directory. Job Level The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages. Timeout Timeout duration. Enter a value from 5 to 1440. The unit is minute. If this parameter is left empty or the default value 1440 is used, the timeout duration is 24 hours by default and can be changed. 
- Click Next to go to the Define Rule page. Click  on the rule card and configure it based on Table 2. You can also add comparison rules.
    Figure 3 Configuring rules for a comparison job on the rule card and configure it based on Table 2. You can also add comparison rules.
    Figure 3 Configuring rules for a comparison job  Table 2 Parameters for configuring a rule template Module Parameter Description Basic Information Subjob Name In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily. Description Information to better identify the subjob It can contain a maximum of 1,024 characters. Object Rule Type The options are Table rule, Field rule, and Custom rule. Field-level rules can be used to configure monitoring rules for specific fields in tables. For example, set this parameter to Table rule, and set other configuration items on the page to table-level rule configuration items correspondingly. The rule type of the destination object is automatically generated based on that of the source object. Data Connection Source and destination objects support the following data connection types: DWS, MRS Hive, MRS Spark, DLI, RDS (MySQL and PostgreSQL), Hetu, Oracle, Doris, MRS Spark (Hudi), and MRS ClickHouse. Select a created data connection from the drop-down list box. NOTE:- Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules. For MRS Hive and MRS Spark data sources, if Rule Type is set to Table rule or Field rule, Connection Type must be set to Proxy connection (MRS API connection is not supported) during the creation of connections to these data sources.
- For MRS Hive and MRS Spark data sources, select MRS API connection or Proxy connection.
             - MRS API connection: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised to keep the default settings when editing the job.
- Proxy connection: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.
 
- The strict mode of the MRS Hive component is not supported.
 Database Select the database to which the configured data quality rules are applied. NOTE:- The database is tailored to the created data connection.
- When Rule Type is set to Custom rule, set the data object to the corresponding database.
 Data Object The data table selected for the source object is compared with the data table of the destination object on the right. Select the table to which the configured comparison rule applies. NOTE:The table is closely related to the database. The database is tailored to the created data connection. SQL This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects. Default Parameter Value Input parameters for the custom SQL statement. The sequence of the parameters must be consistent with that of the default parameter values (when a data quality job is executed in DataArts Factory). NOTE:When a data quality operator is scheduled by a data development job, the parameter values defined in the data development job are preferentially used. Compute Engine Queue Name Select the engine for running the comparison job. This parameter is valid only for Hive, DLI, or Hetu data connections. Enter a queue name. If the connection type is Hetu, and the rule type is not database rule, the queue name is the resource queue name of the Hetu engine. To view the resource queue name of the Hetu engine, log in to FusionInsight Manager of MRS and click HetuEngine in the navigation pane. In the Basic Information area, click the HSConsole WebUI link and view the resource queue name of the Hetu engine in the computing instance list. Rule Template Template This parameter defines how to monitor the quality of data objects. The template name of the source object contains the system rule template and custom rule template. The template name of the destination object is automatically generated based on the rule type of the source object. NOTE:The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating a Data Quality Rule. If Rule Type is set to Field rule and Rule Template is set to Regular expression verification or Regular expression verification ignore null, the regular expression rule can contain a maximum of 1,024 characters. Version This parameter is required only when you select a custom rule template. Select the version of the published custom rule template. Object Scope Scanning Scope You can select All or Partial. The default value is All. If you want only part of data to be computed or comparison jobs to be executed periodically based on a timestamp, you can set a where clause for scanning. WHERE Clause Enter a WHERE clause. The system will scan the data that matches the clause. NOTE:Add and at the beginning of the clause for syntax verification during SQL statement generation. Otherwise, a syntax error will be reported. For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause: and age > 18 and age <= 60 You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause: and time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))Default Parameter Value This parameter is required when you select Partial for Scanning Scope. Enter the default values of the parameters in the where clause text box. NOTE:The default parameter value is preferentially transferred by DataArts Factory. If the value is empty, the quality job may encounter an error. After DataArts Factory transfers parameters to DataArts Quality and the job is executed, you can click View SQL to view the parameters and their values transferred by DataArts Factory. Alarm Condition Alarm Expression Set this parameter if you want to set an alarm condition for the current rule. After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")". - +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
- abs: absolute value
 For example, if Rule Template of the source and destination of the comparison job is set to Table Rows, you can configure the alarm expression as follows: - To configure an alarm to be generated when the number of rows in the source table is less than 100, enter ${1_1}<100, where ${1_1} indicates the total number of rows in the source table.
- To configure an alarm to be generated when the number of rows in the source table is not equal to that in the destination table, enter ${1_1}!=${2_1}, where ${1_1} indicates the total number of rows in the source table and ${2_1} indicates the total number of rows in the destination table.
- To configure an alarm to be generated when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met.
- To configure an alarm to be generated when the absolute value of the number of rows in the source table minus the number of rows in the destination table divided by the number of rows in the source table is greater than 0.1, enter abs(${1_1}-${2_1})/${1_1}>0.1, where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively.
 Parameter The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression. For example, if Template is set to Table Rows, ${1_1} is displayed in Alarm Expression when you click alarm parameter Table Rows. Logical Operator This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions. You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets (). - +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
- abs: absolute value
 For example, if Template is Table Rows and if you want to generate an alarm when the number of rows in the source table is less than 100 or when the number of rows in the source table is not equal to that in the destination table, enter (${1_1}<100)||(${1_1}!=${2_1}), where ${1_1} and ${2_1} indicate the total number of rows in the source and destination tables, respectively, and || indicates that an alarm is generated if either condition is met. 
- Click Next and set the subscription configuration. If the SMN notification is required, enable Notification, and set Notification Type and Topic. See Figure 4.
    
    
      When notification is enabled, a notification is sent for all the subjobs of the configured notification type. If you enable alarming, you do not need to set the notifications for failures. Alarms will be automatically reported if a task fails. Currently, only SMS and email are available for subscribing to topics. You can select Alarm triggered or Run successfully for Notification Type. If you enable Suppress Notifications, you can set the condition for sending an alarm notification, that is, the number of consecutive alarms within a certain period of time (minutes). The period ranges from 1 minute to 360 minutes, and the number of consecutive times ranges from 1 to 10. If you enable Combine Notifications, notifications of all subjobs in the job are sent together. For example, if subjob A and B both fail, a job-level notification is sent based on the notification type. This function is disabled by default. If you enable Message Attributes, notifications will be sent based on the message attributes. Message attributes specify the message release scope. The system sends messages based on the configured attribute names and values. For example, you can configure different alarm receiving conditions for different personnel. You can configure a maximum of 10 String.Array values. For details about how to configure message attributes, see Using a Subscription Filtering Policy. 
- Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 3. Click Submit.
      - If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
- If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
- When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M management on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
- Only MRS clusters that support job submission through an agency support periodic scheduling of comparison jobs. MRS clusters that support job submission through an agency are as follows:
        - Non-security MRS cluster
- MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
 
 Table 3 Parameters for setting the scheduling mode Parameter Description Effective Effective date of a scheduling task. Cycle The frequency at which a scheduling task is executed. Related parameters are: - Minutes
- Hours
- Days
- Weeks
           NOTE:- If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task.
- If Cycle is set to Days, set the start time of the scheduling task.
- If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.
 
 After a comparison job is created, you can view it in the job list. You can also filter jobs by job name, creator, and time range. The system supports fuzzy search. After a comparison job is created, you can edit, delete, run, start scheduling, and stop scheduling it.   You cannot start scheduling a one-off comparison job. 
Running a Comparison Job
To run a comparison job, perform the following operations:
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, locate a comparison job.
- Click Run in the Operation column.
- In enterprise mode, select the development environment or production environment.
- Click OK.
Exporting Comparison Jobs
You can export a maximum of 200 comparison jobs. Each cell of the exported file can contain a maximum of 65,534 characters.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs to export.
- Click Export. The Export Comparison Job dialog box is displayed.
- Click Export to switch to the Export Records tab.
- In the list of exported files, locate an exported comparison job and click Download in the Operation column to download the Excel file of the job to the local PC.
Importing Comparison Jobs
You can import a file containing a maximum of 4 MB data. Each cell of the file to be imported can contain a maximum of 65,534 characters.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, click Import. The Import Comparison Job dialog box is displayed.
    
    Figure 5 Importing comparison jobs  
- On the Import Configurations tab page, set Duplicate Name Policy.
    
    - Terminate: If comparison job names repeat, all comparison jobs will fail to be imported.
- Skip: If comparison job names repeat, the comparison jobs will still be imported.
- Overwrite: If comparison job names repeat, new jobs will replace existing ones with the same names.
        If you select Overwrite, stop job scheduling before uploading a file. Otherwise, the upload will fail. 
 
- Click Upload and select the prepared data file.
    
      Edit the data file using either of the following methods: - (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
- Click Download Template, fill in the template with the data to import, and import the data to the system.
 
- Configure resource mapping for the data connection, cluster, directory, and topic. If you do not configure the resource mapping, the original mapping is used by default.
    
    Figure 6 Configuring the resource mapping  - Data Connection: Select the type of the imported data connection.
- Cluster: If the data connection type is DLI, select the corresponding queue.
- Directory: Select the directory where the imported comparison job is stored.
- Topic: If SMN is configured, you need to select a topic.
 
- Click Import to import the Excel template to the system.
- Click the Import Records tab to view the import records.
Running Comparison Jobs
You can run a maximum of 200 comparison jobs at a time.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to run.
- Above the job list, click More and select Run to run the selected comparison jobs.
    
    Figure 7 Running jobs  
- In enterprise mode, select the development environment or production environment.
- Click OK.
Scheduling Comparison Jobs
You can schedule a maximum of 200 comparison jobs at a time.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to schedule.
- Above the job list, click More and select Start Schedule to schedule the selected comparison jobs.
    
    Figure 8 Scheduling jobs  
Stopping Scheduling Comparison Jobs
You can stop scheduling a maximum of 200 comparison jobs at a time.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Comparison Jobs. In the right pane, select the comparison jobs you want to stop scheduling.
- Above the job list, click More and select Stop Scheduling to stop scheduling the selected comparison jobs.
    
    Figure 9 Stopping scheduling jobs  
Stopping Comparison Jobs
You can stop a maximum of 200 comparison jobs at a time.
Only comparison jobs in Running state can be stopped.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > O&M. In the right pane, select the comparison jobs you want to stop.
- Click Stop. In the displayed Stop Instance dialog box, confirm the instances to stop and click Yes.
    
    Figure 10 Stopping instances Figure 11 Stopping instances Figure 11 Stopping instances  
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot 
     
      