Updated on 2022-09-23 GMT+08:00

Creating Quality Jobs

You can create quality jobs to apply the created rules to existing tables.

Prerequisites

You have created a directory for storing the quality job. To create a directory, choose Quality Monitoring > Quality Jobs in the navigation pane. Before creating a quality job for a data connection, select a directory to store the quality job. For details, see Figure 1.

Figure 1 Directory that stores the quality job to create
Table 1 Buttons in the navigation bar

No.

Description

1

Create Directory

2

Refresh Directory

3

Select All. Right-click to create, delete, and rename directories.

Procedure

  1. On the DataArts Studio console, locate an instance and click Access. On the displayed page, locate a workspace and click DataArts Quality.
  1. Choose Quality Monitoring > Quality Job in the left navigation bar.
  2. Click Create. In the dialog box displayed, set the parameters based on Table 2.
    Table 2 Quality job parameters

    Parameter

    Description

    Name

    Quality job name

    Description

    Information to better identify the quality job. It cannot exceed 256 characters.

    Directory

    Directory for storing the quality job. You can select a created directory. Figure 1 shows the directory.

    Job Level

    The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.

  3. Click Next to go to the Define Rule page, on which each rule card corresponds to a subjob. Click on the rule card and configure it based on Table 3. You can also add more quality rules and click Next to apply them to a created database or table.
    Figure 2 Configuring rules for a quality job
    Table 3 Parameters for configuring a rule

    Parameter

    Sub-parameter

    Description

    Basic Information

    Subjob Name

    In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.

    Description

    Information to better identify the subjob

    Object

    Rule Type

    Database rules, table rules, field rules, cross-field rules, or custom rules configured for specific fields in a table.

    Data Connection

    Source and destination objects support the following data connection types: DWS, MRS Hive, DLI, RDS (MySQL and PostgreSQL), Oracle, and MRS Spark (Hudi).

    Select a created data connection from the drop-down list box.

    NOTE:
    • Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
    • For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
      • MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised keep the default settings when editing the job.
      • Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.

    Database

    Select the database to which the configured data quality rules are applied.

    NOTE:
    • The database is tailored to the created data connection.
    • When Rule Type is set to Database rule, set the data object to the corresponding database.

    Data Table

    Select the table to which the configured data quality rules apply.

    NOTE:
    • The table is closely related to the database.
    • When Rule Type is set to Table rule, set the data object to the corresponding table.

    SQL

    This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.

    Failure Policy

    Select Ignore rule errors as required.

    Select Fields

    This parameter is mandatory if you select Field rule for Rule Type. Select a field in the corresponding data table.

    NOTE:

    Fields names containing only one letter (such as a, b, c, and d) cannot be verified.

    Reference Data Object

    This parameter is mandatory if you select Cross-field rule for Rule Type. Select a reference data field.

    Dimension

    This parameter is mandatory if you select Custom rule for Rule Type. It associates the custom rule with one of the six quality attributes, including completeness, validity, timeliness, consistency, accuracy, and uniqueness.

    Compute Engine

    Cluster Name

    Select the engine for running the quality job. This parameter is valid only for DLI data connections.

    Rule Template

    Template

    Select a system or custom rule template.

    NOTE:

    The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating Rule Templates.

    Version

    This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.

    Scoring Weight

    Set the weight for the rule based on the field level. The value is an integer from 1 to 9. The default value is 5.

    Object Scope

    Scanning Scope

    You can select All or Partial. The default value is All.

    If you want only part of data to be computed or quality jobs to be executed periodically based on a timestamp, you can set a WHERE condition for scanning.

    WHERE Clause

    Enter a WHERE clause. The system will scan the data that matches the clause.

    For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:
    age > 18 and age <= 60

    You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:

    time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))

    Alarm Condition

    Alarm Expression

    Set this parameter if you want to set an alarm condition for the current rule. If you want to use the logical operations of multiple rules to set a unified alarm condition expression, you do not need to set this parameter. Instead, you can set it on the next Set Alarm page.

    After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Rule Template is set to Null value, you can set this parameter as follows:

    • If you want an alarm to be generated when the number of rows with a null value is greater than 10, enter ${1}>10 (${1} is the number of rows with a null value).
    • If you want an alarm to be generated when the ratio of fields with a null value is greater than 80%, enter ${3}>0.8 (${3} is the ratio of fields with a null value).
    • If you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).

    Parameter

    The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.

    For example, if Template is set to Null value, ${1} is displayed in Alarm Expression when you click alarm parameter Null Value Rows.

    Logical Operator

    This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.

    You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Template is set to Null value and you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) for Alarm Expression (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).

    Score Quality

    This parameter is mandatory if you select Custom rule for Rule Type.

    Generate Anomaly Data

    Enable Generate Anomaly Data and click Select next to Anomaly Table to store the anomaly data that does not comply with the preset rules.

    NOTE:
    • For a field rule, the average value, total value, maximum value, and minimum value of a field in the field-level rule template cannot be used to generate anomaly data.
    • If periodic scheduling or re-execution is configured for a quality job, abnormal data detected in each instance scan is inserted into the anomaly table. You are advised to periodically delete the data in the anomaly table to reduce cost and ensure good performance.

    Anomaly Table

    Select a database table. You can configure the prefix and suffix of the output table name.

    Output Settings

    • Output Rule Settings: If you select this option, the quality job settings will show up in the anomaly tables so that you can view the anomaly data sources with ease.
    • Output null: If you select this option, and the preset rules are not complied, the null value will show up in anomaly tables.

    Anomaly Data Amount

    You can choose to export all anomaly data or the specified amount of anomaly data.

    Anomaly Table SQL

    This parameter is mandatory if you select Custom rule for Rule Type. You need to enter a complete SQL statement to specify the abnormal data to be exported.

    View Duplicate Rules

    Click it to view the following duplicate rules:

    • Determine the rule repetition based on tables and fields.
    • View the related sub-rules and quality jobs that already exist.

    Object Scope

    Scanning Scope

    Specify the scope that the created rule is applied to.

    • If All is selected, all tables are scanned.
    • If Partial is selected, you must enter a WHERE condition expression to precisely locate the partitions to query data.
  4. Click Next and set alarm information. If you have configured an alarm expression in the previous step, the configured expression is automatically displayed. If there are two or more sub-rules, you can use either of the following methods to configure alarms:
    1. Use the alarm conditions of sub-rules to report alarms.
    2. Perform mathematical and logical operations on the alarm parameter values to generate a universal alarm expression to specify whether to report alarms for jobs.
    The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".
    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and
  5. Click Next and set the subscription information. If the SMN notification is required, enable Notification, and set Notification Type and Topic.

    After notification is enabled, a notification is sent for all the subjobs of the configured notification type.

  6. Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 4. Click Submit.
    1. If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
    2. If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
    3. When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
    4. Only MRS clusters that support job submission through an agency support periodic scheduling of quality jobs. MRS clusters that support job submission through an agency are as follows:
      • Non-security MRS cluster
      • MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
    Table 4 Parameters

    Parameter

    Description

    Effective

    Effective date of a scheduling task.

    Cycle

    The frequency at which a scheduling task is executed. Related parameters are:

    • Minutes
    • Hours
    • Days
    • Weeks
      NOTE:
      • If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task. Currently, the start time is in minute for stagger scheduling.
      • If Cycle is set to Days, set a specified time when the scheduling task is enabled every day.
      • If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.

Exporting Quality Jobs

You can export a maximum of 200 quality jobs.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs to export.
  2. Click Export. The Export Quality Job dialog box is displayed.
  3. Click Export to switch to the Export Records tab.
  4. In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.

Importing Quality Jobs

You can import a file containing a maximum of 4 MB data.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, click Import. The Import Quality Job dialog box is displayed.

  2. On the Import Configurations tab page, set Duplicate Name Policy.

    • Terminate: If quality job names repeat, all quality jobs will fail to be imported.
    • Skip: If quality job names repeat, the quality jobs will still be imported.
    • Overwrite: If quality job names repeat, new jobs will replace existing ones with the same names.

  3. Click Upload and select the prepared data file.

    Edit the data file using either of the following methods:

    • (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
    • Click Download Template, fill in the template with the data to import, and import the data to the system.

  4. Configure resource mapping for the data connection, cluster, directory, and topic.

    • Data Connection: Select the type of the imported data connection.
    • Cluster: If the data connection type is DLI, select the corresponding queue.
    • Directory: Select the directory where the imported quality job is stored.
    • Topic: If SMN is configured, you need to select a topic.

  5. Click Import to import the Excel template to the system.
  6. Click the Import Records tab to view the import records.