Updated on 2023-09-27 GMT+08:00

Creating Quality Jobs

You can create quality jobs to apply the created rules to existing tables.

Procedure

  1. On the DataArts Studio console, locate an instance and click Access. On the displayed page, locate a workspace and click DataArts Quality.
    Figure 1 DataArts Quality

  1. (Optional) In the left navigation pane, choose Quality Monitoring > Quality Jobs and create a directory. If a directory exists, you do not need to create one. Note that rule templates, quality jobs, and comparison job are in the same directory.

    Currently, you can create a directory using either of the following methods:

    Click and enter a directory name in the displayed dialog box. In this way, you can create a maximum of seven layers of directories.
    Figure 2 Creating a directory
    You can also click to synchronize the subjects in DataArts Architecture as directories. (Only published subjects can be synchronized.) The synchronized directories are consistent with the published subjects in DataArts Architecture and are displayed by layer, such as and .
    1. The directories you create are not affected by the synchronization. If the name of a created directory conflicts with that of a subject:
      • If they conflict during the first synchronization, a subject layer (such as and ) is added to the name of the directory.
      • If they conflict after the subject is modified, the synchronization fails.
    2. Changes to subjects or subject layers in DataArts Architecture cannot be automatically synchronized. You must click again to synchronize them.

      If a subject or subject layer in DataArts Architecture is deleted and synchronized to DataArts Quality, the corresponding directory will not be deleted. Instead, the subject attributes will be deleted from the directory.

    3. After the synchronization is complete, the system automatically displays the synchronization result details. You can view the names of the subjects that fail to be synchronized.
    Figure 3 Synchronizing subjects from DataArts Architecture
  2. On the Quality Jobs page, click Create. In the dialog box displayed, set the parameters based on Table 1.
    Table 1 Quality job parameters

    Parameter

    Description

    Name

    Quality job name

    Description

    Information to better identify the quality job. It cannot exceed 1,024 characters.

    Directory

    Directory for storing the quality job. You can select a created directory. For details about how to create a directory, see (Optional) Creating a Directory.

    Job Level

    The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.

  3. Click Next to go to the Define Rule page, on which each rule card corresponds to a subjob. Click on the rule card and configure it based on Table 2. You can also add more quality rules and click Next to apply them to a created database or table.
    Figure 4 Configuring rules for a quality job
    Table 2 Parameters for configuring a rule

    Parameter

    Sub-parameter

    Description

    Basic Information

    Subjob Name

    In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.

    Description

    Information to better identify the subjob It cannot exceed 1,024 characters.

    Object

    Rule Type

    Database rules, table rules, field rules, cross-field rules, multi-table and multi-field rules, or custom rules configured for specific fields in a table.

    Data Connection

    Source and destination objects support the following data connection types: DWS, MRS Hive, DLI, RDS (MySQL and PostgreSQL), ClickHouse, Oracle, and MRS Spark (Hudi).

    Select a created data connection from the drop-down list box.

    NOTE:
    • Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
    • For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
      • MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised keep the default settings when editing the job.
      • Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.

    Database

    Select the database to which the configured data quality rules are applied.

    NOTE:
    • The database is tailored to the created data connection.
    • When Rule Type is set to Database rule, set the data object to the corresponding database.

    Data Table

    Select the table to which the configured data quality rules apply.

    NOTE:
    • The table is closely related to the database.
    • When Rule Type is set to Table rule, set the data object to the corresponding table.

    SQL

    This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.

    Failure Policy

    Select Ignore rule errors as required.

    Select Fields

    This parameter is mandatory if you select Field rule for Rule Type. Select a field in the corresponding data table.

    NOTE:

    Fields names containing only one letter (such as a, b, c, and d) cannot be verified.

    Data Object

    This parameter is mandatory if you select Cross-field rule for Rule Type. Select a reference data field.

    When you select a table name, the search box is case sensitive.

    Reference Data Object

    This parameter is mandatory if you select Cross-field rule for Rule Type. Select a reference data field.

    When you select a table name, the search box is case sensitive.

    Dimension

    This parameter is mandatory if you select Custom rule for Rule Type. It associates the custom rule with one of the six quality attributes, including completeness, validity, timeliness, consistency, accuracy, and uniqueness.

    Compute Engine

    Cluster Name

    Select the engine for running the quality job. This parameter is valid only for DLI data connections.

    Rule Template

    Template

    Select a system or custom rule template.

    NOTE:

    The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating Rule Templates.

    Version

    This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.

    Scoring Weight

    Set the weight for the rule based on the field level. The value is an integer from 1 to 9. The default value is 5.

    Object Scope

    Scanning Scope

    You can select All or Partial. The default value is All.

    If you want only part of data to be computed or quality jobs to be executed periodically based on a timestamp, you can set a WHERE condition for scanning.

    You can transfer the environment variables from data quality jobs to data development jobs.

    If rules can be configured for multiple tables, the data range of each table can be set independently. If both Data Object and Reference Data Object are set, you need to configure the data to scan in the scanning scope.

    WHERE Clause

    Enter a WHERE clause. The system will scan the data that matches the clause.

    For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:
    age > 18 and age <= 60

    You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:

    time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))

    To transfer parameters through a data quality job, you can enter a condition expression. For example, if you want to transfer time parameters to data development job, you can enter the following expression:

    p_date=${target_date}

    Parameter Value

    This parameter is required when you select Partial for Scanning Scope.

    Enter the default values of the parameters in the where clause.

    If the values of parameters in the where clause are obtained from Configuration in DataArts Factory, this parameter is optional.

    If the values of the parameters in the where clause are obtained from DataArts Quality, this parameter is mandatory.

    Alarm Condition

    Alarm Expression

    Set this parameter if you want to set an alarm condition for the current rule. If you want to use the logical operations of multiple rules to set a unified alarm condition expression, you do not need to set this parameter. Instead, you can set it on the next Set Alarm page.

    After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Rule Template is set to Null value, you can set this parameter as follows:

    • If you want an alarm to be generated when the number of rows with a null value is greater than 10, enter ${1}>10 (${1} is the number of rows with a null value).
    • If you want an alarm to be generated when the ratio of fields with a null value is greater than 80%, enter ${3}>0.8 (${3} is the ratio of fields with a null value).
    • If you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).

    Parameter

    The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.

    For example, if Template is set to Null value, ${1} is displayed in Alarm Expression when you click alarm parameter Null Value Rows.

    Logical Operator

    This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.

    You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().

    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and

    For example, if Template is set to Null value and you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) for Alarm Expression (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).

    Score Quality

    This parameter is mandatory if you select Custom rule for Rule Type.

    Generate Anomaly Data

    Enable Generate Anomaly Data and click Select next to Anomaly Table to store the anomaly data that does not comply with the preset rules.

    NOTE:
    • For a field rule, the average value, total value, maximum value, and minimum value of a field in the field-level rule template cannot be used to generate anomaly data.
    • If periodic scheduling or re-execution is configured for a quality job, abnormal data detected in each instance scan is inserted into the anomaly table. You are advised to periodically delete the data in the anomaly table to reduce cost and ensure good performance.

    Anomaly Table

    Select a database table. You can configure the prefix and suffix of the output table name.

    NOTE:

    When you set an anomaly table, the system adds suffix err to the table name by default.

    Output Settings

    • Output Rule Settings: If you select this option, the quality job settings will show up in the anomaly tables so that you can view the anomaly data sources with ease.
    • Output null: If you select this option, and the preset rules are not complied, the null value will show up in anomaly tables.

    Anomaly Data Amount

    You can choose to export all anomaly data or the specified amount of anomaly data.

    Anomaly Table SQL

    This parameter is mandatory if you select Custom rule for Rule Type. You need to enter a complete SQL statement to specify the abnormal data to be exported.

    View Duplicate Rules

    Click it to view the following duplicate rules:

    • Determine the rule repetition based on tables and fields.
    • View the related sub-rules and quality jobs that already exist.
  4. Click Next and set alarm information. If you have configured an alarm expression in the previous step, the configured expression is automatically displayed. If there are two or more sub-rules, you can use either of the following methods to configure alarms:
    1. Use the alarm conditions of sub-rules to report alarms.
    2. Perform mathematical and logical operations on the alarm parameter values to generate a universal alarm expression to specify whether to report alarms for jobs.
    The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".
    • +: addition
    • -: subtraction
    • *: multiplying
    • /: division
    • ==: equal to
    • !=: not equal to
    • >: greater than
    • <: less than
    • >=: greater than or equal to
    • <=: less than or equal to
    • !: non
    • ||: or
    • &&: and
  5. Click Next and set the subscription information. If the SMN notification is required, enable Notification, and set Notification Type and Topic.

    After notification is enabled, a notification is sent for all the subjobs of the configured notification type.

    Currently, only SMS and email are available for subscribing to topics.

  6. Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 3. Click Submit.
    1. If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
    2. If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
    3. When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
    4. Only MRS clusters that support job submission through an agency support periodic scheduling of quality jobs. MRS clusters that support job submission through an agency are as follows:
      • Non-security MRS cluster
      • MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
    Table 3 Parameters

    Parameter

    Description

    Effective

    Effective date of a scheduling task.

    Cycle

    The frequency at which a scheduling task is executed. Related parameters are:

    • Minutes
    • Hours
    • Days
    • Weeks
      NOTE:
      • If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task. Currently, the start time is in minute for stagger scheduling.
      • If Cycle is set to Days, set a specified time when the scheduling task is enabled every day.
      • If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.

    After a quality job is created, you can view it in the job list. You can also filter jobs by job name, creator, owner, table name, and time range. The system supports fuzzy search.

Exporting Quality Jobs

You can export a maximum of 200 quality jobs.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs to export.

    Figure 5 Export

  2. Click Export. The Export Quality Job dialog box is displayed.

    Figure 6 Exporting Quality Jobs

  3. Click the Export Records tab to view the export result.

    Figure 7 Export Records

  4. In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.

Exporting All Quality Jobs

To export all quality jobs, perform the following operations:

  1. Choose Quality Monitoring > Quality Jobs and click Export All.

    Figure 8 Export All

  2. In the displayed Export Quality Job dialog box, click Export.

    Figure 9 Exporting all quality jobs

  3. Click the Export Records tab to view the export result.

    Figure 10 Export Records

  4. In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.

Importing Quality Jobs

You can import a file containing a maximum of 4 MB data.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, click Import. The Import Quality Job dialog box is displayed.

    Figure 11 Importing quality jobs

  2. On the Import Configurations tab page, set Duplicate Name Policy.

    • Terminate: If quality job names repeat, all quality jobs will fail to be imported.
    • Skip: If quality job names repeat, the quality jobs will still be imported.
    • Overwrite: If quality job names repeat, new jobs will replace existing ones with the same names.

      If you select Overwrite, stop job scheduling before uploading a file. Otherwise, the upload will fail.

  3. Click Upload and select the prepared data file.

    Edit the data file using either of the following methods:

    • (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
    • Click Download Template, fill in the template with the data to import, and import the data to the system.

  4. Configure resource mapping for the data connection, cluster, catalog, and topic. If you do not configure the resource mapping, the original mapping is used by default.

    Figure 12 Configuring the resource mapping

    • Data Connection: Select the type of the imported data connection.
    • Cluster: If the data connection type is DLI, select the corresponding queue.
    • Directory: Select the directory where the imported quality job is stored.
    • Topic: If SMN is configured, you need to select a topic.

  5. Click Import to import the Excel template to the system.
  6. Click the Import Records tab to view the import records.

Running Quality Jobs

You can run a maximum of 200 quality jobs.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs you want to run.
  2. Above the job list, click More and select Run to run the selected quality jobs.

    Figure 13 Running jobs

Scheduling Quality Jobs

You can schedule a maximum of 200 quality jobs at a time.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs you want to schedule.
  2. Above the job list, click More and select Start Schedule to schedule the selected quality jobs.

    Figure 14 Scheduling jobs

Stopping Scheduling Quality Jobs

You can stop scheduling a maximum of 200 quality jobs at a time.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs you want to stop scheduling.
  2. Above the job list, click More and select Stop Scheduling to stop scheduling the selected quality jobs.

    Figure 15 Stopping scheduling jobs

Stopping Quality Jobs

You can stop a maximum of 200 quality jobs at a time.

Only quality jobs in Running state can be stopped.

  1. In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > O&M. In the right pane, select the quality jobs you want to stop.
  2. Click Stop. In the displayed Stop Instance dialog box, confirm the instances to stop and click Yes.

    Figure 16 Stopping instances

    Figure 17 Stopping instances