Creating Quality Jobs
You can create quality jobs to apply the created rules to existing tables.
Prerequisites
You have created a directory for storing the quality job. To create a directory, choose Figure 1.
in the navigation pane. Before creating a quality job for a data connection, select a directory to store the quality job. For details, see
No. |
Description |
---|---|
1 |
Create Directory |
2 |
Refresh Directory |
3 |
Select All. Right-click to create, delete, and rename directories. |
Procedure
- On the DataArts Studio console, locate an instance and click Access. On the displayed page, locate a workspace and click DataArts Quality.
- Choose in the left navigation bar.
- Click Create. In the dialog box displayed, set the parameters based on Table 2.
Table 2 Quality job parameters Parameter
Description
Name
Quality job name
Description
Information to better identify the quality job. It cannot exceed 256 characters.
Directory
Directory for storing the quality job. You can select a created directory. Figure 1 shows the directory.
Job Level
The options are Warning, Minor, Major, and Critical. The job level determines the template for sending notification messages.
- Click Next to go to the Define Rule page, on which each rule card corresponds to a subjob. Click on the rule card and configure it based on Table 3. You can also add more quality rules and click Next to apply them to a created database or table.
Figure 2 Configuring rules for a quality job
Table 3 Parameters for configuring a rule Parameter
Sub-parameter
Description
Basic Information
Subjob Name
In the job execution result, each rule corresponds to a subjob. You are advised to set the subjob information so that you can view the job execution result and locate faults through logs more easily.
Description
Information to better identify the subjob
Object
Rule Type
Database rules, table rules, field rules, cross-field rules, or custom rules configured for specific fields in a table.
Data Connection
Source and destination objects support the following data connection types: DWS, MRS Hive, DLI, RDS (MySQL and PostgreSQL), Oracle, and MRS Spark (Hudi).
Select a created data connection from the drop-down list box.
NOTE:- Rules are based on data connections. Therefore, you must create data connections in Management Center before creating data quality rules.
- For MRS Hive connected through a proxy, select the MRS API mode or proxy mode.
- MRS API mode: An MRS API is used for submission. By default, historical jobs are submitted through MRS APIs. You are advised keep the default settings when editing the job.
- Proxy mode: A username and a password are used for submission. You are advised to select this mode for new jobs to prevent job submission failures caused by permission issues.
Database
Select the database to which the configured data quality rules are applied.
NOTE:- The database is tailored to the created data connection.
- When Rule Type is set to Database rule, set the data object to the corresponding database.
Data Table
Select the table to which the configured data quality rules apply.
NOTE:- The table is closely related to the database.
- When Rule Type is set to Table rule, set the data object to the corresponding table.
SQL
This parameter is mandatory if you select Custom rule for Rule Type. Enter a complete SQL statement to define how to monitor the quality of data objects.
Failure Policy
Select Ignore rule errors as required.
Select Fields
This parameter is mandatory if you select Field rule for Rule Type. Select a field in the corresponding data table.
NOTE:Fields names containing only one letter (such as a, b, c, and d) cannot be verified.
Reference Data Object
This parameter is mandatory if you select Cross-field rule for Rule Type. Select a reference data field.
Dimension
This parameter is mandatory if you select Custom rule for Rule Type. It associates the custom rule with one of the six quality attributes, including completeness, validity, timeliness, consistency, accuracy, and uniqueness.
Compute Engine
Cluster Name
Select the engine for running the quality job. This parameter is valid only for DLI data connections.
Rule Template
Template
Select a system or custom rule template.
NOTE:The template type is closely related to the rule type. For details, see Table 1. In addition to system rule templates, you can select the custom rule template created in Creating Rule Templates.
Version
This parameter is required only when you select a custom rule template. Select the version of the published custom rule template.
Scoring Weight
Set the weight for the rule based on the field level. The value is an integer from 1 to 9. The default value is 5.
Object Scope
Scanning Scope
You can select All or Partial. The default value is All.
If you want only part of data to be computed or quality jobs to be executed periodically based on a timestamp, you can set a WHERE condition for scanning.
WHERE Clause
Enter a WHERE clause. The system will scan the data that matches the clause.
For example, if you want to filter out the data for which the value range of the age field is (18, 60], enter the following WHERE clause:age > 18 and age <= 60
You can also enter a dynamic SQL expression. For example, if you want to filter out the data generated 24 hours ago based on the time field, enter the following WHERE clause:
time >= (date_trunc('hour', now()) - interval '24 h') and time <= (date_trunc('hour', now()))
Alarm Condition
Alarm Expression
Set this parameter if you want to set an alarm condition for the current rule. If you want to use the logical operations of multiple rules to set a unified alarm condition expression, you do not need to set this parameter. Instead, you can set it on the next Set Alarm page.
After the alarm conditions are configured, the system determines whether to generate an alarm based on the value of Parameter and the alarm condition. Apart from a single alarm expression, you can also use more complex alarm conditions consisting of logical operators. The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".
- +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
For example, if Rule Template is set to Null value, you can set this parameter as follows:
- If you want an alarm to be generated when the number of rows with a null value is greater than 10, enter ${1}>10 (${1} is the number of rows with a null value).
- If you want an alarm to be generated when the ratio of fields with a null value is greater than 80%, enter ${3}>0.8 (${3} is the ratio of fields with a null value).
- If you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).
Parameter
The value of this parameter is obtained from the output of the rule template. If you can click a parameter, the expression of the parameter is displayed in Alarm Expression.
For example, if Template is set to Null value, ${1} is displayed in Alarm Expression when you click alarm parameter Null Value Rows.
Logical Operator
This parameter is optional. You can perform logical operations on the result of an alarm expression to generate more complex alarm conditions.
You can move the cursor between two alarm expressions in Alarm Expression and click one of the following operators to insert them. You can also manually enter an operator. The current expression supports the following logical operators which can be enclosed by brackets ().
- +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
For example, if Template is set to Null value and you want an alarm to be generated when the number of rows with a null value is greater than 10 or the ratio of fields with a null value is greater than 80%, enter (${1}>10)||(${3}>0.8) for Alarm Expression (${1} is the number of rows with a null value, ${3} is the ratio of fields with a null value, and || indicates that an alarm will be generated if either of the conditions is met).
Score Quality
This parameter is mandatory if you select Custom rule for Rule Type.
Generate Anomaly Data
Enable Generate Anomaly Data and click Select next to Anomaly Table to store the anomaly data that does not comply with the preset rules.
NOTE:- For a field rule, the average value, total value, maximum value, and minimum value of a field in the field-level rule template cannot be used to generate anomaly data.
- If periodic scheduling or re-execution is configured for a quality job, abnormal data detected in each instance scan is inserted into the anomaly table. You are advised to periodically delete the data in the anomaly table to reduce cost and ensure good performance.
Anomaly Table
Select a database table. You can configure the prefix and suffix of the output table name.
Output Settings
- Output Rule Settings: If you select this option, the quality job settings will show up in the anomaly tables so that you can view the anomaly data sources with ease.
- Output null: If you select this option, and the preset rules are not complied, the null value will show up in anomaly tables.
Anomaly Data Amount
You can choose to export all anomaly data or the specified amount of anomaly data.
Anomaly Table SQL
This parameter is mandatory if you select Custom rule for Rule Type. You need to enter a complete SQL statement to specify the abnormal data to be exported.
View Duplicate Rules
Click it to view the following duplicate rules:
- Determine the rule repetition based on tables and fields.
- View the related sub-rules and quality jobs that already exist.
Object Scope
Scanning Scope
Specify the scope that the created rule is applied to.
- If All is selected, all tables are scanned.
- If Partial is selected, you must enter a WHERE condition expression to precisely locate the partitions to query data.
- Click Next and set alarm information. If you have configured an alarm expression in the previous step, the configured expression is automatically displayed. If there are two or more sub-rules, you can use either of the following methods to configure alarms:
- Use the alarm conditions of sub-rules to report alarms.
- Perform mathematical and logical operations on the alarm parameter values to generate a universal alarm expression to specify whether to report alarms for jobs.
The alarm expression supports the following logical operators, which can be enclosed by "(" and ")".- +: addition
- -: subtraction
- *: multiplying
- /: division
- ==: equal to
- !=: not equal to
- >: greater than
- <: less than
- >=: greater than or equal to
- <=: less than or equal to
- !: non
- ||: or
- &&: and
- Click Next and set the subscription information. If the SMN notification is required, enable Notification, and set Notification Type and Topic.
After notification is enabled, a notification is sent for all the subjobs of the configured notification type.
- Click Next to go to the page where you can select a scheduling mode. Currently, Once and On schedule are supported. Set parameters for scheduling periodically by referring to Table 4. Click Submit.
- If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
- If On schedule is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when a periodic task reaches the scheduled execution time.
- When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
- Only MRS clusters that support job submission through an agency support periodic scheduling of quality jobs. MRS clusters that support job submission through an agency are as follows:
- Non-security MRS cluster
- MRS security cluster whose version is later than 2.1.0, and that has MRS 2.1.0.1 or later installed
Table 4 Parameters Parameter
Description
Effective
Effective date of a scheduling task.
Cycle
The frequency at which a scheduling task is executed. Related parameters are:
- Minutes
- Hours
- Days
- Weeks
NOTE:
- If Cycle is set to Minutes or Hours, set the start time, end time, and interval for the scheduling task. Currently, the start time is in minute for stagger scheduling.
- If Cycle is set to Days, set a specified time when the scheduling task is enabled every day.
- If Cycle is set to Weeks, set Scheduling Time and Start from for the scheduling task, that is, XX o'clock XX minutes on XXX every week.
Exporting Quality Jobs
You can export a maximum of 200 quality jobs.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, select the quality jobs to export.
- Click Export. The Export Quality Job dialog box is displayed.
- Click Export to switch to the Export Records tab.
- In the list of exported files, locate an exported quality job and click Download in the Operation column to download the Excel file of the job to the local PC.
Importing Quality Jobs
You can import a file containing a maximum of 4 MB data.
- In the left navigation pane on the DataArts Quality page, choose Quality Monitoring > Quality Jobs. In the right pane, click Import. The Import Quality Job dialog box is displayed.
- On the Import Configurations tab page, set Duplicate Name Policy.
- Terminate: If quality job names repeat, all quality jobs will fail to be imported.
- Skip: If quality job names repeat, the quality jobs will still be imported.
- Overwrite: If quality job names repeat, new jobs will replace existing ones with the same names.
- Click Upload and select the prepared data file.
Edit the data file using either of the following methods:
- (Recommended) Click Export to export data and import the data to the system directly or import it after modification.
- Click Download Template, fill in the template with the data to import, and import the data to the system.
- Configure resource mapping for the data connection, cluster, directory, and topic.
- Data Connection: Select the type of the imported data connection.
- Cluster: If the data connection type is DLI, select the corresponding queue.
- Directory: Select the directory where the imported quality job is stored.
- Topic: If SMN is configured, you need to select a topic.
- Click Import to import the Excel template to the system.
- Click the Import Records tab to view the import records.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.