Updated on 2024-04-29 GMT+08:00

Configuring Data Lineages

The DataArts Studio data lineage parsing solution supports automatic lineage analysis and manual lineage configuration. Automatic lineage parsing is recommended. In this mode, lineages can be generated without manual configuration. If automatic lineage parsing is not supported, manually configure lineages.
  • Automatic lineage parsing: Lineages are automatically generated after the system parses the data processing and data migration nodes in data development jobs. No manual configuration is required. For details about the node types and scenarios that support automatic lineage parsing, see Automatic Lineage Parsing.
  • Manual lineage configuration: Customize the input and output tables of lineages in data development job nodes. If you configure lineages manually for a node, the automatic lineage parsing does not take effect for this node. For details about the node types that support manual lineage configuration, see Manually Configuring a Lineage.

Constraints

Currently, field-level lineage parsing is not supported.

Automatic Lineage Parsing

Automatic lineage parsing does not require manual configuration. When a data development job contains the nodes and scenarios listed in Table 1, the system can automatically parse lineages.

The lineage of an SQL node can be parsed using multiple SQL statements, and column-level lineage parsing is supported. A single SQL statement cannot contain semicolons (;).

Table 1 Job nodes and scenarios that support automatic lineage parsing

Job Node

Supported Scenario

DLI SQL

  • Lineages generated by data insertion between DLI tables
  • Lineages between OBS files generated by table creation statements and DLI tables

DWS SQL

Lineages between DWS tables generated by DML operations such as "Insert into"

MRS Hive SQL

Lineages between MRS tables generated by DML operations such as "Insert into/overwrite"

MRS Spark SQL

Lineages between MRS tables generated by DML operations such as "Insert into/overwrite"

CDM Job

Lineages generated during table file migration between MRS Hive, DLI, RDS, CSS, DWS, and OBS

ETL Job

Data lineages generated by ETL tasks between DLI, OBS, MySQL, and DWS.

Manually Configuring a Lineage

In a DataArts Studio data development job, you can customize the input and output tables of lineages on the nodes of the job. If you configure lineages manually for a node, the automatic lineage parsing does not take effect for this node.

The following types of job nodes support manual lineage configuration.

When manually configuring the lineage, configure the input and output tables of the lineage on the Lineage tab page of the node. The data sources of the input and output tables can be DLI, DWS, Hive, CSS, OBS and CUSTOM. CUSTOM indicates a custom type. When manually configuring a lineage, you can add data sources that are not supported as custom types.

Figure 1 Example of manual configuration of lineage relationships

For example, you need to manually configure a lineage for an MRS Spark node in a pipeline data development job because this node does not support automatic lineage parsing. The procedure is as follows:

  1. On the DataArts Factory console, choose Data Development > Develop Job. Double-click the name of the job for which you want to configure a lineage to open the job canvas.
  2. Click the MRS Spark node in the job canvas and then the lineageInfo page.

    Figure 2 lineageInfo page

  3. Configure the lineage input table. For example, you can configure input table hive, as shown in Figure 3.

    Figure 3 Configuring the lineage input

  4. Click OK and configure the lineage output table. For example, you can configure output table a, as shown in Figure 4.

    Figure 4 Configuring the lineage output

  5. Click OK. The lineage for the MRS Spark node has been configured. If you want to view the lineage later, collect metadata by referring to Viewing Data Lineages and schedule the job. Then, you can view the manually configured lineage of the MRS Spark node in DataArts Catalog.