What Is Data Lineage?

In the era of big data, various types of data are rapidly generated due to explosive data growth. The massive and complex data information is converged, transformed, and transferred to generate new data and aggregate into an ocean of data.

During this process, a relationship is formed between the data, and these relationships are their lineages. They are analogous to the genetic relationships between people. However, in contrast from our human lineages, data lineages have the following distinct features:

Belongingness: Specific data belongs to a specific organization or individual.
Multi-source: One piece of data can have multiple sources. One piece of data may be generated by processing multiple pieces of data, and there may be multiple such processes.
Traceability: The data lineage is traceable. It reflects the data lifecycle and the entire process from data generation to data disappearance.
Hierarchy: The data lineage is hierarchical. Data classification and summary form new data, and different levels of description result in data layers.

Figure 1 shows the lineage relationship graph for DataArts Studio.

indicates a data table, and

indicates a job node. They are orchestrated using arrows. As shown in the graph, the data in table wk_01 is processed on the hive_1 job node and then written to table wk_02. The data in table wk_02 is processed on the hive_2 job node and written to tables wk_03, wk_04, and wk_05, respectively.

Figure 1 Data lineage example
Click to enlarge

Parent topic: DataArts Catalog

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot