Help Center/ Migration Center/ User Guide/ New Edition/ Preparations for Big Data Migration/ Collecting Data Lineage/ Creating a Lineage Collection Task

Updated on 2025-10-11 GMT+08:00

View PDF

Creating a Lineage Collection Task

Prerequisites

You have collected metadata.

Procedure

Sign in to the MgC console. In the navigation pane, under Project, select your big data migration project from the drop-down list.
In the navigation pane, choose Migration Preparations.
Choose Metadata Management. Under Big Data Lineage, click Create Collection Task.

Figure 1 Creating a lineage collection task

Select a job type and configure the parameters shown.

Type	Parameter	Configuration
Lineage template	File	Download the lineage template to the local PC and set parameters in the template. No formula is allowed in cells in the template. Otherwise, the file parsing will fail. The following fields are required: Target Database (TargetDataset): The name of the database for lineage collection. The value can contain a maximum of 128 characters. Target Table (TargetTable): The name of the target table for lineage collection. The value can contain a maximum of 256 characters. Target Connection Name (TargetConnectionName): The name of the connection to the target database for lineage collection. The value contains a maximum of 255 characters. Target Component Type (TargetComponentType): The type of the target component for lineage collection. Only Hive SQL and MaxCompute are supported. Source Dataset (SourceDataset): The name of the upstream database for the target table. The value can contain a maximum of 128 characters. Source Table (SourceTable): The name of the upstream table for the target table. The value can contain a maximum of 256 characters. Source Connection Name (SourceConnectionName): The name of the connection to the upstream database for the target table. The value contains a maximum of 255 characters. Source Component Type (SourceComponentType): The type of the upstream component for the target table. Only Hive SQL and MaxCompute are supported. Job Name (JobName): The name of the involved DataArts Studio or DataWorks job. The following fields are optional: Job ID (JobId): The ID of the involved DataArts Studio or DataWorks job. To obtain an ID of a DataArts Studio job, you can right-click the corresponding job on the job development page of DataArts Studio and copy the ID. Job Type (JobType): The type of the involved DataArts Studio or DataWorks job. If the involved job is a DataArts Studio job, the job type can be MRS Hive SQL, MRS Presto SQL, or MRS Spark SQL. Job Cron (JobCron): The job execution period, which is a Cron expression. Job Workspace (JobWorkspace): The workspace of the involved job. A workspace is a basic unit used to manage tasks, members, roles, and permissions. All development work is completed in workspaces. Multiple types of data sources can be connected to a single workspace. Go back to the console and click Add to upload the saved file to MgC. CAUTION: The file to be uploaded must be in .xlsx format. A maximum of 300,000 rows can be contained, and the file size cannot exceed 50 MB. A maximum of 1,000 metadata connections can be included. A single table can have a maximum of 1,000 upstream tables.

Type

Parameter

Configuration

Lineage template

File

Download the lineage template to the local PC and set parameters in the template. No formula is allowed in cells in the template. Otherwise, the file parsing will fail. The following fields are required:

Target Database (TargetDataset): The name of the database for lineage collection. The value can contain a maximum of 128 characters.
Target Table (TargetTable): The name of the target table for lineage collection. The value can contain a maximum of 256 characters.
Target Connection Name (TargetConnectionName): The name of the connection to the target database for lineage collection. The value contains a maximum of 255 characters.
Target Component Type (TargetComponentType): The type of the target component for lineage collection. Only Hive SQL and MaxCompute are supported.
Source Dataset (SourceDataset): The name of the upstream database for the target table. The value can contain a maximum of 128 characters.
Source Table (SourceTable): The name of the upstream table for the target table. The value can contain a maximum of 256 characters.
Source Connection Name (SourceConnectionName): The name of the connection to the upstream database for the target table. The value contains a maximum of 255 characters.
Source Component Type (SourceComponentType): The type of the upstream component for the target table. Only Hive SQL and MaxCompute are supported.
Job Name (JobName): The name of the involved DataArts Studio or DataWorks job.

The following fields are optional:

Job ID (JobId): The ID of the involved DataArts Studio or DataWorks job. To obtain an ID of a DataArts Studio job, you can right-click the corresponding job on the job development page of DataArts Studio and copy the ID.
Job Type (JobType): The type of the involved DataArts Studio or DataWorks job. If the involved job is a DataArts Studio job, the job type can be MRS Hive SQL, MRS Presto SQL, or MRS Spark SQL.
Job Cron (JobCron): The job execution period, which is a Cron expression.
Job Workspace (JobWorkspace): The workspace of the involved job. A workspace is a basic unit used to manage tasks, members, roles, and permissions. All development work is completed in workspaces. Multiple types of data sources can be connected to a single workspace.

Go back to the console and click Add to upload the saved file to MgC.

CAUTION:

The file to be uploaded must be in .xlsx format. A maximum of 300,000 rows can be contained, and the file size cannot exceed 50 MB.

A maximum of 1,000 metadata connections can be included.

A single table can have a maximum of 1,000 upstream tables.

Lineage template.
1. Click Template Download to download the template to the local PC.
2. Complete the lineage template. The following parameters are mandatory:
  - Target Dataset (TargetDataset)
  - Target Table (TargetTable)
  - Target Connection Name (TargetConnectionName)
  - Target Component Type (TargetComponentType)
  - Upstream Dataset (SourceDataset)
  - Source Table (SourceTable)
  - Source Connection Name (SourceConnectionName)
  - Source Component Type (SourceComponentType)
  - Job Name (JobName)
  - The value of Target Component Type and Upstream Component Type in the template can be Hive SQL or MaxCompute.
  - No formula is allowed in cells in the template. Otherwise, the parsing fails.
3. Go back to the console and click Add to upload the saved file to MgC.
  
  The file size cannot exceed 50 MB.

Click Confirm. The data lineage collection task is created. The system automatically starts collecting data lineage. On the Big Data Lineage tab, click a lineage collection task name or click View in the Operation column to open the Task Details page.
Wait until the task status changes to Completed. Then click View Lineage above the task list to view the lineage graph.

Parent Topic: Collecting Data Lineage

Previous topic: Overview

Next topic: Viewing Data Lineage

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot