Migrating Metadata to LakeFormation Using Metadata Discovery

Scenario

If data is stored in OBS parallel file systems but is not associated with metadata in LakeFormation, you can use the metadata discovery function to construct metadata corresponding to the data to support the computing and analysis of SQL engines or user applications.

Constraints

Metadata discovery is currently in OBT and is free of charge. However, once it is officially launched, you will be billed for the resources consumed by metadata discovery jobs.

Currently, metadata discovery supports only Spark on Hudi.

Prerequisites

The data to be discovered has been uploaded to the OBS parallel file system. That is, the data has been uploaded from S3 or HDFS to the planned path of the OBS parallel file system in the region where the LakeFormation instance is located.
The catalog and database for metadata discovery have been prepared and created.

Procedure

Log in to the LakeFormation console.
Select the LakeFormation instance to be operated from the drop-down list on the left and choose Jobs > Job Authorization in the navigation pane.

Click Authorize to grant the job management permissions of LakeFormation to the current user. If authorization has been completed, skip this step.

To cancel the permission, click Cancel Authorization.

After the authorization is approved, LakeFormation automatically creates an agency named lakeformation_job_trust. Do not delete the agency during job running.
In the navigation pane, choose Jobs > Metadata Discovery.

Click Create Discovery Job, set related parameters, and click Submit.

**Table 1** Creating a discovery job
Parameter	Description
Job Name	Name of the metadata discovery job.
Description	Description of the created metadata discovery job.
Data Storage Location	Location where the discover result table is stored in the OBS parallel file system. Click , select a location, and click OK.
Discovery File Type	Type of the discovered file. The options include: Automatic discovery (including Parquet, ORC, JSON, Avro, and CSV) Parquet ORC JSON CSV (If you select this type, you also need to configure parameters such as Delimiter, Escape Character, Quotation Character, and Use first row as column name.) Avro Recommended configurations: If the data storage location contains file name extensions of the same type, choose the matching discovery file type. If there be a variety of file name extensions, select Automatic discovery. In the absence of a suffix for the file, opt for the appropriate type. Note that Automatic discovery defaults to identifying Parquet files, and may not recognize files of other formats.
Log Path	Storage location of logs generated when a metadata discovery job is executed. Click to select a path. The path must exist in OBS. If the path is customized, the discovery job will fail.
Target Catalog	Name of the catalog to which the metadata to be discovered belongs.
Target Database	Name of the database to which the metadata to be discovered belongs.
Conflict Resolution	Method used to resolve the issue of duplicate metadata names during metadata discovery. Create and update metadata Create metadata only
Default Owner	Default owner of metadata after a metadata discovery job is executed. To avoid authorization failure, ensure that the selected entity's name does not contain hyphens (-).
File Sampling Rate	(Optional) File sampling frequency. When the sampling rate is 0, all partitions after the current partition table are skipped if an empty file is found. This method reduces the operation time, but reduces the accuracy.
Rediscovery Method	Execute the discovery policy for metadata discovery again. Full discovery: When you perform the discovery operation again, all files in the data storage location are discovered. Incremental discovery: When you perform the discovery operation again, the system discovers the files added to the data storage location after the last job (successfully executed) starts.
Execution Policy	Select the execution policy of the current migration job. Manual: The migration job is manually triggered. If you select this mode, you need to click Run in the Operation column to run the migration job after the job is created. Scheduled: The migration job is automatically executed per schedule. After selecting this mode, you can select the scheduled execution period (monthly, weekly, daily, or hourly) and set related parameters as required.
Entity Type	(Optional) By default, selecting an entity assigns it read permission on the data storage location. You can select a user group, role, IAM user, or agency as the authorization entity. To avoid authorization failure, ensure that the selected entity's name does not contain hyphens (-). If you want to grant the write permission as well, select Write Permission.
Event Notification Policy	(Optional) Once this option is configured, a notification (via SMS or email) will be sent when a specific event (such as job success or failure) occurs. Event Notification: If this function is enabled, event notifications will be activated. Event Notification Topic: Select the topic to be notified. You can configure the topic using Simple Message Notification (SMN) on the management console. Event: Specifies the status of the topic to be notified. The value can be either Job succeeded or Job failed.

Click Run in the Operation column to run the migration job. If the execution policy is set to Scheduled, you do not need to manually execute the job.
- Click Stop to stop a running job.
- Click View Log in the Operation column to view the logs generated during job running. You can click Click here to view complete log to view the complete log.
- View Job instead of View Log may be displayed on the page. In this case, perform the following operations to view logs:
  1. Click View Log in the Operation column to view the job execution status.
  2. In the displayed dialog box, click Click here to view complete log to view the logs generated during job running.
- Click Edit or Delete in the Operation column to modify or delete a job.
After the migration job is complete, choose Metadata > Table. In the upper right corner, select the target catalog and database from the Catalog and Database drop-down lists to view the discovered tables.