Help Center/ DataArts Lake Formation/ User Guide/ Data Migration Management/ Using the Metadata Discovery Function
Updated on 2024-07-22 GMT+08:00

Using the Metadata Discovery Function

Scenario

If data is stored in OBS parallel file systems but is not associated with metadata in LakeFormation, you can use the metadata discovery function to construct metadata corresponding to the data to support the computing and analysis of SQL engines or user applications.

The metadata discovery feature is currently in OBT and is free of charge. However, once it is officially launched, fees will be charged based on the resources consumed by metadata discovery tasks.

Prerequisites

  • The data to be discovered has been uploaded to the OBS parallel file system. That is, the data has been uploaded from S3 or HDFS to the planned path of the OBS parallel file system in the region where the LakeFormation instance is located.
  • The catalog and database for metadata discovery have been prepared and created.

Procedure

  1. Log in to the LakeFormation console.
  2. In the upper left corner, click and choose Analytics > LakeFormation to access the LakeFormation console.
  3. Select the LakeFormation instance to be operated from the drop-down list on the left and choose Tasks > Metadata Discovery in the navigation pane.
  4. Click Create Discovery Task, set related parameters, and click Submit.

    Table 1 Creating a discovery task

    Parameter

    Description

    Task Name

    Name of the metadata discovery task.

    Description

    Description of the created metadata discovery task.

    Data Storage Location

    Location where the discovered metadata is stored in the OBS parallel file system.

    Click , select a location, and click OK.

    Discovery File Type

    Type of the discovered file. The options include:

    • Automatic discovery (including Parquet, ORC, JSON, Avro, and CSV)
    • Parquet
    • ORC
    • JSON
    • CSV (If you select this type, you also need to configure parameters such as Delimiter, Escape Character, .Quotation Character, and Use first row as column name.)
    • Avro
    NOTE:
    • If the data storage location contains file name extensions of the same type, it is recommended to choose the matching discovery file type.
    • Should there be a variety of file name extensions present, selecting Automatic discovery is advisable.
    • In the absence of a suffix for the file, opt for the appropriate type. Note that Automatic discovery defaults to identifying Parquet files, and may not recognize files of other formats.

    Log Path

    Storage location of logs generated when a metadata discovery task is executed. Click to select a path.

    The path must exist in OBS. If the path is customized, the discovery task will fail.

    Target Catalog

    Name of the catalog to which the metadata to be discovered belongs.

    Target Database

    Name of the database to which the metadata to be discovered belongs.

    Conflict Resolution

    Method used to resolve the issue of duplicate metadata names during metadata discovery.

    • Create and update metadata
    • Create metadata only

    Default Owner

    Default owner of metadata after a metadata discovery task is executed.

    To avoid authorization failure, ensure that the selected entity's name does not contain hyphens (-).

    File Sampling Rate

    (Optional) File sampling frequency.

    When the sampling rate is 0, all partitions after the current partition table are skipped if an empty file is found. This method reduces the operation time, but reduces the accuracy.

    Rediscovery Method

    Execute the discovery policy for metadata discovery again.

    • Full discovery: When you perform the discovery operation again, all files in the data storage location are discovered.
    • Incremental discovery: When you perform the discovery operation again, the system discovers the files added to the data storage location after the last task (successfully executed) starts.

    Entity Type

    (Optional) By default, selecting an entity assigns it read permission on the data storage location.

    • You can choose user groups, roles, or users to be the authorization entity.

      To avoid authorization failure, ensure that the selected entity's name does not contain hyphens (-).

    • If you want to grant the write permission as well, select Write Permission.

  5. Click Run in the Operation column to run the migration task.

    • Click Stop to stop a running task.
    • Click View Log to view the logs generated during task running.

      By default, the latest 50 lines of logs are displayed.

      You can click the hyperlink at the bottom of the log to view the complete log. For details about the configuration, see section "Downloading an Object" in Object Storage Service 3.0 (OBS) 3.24.3h&s User Guide (for Huawei Cloud Stack 8.3.1) in Object Storage Service 3.0 (OBS) 3.24.3h&s Usage Guide (for Huawei Cloud Stack 8.3.1).

    • Click Edit or Delete in the Operation column to modify or delete a task.

  6. After the migration task is complete, choose Metadata > Table. In the upper right corner, select the target catalog and database from the Catalog and Database drop-down lists to view the discovered tables.