Using the Metadata Discovery Function
Scenario
If data is stored in OBS parallel file systems but is not associated with metadata in LakeFormation, you can use the metadata discovery function to construct metadata corresponding to the data to support the computing and analysis of SQL engines or user applications.
The metadata discovery feature is currently in OBT and is free of charge. However, once it is officially launched, fees will be charged based on the resources consumed by metadata discovery tasks.
Prerequisites
- Authorization has been enabled by referring to Granting the Job Management Permission.
- The data to be discovered has been uploaded to the OBS parallel file system. That is, the data has been uploaded from S3 or HDFS to the planned path of the OBS parallel file system in the region where the LakeFormation instance is located.
- The catalog and database for metadata discovery have been prepared and created.
Procedure
- Log in to the LakeFormation console.
- In the upper left corner, click and choose Analytics > LakeFormation to access the LakeFormation console.
- Select the LakeFormation instance to be operated from the drop-down list on the left and choose Tasks > Metadata Discovery in the navigation pane.
- Click Create Discovery Task, set related parameters, and click Submit.
Table 1 Creating a discovery task Parameter
Description
Task Name
Name of the metadata discovery task.
Description
Description of the created metadata discovery task.
Data Storage Location
Location where the discovered metadata is stored in the OBS parallel file system.
Click , select a location, and click OK.
Discovery File Type
Type of the discovered file. The options include:
- Automatic discovery (including Parquet, ORC, JSON, Avro, and CSV)
- Parquet
- ORC
- JSON
- CSV (If you select this type, you also need to configure parameters such as Delimiter, Escape Character, .Quotation Character, and Use first row as column name.)
- Avro
NOTE:- If the data storage location contains file name extensions of the same type, it is recommended to choose the matching discovery file type.
- Should there be a variety of file name extensions present, selecting Automatic discovery is advisable.
- In the absence of a suffix for the file, opt for the appropriate type. Note that Automatic discovery defaults to identifying Parquet files, and may not recognize files of other formats.
Log Path
Storage location of logs generated when a metadata discovery task is executed. Click to select a path.
The path must exist in OBS. If the path is customized, the discovery task will fail.
Target Catalog
Name of the catalog to which the metadata to be discovered belongs.
Target Database
Name of the database to which the metadata to be discovered belongs.
Conflict Resolution
Method used to resolve the issue of duplicate metadata names during metadata discovery.
- Create and update metadata
- Create metadata only
Default Owner
Default owner of metadata after a metadata discovery task is executed.
To avoid authorization failure, ensure that the selected entity's name does not contain hyphens (-).
File Sampling Rate
(Optional) File sampling frequency.
When the sampling rate is 0, all partitions after the current partition table are skipped if an empty file is found. This method reduces the operation time, but reduces the accuracy.
Rediscovery Method
Execute the discovery policy for metadata discovery again.
- Full discovery: When you perform the discovery operation again, all files in the data storage location are discovered.
- Incremental discovery: When you perform the discovery operation again, the system discovers the files added to the data storage location after the last task (successfully executed) starts.
Entity Type
(Optional) By default, selecting an entity assigns it read permission on the data storage location.
- Click Run in the Operation column to run the migration task.
- Click Stop to stop a running task.
- Click View Log to view the logs generated during task running.
By default, the latest 50 lines of logs are displayed.
You can click the hyperlink at the bottom of the log to view the complete log. For details about the configuration, see section "Downloading an Object" in Object Storage Service 3.0 (OBS) 3.24.3h&s User Guide (for Huawei Cloud Stack 8.5.0) in Object Storage Service 3.0 (OBS) 3.24.3h&s Usage Guide (for Huawei Cloud Stack 8.5.0).
- Click Edit or Delete in the Operation column to modify or delete a task.
- After the migration task is complete, choose Metadata > Table. In the upper right corner, select the target catalog and database from the Catalog and Database drop-down lists to view the discovered tables.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot