Esta página ainda não está disponível no idioma selecionado. Estamos trabalhando para adicionar mais opções de idiomas. Agradecemos sua compreensão.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Configuring a Metadata Collection Task

Updated on 2025-02-27 GMT+08:00

You can create collection tasks by configuring metadata collection policies. Different types of data sources require different collection policies. Metadata management allows you to collect technical metadata using the configured collection policies.

Constraints

  • If the collection scope is not specified for a metadata collection task, all data tables and files of a data connection are collected by default. After the collection task is complete, if data tables or files are added to the data connection, you must run the metadata collection task again to collect the new data tables or files.
  • Before collecting Oracle metadata, ensure that the database user of the data connection has the permission to read and write data tables and read metadata. For details, see how to assign permissions to users in Oracle Connection Parameters.
  • Due to MRS cluster restrictions, metadata collection tasks cannot directly collect metadata of Hive partitioned tables by default.

    To collect metadata of Hive partitioned tables, add parameter hive-ext.display.desc.statistic.stats and value true to hive.server.customized.configs in HiveServer(Role) > Customization of the MRS cluster. For details, see Enabling Metadata Collection from Hive Partitioned Tables of an MRS Cluster.

Prerequisites

  • Metadata of the following types of data sources can be collected: DWS, DLI, MRS HBase, MRS Hive, RDS, and Oracle. To obtain metadata, you must first create data connections in Management Center. To collect metadata from other data sources (such as OBS, CSS, and GES), you do not need to create data connections in Management Center.

  • Before you can collect the metadata of Hudi tables by collecting the MRS Hive metadata, you must enable synchronization of the Hive table configuration for Hudi tables.
  • To collect metadata of Hive partitioned tables, add parameter hive-ext.display.desc.statistic.stats and value true to hive.server.customized.configs in HiveServer(Role) > Customization of the MRS cluster. For details, see Enabling Metadata Collection from Hive Partitioned Tables of an MRS Cluster.

Creating a Collection Task

  1. On the DataArts Studio console, locate a workspace and click DataArts Catalog.
  1. Choose Metadata Collection > Task Management from the left navigation bar.
  2. Select the directory for the collection task. If no directory is available, create one as Figure 1 shows.
    Figure 1 Directory that stores the collection task to create
  3. Click Create in the upper part of the displayed page or right-click Task name and choose Add Task from the shortcut menu. On the page displayed, set the parameters.

    Figure 2 shows the entries for creating a task.

    Figure 2 Entries for creating a collection task
    1. Set the basic configuration based on Table 1.
      Table 1 Basic configuration parameters

      Parameter

      Description

      Task Name

      Name of a collection task. The value can contain only letters, numbers, and underscores (_), and cannot exceed 62 characters.

      Description

      Information to better identify the collection task. Length of the description cannot exceed 255 characters.

      Select Directory

      The directory that stores the collection task. You can select an existing one. Figure 1 shows the directory.

    2. Configure data source information based on Table 2.
      Table 2 Data source parameters

      Parameter

      Description

      Data Connection Type

      Select a data connection type from the drop-down list box.

      NOTE:

      Metadata of the following types of data sources can be collected: DWS, DLI, MRS HBase, MRS Hive, RDS, and Oracle. To obtain metadata, you must first create data connections in Management Center. To collect metadata from other data sources (such as OBS, CSS, and GES), you do not need to create data connections in Management Center.

      • DWS
      • DLI
      • MRS HBase
      • MRS Hive
      • ORACLE
      • RDS

      Data Connection Name

      • To use an existing data connection, select a value from the drop-down list.
      • To use a data connection that does not exist, click Create to add one.

      Database

      (or Database and Schema and Namespace)

      Database, schema, or namespace and data table from which data will be collected

      • Click Set next to Database (or Database and Schema or Namespace) to set the range of databases (or databases and schemas or namespaces) to be scanned by the collection task. If this parameter is not set, all databases (or databases and schemas or namespaces) under the data connection are scanned by default.
      • Click Set next to Table to set the range of tables to be scanned by the collection task. If this parameter is not set, all tables in the database (or database and schema or namespace) are scanned by default.
      • If neither the database (or database and schema or namespace) nor the data table is set, the task scans all data tables of the selected data connection.
      • Click Clear to delete the selected database (or database and schema or namespace) and data table.

      Table

      CSS

      Cluster

      Select the CSS cluster for storing the data to be collected.

      You can also click Create to create a CSS cluster. After the CSS cluster is created, click Refresh and select the new CSS cluster.

      CDM Cluster

      Select the agent provided by the CDM cluster.

      You can also click Create to create an agent. After the agent is created, click Refresh and select the new agent.

      Index

      Index, similar to "database" in the relational database (RDB), stores Elasticsearch data. It is a logical space that consists of one or more shards.

      GES

      Graph

      Select graphs that store structured data based on "relationships".

      CDM Cluster

      Select the agent provided by the CDM cluster.

      You can also click Create to create an agent. After the agent is created, click Refresh and select the new agent.

      OBS

      OBS Bucket

      Select the OBS bucket from which data will be collected.

      OBS Path

      Select the path of the OBS bucket from which data will be collected.

      Collection Scope

      Select the range of data to be collected.

      • If you select This folder, the collection task collects only the objects in the folder set in the OBS path.
      • If you select This folder and subfolders, the collection task collects all objects in the folder set in the OBS path, including the objects in the sub-folders.

      Collected Content

      Select the content of data to be collected.

      • If you select Folders and objects, the collection task collects folders and objects.
      • If you select Folders, the collection task collects only folders.

      DIS

      Collect Dump Task

      If Yes is selected, the dump task is collected.

      Collection Channel

      A DIS instance is a stream. This parameter is used to specify a stream used for data collection.

    3. Set parameters under Metadata Collection. See Table 3.
      NOTE:

      Metadata collection parameters are available only for DWS, DLI, MRS HBase, MRS Hive, RDS, or Oracle connections.

      Table 3 Parameters for metadata collection

      Parameter

      Description

      The data source metadata has been updated.

      When metadata in a data connection changes, you can configure an update policy to set the metadata update mode in the data catalog.

      Note that the configured update and deletion policies apply only to the databases and data tables configured by yourself.

      • If you select Update metadata in the data directory only, the collection task updates only the metadata that has been collected in the data catalog.
      • If you select Add new metadata to the data directory only, the collection task collects only metadata that exists in the data source but does not exist in the data catalog.
      • If you select Update metadata in the data directory and add metadata, the collection task fully synchronizes metadata from the data source.
      • If you select Ignore the update and addition operations, the metadata in the data source is not collected.

      The data source metadata has been deleted.

      When metadata in a data connection changes, you can configure a deletion policy to set the metadata update mode in the data catalog.

      • If you select Delete metadata from data directory, when some metadata in the data source is deleted, the corresponding metadata is also deleted from the data catalog.
      • If you select Ignore the deletion, when some metadata in the data source is deleted, the corresponding metadata is not deleted from the data catalog.
    4. Set parameters when Data Summary is selected. See Table 4 for details.
      NOTE:
      • Data Summary parameters are available only for DWS, DLI connections.
      • You are advised not to select Data Summary unless necessary. Selecting this option will increase the SQL execution workload. As a result, the metadata collection task may take a longer time than expected.
      Table 4 Parameters

      Parameter

      Description

      Full data

      If this option is selected, a data profile is generated in the data catalog based on all data collected.

      This mode applies to scenarios where the data volume is less than 1 million.

      Sampled data, first x rows

      If this option is selected, a data profile is generated in the data catalog based on all data collected.

      This mode is applicable to scenarios with a large amount of data.

      Randomly collect x% records of data from all data

      If this option is selected, a data profile is generated in the data catalog based on all data collected.

      This mode is applicable to scenarios with a large amount of data.

      Data Lake Insight Queue

      The queue used to obtain profile data and execute DLI SQL statements.

      If you select Collect unique value, the number of unique values in the collected table is calculated and displayed on the Profile tab page in the data catalog.

    5. Set parameters when Data Classification is selected. (This option is available only when DataArts Catalog provides data security functions. The data classification cannot be associated with a sensitive data identification rule created in the independent DataArts Security module.)
      • If you select Data Classification and create a classification rule group or select an existing classification rule group by referring to Creating a Data Classification (To Be Removed), data will be automatically identified and a classification will be added.
      • If you select Update the data table security level based on the data classification result, the table security level must be the same as the highest security level of the matched classification rules.
      • If you select Manually for Synchronize Data, classification rules and security levels are not automatically added to Column Attributes of Data Catalog under Data Map. Go to the Task Monitoring page. Locate the target instance and choose More > View Scanning Result to view the execution result of the collection task and check whether the classification result matches. Select the check box of the classification matching field and click Synchronize to manually synchronize the classification rule and security level.
      NOTE:

      Only when you choose the DWS or DLI data source, you can add data classifications for automatic data identification. In addition, you can add classification rules only for columns in the data tables and OBS objects.

  4. Click Next and select a scheduling mode.

    Once: If the execution duration of a task exceeds the configured timeout duration, the task is considered failed.

    Repeating: See Table 5 for details.
    NOTE:
    1. If Once is selected, a manual task instance is generated. A manual task has no dependency on scheduling and must be manually triggered.
    2. If Repeating is selected, a periodic instance is generated. A periodic instance is an instance snapshot that is automatically scheduled when the scheduled execution time is arrived.
    3. When a periodic task is scheduled once, an instance workflow is generated. You can perform routine O&M on scheduled instance tasks, such as viewing the running status, stopping and rerunning the scheduled tasks.
    Table 5 Parameters

    Parameter

    Description

    Scheduling Date

    The period during which a scheduling task takes effect.

    Scheduling Cycle

    The frequency at which the scheduling task is executed, which can be:

    • Minutes
    • Hours
    • Days
    • Weeks

    Start Time

    Start time of periodic scheduling, which is used together with the start time in Scheduling Date.

    Time Interval

    Interval between two periodic scheduling operations

    A scheduling task instance starts even if the previous scheduling task instance has not ended. A collection task supports concurrent running of multiple instances.

    End Time

    End time of periodic scheduling, which is used together with the end time in Scheduling Date.

    Timeout

    Timeout duration for a task instance. If a task runs longer than the value of this parameter, the task fails to be executed.

    Start

    If this check box is selected, the task is scheduled immediately.

  5. Click Submit. The collection task is created.

Managing a Collection Task

  1. On the DataArts Studio console, locate a workspace and click DataArts Catalog.
  1. Choose Metadata Collection > Task Management from the left navigation bar.

Then, you can view all created collection tasks.

Table 6 Parameters for managing collection tasks

Parameter

Description

Task Name

The name of a collection task.

Click a collection task name to view the collection policies and scheduling properties.

Type

The name of a data connection.

Scheduling Status

The scheduling status of a collection task. You can click to view only tasks of the specified statuses.

Scheduling Cycle

The scheduling frequency of a collection task. You can click to view only tasks of the specified frequencies.

Description

The description of a collection task.

Creator

The creator of a collection task.

Last Executed On

The last time when the collection task ran.

Operation

You can perform the following operations on a created collection task:

  • Edit: Modify the parameters that are closely related to the policies of collection tasks whose status is Started, Not started, or Failed. The data source type cannot be modified.
  • Run: Click Run to run a collection task once and view its status and related logs on the Task Monitoring page.
  • Start Scheduling: If the status of a task is Stopped, you can start scheduling the task based on the configured scheduling mode.
  • Stop Scheduling: When the scheduling status is Scheduling, you can stop the scheduling.

Enabling Metadata Collection from Hive Partitioned Tables of an MRS Cluster

  1. Log in to MRS Manager as user admin.
  2. On FusionInsight Manager, choose Cluster > Services > Hive and click the Configurations tab and then All Configurations. Choose HiveServer(Role) > Customization. Add hive-ext.display.desc.statistic.stats to the value of hive.server.customized.configs and set the value of hive-ext.display.desc.statistic.stats to true.

    Figure 3 Adding a custom parameter

  3. After setting the parameter, click Save in the upper left corner and then OK in the dialog box to save the configuration.

    Figure 4 Saving the configuration

  4. After saving the configuration, switch to the Instances tab page, select the instance that has expired, click More, and select Instance Rolling Restart to make the configuration take effect.

    Figure 5 Performing a rolling instance restart

Usamos cookies para aprimorar nosso site e sua experiência. Ao continuar a navegar em nosso site, você aceita nossa política de cookies. Saiba mais

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback