Esta página ainda não está disponível no idioma selecionado. Estamos trabalhando para adicionar mais opções de idiomas. Agradecemos sua compreensão.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Interconnecting Spark with LakeFormation

Updated on 2024-12-19 GMT+08:00
NOTE:

When using PySpark, trim off the spark.hadoop prefix from each parameter, but keep the rest of these parameters and add them to the hive-site.xml configuration file.

Adding Interconnection Configuration Items

Add the following configuration items to the spark/conf/spark-defaults.conf file:

# Project ID. This parameter is mandatory. The value is for reference only.
spark.hadoop.lakeformation.project.id=Project ID
# LakeFormation instance ID. This parameter is optional. You can obtain the value from the LakeFormation instance page. If this parameter is not specified, the default instance is connected. The value configured here is for reference only.
spark.hadoop.lakeformation.instance.id=LakeFormation Instance ID
#AK information for LakeFormation IAM authentication. This parameter is optional. Ignore it if you plan to use the custom authentication information obtaining class.
spark.hadoop.lakeformation.authentication.access.key=AK
#SK information for LakeFormation IAM authentication. This parameter is optional. Ignore it if you plan to use the custom authentication information obtaining class.
spark.hadoop.lakeformation.authentication.secret.key=SK
# IAM authentication information securitytoken for accessing LakeFormation. This parameter is optional and is used together with a temporary AK/SK. If a permanent AK/SK or the custom authentication information obtaining class is used, ignore this parameter.
spark.hadoop.lakeformation.authentication.security.token=securitytoken information
NOTE:

The project ID must be configured and other parameters are optional. Set them based on the site requirements.

These configuration items can also take effect after being added to hive-site.xml or core-site.xml. Remember to trim off the spark.hadoop prefix when adding them.

Interconnecting with OBS

Add the following configuration items to the spark/conf/spark-defaults.conf file:

# Fixed configuration for interconnecting with OBS. The endpoint needs to be configured based on the region.
spark.hadoop.fs.obs.impl=org.apache.hadoop.fs.obs.OBSFileSystem
spark.hadoop.fs.AbstractFileSystem.obs.impl=org.apache.hadoop.fs.obs.OBS
spark.hadoop.fs.obs.endpoint=obs.xxx.huawei.com

# Specify LakeFormationObsCredentialProvider as the class for obtaining OBS credentials.
spark.hadoop.fs.obs.credentials.provider=com.huawei.cloud.dalf.lakecat.client.obs.LakeFormationObsCredentialProvider

# Optional parameter. Disable the OBS file system cache. This configuration needs to be added for long tasks to prevent the temporary AK/SK in the cache from becoming invalid.
spark.hadoop.fs.obs.impl.disable.cache=true
NOTE:

Endpoint: Endpoints vary in different services and regions. Obtain the value of this parameter from Regions and Endpoints.

These configuration items can also take effect after being added to core-site.xml. Remember to trim off the spark.hadoop prefix when adding them.

Interconnecting with LakeFormation Metadata

You can use either of the following methods to connect Spark to LakeFormation. You are advised to use either method as required.

  • Interconnection using SparkCatalogPlugin: Spark SessionCatalogV2 allows you to connect to different catalogs in the same session. This feature is still experimental and does not support some SQL commands.
  • Interconnection using MetastoreClient: MetastoreClient relies on Spark HiveExternalCatalog and Hive MetastoreClient mechanisms to execute most Hive SQL commands. However, it does not allow connecting to different catalogs simultaneously.
Interconnection using SparkCatalogPlugin:
  1. Add the following configuration items to the spark/conf/spark-defaults.conf file. If multiple catalogs need to be interconnected at the same time, configure the following configuration in multiple lines:
    # Specify the catalog implementation class. This parameter is mandatory. spark_catalog_name indicates the catalog name in Spark. Replace it as required.
    spark.sql.catalog.${spark_catalog_name}=com.huawei.cloud.dalf.lakecat.client.spark.LakeFormationSparkCatalog
    # Name of the catalog to be connected (lakeformation_catalog_name is the catalog in LakeFormation). This parameter is optional. If it is not set, the Hive catalog is connected instead. The value here is for reference only.
    spark.sql.catalog.${spark_catalog_name}.lakecat.catalogname.default=${lakeformation_catalog_name}
  2. Verify the interconnection.

    After the interconnection, you can access LakeFormation through spark-shell, spark-submit, or spark-sql. The following uses spark-sql as an example.

    • Switch the database. (You need to specify the catalog name during the switchover. The database corresponding to database_name must exist in LakeFormation.)

      use spark_catalog_name.database_name;

    • View the table information.

      show tables;

    • Create a database. (You cannot directly create a database with the same name as the catalog. You need to specify the catalog.)

      create database catalog_name.test;

Interconnection using MetastoreClient:
  1. Add the following configuration items to spark-defaults.conf:
    spark.sql.catalogImplementation=hive
  2. Add the hive-site.xml file to the spark/conf/ folder (edit this file if it already exists) and add the following configurations to the hive-site.xml file:
    <configuration>
    <!--Fixed configuration. Enable the custom metastore client.-->
    <property>
    <name>hive.metastore.session.client.class</name>
    <value>com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient</value>
    </property>
    <!--Name of the LakeFormation catalog to be connected. This parameter is optional. If it is not set, the Hive catalog is connected instead. The value of this parameter is for reference only.
    <property>
    <name>lakecat.catalogname.default</name>
    <value>hive</value>
    </property>
    <!--Hive execution path. This parameter is optional. If the HDFS is not connected, local path /tmp/hive is used by default. The value here is for reference only.
    <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive</value>
    </property>
    </configuration>

    In addition to adding configurations to hive-site.xml, you can also add configurations starting with spark.hadoop in the spark-defaults.conf configuration file, for example, add spark.hadoop.hive.metastore.session.client.class=com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient.

    NOTE:
    • The permission on the hive.exec.scratchdir path must be changed to 777. Otherwise, the Hive client initialization will be abnormal.
    • You need to create a database named default in the catalog corresponding to lakecat.catalogname.default. (If the database has been created, ignore it.) Otherwise, spark-sql initialization will be abnormal or spark-shell cannot be connected.
  3. Verify the interconnection.

    After the interconnection, you can use spark-shell or execute SQL statements to access LakeFormation. The following uses spark-sql as an example.

    • Switch the database. (You do not need to specify the catalog name during the switchover.)

      use database_name;

    • View the table information.

      show tables;

Integrating the SQL Authentication Plug-in

  1. To use the authentication plug-in, you must implement and specify a custom user information obtaining class. For details, see Custom User Information Obtaining Class.
  2. Add the following configuration to the spark-default.conf configuration file:

    com.huawei.cloud.dalf.lakecat.client.spark.v31.authorizer.LakeFormationSparkSQLExtension
    spark.sql.extensions=com.huawei.cloud.dalf.lakecat.client.spark.authorizer.LakeFormationSparkSQLExtension

NOTE:
  • After the permission plug-in is integrated, if the current user (specified by Custom User Information Obtaining Class) does not have the corresponding metadata permission, an exception is thrown when the SQL statement is executed.
  • If the current user has the IAM LakeFormation:policy:create permission and the current user (specified by Custom User Information Obtaining Class) and authentication information (specified by Custom Authentication Information Obtaining Class) are unified users, SQL authentication will be skipped.
  • Currently, filtering functions are not supported. Databases, tables, and rows cannot be filtered, and columns cannot be masked.

Log Printing

You can add log4j.logger.org.apache=WARN to the log4j.properties file to disable the HttpClient request logging function of the LakeFormation client.

Usamos cookies para aprimorar nosso site e sua experiência. Ao continuar a navegar em nosso site, você aceita nossa política de cookies. Saiba mais

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback