Updated on 2024-12-19 GMT+08:00

Interconnecting Spark with LakeFormation

When using PySpark, trim off the spark.hadoop prefix from each parameter, but keep the rest of these parameters and add them to the hive-site.xml configuration file.

Adding Interconnection Configuration Items

Add the following configuration items to the spark/conf/spark-defaults.conf file:

# Project ID. This parameter is mandatory. The value is for reference only.
spark.hadoop.lakeformation.project.id=Project ID
# LakeFormation instance ID. This parameter is optional. You can obtain the value from the LakeFormation instance page. If this parameter is not specified, the default instance is connected. The value configured here is for reference only.
spark.hadoop.lakeformation.instance.id=LakeFormation Instance ID
#AK information for LakeFormation IAM authentication. This parameter is optional. Ignore it if you plan to use the custom authentication information obtaining class.
spark.hadoop.lakeformation.authentication.access.key=AK
#SK information for LakeFormation IAM authentication. This parameter is optional. Ignore it if you plan to use the custom authentication information obtaining class.
spark.hadoop.lakeformation.authentication.secret.key=SK
# IAM authentication information securitytoken for accessing LakeFormation. This parameter is optional and is used together with a temporary AK/SK. If a permanent AK/SK or the custom authentication information obtaining class is used, ignore this parameter.
spark.hadoop.lakeformation.authentication.security.token=securitytoken information

The project ID must be configured and other parameters are optional. Set them based on the site requirements.

These configuration items can also take effect after being added to hive-site.xml or core-site.xml. Remember to trim off the spark.hadoop prefix when adding them.

Interconnecting with OBS

Add the following configuration items to the spark/conf/spark-defaults.conf file:

# Fixed configuration for interconnecting with OBS. The endpoint needs to be configured based on the region.
spark.hadoop.fs.obs.impl=org.apache.hadoop.fs.obs.OBSFileSystem
spark.hadoop.fs.AbstractFileSystem.obs.impl=org.apache.hadoop.fs.obs.OBS
spark.hadoop.fs.obs.endpoint=obs.xxx.huawei.com

# Specify LakeFormationObsCredentialProvider as the class for obtaining OBS credentials.
spark.hadoop.fs.obs.credentials.provider=com.huawei.cloud.dalf.lakecat.client.obs.LakeFormationObsCredentialProvider

# Optional parameter. Disable the OBS file system cache. This configuration needs to be added for long tasks to prevent the temporary AK/SK in the cache from becoming invalid.
spark.hadoop.fs.obs.impl.disable.cache=true

Endpoint: Endpoints vary in different services and regions. Obtain the value of this parameter from Regions and Endpoints.

These configuration items can also take effect after being added to core-site.xml. Remember to trim off the spark.hadoop prefix when adding them.

Interconnecting with LakeFormation Metadata

You can use either of the following methods to connect Spark to LakeFormation. You are advised to use either method as required.

  • Interconnection using SparkCatalogPlugin: Spark SessionCatalogV2 allows you to connect to different catalogs in the same session. This feature is still experimental and does not support some SQL commands.
  • Interconnection using MetastoreClient: MetastoreClient relies on Spark HiveExternalCatalog and Hive MetastoreClient mechanisms to execute most Hive SQL commands. However, it does not allow connecting to different catalogs simultaneously.
Interconnection using SparkCatalogPlugin:
  1. Add the following configuration items to the spark/conf/spark-defaults.conf file. If multiple catalogs need to be interconnected at the same time, configure the following configuration in multiple lines:
    # Specify the catalog implementation class. This parameter is mandatory. spark_catalog_name indicates the catalog name in Spark. Replace it as required.
    spark.sql.catalog.${spark_catalog_name}=com.huawei.cloud.dalf.lakecat.client.spark.LakeFormationSparkCatalog
    # Name of the catalog to be connected (lakeformation_catalog_name is the catalog in LakeFormation). This parameter is optional. If it is not set, the Hive catalog is connected instead. The value here is for reference only.
    spark.sql.catalog.${spark_catalog_name}.lakecat.catalogname.default=${lakeformation_catalog_name}
  2. Verify the interconnection.

    After the interconnection, you can access LakeFormation through spark-shell, spark-submit, or spark-sql. The following uses spark-sql as an example.

    • Switch the database. (You need to specify the catalog name during the switchover. The database corresponding to database_name must exist in LakeFormation.)

      use spark_catalog_name.database_name;

    • View the table information.

      show tables;

    • Create a database. (You cannot directly create a database with the same name as the catalog. You need to specify the catalog.)

      create database catalog_name.test;

Interconnection using MetastoreClient:
  1. Add the following configuration items to spark-defaults.conf:
    spark.sql.catalogImplementation=hive
  2. Add the hive-site.xml file to the spark/conf/ folder (edit this file if it already exists) and add the following configurations to the hive-site.xml file:
    <configuration>
    <!--Fixed configuration. Enable the custom metastore client.-->
    <property>
    <name>hive.metastore.session.client.class</name>
    <value>com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient</value>
    </property>
    <!--Name of the LakeFormation catalog to be connected. This parameter is optional. If it is not set, the Hive catalog is connected instead. The value of this parameter is for reference only.
    <property>
    <name>lakecat.catalogname.default</name>
    <value>hive</value>
    </property>
    <!--Hive execution path. This parameter is optional. If the HDFS is not connected, local path /tmp/hive is used by default. The value here is for reference only.
    <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive</value>
    </property>
    </configuration>

    In addition to adding configurations to hive-site.xml, you can also add configurations starting with spark.hadoop in the spark-defaults.conf configuration file, for example, add spark.hadoop.hive.metastore.session.client.class=com.huawei.cloud.dalf.lakecat.client.hiveclient.LakeCatMetaStoreClient.

    • The permission on the hive.exec.scratchdir path must be changed to 777. Otherwise, the Hive client initialization will be abnormal.
    • You need to create a database named default in the catalog corresponding to lakecat.catalogname.default. (If the database has been created, ignore it.) Otherwise, spark-sql initialization will be abnormal or spark-shell cannot be connected.
  3. Verify the interconnection.

    After the interconnection, you can use spark-shell or execute SQL statements to access LakeFormation. The following uses spark-sql as an example.

    • Switch the database. (You do not need to specify the catalog name during the switchover.)

      use database_name;

    • View the table information.

      show tables;

Integrating the SQL Authentication Plug-in

  1. To use the authentication plug-in, you must implement and specify a custom user information obtaining class. For details, see Custom User Information Obtaining Class.
  2. Add the following configuration to the spark-default.conf configuration file:

    com.huawei.cloud.dalf.lakecat.client.spark.v31.authorizer.LakeFormationSparkSQLExtension
    spark.sql.extensions=com.huawei.cloud.dalf.lakecat.client.spark.authorizer.LakeFormationSparkSQLExtension

  • After the permission plug-in is integrated, if the current user (specified by Custom User Information Obtaining Class) does not have the corresponding metadata permission, an exception is thrown when the SQL statement is executed.
  • If the current user has the IAM LakeFormation:policy:create permission and the current user (specified by Custom User Information Obtaining Class) and authentication information (specified by Custom Authentication Information Obtaining Class) are unified users, SQL authentication will be skipped.
  • Currently, filtering functions are not supported. Databases, tables, and rows cannot be filtered, and columns cannot be masked.

Log Printing

You can add log4j.logger.org.apache=WARN to the log4j.properties file to disable the HttpClient request logging function of the LakeFormation client.