Updated on 2025-08-11 GMT+08:00

Interconnecting Spark with OBS Using an IAM Agency

After configuring decoupled storage and compute for a cluster by referring to Interconnecting an MRS Cluster with OBS Using an IAM Agency, you can create tables with OBS paths as their location on the Spark client.

Interconnecting with OBS Using Spark Beeline

  1. Log in to the MRS Manager and choose Cluster > Services > Spark2x > Configurations > All Configurations. For how to log in to MRS Manager, see Accessing MRS Manager.

    In the left navigation pane, choose JDBCServer2x > Customization. Add the configuration item dfs.namenode.acls.enabled to the spark.hdfs-site.customized.configs parameter and set the value to false to disable the HDFS ACL function.

    Figure 1 Adding Spark custom parameters

  2. Search for the spark.sql.statistics.fallBackToHdfs parameter and set its value to false.

    Figure 2 Setting spark.sql.statistics.fallBackToHdfs

  3. Save the configurations and restart the JDBCServer2x instance.
  4. Log in to the client installation node as the client installation user.

    For details about how to download and install the cluster client, see Installing an MRS Cluster Client.

  5. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  6. Authenticate the user of the cluster with Kerberos authentication enabled. Skip this step for the user of the cluster with Kerberos authentication disabled.

    kinit Username

  7. Access OBS using Spark beeline. The following example creates a table named test in the obs://mrs-word001/table/ directory.

    Access spark-beeline.

    spark-beeline

    Create the test table.

    create table test(id int) location 'obs://mrs-word001/table/';

  8. Run the following command to query all tables. If table test is returned, OBS access is successful.

    show tables;
    Figure 3 Returned table names

  9. Press Ctrl+C to exit Spark beeline.

Interconnecting with OBS Using Spark SQL

  1. Log in to the client installation node as the client installation user.

    For details about how to download and install the cluster client, see Installing an MRS Cluster Client.

  2. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  3. Modify the configuration file:

    vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml

    Modify the following content, where the dfs.namenode.acls.enabled parameter specifies whether to enable the HDFS ACL function.

    <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
    </property>

  4. Authenticate the user of the cluster with Kerberos authentication enabled. Skip this step for the user of the cluster with Kerberos authentication disabled.

    kinit Username

  5. Access OBS using Spark SQL CLI. For example, create a table named test in the obs://mrs-word001/table/ directory.

    1. Go to the Spark bin directory.
      cd Client installation directory/Spark2x/spark/bin
    2. Log in to the Spark SQL CLI.
      ./spark-sql
    3. Run the following command in the Spark SQL CLI to create a table:
      create table test(id int) location 'obs://mrs-word001/table/';

  6. Check whether the table exists.

    show tables;

  7. Run exit; to exit the Spark SQL CLI.

    If a large number of logs are printed in the OBS file system, read and write performance may be affected. You can adjust the log level of the OBS client as follows:

    1. Go to the conf directory.
      cd Client installation directory/Spark2x/spark/conf
    2. Edit the file log4j.properties.
      vi log4j.properties
    1. Add the OBS log level configuration to the file.
      log4j.logger.org.apache.hadoop.fs.obs=WARN
      log4j.logger.com.obs=WARN
      Figure 4 Adding an OBS log level

Using Spark Shell to Read OBS Files

  1. Log in to the client installation node as the client installation user.

    For details about how to download and install the cluster client, see Installing an MRS Cluster Client.

  2. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  3. Modify the configuration file:

    vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml

    Modify the following content, where the dfs.namenode.acls.enabled parameter specifies whether to enable the HDFS ACL function.

    <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
    </property>

  4. Authenticate the user of the cluster with Kerberos authentication enabled. Skip this step for the user of the cluster with Kerberos authentication disabled.

    kinit Username

  5. Run the following commands to log in to the Spark SQL CLI:

    Go to the bin directory.

    cd Client installation directory/Spark2x/spark/bin

    Log in to the Spark SQL CLI.

    ./spark-sql

  6. Create a table in OBS and import data to the table.

    1. Create a database.
      create database test location "obs://Parallel file system path/test";
    2. Switch to the new database.
      use test;
    3. Create a table.
      create table test1(a int,b int) using parquet;
    4. Insert data into the table.
      insert into test1 values(1,2);
    5. View table details.
      desc formatted test1;

      The Location of the table shown in Figure 5 is an OBS path.

      Figure 5 Checking the location of the table
    6. Run exit; to exit the Spark SQL CLI.

  7. Run the following command to go to the Spark bin directory:

    cd Client installation directory/Spark2x/spark/bin

  8. Log in to the Spark SQL CLI.

    ./spark-shell

  9. In the Spark Shell CLI, run the following command to query the table created in 5.c:

    spark.read.format("parquet").load("obs://Parallel file system path/test1").show();

    The table data is displayed, as shown below.

    Figure 6 Viewing table data

  10. Run the :quit command to exit the Spark Shell CLI.