Updated on 2023-12-04 GMT+08:00

Interconnecting Spark2x with OBS

The OBS file system can be interconnected with Spark2x after an MRS cluster is installed.

Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to Configuring a Storage-Compute Decoupled Cluster (Agency) or Configuring a Storage-Compute Decoupled Cluster (AK/SK).

Verifying OBS Access with Spark Beeline

  1. Log in to FusionInsight Manager and choose Cluster > Services > Spark2x > Configurations > All Configurations.

    In the left navigation tree, choose JDBCServer2x > Customization. Add dfs.namenode.acls.enabled to the spark.hdfs-site.customized.configs parameter and set its value to false.

    Figure 1 Adding Spark custom parameters

  2. Search for the spark.sql.statistics.fallBackToHdfs parameter and set its value to false.

    Figure 2 Setting spark.sql.statistics.fallBackToHdfs

  3. Save the configurations and restart the JDBCServer2x instance.
  4. Log in to the client installation node as the client installation user.
  5. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  6. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.

    kinit Username

  7. Access OBS using Spark beeline. The following example creates a table named test in the obs://mrs-word001/table/ directory.

    create table test(id int) location 'obs://mrs-word001/table/';

  8. Run the following command to query all tables. If table test is returned, OBS access is successful.

    show tables;

    Figure 3 Returned table names

  9. Press Ctrl+C to exit Spark beeline.

Verifying OBS Access with Spark SQL

  1. Log in to the client installation node as the client installation user.
  2. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  3. Modify the configuration file:

    vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml

    <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
    </property>

  4. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.

    kinit Username

  5. Access OBS using Spark SQL CLI. For example, create a table named test in the obs://mrs-word001/table/ directory.

    1. Go to the cd Client installation directory/Spark2x/spark/bin directory and run the ./spark-sql command to log in to the Spark SQL CLI.
    2. Run the following command in the Spark SQL CLI:

      create table test(id int) location 'obs://mrs-word001/table/';

  6. Run the show tables; command to confirm that the table is created successfully.
  7. Run exit; to exit the Spark SQL CLI.

    If a large number of logs are printed in the OBS file system, read and write performance may be affected. You can adjust the log level of the OBS client as follows:

    cd Client installation directory/Spark2x/spark/conf

    vi log4j.properties

    Add the OBS log level configuration to the file as follows:

    log4j.logger.org.apache.hadoop.fs.obs=WARN

    log4j.logger.com.obs=WARN
    Figure 4 Adding an OBS log level

Using Spark Shell to Read OBS Files

  1. Log in to the client installation node as the client installation user.
  2. Run the following commands to configure environment variables:

    source Client installation directory/bigdata_env

  3. Modify the configuration file:

    vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml

    <property>
    <name>dfs.namenode.acls.enabled</name>
    <value>false</value>
    </property>

  4. For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.

    kinit Username

  5. Create an OBS file.

    1. Run the following commands to log in to the Spark SQL CLI:

      cd Client installation directory/Spark2x/spark/conf

      ./spark-sql

    2. Run the following commands to create a table and import data to the table:

      create database test location "obs://Parallel file system path/test";

      use test;

      create table test1(a int,b int) using parquet;

      insert into test1 values(1,2);

      desc formatted test1;

      Figure 5 Checking the location of the table

  6. Run the following command to go to the Spark bin directory:

    cd Client installation directory/Spark2x/spark/conf

    Run ./spark-sql to log in to the Spark SQL CLI.

  7. In the Spark Shell CLI, run the following command to query the table created in 5.b:

    spark.read.format("parquet").load ("obs://Parallel file system path/test1").show();

    Figure 6 Viewing table data

  8. Run the :quit command to exit the Spark Shell CLI.