Interconnecting Spark2x with OBS
The OBS file system can be interconnected with Spark2x after an MRS cluster is installed.
Before performing the following operations, ensure that you have configured a storage-compute decoupled cluster by referring to Configuring a Storage-Compute Decoupled Cluster (Agency) or Configuring a Storage-Compute Decoupled Cluster (AK/SK).
Verifying OBS Access with Spark Beeline
- Log in to FusionInsight Manager and choose Cluster > Services > Spark2x > Configurations > All Configurations.
In the left navigation tree, choose JDBCServer2x > Customization. Add dfs.namenode.acls.enabled to the spark.hdfs-site.customized.configs parameter and set its value to false.
Figure 1 Adding Spark custom parameters
- Search for the spark.sql.statistics.fallBackToHdfs parameter and set its value to false.
Figure 2 Setting spark.sql.statistics.fallBackToHdfs
- Save the configurations and restart the JDBCServer2x instance.
- Log in to the client installation node as the client installation user.
- Run the following commands to configure environment variables:
source Client installation directory/bigdata_env
- For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.
kinit Username
- Access OBS using Spark beeline. The following example creates a table named test in the obs://mrs-word001/table/ directory.
create table test(id int) location 'obs://mrs-word001/table/';
- Run the following command to query all tables. If table test is returned, OBS access is successful.
show tables;
Figure 3 Returned table names
- Press Ctrl+C to exit Spark beeline.
Verifying OBS Access with Spark SQL
- Log in to the client installation node as the client installation user.
- Run the following commands to configure environment variables:
source Client installation directory/bigdata_env
- Modify the configuration file:
vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml
<property> <name>dfs.namenode.acls.enabled</name> <value>false</value> </property>
- For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.
kinit Username
- Access OBS using Spark SQL CLI. For example, create a table named test in the obs://mrs-word001/table/ directory.
- Run the show tables; command to confirm that the table is created successfully.
- Run exit; to exit the Spark SQL CLI.
If a large number of logs are printed in the OBS file system, read and write performance may be affected. You can adjust the log level of the OBS client as follows:
cd Client installation directory/Spark2x/spark/conf
vi log4j.properties
Add the OBS log level configuration to the file as follows:
log4j.logger.org.apache.hadoop.fs.obs=WARN
log4j.logger.com.obs=WARNFigure 4 Adding an OBS log level
Using Spark Shell to Read OBS Files
- Log in to the client installation node as the client installation user.
- Run the following commands to configure environment variables:
source Client installation directory/bigdata_env
- Modify the configuration file:
vim Client installation directory/Spark2x/spark/conf/hdfs-site.xml
<property> <name>dfs.namenode.acls.enabled</name> <value>false</value> </property>
- For a security cluster, run the following command to perform user authentication. If Kerberos authentication is not enabled for the current cluster, you do not need to run this command.
kinit Username
- Create an OBS file.
- Run the following commands to log in to the Spark SQL CLI:
cd Client installation directory/Spark2x/spark/conf
./spark-sql
- Run the following commands to create a table and import data to the table:
create database test location "obs://Parallel file system path/test";
use test;
create table test1(a int,b int) using parquet;
insert into test1 values(1,2);
desc formatted test1;
Figure 5 Checking the location of the table
- Run the following commands to log in to the Spark SQL CLI:
- Run the following command to go to the Spark bin directory:
cd Client installation directory/Spark2x/spark/conf
Run ./spark-sql to log in to the Spark SQL CLI.
- In the Spark Shell CLI, run the following command to query the table created in 5.b:
spark.read.format("parquet").load ("obs://Parallel file system path/test1").show();
Figure 6 Viewing table data
- Run the :quit command to exit the Spark Shell CLI.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.