Spark JDBCServer APIs

Overview

JDBCServer is another HiveServer2 implementation in Hive. It uses the Spark structured query language (SQL) to process the SQL statements, providing higher performance than Hive.

JDBCServer is a JDBC interface. Users can use JDBC to connect to JDBCServer to access SparkSQL data. When JDBCServer is started, a SparkSQL application is started, and the clients connected through JDBC can share the resources of the SparkSQL application, that is, different users can share data. When JDBCServer is started, a listener is also enabled to wait for the JDBC client to submit the connection and query requests. Therefore, when configuring JDBCServer, you need to configure at least the host name and port number of JDBCServer. If you want to use Hive data, you also need to provide the URIs of Hive MetaStore.

JDBCServer starts a JDBC service on port 22550 of the installation node by default. (If you want to change the port, configure the hive.server2.thrift.port parameter.) You can connect to JDBCServer using Beeline or running the JDBC client code to run SQL statements.

For other information about the JDBCServer, visit the Spark official website: http://archive.apache.org/dist/spark/docs/3.3.1/sql-programming-guide.html#distributed-sql-engine.

Beeline

For connection methods of the Beeline provided by the open-source community, visit https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients.

To fix the connection issue in Beeline's two scenarios, add authentication information to the Beeline connection. Parameters user.keytab and user.principal are added in the URL of the JDBC. Once the key tab expires, the client's login information can be automatically retrieved, and the connection will be successful again.

You do not want to perform the key tab authentication by running the kinit command because the key tab expires every 24 hours. The Keytab file and principal information can be obtained from the administrator. The following command is used as a connection example of Beeline.

sh CLIENT_HOME/spark/bin/beeline -u "jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;user.principal=spark2x/hadoop.<System domain name>@<System domain name>;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<System domain name>@<System domain name>;"

In the preceding information, <zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port> indicates the ZooKeeper URL. Use commas (,) to separate multiple URLs, for example, 192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181.
sparkthriftserver2x indicates the ZooKeeper directory, where a random JDBCServer instance is connected to the client.

JDBC Client Codes

Log in to JDBCServer by using the JDBC client codes and access the Spark SQL data. For details, see Sample Projects for Accessing Spark SQL Through JDBC.

Enhanced Features

Compared with the open source community, Huawei provides two enhanced features: JDBCServer HA solution and setting of the JDBCServer connection timeout interval.

In the JDBCServer HA solution, multiple active JDBCServer nodes provide services at the same time. When one node is faulty, new client connections are allocated to other active nodes to ensure uninterrupted services for the cluster. The operations of using Beeline or JDBC client code for connection are the same.
Set the timeout interval for the connection between the client and JDBCServer.
- Beeline
  In the case of network congestion, this feature prevents the Beeline from being suspended due to the infinite waiting for the response of the server. The configuration method is as follows:
  
  When Beeline is started, add --socketTimeOut=n. n indicates the timeout waiting for the service return. The unit is second and the default value is 0 (indicating never timing out). Set the maximum timeout waiting time as required.
- JDBC Client Codes
  This feature prevents the client from being stuck waiting for the server's response indefinitely in case of network congestion. The configuration method is as follows:
  
  Before using the DriverManager.getConnection method to obtain the JDBC connection, add the DriverManager.setLoginTimeout(n) method to configure a timeout interval. n indicates the timeout interval for waiting for the return result from the server. The unit is second, the type is Int, and the default value is 0 (indicating never timing out). Set the maximum timeout waiting time as required.