Updated on 2024-10-23 GMT+08:00

Spark JDBCServer APIs

Overview

The JDBCServer is another implement of HiveServer2 in the Hive. The Spark SQL is used to process the SQL statement at its bottom. Therefore, the JDBCServer has better performance than the Hive.

The JDBCServer is a JDBC interface. You can log in to the JDBCServer and access the Spark SQL data through the JDBC. When the JDBCServer is started, a Spark SQL application is started, and the clients connected through the JDBC share the resources in this application. That is, various users can share data. When the JDBCServer is started, a listener is also started to wait for the connection of the JDBC client and submit the query after the connection. Therefore, during the configuration of the JDBCServer, at least the host name and port of the JDBCServer must be configured. If Hive data is required, the uris of the hive metastore needs to be provided.

JDBCServer starts a JDBC service on port 22550 of the installation node by default. (If you want to change the port, configure the hive.server2.thrift.port parameter.) You can connect to JDBCServer using Beeline or running the JDBC client code to run SQL statements.

For other information about the JDBCServer, visit the Spark official website http://archive.apache.org/dist/spark/docs/3.3.1/sql-programming-guide.html#distributed-sql-engine.

Beeline

For connection methods of the Beeline provided by the open-source community, visit https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients.

To solve the connection problem in two scenarios of the Beeline, the authentication information is added in the Beeline connection. The user.keytab and user.principal parameters are added in the URL of the JDBC. When the key tab expires, the login information of the client can be read automatically and the connection succeeds again.

You do not want to perform the key tab authentication by running the kinit command because the key tab expires every 24 hours. The Keytab file and principal information can be obtained from the administrator. The following command is used as a connection example of Beeline.

sh CLIENT_HOME/spark/bin/beeline -u "jdbc:hive2://<zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port>;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=sparkthriftserver2x;user.principal=spark2x/hadoop.<system domain name>@<system domain name>;saslQop=auth-conf;auth=KERBEROS;principal=spark2x/hadoop.<system domain name>@<system domain name>;"

  • <zkNode1_IP>:<zkNode1_Port>,<zkNode2_IP>:<zkNode2_Port>,<zkNode3_IP>:<zkNode3_Port> indicates the URL of ZooKeeper. Multiple URLs are separated by comma. For example: 192.168.81.37:2181,192.168.195.232:2181,192.168.169.84:2181.
  • sparkthriftserver2x indicates the directory in Zookeeper where a random JDBCServer instance is selected for the connection to the client.

JDBC Client Codes

Log in to the JDBCServer by using the JDBC client codes and access the Spark SQL data. For details, see Sample Projects for Accessing Spark SQL Through JDBC.

Enhanced Features

Compared with the open source community, Huawei provides two enhanced features: the JDBCServerHA solution and timeout of configuring the JDBCServer.

  • The JDBCServer HA solution is described as follows:

    When multiple active nodes of JDBCServer provide services at the same time, a new client will be connected to another active node if a fault occurs on one node, ensuring continuous services for clusters. The operations by using the Beeline and JDBC client codes are the same.

  • Configure the timeout of the connection between the client and JDBCServer.
    • Beeline

      In network congestion, this feature can avoid the suspending of Beeline due to timeless wait of the return from the server. The method is described as follows:

      When the Beeline is started, add --socketTimeOut=n. The n indicates the timeout waiting for the service return. The unit is second and the default value is 0 (indicating never timing out). Set the maximum timeout waiting time as required.

    • JDBC Client Codes

      In the scenario of network congestion, this feature can avoid the suspending of the client due to limitless wait of the return of server. The method to use is shown as follows:

      Before the obtaining of the JDBC by using the DriverManager.getConnection method, add the DriverManager.setLoginTimeout(n) method to configure the timeout length. n indicates the timeout length of waiting for the service return. The unit is second and the type is Int. The default value is 0 (indicating never timing out). Set the maximum timeout waiting time as required.