Help Center> MapReduce Service> Developer Guide (Normal_Earlier Than 3.x)> Hive Application Development> Development Specifications> Rules

Updated on 2022-11-04 GMT+08:00

View PDF

Rules

Load the Hive JDBC Driver

A client program connects to HiveServer using JDBC. You need to load the JDBC driver class org.apache.hive.jdbc.HiveDriver of Hive first.

Therefore, you must load the driver class using the current class loader before starting the client program.

If there is no JAR file in classpath, the client program throws "Class Not Found" and exits.

Example:

Class.forName("org.apache.hive.jdbc.HiveDriver").newInstance();

Obtain a Database Connection

Use the driver management class java.sql.DriverManager of the JDK to obtain a Hive database connection.

The Hive database URL is url="jdbc:hive2://xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver;sasl.qop=auth-conf;auth=KERBEROS;principal=hive/hadoop.hadoop.com@HADOOP.COM;user.principal=hive/hadoop.hadoop.com;user.keytab=conf/hive.keytab";

The username and password of the Hive database are null or empty because authentication has been performed successfully.

Example:

//Set up a connection.

connection = DriverManager.getConnection(url, "", "");

Execute HiveQL

Execute HiveQL. Note that HiveQL cannot end with a semicolon (;).

Correct example:

String sql = "SELECT COUNT(*) FROM employees_info";
Connection connection = DriverManager.getConnection(url, "", "");
PreparedStatement statement = connection.prepareStatement(sql);
resultSet = statement.executeQuery();

Incorrect example:

String sql = "SELECT COUNT(*) FROM employees_info;";
Connection connection = DriverManager.getConnection(url, "", "");
PreparedStatement statement = connection.prepareStatement(sql);
resultSet = statement.executeQuery();

Close a Database Connection

After the client executes the HiveQL, close the database connection to prevent memory leakage.

The statement and connection objects of the JDK must be closed.

Example:

finally {
            if (null != statement) {
                statement.close();
            }
            
            // Close a JDBC connection.
            if (null != connection) {
                connection.close();
            }
        }

Use HiveQL Syntax to Check Whether a Field is Null

Use is null to check whether a field is empty, that is, the field has no value. Use is not null to check whether a field is not null, that is, the field has a value.

If you use is null to check a field whose type is String and length is 0, False is returned. In this case, you need to use col = '' to check null character strings, and col != '' to check non-null character strings.

Correct example:

select * from default.tbl_src where id is null;
select * from default.tbl_src where id is not null;
select * from default.tbl_src where name = '';
select * from default.tbl_src where name != '';

Incorrect example:

select * from default.tbl_src where id = null;
select * from default.tbl_src where id != null;
select * from default.tbl_src where name is null;
select * from default.tbl_src where name is not null; Note: The id field in the tbl_src table is of the Int type, and the name field is of the String type.

Ensure that Parameter Configurations Are Consistent Between the Client and Server

If the configuration parameters of the Hive, YARN, and HDFS servers of the cluster are modified, the related parameters in a client program will be modified. You need to check whether the configuration parameters submitted to the HiveServer before modification are consistent with those on the servers. If the configuration parameters are inconsistent, modify them on the client and submit them to the HiveServer. In the following example, if the configuration parameters of YARN in the cluster are modified, the configuration parameters submitted to the HiveServer from Hive clients and sample programs before the modification must be reviewed and modified.

Initial state

The parameter configuration of Yarn in the cluster is as follows:

mapreduce.reduce.java.opts=-Xmx2048M

The parameter configuration on the client is as follows:

mapreduce.reduce.java.opts=-Xmx2048M

The parameter configuration of Yarn in the cluster after the modification is as follows:

mapreduce.reduce.java.opts=-Xmx1024M

If the parameter in the client program is not modified, the client parameter is still valid. As a result, the reducer memory is insufficient and the MapReduce fails to run.

Multithread Security Login Mode

If multiple threads are performing login operations, the relogin mode must be used for the subsequent logins of all threads after the first successful login of an application.

  private Boolean login(Configuration conf){
    boolean flag = false;
    UserGroupInformation.setConfiguration(conf);
    
    try {
      UserGroupInformation.loginUserFromKeytab(conf.get(PRINCIPAL), conf.get(KEYTAB));
      System.out.println("UserGroupInformation.isLoginKeytabBased(): " +UserGroupInformation.isLoginKeytabBased());
      flag = true;
    } catch (IOException e) {
      e.printStackTrace();
    }
    return flag;
    
  }

Relogin sample code:

public Boolean relogin(){
        boolean flag = false;
        try {
            
          UserGroupInformation.getLoginUser().reloginFromKeytab();
          System.out.println("UserGroupInformation.isLoginKeytabBased(): " +UserGroupInformation.isLoginKeytabBased());
          flag = true;
        } catch (IOException e) {
            e.printStackTrace();
        }
        return flag;
    }

Prerequisites for Using REST APIs of WebHCat to Submit MapReduce Jobs in Streaming Mode

This API depends on a Hadoop streaming package. Before submitting a MapReduce job to WebHCat in Streaming mode, you need to perform the following steps to upload the hadoop-streaming-x.x.x.jar package to the specified path in HDFS (hdfs:///apps/templeton/hadoop-streaming-xxx.jar).

Log in to the node where the client and Hive service are installed. Assume that the client installation path is /opt/client.
source /opt/client/bigdata_env
Run the kinit command to log in to the human-machine or machine-machine user of the cluster.
Run the following command to save the streaming package to the /apps/templeton directory in HDFS:
- For MRS 1.9.2 or later, run the following command:
  hdfs dfs -put /opt/Bigdata/MRS_x.x.x/install/FusionInsight-Hadoop-x.x.x/hadoop/share/hadoop/tools/lib/hadoop-streaming-x.x.x-mrs-x.x.jar /apps/templeton
  
  /apps/templeton/ needs to be modified according to different instances. The default instance is /apps/templeton/, the Hive1 instance is /apps1/templeton/, and so on.
hdfs dfs -put /opt/Bigdata/MRS_x.x.x/install/FusionInsight-Hadoop-x.x.x/hadoop/share/hadoop/tools/lib/hadoop-streaming-x.x.x-mrs-x.x.jar /apps/templeton

/apps/templeton/ needs to be modified according to different instances. The default instance is /apps/templeton/, the Hive1 instance is /apps1/templeton/, and so on.

For example, submit the MR job of the Streaming.

Create the hdfs directory.
hdfs dfs -mkdir /user/root/input/
Place the customized data file new.txt in the hdfs directory.
hdfs dfs -put new.txt /user/root/input/
Submit the MR job.
- For MRS 1.9.2 or later, run the following command:
  $HADOOP_HOME/bin/hadoop jar /opt/client/HDFS/hadoop/share/hadoop/tools/lib/hadoop-streaming-x.x.x-mrs-x.x.jar -input input/new.txt -output output -mapper 'cut -f 2 -d ,' -reducer 'uniq'
  
  The following provides description about the parameters in the preceding command.
  - The second output in -output output is a directory for storing the generated files. The directory is automatically created after the command is executed. Before running the command, ensure that the directory does not exist in the /user/root/ directory. Otherwise, an error is reported indicating that the directory already exists.
  - Change the JAR file of the hadoop-streaming version in the preceding command to the actual JAR file provided in the cluster.
  - The parameters after -mapper and -reducer can be customized.

Do Not Perform Read and Write Operations on the Same Table at the Same Time

In the current version, Hive does not support concurrent operations. Therefore, do not perform read and write operations on the same table at the same time. Otherwise, query results may be inaccurate or even tasks fail.

Bucket Table Does Not Support insert into

A bucket table does not support insert into and supports only insert overwrite. If you use insert into, the number of files is inconsistent with the number of buckets.

Prerequisites for Using Some REST APIs of WebHCat

Some REST APIs of WebHCat depend on the JobHistoryServer instance of MapReduce. The APIs are as follows:

mapreduce/jar(POST)
mapreduce/streaming(POST)
hive(POST)
jobs(GET)
jobs/:jobid(GET)
jobs/:jobid(DELETE)

Hive Authorization Description

It is recommended that Hive authorization (databases, tables, or views) be performed on the Manager authorization page. Authorization in CLI is not recommended except in the scenario alter databases databases_name set owner=';user_name';.

Hive on HBase Partition Tables Cannot Be Created

Data of Hive on HBase partition tables is stored on HBase. Because HBase tables are divided into multiple partitions that are scattered on RegionServer, Hive on HBase partition tables cannot be created on Hive.

Hive on HBase Tables Do Not Support insert overwrite

HBase uses a RowKey to uniquely identify a record. If data to be inserted has the same RowKey as the existing data, HBase will use the new data to overwrite the existing data. If insert overwrite is performed for a Hive on HBase table on Hive, only data with the same RowKey will be overwritten.

Parent topic: Development Specifications

Previous topic: Development Specifications

Next topic: Suggestions