Hive Application Development Suggestions

HQL - implicit type conversion

If the query statements use the field value for filtering, do not use the implicit type conversion of Hive to compile HQL. The reason is that the implicit type conversion is not conducive to code reading and migration.

Correct:

select * from default.tbl_src where id = 10001; 
 select * from default.tbl_src where name = 'TestName';

Incorrect:

select * from default.tbl_src where id = '10001'; 
 select * from default.tbl_src where name = TestName;

Note that the type of the id field in the tbl_src table is Int, and the type of the name field is String.

HQL - object name length

The HQL object names include table names, field names, view names, and index names. It is recommended that the object name not exceed 30 bytes.

An error is reported if an object name of Oracle exceeds 30 bytes. PT also limits object names to 30 bytes.

Excessive long object names are not conductive to code reading, migration, and maintenance.

HQL - statistics of data records

To count the total number of records in a table, use select count(1) from table_name.

To count the number of valid records for a field in a table, use select count(column_name) from table_name.

JDBC - timeout limit

The JDBC provided by Hive supports timeout limit. The default value is 5 minutes. Users can use java.sql.DriverManager.setLoginTimeout(int seconds) to change the value. The unit of seconds is second.

UDF Management

It is recommended that the administrator creates permanent UDF. This is done to avoid repeated execution of the add jar statement and UDF redefining.

UDF of Hive has some default properties. For example, the default value of deterministic is true (indicating that the same result will be returned for the same input), and the default value of stateful is true. Corresponding annotates should be added when user-defined UDF conducts an internal data summary. The following is an example:

@UDFType(deterministic = false)   
Public class MyGenericUDAFEvaluator implements Closeable {

Suggestions on Optimizing Table Partitions

It is advised to use partition tables and store data by day when data volume is large and statistics need to be collected on a daily basis.
In order to avoid excessive small files, add distribute by to the partition field during dynamic partition data insertion.

Suggestions on Optimizing Storage File Formats

Hive supports multiple storage formats, including TextFile, RCFile, ORC, Sequence, and Parquet. If you want to save storage space or query certain fields for the most of time, use columnar storage, for example, ORC files, to create tables.

Parent topic: Hive

Previous topic: Hive Application Development Rules

Next topic: Hudi