Creating Hive Table Partitions to for Faster Queries

Scenario

During the Select query, Hive generally scans the entire table, which is time-consuming. To improve query efficiency, create table partitions based on service requirements and query dimensions.

Procedure

For versions earlier than MRS 3.x:

Log in to the MRS console. In the left navigation pane, choose Active Clusters and click a cluster name. Choose Nodes > Node. The ECS page is displayed. Click Remote Login to log in to the Hive node.

For MRS 3.x and later:

Log in to the node where the Hive client has been installed as user root.
Run the following command to go to the client installation directory, for example, /opt/client:

cd /opt/client
Run the source bigdata_env command to configure environment variables for the client.
Run the following command on the client for login:

kinit Username
Run the following command to log in to the client tool:

beeline
Select the static or dynamic partition.
- Static partition:
  Manually enter a partition name, and use the keyword PARTITIONED BY to specify partition column name and data type when creating a table. During application development, use the ALTER TABLE ADD PARTITION statement to add a partition and use the LOAD DATA INTO PARTITION statement to load data to the partition, which supports only static partitions.
- Dynamic partition: Use a query command to insert results to a partition of a table. The partition can be a dynamic partition.
  The dynamic partition can be enabled on the client tool by running the following command:
  
  set hive.exec.dynamic.partition=true;
  
  The default mode of the dynamic partition is strict. That is, at least a column must be specified as a static partition, under which dynamic sub-partitions can be created. You can run the following command to enable a completely dynamic partition:
  
  set hive.exec.dynamic.partition.mode=nonstrict;
- The dynamic partition may cause a DML statement to create a large number of partitions and new mapping folders, which deteriorates system performance.
- If there are a large number of files, it takes a long time to run a SQL statement. You can run the set mapreduce.input.fileinputformat.list-status.num-threads = 100; statement before running a SQL statement to shorten the time. The parameter mapreduce.input.fileinputformat.list-status.num-threads can be set only after being added to the Hive whitelist.