Creating Hive Table Partitions to for Faster Queries

Scenario

During the Select query, Hive generally scans the entire table, which is time-consuming. To improve query efficiency, create table partitions based on service requirements and query dimensions.

Procedure

For versions earlier than MRS 3.x:

Log in to the MRS console. In the left navigation pane, choose Active Clusters and click a cluster name. Choose Nodes > Node. The ECS page is displayed. Click Remote Login to log in to the Hive node.

For MRS 3.x and later:

Log in to the node where the Hive client has been installed as user root.
Run the following command to go to the client installation directory, for example, /opt/client:

cd /opt/client
Run the source bigdata_env command to configure environment variables for the client.
Run the following command on the client for login:

kinit Username
Run the following command to log in to the client tool:

beeline
Select the static or dynamic partition.
- Static partition:
  Manually enter a partition name, and use the keyword PARTITIONED BY to specify partition column name and data type when creating a table. During application development, use the ALTER TABLE ADD PARTITION statement to add a partition and use the LOAD DATA INTO PARTITION statement to load data to the partition, which supports only static partitions.
- Dynamic partition: Use a query command to insert results to a partition of a table. The partition can be a dynamic partition.
  The dynamic partition can be enabled on the client tool by running the following command:
  
  set hive.exec.dynamic.partition=true;
  
  The default mode of the dynamic partition is strict. That is, at least a column must be specified as a static partition, under which dynamic sub-partitions can be created. You can run the following command to enable a completely dynamic partition:
  
  set hive.exec.dynamic.partition.mode=nonstrict;
- Dynamic partitioning may cause a DML statement to create a large number of partitions and new mapping folders, which deteriorates system performance.
- If there are a large number of files, it takes a long time to run a SQL statement. You can run the set mapreduce.input.fileinputformat.list-status.num-threads = 100; statement before running a SQL statement to shorten the time. The parameter mapreduce.input.fileinputformat.list-status.num-threads can be set only after being added to the Hive whitelist.

Parent topic: Hive Performance Tuning

Previous topic: Hive Performance Tuning

Next topic: Hive Join Optimization

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot