Updated on 2023-04-28 GMT+08:00

Creating Table Partitions

Scenario

During the Select query, Hive generally scans the entire table, which is time-consuming. To improve query efficiency, create table partitions based on service requirements and query dimensions.

Procedure

  1. Log in to the node where the Hive client has been installed as user root.
  2. Run the following command to go to the client installation directory, for example, /opt/client:

    cd /opt/client

  3. Run the source bigdata_env command to configure environment variables for the client.
  4. Run the following command on the client for login:

    kinit Username

  5. Run the following command to log in to the client tool:

    beeline

  6. Select the static or dynamic partition.

    • Static partition:

      Manually enter a partition name, and use the keyword PARTITIONED BY to specify partition column name and data type when creating a table. During application development, use the ALTER TABLE ADD PARTITION statement to add a partition and use the LOAD DATA INTO PARTITION statement to load data to the partition, which supports only static partitions.

    • Dynamic partition: Use a query command to insert results to a partition of a table. The partition can be a dynamic partition.

      The dynamic partition can be enabled on the client tool by running the following command:

      set hive.exec.dynamic.partition=true;

      The default mode of the dynamic partition is strict. That is, at least a column must be specified as a static partition, under which dynamic sub-partitions can be created. You can run the following command to enable a completely dynamic partition:

      set hive.exec.dynamic.partition.mode=nonstrict;

    • The dynamic partition may cause a DML statement to create a large number of partitions and new mapping folders, which deteriorates system performance.
    • If there are a large number of files, it takes a long time to run a SQL statement. You can run the set mapreduce.input.fileinputformat.list-status.num-threads = 100; statement before running a SQL statement to shorten the time. The parameter mapreduce.input.fileinputformat.list-status.num-threads can be set only after being added to the Hive whitelist.