Updated on 2022-02-22 GMT+08:00

Creating Table Partitions

Scenario

During the Select query, Hive generally scans the entire table, which is time-consuming. To improve query efficiency, create table partitions based on service requirements and query dimensions.

Procedure

  1. For versions earlier than MRS 3.x:

    Log in to the MRS console. In the left navigation pane, choose Clusters > Active Clusters, and click a cluster name. Choose Nodes > Node. The ECS page is displayed. Click Remote Login to log in to the Hive node.

    For MRS 3.x or later:

    Log in to the node where the Hive client has been installed as user root.

  2. Run the following command to go to the client installation directory, for example, /opt/client.

    cd /opt/client

  3. Run the source bigdata_env command to configure environment variables for the client.
  4. Run the following command on the client for login:

    kinit Username

  5. Run the following command to log in to the client tool:

    beeline

  6. Select the static or dynamic partition.

    • Static partition:

      Manually enter a partition name, and use the keyword PARTITIONED BY to specify partition column name and data type when creating a table. During application development, use the ALTER TABLE ADD PARTITION statement to add a partition and use the LOAD DATA INTO PARTITION statement to load data to the partition, which supports only static partitions.

    • Dynamic partition: Use a query command to insert results to a partition of a table. The partition can be a dynamic partition.

      The dynamic partition can be enabled on the client tool by running the following command:

      set hive.exec.dynamic.partition=true;

      The default mode of the dynamic partition is strict. That is, at least a column must be specified as a static partition, under which dynamic sub-partitions can be created. You can run the following command to enable a completely dynamic partition:

      set hive.exec.dynamic.partition.mode=nonstrict;

    • The dynamic partition may cause a DML statement to create a large number of partitions and new mapping folders, which deteriorates system performance.
    • If there are a large number of files, it takes a long time to run a SQL statement. You can run the set mapreduce.input.fileinputformat.list-status.num-threads = 100; statement before running a SQL statement to shorten the time. The parameter mapreduce.input.fileinputformat.list-status.num-threads can be set only after being added to the Hive whitelist.