Updated on 2024-11-29 GMT+08:00

Using Hive from Scratch

Hive is a data warehouse framework built on Hadoop. It maps structured data files to a database table and provides SQL-like functions to analyze and process data. It also allows you to quickly perform simple MapReduce statistics using SQL-like statements without the need of developing a specific MapReduce application. It is suitable for statistical analysis of data warehouses.

Background

Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Hive client is as follows:

Operations on common tables:

  • Create the user_info table.
  • Add users' educational backgrounds and professional titles to the table.
  • Query user names and addresses by user ID.
  • Delete the user information table after service A ends.
Table 1 User information

ID

Name

Gender

Age

Address

12005000201

A

Male

19

City A

12005000202

B

Female

23

City B

12005000203

C

Male

26

City C

12005000204

D

Male

18

City D

12005000205

E

Female

21

City E

12005000206

F

Male

32

City F

12005000207

G

Female

29

City G

12005000208

H

Female

30

City H

12005000209

I

Male

26

City I

12005000210

J

Female

25

City J

Procedure

  1. Download the client configuration file.

    1. Log in to FusionInsight Manager and click Download Client in the upper right corner of the home page.
    2. Select Configuration Files Only for Select Client Type, select a platform type, select Server for downloading the client file, and click OK to generate the client configuration file. The generated file is saved in the /tmp/FusionInsight-Client/ directory on the active management node by default.

  2. Log in to the active management node of Manager.

    1. Log in to any node where Manager is deployed as user root.
    2. Run the following command to identify the active and standby nodes:

      sh ${BIGDATA_HOME}/om-server/om/sbin/status-oms.sh

      In the command output, the value of HAActive for the active management node is active, and that for the standby management node is standby. In the following example, node-master1 is the active management node, and node-master2 is the standby management node.

      HAMode 
      double 
      NodeName             HostName        HAVersion          StartTime                HAActive             HAAllResOK           HARunPhase  
      192-168-0-30         node-master1    V100R001C01        2020-05-01 23:43:02      active               normal               Actived     
      192-168-0-24         node-master2    V100R001C01        2020-05-01 07:14:02      standby              normal               Deactived 
    3. Log in to the primary management node as user root and run the following command to switch to user omm:

      sudo su - omm

  3. Run the following command to go to the client installation directory:

    cd /opt/client

    The cluster client has been installed in advance. The following client installation directory is used as an example. Change it based on the site requirements.

  4. Run the following command to update the client configuration for the active management node.

    sh refreshConfig.sh /opt/client Full path of the client configuration file package

    For example, run the following command:

    sh refreshConfig.sh /opt/client /tmp/FusionInsight-Client/FusionInsight_Cluster_1_Services_Client.tar

    If the following information is displayed, the configurations have been updated successfully.

     ReFresh components client config is complete.
     Succeed to refresh components client config.

  5. Use the client on a Master node.

    1. On the active management node, for example, 192-168-0-30, run the following command to switch to the client directory, for example, /opt/client.

      cd /opt/client

    2. Run the following command to configure environment variables:

      source bigdata_env

    3. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user:

      kinit MRS cluster user

      Example: user kinit hiveuser

      The current user must have the permission to create Hive tables. If Kerberos authentication is disabled, skip this step.

    4. Run the following command to log in to the Hive client CLI:

      beeline

      Hive allows you to add extension identifiers to JDBC connection strings. These extension identifiers are printed in HiveServer audit logs to distinguish SQL sources. You can concatenate the following to a URL:

      auditAddition=xxx

      xxx is the custom identifier. The identifier can contain a maximum of 256 bytes and only letters, digits, underscores (_), commas (,), and colons (:) are allowed.

      For details about how to set the extension identifier using code, see the Hive Development Guide. The client connection can be set in either of the following ways:

      • Modify the Client installation directory/Hive/component_env file, add \;auditAddition=xxx to the end of the CLIENT_HIVE_URI parameter, and run the source bigdata_env command again to apply the changes.
      • When using a specified JDBC URL to connect to the Hive client, add ;auditAddition=xxx at the end of the URL. The following is an example:

  6. Run the Hive client command to implement service A.

    Operations on internal tables:

    1. Create the user_info user information table according to Table 1 and add data to it.

      create table user_info(id string,name string,gender string,age int,addr string);

      insert into table user_info(id,name,gender,age,addr) values("12005000201","A","Male",19,"City A");

    2. Add users' educational backgrounds and professional titles to the user_info table.

      For example, to add educational background and title information about user 12005000201, run the following command:

      alter table user_info add columns(education string,technical string);

    3. Query user names and addresses by user ID.

      For example, to query the name and address of user 12005000201, run the following command:

      select name,addr from user_info where id='12005000201';

    4. Delete the user information table.

      drop table user_info;

    Operations on external partition tables:

    Create an external partition table and import data.

    1. Create a path for storing external table data.

      hdfs dfs -mkdir /hive/

      hdfs dfs -mkdir /hive/user_info

    2. Create a table.

      create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';

      fields terminated indicates delimiters, for example, spaces.

      lines terminated indicates line breaks, for example, \n.

      /hive/user_info indicates the path of the data file.

    3. Import data.
      1. Execute the insert statement to insert data.

        insert into user_info partition(year="2018") values ("12005000201","A","Male",19,"City A");

      2. Run the load data command to import file data.
        1. Create a file based on the data in Table 1. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
        2. Upload the file to HDFS.

          hdfs dfs -put txt.log /tmp

        3. Load data to the table.

          load data inpath '/tmp/txt.log' into table user_info partition (year='2011');

    4. Query the imported data.

      select * from user_info;

    5. Delete the user information table.

      drop table user_info;

    6. Run the following command to exit:

      !q