Using Hive for Data Analysis

Hive is a data warehouse framework built on Hadoop. It maps structured data files to a database table and provides SQL-like functions to analyze and process data. It also allows you to quickly perform simple MapReduce statistics using SQL-like statements without the need of developing a specific MapReduce application. It is suitable for statistical analysis of data warehouses.

Background

Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Hive client is as follows:

Operations on common tables:

Create the user_info table.
Add users' educational backgrounds and professional titles to the table.
Query user names and addresses by user ID.
Delete the user information table after service A ends.

**Table 1** User information
ID	Name	Gender	Age	Address
12005000201	A	Male	19	City A
12005000202	B	Female	23	City B
12005000203	C	Male	26	City C
12005000204	D	Male	18	City D
12005000205	E	Female	21	City E
12005000206	F	Male	32	City F
12005000207	G	Female	29	City G
12005000208	H	Female	30	City H
12005000209	I	Male	26	City I
12005000210	J	Female	25	City J

Procedure

Log in to the node where the client is installed as the client installation user. For details about how to install the client, see Installing a Client (MRS 3.x or Later).
Run the following command to switch to the client directory, for example, /opt/client.

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user:

kinit MRS cluster user

Example: user kinit hiveuser

The current user must have the permission to create Hive tables. To create a role with the permission, see Creating a Role To bind the role to the current user, see Creating a User. If Kerberos authentication is disabled, skip this step.
Run the Hive client command to implement service A.
- Operations on internal tables
  1. Run the following command to log in to the Hive client CLI:
    beeline
  2. Create the user_info user information table according to Table 1 and add data to it.
    create table user_info(id string,name string,gender string,age int,addr string);
    
    insert into table user_info(id,name,gender,age,addr) values("12005000201","A","Male",19,"City A");
  3. Add users' educational backgrounds and professional titles to the user_info table.
    For example, to add educational background and title information about user 12005000201, run the following command:
    
    alter table user_info add columns(education string,technical string);
  4. Query user names and addresses by user ID.
    For example, to query the name and address of user 12005000201, run the following command:
    
    select name,addr from user_info where id='12005000201';
  5. Delete the user information table.
    drop table user_info;
  6. Run the following command to exit:
    !q
- Operations on external partition tables
  1. You can run the insert statement to insert data into an external table or run the load data command to import file data from HDFS to an external table. To run the load data command to import file data, perform the following operations:
    1. Create a file based on the data in Table 1. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
    2. Run the following command to upload the file to HDFS.
      hdfs dfs -put txt.log /tmp
  2. Run the following command to create a path for storing external table data.
    hdfs dfs -mkdir /hive/
    
    hdfs dfs -mkdir /hive/user_info
  3. Run the following command to create a table.
    create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';
    fields terminated: indicates delimiters, for example, spaces.
    
    lines terminated: indicates line breaks, for example, \n.
    
    /hive/user_info: indicates the path of the data file.
  4. Import data.
    - Execute the insert statement to insert data.
      insert into user_info partition(year="2018") values ("12005000201","A","Male",19,"City A");
    - Run the load data command to import file data.
      load data inpath '/tmp/txt.log' into table user_info partition (year='2011');
      
      In the preceding command, /tmp/txt.log indicates the data file uploaded to the HDFS in 5.a.
  5. Query the imported data.
    select * from user_info;
  6. Delete the user information table.
    drop table user_info;
  7. Run the following command to exit:
    !q