Using Hive for Data Analysis
Hive is a data warehouse framework built on Hadoop. It maps structured data files to a database table and provides SQL-like functions to analyze and process data. It also allows you to quickly perform simple MapReduce statistics using SQL-like statements without the need of developing a specific MapReduce application. It is suitable for statistical analysis of data warehouses.
Background
Suppose a user develops an application to manage users who use service A in an enterprise. The procedure of operating service A on the Hive client is as follows:
Operations on common tables:
- Create the user_info table.
- Add users' educational backgrounds and professional titles to the table.
- Query user names and addresses by user ID.
- Delete the user information table after service A ends.
ID |
Name |
Gender |
Age |
Address |
---|---|---|---|---|
12005000201 |
A |
Male |
19 |
City A |
12005000202 |
B |
Female |
23 |
City B |
12005000203 |
C |
Male |
26 |
City C |
12005000204 |
D |
Male |
18 |
City D |
12005000205 |
E |
Female |
21 |
City E |
12005000206 |
F |
Male |
32 |
City F |
12005000207 |
G |
Female |
29 |
City G |
12005000208 |
H |
Female |
30 |
City H |
12005000209 |
I |
Male |
26 |
City I |
12005000210 |
J |
Female |
25 |
City J |
Procedure
- Log in to the node where the client is installed as the client installation user. For details about how to install the client, see Installing a Client (MRS 3.x or Later).
- Run the following command to switch to the client directory, for example, /opt/client.
cd /opt/client
- Run the following command to configure environment variables:
source bigdata_env
- If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the current user:
kinit MRS cluster user
Example: user kinit hiveuser
The current user must have the permission to create Hive tables. To create a role with the permission, see Creating a Role To bind the role to the current user, see Creating a User. If Kerberos authentication is disabled, skip this step.
- Run the Hive client command to implement service A.
- Operations on internal tables
- Run the following command to log in to the Hive client CLI:
- Create the user_info user information table according to Table 1 and add data to it.
create table user_info(id string,name string,gender string,age int,addr string);
insert into table user_info(id,name,gender,age,addr) values("12005000201","A","Male",19,"City A");
- Add users' educational backgrounds and professional titles to the user_info table.
For example, to add educational background and title information about user 12005000201, run the following command:
alter table user_info add columns(education string,technical string);
- Query user names and addresses by user ID.
For example, to query the name and address of user 12005000201, run the following command:
select name,addr from user_info where id='12005000201';
- Delete the user information table.
- Run the following command to exit:
- Operations on external partition tables
- You can run the insert statement to insert data into an external table or run the load data command to import file data from HDFS to an external table. To run the load data command to import file data, perform the following operations:
- Create a file based on the data in Table 1. For example, the file name is txt.log. Fields are separated by space, and the line feed characters are used as the line breaks.
- Run the following command to upload the file to HDFS.
- Run the following command to create a path for storing external table data.
hdfs dfs -mkdir /hive/user_info
- Run the following command to create a table.
create external table user_info(id string,name string,gender string,age int,addr string) partitioned by(year string) row format delimited fields terminated by ' ' lines terminated by '\n' stored as textfile location '/hive/user_info';
- fields terminated: indicates delimiters, for example, spaces.
- lines terminated: indicates line breaks, for example, \n.
- /hive/user_info: indicates the path of the data file.
- Import data.
- Execute the insert statement to insert data.
insert into user_info partition(year="2018") values ("12005000201","A","Male",19,"City A");
- Run the load data command to import file data.
load data inpath '/tmp/txt.log' into table user_info partition (year='2011');
In the preceding command, /tmp/txt.log indicates the data file uploaded to the HDFS in 5.a.
- Execute the insert statement to insert data.
- Query the imported data.
- Delete the user information table.
- Run the following command to exit:
- You can run the insert statement to insert data into an external table or run the load data command to import file data from HDFS to an external table. To run the load data command to import file data, perform the following operations:
- Operations on internal tables
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot