Using Hive to Load HDFS Data and Analyze Book Scores
MRS offline processing clusters enable you to analyze and process massive amount of data as well as provide the results for later use.
Offline processing has low requirements on processing time. However, a large amount of data needs to be processed, which occupies a large number of compute and storage resources. Generally, offline processing is implemented through Hive/SparkSQL or MapReduce/Spark2x.
This practice describes how to import and analyze raw data using Hive after you create an MRS cluster and how to implement elastic and low-cost offline big data analysis.
You can get started by reading the following topics:
- Creating an MRS Offline Query Cluster
- Importing Local Data to HDFS
- Creating a Hive Table
- Importing Raw Data to Hive for Analysis
Scenario
Hive is a data warehouse built on Hadoop. It provides batch computing capability for the big data platform and is able to batch analyze and summarize structured and semi-structured data for data calculation. Hive operates structured data using Hive Query Language (HQL), a SQL-like language. HQL is automatically converted into MapReduce tasks for the query and analysis of massive data in the Hadoop cluster.
Hive is able to:
- Analyze massive structured data and summarizes analysis results.
- Allow complex MapReduce jobs to be compiled in SQL languages.
- Support flexible data storage formats, including JavaScript object notation (JSON), comma separated values (CSV), TextFile, RCFile, SequenceFile, and Optimized Row Columnar (ORC).
In this practice, user comments from the background of a book website are used as the raw data. After the data is imported to a Hive table, you can run SQL commands to query the most popular best-selling books.
Creating an MRS Offline Query Cluster
- Go to the Buy Cluster page.
- Click the Quick Config tab and set configuration parameters.
Table 1 Software parameters (for reference only) Parameter
Value
Region
EU-Dublin
Billing Mode
Pay-per-use
Cluster Name
MRS_demo
Version Type
Normal
Cluster Version
MRS 3.1.0
Component
Hadoop Analysis Cluster
AZ
AZ1
VPC
vpc-01
Subnet
subnet-01
Enterprise Project
default
Kerberos Authentication
Disabled
Username
root/admin
Password
Set the password for logging in to the cluster management page and ECS node, for example, Test!@12345.
Confirm Password
Enter the password again.
Secure Communications
Select Enable.
Figure 1 Buying a Hadoop analysis cluster
- Click Buy Now and wait until the MRS cluster is created.
Figure 2 Cluster purchased
Importing Local Data to HDFS
- Obtain the book comments file book_score.txt from the background of the book website and save it on the local host.
The file contains the following fields: user ID, book ID, book score, and remarks.
Some data is as follows:202001,242,3,Good! 202002,302,3,Test. 202003,377,1,Bad! 220204,51,2,Bad! 202005,346,1,aaa 202006,474,4,None 202007,265,2,Bad! 202008,465,5,Good! 202009,451,3,Bad! 202010,86,3,Bad! 202011,257,2,Bad! 202012,465,4,Good! 202013,465,4,Good! 202014,465,4,Good! 202015,302,5,Good! 202016,302,3,Good! ...
- Log in to the OBS console, click Create Bucket, set the following parameters, and click Create Now.
Table 2 Bucket parameters Parameter
Value
Region
EU-Dublin
Bucket Name
mrs-hive
Default Storage Class
Standard
Bucket Policy
Private
Direct Reading
Disable
Enterprise Project
default
Tags
-
After the bucket is created, click the bucket name. In the navigation pane on the left, choose Objects and click Upload Object to upload the data file.
Figure 3 Uploading an object
- Switch back to the MRS console and click the name of the created MRS cluster. On the Dashboard page, click Synchronize next to IAM User Sync. The synchronization takes about five minutes.
Figure 4 Synchronizing IAM users
- Upload the data file to the HDFS.
- On the Files page, click the HDFS File List and go to the data storage directory, for example, /tmp/test.
The /tmp/test directory is only an example. You can use any directory on the page or create a new one.
- Click Import Data.
- OBS Path: Select the created OBS bucket name, find the book_score.txt file, select I confirm that the selected script is secure, and I understand the potential risks and accept the possible exceptions or impacts on the cluster, and click OK.
- HDFS Path: Select the /tmp/test directory and click OK.
Figure 5 Importing data from OBS to HDFS
- Click OK. After the data is imported, the data file has been uploaded to HDFS of the MRS cluster.
Figure 6 Data imported
- On the Files page, click the HDFS File List and go to the data storage directory, for example, /tmp/test.
Creating a Hive Table
- Download the cluster client, and install it, for example, in the /opt/client directory of the active master node.
You can also use the cluster client provided in the /opt/Bigdata/client directory of the master node.
- Bind an EIP to the active master node and enable port 22 in the security group. Then, log in to the active master node as user root, go to the directory where the client is located, and load variables.
cd /opt/client
source bigdata_env
- Run the beeline -n'hdfs' command to go to the Hive Beeline page.
Run the following command to create a Hive table whose fields match the raw data fields:
create table bookscore (userid int,bookid int,score int,remarks string) row format delimited fields terminated by ','stored as textfile;
- Run the following command to check whether the table is successfully created:
+------------+ | tab_name | +------------+ | bookscore | +------------+
Importing Raw Data to Hive for Analysis
- Run the following command on Hive Beeline to import the raw data that has been imported to HDFS to the Hive table:
load data inpath '/tmp/test/book_score.txt' into table bookscore;
- After data is imported, run the following command to view content in the Hive table:
+-------------------+-------------------+------------------+--------------------+ | bookscore.userid | bookscore.bookid | bookscore.score | bookscore.remarks | +-------------------+-------------------+------------------+--------------------+ | 202001 | 242 | 3 | Good! | | 202002 | 302 | 3 | Test. | | 202003 | 377 | 1 | Bad! | | 220204 | 51 | 2 | Bad! | | 202005 | 346 | 1 | aaa | | 202006 | 474 | 4 | None | | 202007 | 265 | 2 | Bad! | | 202008 | 465 | 5 | Good! | | 202009 | 451 | 3 | Bad! | | 202010 | 86 | 3 | Bad! | | 202011 | 257 | 2 | Bad! | | 202012 | 465 | 4 | Good! | | 202013 | 465 | 4 | Good! | | 202014 | 465 | 4 | Good! | | 202015 | 302 | 5 | Good! | | 202016 | 302 | 3 | Good! | ...
Run the following command to count the number of rows in the table:
select count(*) from bookscore;
+------+ | _c0 | +------+ | 32 | +------+
- Run the following command to filter the top 3 books with the highest scores in the raw data after the MapReduce task is complete:
select bookid,sum(score) as summarize from bookscore group by bookid order by summarize desc limit 3;
Finally, the following information is displayed:
... INFO : 2021-10-14 19:53:42,427 Stage-2 map = 0%, reduce = 0% INFO : 2021-10-14 19:53:49,572 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 2.15 sec INFO : 2021-10-14 19:53:56,713 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 4.19 sec INFO : MapReduce Total cumulative CPU time: 4 seconds 190 msec INFO : Ended Job = job_1634197207682_0025 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.24 sec HDFS Read: 7872 HDFS Write: 322 SUCCESS INFO : Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.19 sec HDFS Read: 5965 HDFS Write: 143 SUCCESS INFO : Total MapReduce CPU Time Spent: 8 seconds 430 msec INFO : Completed executing command(queryId=omm_20211014195310_cf669633-5b58-4bd5-9837-73286ea83409); Time taken: 47.388 seconds INFO : OK INFO : Concurrency mode is disabled, not creating a lock manager +---------+------------+ | bookid | summarize | +---------+------------+ | 465 | 170 | | 302 | 110 | | 474 | 88 | +---------+------------+ 3 rows selected (47.469 seconds)
The books whose IDs are 456, 302, and 474 are the top 3 books with the highest scores.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.