Updated on 2022-07-11 GMT+08:00

Instance

Scenario

The sample project illustrates how to compile MapReduce jobs to visit multiple service components in HDFS, HBase, and Hive, helping users to understand key actions such as certificating and configuration loading.

The logic of the sample project is as follows:

The input data is HDFS text file and the input file is log1.txt.

YuanJing,male,10
GuoYijun,male,5

Map:

  1. Obtain one row of the input data and extract the user name.
  2. Query one piece of data from HBase.
  3. Query one piece of data from Hive.
  4. Combine the data queried from HBase and that from Hive as the output of Map as the output of Map.

Reduce:

  1. Obtain the last piece of data from Map output.
  2. Import the data to HBase.
  3. Save the data to HDFS.

Data Planning

  1. Create an HDFS data file.
    1. Create a text file named data.txt in the Linux-based HDFS and copy the content of log1.txt to data.txt.
    2. Run the following commands to create a directory /tmp/examples/multi-components/mapreduce/input/ and copy the data.txt to the directory:
      1. hdfs dfs -mkdir -p /tmp/examples/multi-components/mapreduce/input/
      2. hdfs dfs -put data.txt /tmp/examples/multi-components/mapreduce/input/
  2. Create a HBase table and insert data into it.
    1. Run the source bigdata_env command on a Linux-based HBase client and run the hbase shell command.
    2. Run the create 'table1', 'cf' command in the HBase shell to create table1 with column family cf.
    3. Run the put 'table1', '1', 'cf:cid', '123' command to insert data whose rowkey is 1, column name is cid, and data value is 123.
    4. Run the quit command to exit the table.
  3. Create a Hive table and load data to it.
    1. Run the beeline command on a Linux-based Hive client.
    2. In the Hive beeline interaction window, run the CREATE TABLE person(name STRING, gender STRING, stayTime INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile; command to create data table person with three fields.
    3. In the Hive beeline interaction window, run the LOAD DATA INPATH '/tmp/examples/multi-components/mapreduce/input/' OVERWRITE INTO TABLE person; command to load data files to the person table.
    4. Run !q to exit the table.
  4. The data of HDFS is cleared in the preceding step. Therefore, perform 1 again.