Updated on 2023-08-31 GMT+08:00

Preparing Initial Data

Scenario

Before commissioning the program, you need to prepare the data to be processed.

Planning MapReduce Statistics Sample Program Data

Store the log files to be processed in the HDFS system.

  1. Create a text file in the Linux system and copy the data to be processed to the file. For example, copy and save the content log1.txt in Typical Scenarios to input_data1.txt, and copy and save the content in log2.txt to input_data2.txt.
  2. Create the /tmp/input folder in the HDFS, and upload input_data1.txt and input_data2.txt to the folder.

    1. Run the following commands to go to the HDFS client directory and authenticate the user:

      cd HDFS client installation directory

      source bigdata_env

      kinit Component service user (This user must have the permission to operate HDFS. Change the password upon the first authentication.)

    2. Run the following command to create the /tmp/input directory:

      hdfs dfs -mkdir /tmp/input

    3. Run the following command to upload the prepared file to the /tmp/input directory on the HDFS client:

      hdfs dfs -put local_filepath/input_data1.txt /tmp/input

      hdfs dfs -put local_filepath/input_data2.txt /tmp/input

Planning MapReduce Accessing Multi-Component Sample Program Data

  1. Create an HDFS data file.

    1. Create a text file in the Linux system and copy the data to be processed to the file. For example, copy the content in log1.txt in Instance to data.txt.
    2. Run the following commands to go to the HDFS client directory and authenticate the user:

      cd HDFS client installation directory

      source bigdata_env

      kinit Component service user (This user must have the permission to operate HDFS. Change the password upon the first authentication.)

    3. Create the /tmp/examples/multi-components/mapreduce/input/ folder in the HDFS, and upload the data.txt file to the directory. The operations are as follows:
      1. On the HDFS client, run the following command to create a directory:

        hdfs dfs -mkdir -p /tmp/examples/multi-components/mapreduce/input/

      2. Run the following command to upload the file to HDFS:

        hdfs dfs -put local_filepath/data.txt /tmp/examples/multi-components/mapreduce/input/

  2. Create an HBase table and insert data.

    1. Run the following command to log in to the HBase client:

      cd HBase client installation directory

      source bigdata_env

      kinit Component service user

      hbase shell

    2. Run the following command to create a data table table1 in the HBase shell interaction window. The table has a column family cf.

      create 'table1', 'cf'

    3. Run the following command to insert a data record whose rowkey is 1, column name is cid, and data value is 123:

      put 'table1', '1', 'cf:cid', '123'

    4. Run the following command to exit the HBase client:

      quit

  3. Create a Hive table and insert data.

    1. Run the following command to log in to the Hive client:

      cd Hive client installation directory

      source bigdata_env

      kinit Component service user

      beeline

    2. Run the following command to create the person data table in the Hive beeline interaction window. The table contains three fields: name, render, and stayTime.

      CREATE TABLE person(name STRING, gender STRING, stayTime INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile;

    3. Run the following command to load the data file in the Hive beeline interaction window:

      LOAD DATA INPATH '/tmp/examples/multi-components/mapreduce/input/' OVERWRITE INTO TABLE person;

    4. Run the !q command to exit.

  4. Loading data to Hive clears the HDFS data directory. Therefore, you need to perform 1 again.