Updated on 2022-12-09 GMT+08:00

Importing Data in Batches

Scenario

Import data in batches to HBase in custom mode by running commands.

You can define multiple methods in configuration.xml for importing data in batches. You do not need to create indexes during data importing.

  • The column name consists of letters, digits, and underscores (_) and cannot contain any special characters.
  • If the MapReduce job fails to be executed, rectify the fault by following the instructions provided in "What Do I Do If a Physical Memory Overflow Occurs on ApplicationMaster?".
  • The data sources supported by BulkLoad are text files with separators.
  • The client has been installed. For example, the installation directory is /opt/hadoopclient. The client directory in the following operations is only an example. Change it to the actual installation directory.

Procedure

  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client directory:

    cd /opt/hadoopclient

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the following command to authenticate the current user if Kerberos authentication is enabled for the current cluster. The current user must have the permissions to create HBase tables and operate HDFS.

    kinit Component service user

    Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:

    export HADOOP_USER_NAME=hbase

  5. Run the following commands to import data to HDFS:

    hdfs dfs -mkdir <inputdir>

    hdfs dfs -put <local_data_file> <inputdir>

    For example, define data file data.txt as follows:

    001,Hadoop,citya
    002,HBaseFS,cityb
    003,HBase,cityc
    004,Hive,cityd
    005,Streaming,citye
    006,MapReduce,cityf
    007,Kerberos,cityg
    008,LdapServer,cityh

    Run the following command:

    hdfs dfs -mkdir /datadirImport

    hdfs dfs -put data.txt /datadirImport

  6. Go to HBase shell, create the table ImportTable and file configuration.xml (this file can be edited by referring to the template file in /opt/client/HBase/hbase/conf/import.xml.template).

    For example, run the following command to create the table:

    create 'ImportTable', {NAME => 'f1',COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'f2'}

    For example, customize and import the configuration.xml template file.

    • The value of column_num must be consistent with the number of columns in the data file.
    • The specified family must correspond to the column family of the table.
    <?xml version="1.0" encoding="UTF-8"?> 
      
     <configuration> 
             <import id="first" column_num="3"> 
                     <columns> 
                             <column index="1" type="int">SMS_ID</column> 
                             <column index="2" type="string">SMS_NAME</column> 
                             <column index="3" type="string">SMS_ADDRESS</column> 
                     </columns> 
      
                     <rowkey> 
                             SMS_ID+'_'+substring(SMS_NAME,1,4)+'_'+reverse(SMS_ADDRESS) 
                     </rowkey> 
      
                     <qualifiers> 
                             <normal family="f1"> 
                                     <qualifier column="SMS_ID">H_ID</qualifier> 
                                     <qualifier column="SMS_NAME">H_NAME</qualifier> 
                                     <qualifier column="SMS_ADDRESS">H_ADDRESS</qualifier> 
                             </normal> 
      
                             <!-- Define composite columns --> 
                             <composite family="f2"> 
                                     <qualifier class="com.huawei.H_COMBINE_1">H_COMBINE_1</qualifier> 
                                     <columns> 
                                             <column>SMS_ADDRESS</column> 
                                             <column>SMS_NAME</column> 
                                     </columns> 
                             </composite> 
      
                     </qualifiers> 
                     <badlines>SMS_ID &lt; 7000 &amp;&amp; SMS_NAME == 'HBase'</badlines>
             </import> 
     </configuration>     

  7. Run the following commands to generate an HFile file:

    hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true -Dimport.separator=<separator> -Dimport.bad.lines.output=</path/badlines/output> -Dimport.hfile.output=</path/for/output> <configuration xmlfile> <tablename> <inputdir>

    • -Dimport.skip.bad.lines: If this parameter is set to false, the command execution stops when an inapplicable row occurs. If this parameter is set to true, when an inapplicable row occurs, this row is skipped and the command execution continues. If no inapplicable row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.separator: indicates a separator, for example, -Dimport.separator=','.
    • -Dimport.bad.lines.output=</path/badlines/output>: indicates the output path of the inapplicable data row. If no inapplicable data row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.hfile.output=</path/for/output>: indicates the output path of the execution result.
    • <configuration xmlfile>: points to the configuration file.
    • <tablename>: indicates the name of a table to be operated.
    • <inputdir>: data directory to be uploaded in batches.

    For example, run the following command:

    hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration.xml ImportTable /datadirImport

    • After transparent encryption is configured for HBase, when you run the bulkload command to generate an HFile, the HFile path specified by -Dimport.hfile.output must be a subdirectory in /HBase root directory/extdata, for example, /hbase/extdata/bulkloadTmp/hfile.
    • After transparent encryption is configured for HBase, the HBase user who runs the bulkload command needs to be added to the hadoop user group of the corresponding cluster (the user group is c<Cluster ID>_hadoop for the cluster that is not the first installed on FusionInsight Manager, for example, c2_hadoop) and has the read permission on the encryption key of the HBase root directory.
    • Check the permission on the /tmp/hbase directory and manually grant the write permission on the directory to the current user.

  8. Run the following command to import HFile to HBase:

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles </path/for/output> <tablename>

    For example, run the following command:

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /hfile ImportTable