Updated on 2024-10-08 GMT+08:00

Importing HBase Data in Batches Using BulkLoad

Scenario

Import data in batches to HBase and create indexes as you need by running commands.

You can define multiple methods in configuration.xml for importing data in batches. You do not need to create indexes during data importing.

  • The column name consists of letters, digits, and underscores (_) and cannot contain any special characters.
  • If a MapReduce task fails to be executed, rectify the fault by referring to Why Physical Memory Overflow Occurs If a MapReduce Task Fails?.
  • The data sources supported by BulkLoad are text files with separators.
  • You have installed the client. For example, the installation directory is /opt/hadoopclient. The client directory in the following operations is only an example. Change it to the actual installation directory.
  • If you want to create a secondary index when importing data in batches, pay attention to the following:
    • If the column type is set to string, the string length cannot be set. For example, <column index="1" type="string" length="1" >COLOUMN_1</column> is not supported.
    • If the column type is set to date, the date format cannot be set. For example, <column index="13" type="date" format="yyyy-MM-dd hh:mm:ss">COLOUMN_13</column> is not supported.
    • Secondary indexes cannot be created for combined columns.

Importing HBase Data in Batches Using BulkLoad

  1. Log in to the node where the client is installed as the client installation user.
  2. Run the following command to go to the client directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. Run the following command to authenticate the current user if Kerberos authentication is enabled for the current cluster. The current user must have the permissions to create HBase tables and operate HDFS.

    kinit Component service user

    Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:

    export HADOOP_USER_NAME=hbase

  5. Run the following commands to import data to HDFS:

    hdfs dfs -mkdir<inputdir>

    hdfs dfs -put<local_data_file> <inputdir>

    For example, define data file data.txt as follows:

    001,Hadoop,citya
    002,HBaseFS,cityb
    003,HBase,cityc
    004,Hive,cityd
    005,Streaming,citye
    006,MapReduce,cityf
    007,Kerberos,cityg
    008,LdapServer,cityh

    Run the following command:

    hdfs dfs -mkdir /datadirImport

    hdfs dfs -put data.txt /datadirImport

  6. Go to HBase shell, create the table ImportTable and file configuration.xml (this file can be edited by referring to the template file in /opt/client/HBase/hbase/conf/import.xml.template).

    For example, run the following command to create the table:

    create 'ImportTable', {NAME => 'f1',COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'f2'}

    For example, customize and import the configuration.xml template file.

    • The value of column_num must be consistent with the number of columns in the data file.
    • The specified family must correspond to the column family of the table.
    • The following parameters need to be set only when a secondary index is created during batch data import. The first letter of the index type must be capitalized, for example, type="String". In the following segment, length="30" indicates the index column H_ID. The column value cannot exceed 30 characters.
                           <indices> 
                               <index name="IDX1"> 
                                       <index_column family="f1"> 
                                               <qualifier type="String" length="30">H_ID</qualifier> 
                                       </index_column> 
                               </index> 
                       </indices> 
    <?xml version="1.0" encoding="UTF-8"?> 
      
     <configuration> 
             <import id="first" column_num="3"> 
                     <columns> 
                             <column index="1" type="int">SMS_ID</column> 
                             <column index="2" type="string">SMS_NAME</column> 
                             <column index="3" type="string">SMS_ADDRESS</column> 
                     </columns> 
      
                     <rowkey> 
                             SMS_ID+'_'+substring(SMS_NAME,1,4)+'_'+reverse(SMS_ADDRESS) 
                     </rowkey> 
      
                     <qualifiers> 
                             <normal family="f1"> 
                                     <qualifier column="SMS_ID">H_ID</qualifier> 
                                     <qualifier column="SMS_NAME">H_NAME</qualifier> 
                                     <qualifier column="SMS_ADDRESS">H_ADDRESS</qualifier> 
                             </normal> 
      
                             <!-- Define composite columns --> 
                             <composite family="f2"> 
                                     <qualifier class="com.huawei.H_COMBINE_1">H_COMBINE_1</qualifier> 
                                     <columns> 
                                             <column>SMS_ADDRESS</column> 
                                             <column>SMS_NAME</column> 
                                     </columns> 
                             </composite> 
      
                         <indices> 
                             <index name="IDX1"> 
                                     <index_column family="f1"> 
                                             <qualifier type="String" length="30">H_ID</qualifier> 
                                     </index_column> 
                             </index> 
                     </indices> 
    
                     </qualifiers> 
                     <badlines>SMS_ID &lt; 7000 &amp;&amp; SMS_NAME == 'HBase'</badlines>
             </import> 
     </configuration>     

  7. Run the following commands to generate an HFile file:

    hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true-Dimport.separator=<separator>-Dimport.bad.lines.output=</path/badlines/output>-Dimport.hfile.output=</path/for/output> <configuration xmlfile> <tablename> <inputdir>

    • -Dimport.skip.bad.lines: If this parameter is set to false, the command execution stops when an inapplicable row occurs. If this parameter is set to true, when an inapplicable row occurs, this row is skipped and the command execution continues. If no inapplicable row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.separator: indicates a separator, for example, -Dimport.separator=','.
    • -Dimport.bad.lines.output=</path/badlines/output>: indicates the output path of the inapplicable data row. If no inapplicable data row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.hfile.output=</path/for/output>: indicates the output path of the execution result.
    • <configuration xmlfile>: points to the configuration file.
    • <tablename>: indicates the name of the table to be operated.
    • <inputdir>: indicates the data directory to be uploaded in batches.

    For example, run the following command:

    • hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration.xml ImportTable /datadirImport
    • hbase com.huawei.hadoop.hbase.tools.bulkload.IndexImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration_index.xml IndexImportTable /datadirIndexImport
    • After transparent encryption is configured for HBase, when you run the bulkload command to generate an HFile, the HFile path specified by -Dimport.hfile.output must be a subdirectory in /HBase root directory/extdata, for example, /hbase/extdata/bulkloadTmp/hfile.
    • To use transparent encryption is for HBase, the HBase user who runs the bulkload command must be added to the hadoop user group of the cluster (If the cluster is not the first one installed on FusionInsight Manager, the user group is c<Cluster ID>_hadoop, for example, c2_hadoop), and have the read permission on the encryption key of the HBase root directory.
    • Check the permission on the /tmp/hbase directory and manually grant the write permission on the directory to the current user.

  8. Run the following command to import HFile to HBase:

    • Importing data in batches

      hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles </path/for/output> <tablename>

    • Creating a secondary index when importing data in batches

      hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles </path/for/output> <tablename>

    For example, run the following command:

    • hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /hfile ImportTable
    • hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles /hfile IndexImportTable