Updated on 2022-09-22 GMT+08:00

Creating a Secondary Index When Importing Data In Batches

Scenario

Import data in batches to HBase in custom mode by running commands.

  • The column name consists of letters, digits, and underscores (_) and cannot contain any special characters.
  • If the column type is set to string, the string length cannot be set. For example, <column index="1" type="string" length="1" >COLOUMN_1</column> is not supported.
  • If the column type is set to date, the date format cannot be set. For example, <column index="13" type="date" format="yyyy-MM-dd hh:mm:ss">COLOUMN_13</column> is not supported.
  • Secondary indexes cannot be created for combined columns.

Procedure

  1. Run the following commands to import data to HDFS:

    hdfs dfs -mkdir <inputdir>

    hdfs dfs -put <local_data_file> <inputdir>

    For example, define data file data.txt as follows:

    001,Hadoop,citya
    002,HBaseFS,cityb
    003,HBase,cityc
    004,Hive,cityd
    005,Streaming,citye
    006,Mapreduce,cityf
    007,Kerberos,cityg
    008,LdapServer,cityh

    Run the following commands:

    hdfs dfs -mkdir /datadirIndexImport

    hdfs dfs -put data.txt /datadirIndexImport

  2. Create table IndexImportTable and create file configuration_index.xml (this file can be edited based on the reference template in ${client path}/HBase/hbase/conf/index_import.xml.template).

    For example, run the following command to create the table:

    create 'IndexImportTable', {NAME => 'f1',COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'f2'}

    For example, customize and import the configuration_index.xml template file in custom mode:

    • The value of column_num must be consistent with the number of columns in the data file.
    • The specified family must correspond to the column family of the table.
    • The first letter of the index type must be capitalized, for example, type=String.
    <?xml version="1.0" encoding="UTF-8"?> 
      
     <configuration> 
             <import id="first" column_num="3"> 
                     <columns> 
                             <column index="1" type="int">SMS_ID</column> 
                             <column index="2" type="string">SMS_NAME</column> 
                             <column index="3" type="string">SMS_ADDRESS</column> 
                     </columns> 
      
                     <rowkey> 
                             SMS_ID+'_'+substring(SMS_NAME,1,4)+'_'+reverse(SMS_ADDRESS) 
                     </rowkey> 
      
                     <qualifiers> 
                             <normal family="f1"> 
                                     <qualifier column="SMS_ID">H_ID</qualifier> 
                                     <qualifier column="SMS_NAME">H_NAME</qualifier> 
                                     <qualifier column="SMS_ADDRESS">H_ADDRESS</qualifier> 
                             </normal> 
      
                             <!-- Define composite columns --> 
                             <composite family="f2"> 
                                     <qualifier class="com.huawei.H_COMBINE_1">H_COMBINE_1</qualifier> 
                                     <columns> 
                                             <column>SMS_ADDRESS</column> 
                                             <column>SMS_NAME</column> 
                                     </columns> 
                             </composite> 
      
                     </qualifiers> 
      
                         <indices> 
                             <index name="IDX1"> 
                                     <index_column family="f1"> 
                                             <qualifier type="String" length="30">H_ID</qualifier> 
                                     </index_column> 
                             </index> 
                     </indices> 
      
                     <badlines>SMS_ID &lt; 7000 &amp;&amp; SMS_NAME == 'HBase'</badlines> 
             </import> 
     </configuration>     

    In the preceding segment information, length="30" indicates that the column value of the index column H_ID can contain a maximum of 30 characters.

  3. Run the following commands to generate an HFile file:

    hbase com.huawei.hadoop.hbase.tools.bulkload.IndexImportData -Dimport.skip.bad.lines=true -Dimport.separator=<separator> -Dimport.bad.lines.output=</path/badlines/output> -Dimport.hfile.output=</path/for/output> <configuration xmlfile> <tablename> <inputdir>

    • -Dimport.skip.bad.lines: If this parameter is set to false, the command execution stops when an inapplicable row occurs. If this parameter is set to true, when an inapplicable row occurs, this row is skipped and the command execution continues. If no inapplicable row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.separator: indicates a separator. For example, -Dimport.separator=','
    • -Dimport.bad.lines.output=</path/badlines/output>: indicates the output path of the inapplicable data row. If no inapplicable data row is defined in configuration.xml, this parameter does not need to be added.
    • -Dimport.hfile.output=</path/for/output>: indicates the output path of the execution result.
    • <configuration xmlfile>: points to the configuration file.
    • <tablename>: indicates the name of a table to be operated.
    • <inputdir>: data directory to be uploaded in batches.

    For example, run the following command:

    hbase com.huawei.hadoop.hbase.tools.bulkload.IndexImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration_index.xml IndexImportTable /datadirIndexImport

    • After transparent encryption is configured for HBase, when you run the bulkload command to generate an HFile, the HFile path specified by -Dimport.hfile.output must be a subdirectory in /HBase root directory/extdata, for example, /hbase/extdata/bulkloadTmp/hfile.
    • After transparent encryption is configured for HBase, the HBase user who runs the bulkload command needs to be added to the hadoop user group of the corresponding cluster (the user group is c<Cluster ID>_hadoop for the cluster that is not the first installed on FusionInsight Manager, for example, c2_hadoop) and has the read permission on the encryption key of the HBase root directory.

  4. Run the following command to import HFile to HBase:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles </path/for/output> <tablename>

    For example, run the following command:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles /hfile IndexImportTable