Importing HBase Data in Batches Using BulkLoad

Scenario

You can run commands to import data to HBase in batches and create indexes.

You can define multiple methods in configuration.xml for importing data in batches. You do not need to create indexes during data importing.

The column name consists of letters, digits, and underscores (_) and cannot contain any special characters.
If the MapReduce job fails to be executed, rectify the fault by following the instructions provided in Why Physical Memory Overflow Occurs If a MapReduce Task Fails?.
The data sources supported by BulkLoad are text files with separators.
The client has been installed. For example, the installation directory is /opt/hadoopclient. The client directory in the following operations is only an example. Change it to the actual installation directory.
If secondary indexes are created when data is imported in batches, pay attention to the following points:
- If the column type is set to string, the string length cannot be set. For example, <column index="1" type="string" length="1" >COLOUMN_1</column> is not supported.
- If the column type is set to date, the date format cannot be set. For example, <column index="13" type="date" format="yyyy-MM-dd hh:mm:ss">COLOUMN_13</column> is not supported.
- Secondary indexes cannot be created for combined columns.

Importing HBase Data in Batches Using the BulkLoad Tool

Log in to the node where the client is installed as the client installation user.
Run the following command to go to the client directory:

cd /opt/hadoopclient
Run the following command to configure environment variables:

source bigdata_env
Run the following command to authenticate the current user if Kerberos authentication is enabled for the current cluster. The current user must have the permissions to submit YARN jobs, create and write HBase tables, and use HDFS.

kinit Component service user

Run the following command to set the Hadoop username if Kerberos authentication is not enabled for the current cluster:

export HADOOP_USER_NAME=hbase
Run the following commands to import data to HDFS:

hdfs dfs -mkdir <inputdir>

hdfs dfs -put <local_data_file> <inputdir>

For example, define data file data.txt as follows:
```
001,Hadoop,citya
002,HBaseFS,cityb
003,HBase,cityc
004,Hive,cityd
005,Streaming,citye
006,MapReduce,cityf
007,Kerberos,cityg
008,LdapServer,cityh
```
Run the following command:

hdfs dfs -mkdir /datadirImport

hdfs dfs -put data.txt /datadirImport

Go to hbase shell, create the table ImportTable and file configuration.xml (this file can be edited by referring to the template file in /opt/client/HBase/hbase/conf/import.xml.template).

For example, run the following command to create the table:

create 'ImportTable', {NAME => 'f1',COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'f2'}

For example, the content of the customized import template file configuration.xml is as follows:

The value of column_num must be consistent with the number of columns in the data file.
The specified family must correspond to the column family of the table.

The following parameters need to be set only when secondary indexes are created during batch data import. The first letter of the index type must be capitalized, for example, type="String". In the following snippets, length="30" indicates that the value of the index column H_ID cannot exceed 30 characters:

                     <indices> 
                         <index name="IDX1"> 
                                 <index_column family="f1"> 
                                         <qualifier type="String" length="30">H_ID</qualifier> 
                                 </index_column> 
                         </index> 
                 </indices>

<?xml version="1.0" encoding="UTF-8"?> 
  
 <configuration> 
         <import id="first" column_num="3"> 
                 <columns> 
                         <column index="1" type="int">SMS_ID</column> 
                         <column index="2" type="string">SMS_NAME</column> 
                         <column index="3" type="string">SMS_ADDRESS</column> 
                 </columns> 
  
                 <rowkey> 
                         SMS_ID+'_'+substring(SMS_NAME,1,4)+'_'+reverse(SMS_ADDRESS) 
                 </rowkey> 
  
                 <qualifiers> 
                         <normal family="f1"> 
                                 <qualifier column="SMS_ID">H_ID</qualifier> 
                                 <qualifier column="SMS_NAME">H_NAME</qualifier> 
                                 <qualifier column="SMS_ADDRESS">H_ADDRESS</qualifier> 
                         </normal> 
  
                         <!-- Define composite columns --> 
                         <composite family="f2"> 
                                 <qualifier class="com.huawei.H_COMBINE_1">H_COMBINE_1</qualifier> 
                                 <columns> 
                                         <column>SMS_ADDRESS</column> 
                                         <column>SMS_NAME</column> 
                                 </columns> 
                         </composite> 
  
                 </qualifiers> 

                     <indices> 
                         <index name="IDX1"> 
                                 <index_column family="f1"> 
                                         <qualifier type="String" length="30">H_ID</qualifier> 
                                 </index_column> 
                         </index> 
                 </indices> 
  
                 <badlines>SMS_ID &lt; 7000 &amp;&amp; SMS_NAME == 'HBase'</badlines>
         </import> 
 </configuration>

Run the following commands to generate an HFile file:

hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true -Dimport.separator=<separator> -Dimport.bad.lines.output=</path/badlines/output> -Dimport.hfile.output=</path/for/output> <configuration xmlfile> <tablename> <inputdir>
- -Dimport.skip.bad.lines: If this parameter is set to false, the command execution stops when an inapplicable row occurs. If this parameter is set to true, when an inapplicable row occurs, this row is skipped and the command execution continues. If no inapplicable row is defined in configuration.xml, this parameter does not need to be added.
- -Dimport.separator: indicates a separator, for example, -Dimport.separator=','.
- -Dimport.bad.lines.output=</path/badlines/output>: indicates the output path of the inapplicable data row. If no inapplicable data row is defined in configuration.xml, this parameter does not need to be added.
- -Dimport.hfile.output=</path/for/output>: indicates the output path of the execution result.
- <configuration xmlfile>: points to the configuration file.
- <tablename>: indicates the name of a table to be operated.
- <inputdir>: data directory to be uploaded in batches.
For example, run the following command:
- hbase com.huawei.hadoop.hbase.tools.bulkload.ImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration.xml ImportTable /datadirImport
- hbase com.huawei.hadoop.hbase.tools.bulkload.IndexImportData -Dimport.skip.bad.lines=true -Dimport.separator=',' -Dimport.bad.lines.output=/badline -Dimport.hfile.output=/hfile configuration_index.xml IndexImportTable /datadirIndexImport
- After transparent encryption is configured for HBase, when you run the bulkload command to generate an HFile, the HFile path specified by -Dimport.hfile.output must be a subdirectory in /HBase root directory/extdata, for example, /hbase/extdata/bulkloadTmp/hfile.
- After transparent encryption is configured for HBase, the HBase user who runs the bulkload command needs to be added to the hadoop user group of the corresponding cluster (the user group is c<Cluster ID>_hadoop for the cluster that is not the first installed on FusionInsight Manager, for example, c2_hadoop) and has the read permission on the encryption key of the HBase root directory.
- Check the permission on the /tmp/hbase directory and manually grant the write permission on the directory to the current user.
Run the following command to import HFile to HBase:
- Importing data in batches:
  hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles </path/for/output> <tablename>
- Create a secondary index when importing data in batches:
  hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles </path/for/output> <tablename>
For example, run the following command:
- hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /hfile ImportTable
- hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexLoadIncrementalHFiles /hfile IndexImportTable