Loading Index Data in Batches

Scenarios

HBase allows you to use the ImportTsv and LoadIncremental tools to load user data in batches. You can also use the GlobalIndexImportTsv and GlobalIndexBulkLoadHFilesTool tools to load both user data and global index data in batches. GlobalIndexImportTsv inherits all functions of the HBase batch data loading tool ImportTsv.

If a table is not created before the GlobalIndexImportTsv tool is executed, a global index will be created when the table is created, and index data is generated when user data is generated. Pre-splitting is not supported for automatic table creation, which may cause performance problems. You need to create tables before you run the GlobalIndexImportTsv tool to load data.

Procedure

Log in to the node where the HDFS clients are installed as the client installation user and run the following commands:

cd Client installation directory

source bigdata_env

kinit Component service user (skip this step if Kerberos authentication is disabled for the cluster (the cluster is in normal mode))

Run the following commands to import data to HDFS:

hdfs dfs -mkdir <inputdir>

hdfs dfs -put <local_data_file> <inputdir>

For example, define data file data.txt as follows:

12005000201,Zhang San,Male,19,City a,Province a
12005000202,Li Wanting,Female,23,City b,Province b
12005000203,Wang Ming,Male,26,City c,Province c
12005000204,Li Gang,Male,18,City d,Province d
12005000205,Zhao Enru,Female,21,City e,Province e
12005000206,Chen Long,Male,32,City f,Province f
12005000207,Zhou Wei,Female,29,City g,Province g
12005000208,Yang Yiwen,Female,30,City h,Province h
12005000209,Xu Bing,Male,26,City i,Province i
12005000210,Xiao Kai,Male,25,City j,Province j

Run the following commands to import data to HDFS:

hdfs dfs -mkdir /datadirImport

hdfs dfs -put data.txt /datadirImport

Run the following command to create the bulkTable table:

hbase shell

create 'bulkTable', {NAME => 'info',COMPRESSION => 'SNAPPY', DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'address'}

After the table is created, exit the HBase shell command line.
Run the following command to create the global index:

hbase org.apache.hadoop.hbase.hindex.global.mapreduce.GlobalTableIndexer -Dtablename.to.index='bulkTable' -Dindexspecs.to.add='index_bulk=>info:[age->String]' -Dindexspecs.coveredallcolumn.to.add='index_bulk=>true' -Dindexspecs.splitkeys.to.set='index_bulk=>[\x010,\x011,\x012]'

For details about how to use the command, see Creating Indexes.
Run the following commands to generate an HFile file (StoreFiles):

hbase org.apache.hadoop.hbase.hindex.global.mapreduce.GlobalIndexImportTsv -Dimporttsv.separator=<separator>

-Dimporttsv.bulk.output=</path/for/output> <columns> tableName <inputdir>
- -Dimport.separator: indicates a separator, for example, -Dimport.separator=','.
- -Dimport.bulk.output=</path/for/output>: indicates the output path of the execution result. You need to specify a path that does not exist.
- <columns>: indicates the mapping of the imported data in a table, for example, -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:gender,info:age,address:city,address:province.
- <tablename>: indicates the name of the table to be operated.
- <inputdir>: indicates the directory where data is loaded in batches.
- (Optional) -Dindexspecs.covered.to.add: indicates the column of the data table that is redundantly stored, that is, the covered column. Example: -Dindexspecs.covered.to.add='IDX1=>cf1:[q1];cf2:[q1]#IDX2=>cf0:[q5]'.
- (Optional) -Dindexspecs.covered.family.to.add: indicates the column family of the data table where the index table is redundantly stored, that is, the covered column family. Example: -Dindexspecs.covered.family.to.add='IDX1=>cf_0#IDX2=>cf_1;cf_2'.
- (Optional) -Dindexspecs.coveredallcolumn.to.add: indicates all data that the index table redundantly stores, that is, all covered columns in the data table. Example: -Dindexspecs.coveredallcolumn.to.add='IDX1=>true#IDX2=>true'.
- (Optional) -Dindexspecs.splitkeys.to.set: indicates the pre-splitting key of the index table. Specify this parameter to prevent region hotspotting. For example, the format of specifying the pre-splitting is as follows:
  - '#': separates indexes.
  - '[]' contains splitkeys.
  - ',' separates splitkeys.
  For example: -Dindexspecs.splitkeys.to.set='IDX1=>[1,2,3]#IDX2=>[a,b,c]'
- (Optional) -Dindexspecs.to.add=<indexspecs>: indicates the mapping between an index name and a column, for example, -Dindexspecs.to.add='index_bulk=>info:[age->String]'. A value can be represented in the following format:
  indexNameN=>familyN :[columnQualifierN-> columnQualifierDataType], [columnQualifierM-> columnQualifierDataType];familyM: [columnQualifierO-> columnQualifierDataType]# indexNameN=> familyM: [columnQualifierO-> columnQualifierDataType]
  
  The parameters are as follows:
  - Column qualifiers are separated by commas (,). Example: index1 => f1:[c1-> String],[c2-> String]
  - Column families are separated by semicolons (;). Example: index1 => f1:[c1-> String],[c2-> String]; f2:[c3-> Long]
  - Multiple indexes are separated by pound keys (#). Example: index1 => f1 :[c1-> String], [c2-> String]; f2 :[c3-> Long]#index2 => f2 :[c3-> Long]
  - The following data types are supported by columns:
    STRING, INTEGER, FLOAT\LONG, DOUBLE, SHORT, BYTE, and CHAR
- Data types are not case-sensitive.
- The indexspecs.covered.to.add, indexspecs.covered.family.to.add, indexspecs.coveredallcolumn.to.add, indexspecs.splitkeys.to.set, and indexspecs.to.add parameters take effect only when the table to be operated does not exist and the table needs to be automatically created.
The following is an example:

hbase org.apache.hadoop.hbase.hindex.global.mapreduce.GlobalIndexImportTsv -Dimporttsv.separator=',' -Dimporttsv.bulk.output=/dataOutput -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:gender,info:age,address:city,address:province bulkTable /datadirImport/data.txt

Run the following command to import the generated HFile to HBase:

hbase org.apache.hadoop.hbase.tool.GlobalIndexBulkLoadHFilesTool </path/for/output> <tablename>

The following is an example:

hbase org.apache.hadoop.hbase.tool.GlobalIndexBulkLoadHFilesTool /dataOutput bulkTable

Command output is as follows:

2024-01-13 18:29:03,043 INFO  [GlobalIndexBulkLoadHFiles-0] hdfs.DFSClient: Created token for admintest: HDFS_DELEGATION_TOKEN owner=admintest@HADOOP.COM, renewer=renewer, realUser=, issueDate=1705141743030, maxDate=1705746543030, sequenceNumber=4261, masterKeyId=5 on ha-hdfs:hacluster
2024-01-13 18:29:03,123 INFO  [LoadIncrementalHFiles-0] compress.CodecPool: Got brand-new decompressor [.snappy]
2024-01-13 18:29:03,127 INFO  [LoadIncrementalHFiles-0] compress.CodecPool: Got brand-new decompressor [.snappy]
2024-01-13 18:29:03,127 INFO  [LoadIncrementalHFiles-1] compress.CodecPool: Got brand-new decompressor [.snappy]
2024-01-13 18:29:03,127 INFO  [LoadIncrementalHFiles-4] compress.CodecPool: Got brand-new decompressor [.snappy]
2024-01-13 18:29:03,128 INFO  [LoadIncrementalHFiles-0] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/0/8610217824254455849576409ebf8f53 first=Optional[\x0118\x00\x0112005000204\x00] last=Optional[\x0119\x00\x0112005000201\x00]
2024-01-13 18:29:03,128 INFO  [LoadIncrementalHFiles-1] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/0/fa17bc8e753341ffa0ba9e702200c04a first=Optional[\x0121\x00\x0112005000205\x00] last=Optional[\x0132\x00\x0112005000206\x00]
2024-01-13 18:29:03,129 INFO  [LoadIncrementalHFiles-2] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/address/7a0308810d264d61bda32c385f50260c first=Optional[\x0121\x00\x0112005000205\x00] last=Optional[\x0132\x00\x0112005000206\x00]
2024-01-13 18:29:03,129 INFO  [LoadIncrementalHFiles-4] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/info/27cb42f48cb14597badb6cf8b302d4e8 first=Optional[\x0118\x00\x0112005000204\x00] last=Optional[\x0119\x00\x0112005000201\x00]
2024-01-13 18:29:03,130 INFO  [LoadIncrementalHFiles-3] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/address/fe8487c5e2cf4bbaaeb9e638b8acc2c1 first=Optional[\x0118\x00\x0112005000204\x00] last=Optional[\x0119\x00\x0112005000201\x00]
2024-01-13 18:29:03,131 INFO  [LoadIncrementalHFiles-5] tool.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/bulkTable.index_bulk/info/657937b1edd6401b8f5575e42e7ec92b first=Optional[\x0121\x00\x0112005000205\x00] last=Optional[\x0132\x00\x0112005000206\x00]
2024-01-13 18:29:03,539 INFO  [GlobalIndexBulkLoadHFiles-0] hdfs.DFSClient: Cancelling token for admintest: HDFS_DELEGATION_TOKEN owner=admintest@HADOOP.COM, renewer=renewer, realUser=, issueDate=1705141743030, maxDate=1705746543030, sequenceNumber=4261, masterKeyId=5 on ha-hdfs:hacluster
2024-01-13 18:29:03,571 INFO  [GlobalIndexBulkLoadHFiles-0] client.ConnectionImplementation: Closing master protocol: MasterService
2024-01-13 18:29:03,678 INFO  [GlobalIndexBulkLoadHFiles-0-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x3201ef383210e59e
2024-01-13 18:29:03,678 INFO  [GlobalIndexBulkLoadHFiles-0] zookeeper.ZooKeeper: Connection: 0x3201ef383210e59e closed
2024-01-13 18:29:03,679 INFO  [GlobalIndexBulkLoadHFiles-0] client.ConnectionImplementation: Connection has been closed by GlobalIndexBulkLoadHFiles-0.

During index data generation and loading, do not modify indexes, including but not limited to adding and deleting indexes and changing index status. Otherwise, running tasks may fail due to data consistency. In this case, you need to execute the tasks again after the indexes becomes stable.

Parent topic: Using Global Secondary Indexes

Previous topic: Checking Consistency and Rebuilding Index Data

Next topic: GSI APIs