Updated on 2024-10-08 GMT+08:00

Using TableIndexer to Generate a Local HBase Secondary Index

Scenarios

TableIndexer allows you to quickly index data in HBase. With this tool, you can create, add, and delete indexes using MapReduce functions. The application scenarios are as follows:

  • You want to add an index for a specified column in a table where a large amount of data exists. However, if you use the addIndicesWithData() API to add an index, index data corresponding to the related data will be generated, which is time-consuming. If you use addIndices() to create an index, index data corresponding to table data will not be generated. You can use the TableIndexer tool to create indexes.
  • If the index data is inconsistent with the table data, the tool can be used to rebuild index data.

    If you temporarily disable the index, put new data to the disabled index column, and then directly enable the index from the disabled state, index data and user data may be inconsistent. Therefore, you must rebuild all index data before using it again.

  • You can use the TableIndexer tool to completely delete a large amount of existing index data from a table.
  • For tables that do not have indexes, this tool allows you to add and build indexes at the same time.

How to Use TableIndexer

  • Adding a new index to a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0=>cf_0:[q_0->string],[q_1];cf_1:[q_2],[q_3]#idx_1=>cf_1:[q_4]'

    The parameters are as follows:

    • tablename.to.index: indicates the name of a table for which an index is created.
    • indexspecs.to.add: indicates the mappings between the index name and the column in the table.
    • (Optional) scan.caching: indicates the number of cached rows to be passed to the scanner during data table scans. The value contains an integer.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • cf_0: Indicates the name of a column family.
    • q_0: Indicates the name of a column.
    • string: indicates the data type. The value can be STRING, INTEGER, FLOAT, LONG, DOUBLE, SHORT, BYTE, or CHAR.
    • The pound key (#) is used to separate indexes. The semicolon (;) is used to separate column families. The comma (,) is used to separate column qualifiers.
    • The column name and its data type must be included in '[]'.
    • Column names and their data types are separated by '->'.
    • If the data type of a specific column is not specified, the default data type (string) is used.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.
    • The index specified in the table must not exist.
    • If a column family named d exists in the user table, you must use the TableIndexer tool to build index data.

    After the preceding command is executed, the specified index is added to the table and is in INACTIVE state. This behavior is similar to the addIndices() API.

  • Creating index data for existing indexes in a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.build='idx_0#idx_1'

    • tablename.to.index: Indicates the name of a table for which an index is created.
    • indexspecs.to.build: Indicates an index name.
    • scan.caching (optional): Contains an integer value, indicating the number of cached rows to be transmitted to the scanner during data table scanning.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • The pound key (#) is used to separate index names.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.

    After the preceding command is executed, the specified index is set to the ACTIVE state. Users can use them when scanning data.

  • Deleting the existing indexes and their data from a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.drop='idx_0#idx_1'

    • tablename.to.index: Indicates the name of a table for which an index is created.
    • indexnames.to.drop: Indicates the name of the index that should be deleted with its data (must exist in the table).
    • scan.caching (optional): Contains an integer value, indicating the number of cached rows to be transmitted to the scanner during data table scanning.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • The pound key (#) is used to separate index names.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.

    After the preceding command is executed, the specified index is deleted from the table.

  • Adding new indexes to user tables and building data based on existing data

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0 => cf_0:[q_0-> string],[q_1];cf_1:[ q_2],[q_3]#idx_1 => cf_1:[q_4]' -Dindexnames.to.build='idx_0'

    • The user table must exist.
    • The indexes specified in indexspecs.to.add must not exist in the table.
    • The index names specified in indexnames.to.build must exist in the table or be part of the value of indexspecs.to.add.

    After the preceding command is executed, all indexes specified in indexspecs.to.add will be added to this table, and index data will be built for all specified indexes using indexnames.to.build.