Updated on 2023-05-06 GMT+08:00

Using an Index Generation Tool

Scenarios

To quickly create indexes for user data, HBase provides the TableIndexer tool for you to create, add, and delete indexes using MapReduce functions. The application scenarios are as follows:

  • You want to add an index for a specified column in a table where a large amount of data exists. However, if you use the addIndicesWithData() API to add an index, index data corresponding to the related user data will be generated, which is time-consuming. If you use addIndices() to create an index, index data corresponding to user data will not be generated. Therefore, to create index data for user data, you can use the TableIndexer tool to create an index.
  • If the index data is inconsistent with the user data, the tool can be used to rebuild index data.

    If you temporarily disable the index, put new data to the disabled index column, and then directly enable the index from the disabled state, index data and user data may be inconsistent. Therefore, you must rebuild all index data before using it again.

  • You can use the TableIndexer tool to completely delete a large amount of existing index data from a user table.
  • For user tables that do not have indexes, this tool allows you to add and build indexes at the same time.

How to Use

  • Adding a new index to a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0=>cf_0:[q_0->string],[q_1];cf_1:[q_2],[q_3]#idx_1=>cf_1:[q_4]'

    The following parameters are required.

    • tablename.to.index: Indicates the name of a table for which an index is created.
    • indexspecs.to.add: Indicates the mapping between the index name and the column in the corresponding user table.
    • scan.caching (optional): Contains an integer value, indicating the number of cached rows to be transmitted to the scanner during data table scanning.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • cf_0: Indicates the name of a column family.
    • q_0: Indicates the name of a column.
    • string: Indicates a data type. The parameter value can be STRING, INTEGER, FLOAT, LONG, DOUBLE, SHORT, BYTE, or CHAR.
    • The pound key (#) is used to separate indexes. The semicolon (;) is used to separate column families. The comma (,) is used to separate column qualifiers.
    • The column name and its data type must be included in '[]'.
    • Column names and their data types are separated by '->'.
    • If the data type of a specific column is not specified, the default data type (string) is used.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.
    • The index specified in the table must not exist.
    • If a column family named d exists in the user table, you must use the TableIndexer tool to build index data.

    After the preceding command is executed, the specified index is added to the table and is in INACTIVE state. This behavior is similar to the addIndices() API.

  • Creating index data for existing indexes in a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.build='idx_0#idx_1'

    The following parameters are required.

    • tablename.to.index: Indicates the name of a table for which an index is created.
    • indexspecs.to.build: Indicates an index name.
    • scan.caching (optional): Contains an integer value, indicating the number of cached rows to be transmitted to the scanner during data table scanning.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • The pound key (#) is used to separate index names.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.

    After the preceding command is executed, the specified index is set to the ACTIVE state. Users can use them when scanning data.

  • Deleting the existing indexes and their data from a user table

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.drop='idx_0#idx_1'

    The following parameters are required.

    • tablename.to.index: Indicates the name of a table for which an index is created.
    • indexnames.to.drop: Indicates the name of the index that should be deleted with its data (must exist in the table).
    • scan.caching (optional): Contains an integer value, indicating the number of cached rows to be transmitted to the scanner during data table scanning.

    The parameters in the preceding command are described as follows:

    • idx_1: Indicates an index name.
    • The pound key (#) is used to separate index names.
    • If scan.caching is not configured, the default value 1000 is used.
    • The user table must exist.

    After the preceding command is executed, the specified index is deleted from the table.

  • Adding new indexes to user tables and building data based on existing data

    The command is as follows:

    hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0 => cf_0:[q_0-> string],[q_1];cf_1:[ q_2],[q_3]#idx_1 => cf_1:[q_4]' -Dindexnames.to.build='idx_0'

    • The parameters are the same as the previous ones.
    • The user table must exist.
    • The indexes specified in indexspecs.to.add must not exist in the table.
    • The index names specified in indexnames.to.build must exist in the table or be part of the value of indexspecs.to.add.

    After the preceding command is executed, all indexes specified in indexspecs.to.add will be added to this table, and index data will be built for all specified indexes using indexnames.to.build.