Updated on 2023-11-03 GMT+08:00

Introduction to HIndex

Scenarios

HBase is a distributed storage database of the Key-Value type. Data in tables is sorted by dictionary based on row keys. If you query data by specifying a row key or scan data in a specific row key range, HBase can help you quickly locate the data to be read. In most cases, you need to query data whose column value is XXX. HBase provides the filter function to enable you to query data with a specific column value. All data is scanned in the sequence of row keys and is matched with the specific column value until the required data is found. To obtain the required data, the filter will scan some unnecessary data. As a result, the filter function cannot meet the requirements for high-performance, frequent queries.

HBase HIndex is designed to address these issues. HBase HIndex provides HBase with the capability of indexing based on specific column values, making queries faster.

Figure 1 HBase HIndex
  • Rolling upgrade is not supported for index data.
  • Composite index: You must add or delete all columns that participate in composite indexes. Otherwise, the data may be inconsistent.
  • You should not explicitly configure any split policy to a data table where an index has been created.
  • The mutation operations are not supported, such as increment and append.
  • Index of the column with maxVersions greater than 1 is not supported.
  • The value size of a column for which an index is added cannot exceed 32 KB.
  • When the user data is deleted because TTL of the column family is invalid, the corresponding index data will not be deleted immediately. The index data will be deleted during major compaction.
  • After an index is created, the TTL of the user column family must not be changed.
    • If the TTL of the column family is changed to a larger value after an index is created, delete the index and create one again. Otherwise, some generated index data may be deleted before the deletion of user data.
    • If the TTL of the column family is changed to a smaller value after an index is created, the index may be deleted after the deletion of user data.
  • After disaster recovery is enabled for HBase tables, a secondary index is created in the active cluster and index table changes are not automatically synchronized to the standby cluster. To implement disaster recovery in this case, perform the following operations:
    1. After the secondary index is created in the active table, create a secondary index with the same schema and name using the same method in the standby cluster.
    2. In the active cluster, manually set REPLICATION_SCOPE of the index column family (default value: d) to 1.

Parameter Configuration

  1. Log in to the MRS console, click a cluster name and choose Components.
  2. Go to the All Configurations page of the HBase service. For details, see Modifying Cluster Service Configuration Parameters.

  3. View parameters on the HBase configurations page.

    Navigation Path

    Parameter

    Default Value

    Description

    HMaster > System

    hbase.coprocessor.master.classes

    org.apache.hadoop.hbase.hindex.server.master.HIndexMasterCoprocessor,com.xxx.hadoop.hbase.backup.services.RecoveryCoprocessor,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,org.apache.hadoop.hbase.security.access.ReadOnlyClusterEnabler,org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint

    This coprocessor is used to handle Master-level operations after the HIndex function is enabled, for example, creating an index meta table, adding an index, and deleting an index, a table, and index metadata.

    RegionServer > RegionServer

    hbase.coprocessor.regionserver.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor,org.apache.hadoop.hbase.JMXListener,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor

    This coprocessor is used to handle the operations that the Master delivers to RegionServer after the HIndex function is enabled.

    hbase.coprocessor.region.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionCoprocessor,org.apache.hadoop.hbase.security.token.TokenProvider,com.xxx.hadoop.hbase.backup.services.RecoveryCoprocessor,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.hadoop.hbase.security.access.ReadOnlyClusterEnabler,org.apache.hadoop.hbase.coprocessor.MetaTableMetrics

    This coprocessor is used to operate data in the Region after the HIndex function is enabled.

    hbase.coprocessor.wal.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor,org.apache.hadoop.hbase.JMXListener,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor

    This coprocessor is used for Replication, which filters index data to prevent the index data from being sent to the peer cluster. The peer cluster generates index data by itself.

    This parameter is supported only in versions earlier than MRS 3.x.

    1. The preceding default values need to be configured after the HBase HIndex function is enabled. In MRS clusters that support the HBase HIndex function, the values have been configured by default.
    2. Ensure that the Master parameter is configured on HMaster and the region and regionserver parameters are configured on RegionServer.

Related Interfaces

The APIs that use HIndex are in the org.apache.hadoop.hbase.hindex.client.HIndexAdmin class. The following table describes the related APIs.

Operation

API

Description

Precautions

Add an index.

addIndices()

Add an index to a table without data. Calling this API will add the specified index to a table but skips index data generation. Therefore, after this operation, the index cannot be used for the scanning and filtering operations. This API applies to scenarios where users want to add indexes in batches to tables that have a large amount of pre-existing user data. The specific operation is to use external tools such as the TableIndexer tool to build index data.

  • An index cannot be modified once it is added. To modify the index, you need to delete the old index and then create a new one.
  • Do not create two indexes on the same column with different index names. Otherwise, storage and processing resources will be wasted.
  • Indexes cannot be added to a system table.
  • The append and increment operations are not supported when data is put into the index column.
  • If any fault occurs on the client except DoNotRetryIOException, you need to try again.
  • An index column family is selected from the following conditions in sequence based on availability:
    • Typically, the default index column family is d. However, if the value of hindex.default.family.name is set, the value will be used.
    • Symbol #, @, $, or %
    • #0, @ 0, $ 0, %0, #1, @ 1 ...to #255, @ 255, $ 255, %255
    • Throw exceptions.
  • You can use the HIndex TableIndexer tool to add indexes without building index data.

addIndicesWithData()

Add an index to a table with data. This API is used to add the specified index to the table and create index data for the existing user data. Alternatively, the API can be called to generate an index and then generate index data when the user data is being stored. Therefore, after this operation, the index can be used for the scanning and filtering operations immediately.

Delete an index.

dropIndices()

This API is used to delete an index only. It deletes the specified index from a table but skips the corresponding index data. After this operation, the index cannot be used for the scanning and filtering operations. The cluster automatically deletes old index data during major compaction.

This API applies to scenarios where a table contains a large amount of index data and dropIndicesWithData() is unavailable. In addition, you can use the TableIndexer tool to delete indexes and index data.

  • An index can be disabled when it is in the ACTIVE, INACTIVE, or DROPPING state.
  • If you use dropIndices() to delete an index, ensure that the index data has been deleted before the index is added to the table with the same index name (that is, major compaction has been completed).
  • If you delete an index, the following information will also be deleted:
    • A column family with an index
    • Any one of column families in a combination index
  • Indexes and index data can be deleted together using the HIndex TableIndexer tool.

dropIndicesWithData()

Delete index data. This API deletes the specified index and all index data corresponding to the index in a user table. After this operation, the index is completely deleted from the table and is no longer used for the scanning and filtering operations.

Enable/Disable an index.

disableIndices()

This API disables all indexes specified by a user so that they are no longer used for the scanning and filtering operations.

  • An index can be enabled when the index is in the ACTIVE, INACTIVE, or BUILDING state.
  • An index can be disabled when the index is in the ACTIVE or INACTIVE state.
  • Before disabling an index, ensure that the index data is consistent with the user data. If no new data is added to the table when the index is disabled, the index data is consistent with the user data.
  • When enabling an index, you can use the TableIndexer tool to build index data to ensure data consistency.

enableIndices()

This API enables all indexes specified by a user so that they can be used for the scanning and filtering operations.

View the created index.

listIndices()

This API is used to list all indexes of a specified table.

N/A

Querying Data Based on Indexes

You can use a filter to query data in a user table with an index. The query result of a user table with a single or combination index is the same as that of a table without an index, but the table with an index provides higher data query performance than the table without an index.

The index usage rules are as follows:

  • Scenario 1: A single index is created for one or more columns.
    • When this column is used for AND or OR query filtering, an index can improve query performance.

      Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2)

    • When you use Index Column AND Non-Index Column for filtering in the query, the index can improve query performance.

      Example: Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

    • When you use Index Column OR Non-Index Column for filtering in the query but do not use an index, query performance will not be improved.

      Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2) OR Filter_Condition(NonIndexCol1)

  • Scenario 2: A combination index is created for multiple columns.
    • When the columns to be queried are all or part of the combination index and have the same order as the combination index, using the index improves query performance.

      For example, create a combination index for C1, C2, and C3.

      • The index takes effect in the following situations:

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)

        FILTER_CONDITION(IndexCol1)

      • The index does not take effect in the following situations:

        Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol3)

        FILTER_CONDITION(IndexCol2)

        FILTER_CONDITION(IndexCol3)

    • When you use Index Column AND Non-Index Column for filtering in the query, the index can improve query performance.

      Examples:

      Filter_Condition(IndexCol1)AND Filter_Condition(NonIndexCol1)

      Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

    • When you use Index Column OR Non-Index Column for filtering in the query but do not use an index, query performance will not be improved.

      Examples:

      Filter_Condition(IndexCol1)OR Filter_Condition(NonIndexCol1)

      (Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2))OR(Filter_Condition(NonIndexCol1))

    • When multiple columns are used for query, you can specify a value range for only the last column in the combination index and set other columns to specified values

      For example, create a combination index for C1, C2, and C3. In a range query, only the value range of C3 can be set. The filter criteria are "C1 = XXX, C2 = XXX, and C3 = Value range."

Query Policy Selection

Use SingleColumnValueFilter or SingleColumnRangeFilter. It will provide the definite value column_family:qualifierpair (called col1) in filter criteria.

If col1 is the first index column in the table, any index in the table can be a candidate index used during the query. The following provides an example:

If there is an index on col1, the index can be used as a candidate index because col1 is the first and the only column of the index. If there is another index on col1 and col2, you can consider this index as a candidate index because col1 is the first column in the index list. However, if there is an index on col2 and col1, this index cannot be used as a candidate index because the first column in the index list is not col1.

The most suitable method to use the index now is that when there are multiple candidate indexes, select the most suitable index for scanning data.

You can use the following solutions to learn how to select the best index policy.

  • Use the fully matched index.

    Scenario: There are two indexes available, one for col1&col2 and the other for col1.

    In this case, the second index is better than the first one, because it scans less index data.

  • If there are multiple candidate multi-column indexes, select an index with fewer index columns.

    Scenario: There are two indexes available, one for col1&col2 and the other for col1&col2&col3.

    In this case, use the index on col1&col2, because it scans less index data.

  • During a query based on an index, the index state must be ACTIVE. You can call the listIndices() API to view the index state.
  • To query the correct data based on the index, ensure the consistency between index data and user data.
  • Run the following command to perform a complex query on the HBase shell client (assuming that an index has been created for the specified column):

    scan 'tablename', {FILTER => "SingleColumnValueFilter(family, qualifier, compareOp, comparator, filterIfMissing, latestVersionOnly)"}

    Example: scan 'test', {FILTER => "SingleColumnValueFilter('info', 'age', =, 'binary:26', true, true)"}

    In the preceding scenario, if you want to save the row where no column is found in the result, you should not create any index in any such column, because if the column to be queried does not exist, the row will be filtered out when SCVF is used to scan the index columns. When the SCVF whose filterIfMissingset is false (default value) scans non-index columns, rows where no column is queried will also be returned in the result. Therefore, to avoid inconsistent query results, you are advised to set filterIfMissing to true after creating SCVF for the index column.

  • Run the following command on the HBase shell client to view the index data created for user data:

    scan 'tablename', {ATTRIBUTES => {'FETCH_INDEX_DATA' => 'true'}}