Help Center > > User Guide> MRS Cluster Component Operation Guide> Using HBase> Using HIndex> Introduction to HIndex

Introduction to HIndex

Updated at: Apr 28, 2020 GMT+08:00

Scenario

HBase is a distributed storage database based on key-value. Data in tables is sorted by dictionary based on rowkeys. If you query data by specifying rowkey or scan data in a specific rowkey range, HBase can quickly locate data to be read. In most cases, you need to query data whose column value is XXX. HBase provides the filter function to enable you to query data with specific column values. All data is scanned in the sequence of RowKey and data is matched with a specific column value until the required data is found. The filter function will scan some unnecessary data to obtain the required data. As a result, the filter function cannot meet the requirements for high-performance, frequent queries.

HBase HIndex is designed to address these issues. HBase HIndex provides HBase with the capability of indexing based on specific column values, making query faster, as shown in Figure 1.

Figure 1 HBase HIndex
  • Index data does not support a rolling upgrade.
  • Composite index: You must add or delete all columns that participate in composite indexes. Otherwise, data may be inconsistent.
  • You should not explicitly configure any split policy to a data table where an index has been created.
  • The mutation operation is not supported, such as, increment and append.
  • Index on columns having maxVersions > 1 is not supported
  • The value size of a column for which an index is added cannot exceed 32 KB.
  • When the user data is deleted because TTL of the column family is invalid, the corresponding index data will not be deleted immediately. The index data will be deleted during major compaction.
  • After an index is created, the TTL of the user column family must not be changed.
    • If the TTL of the column family is changed to a larger value after the index is created, delete the index and create one again. Otherwise, some generated index data may be deleted before the deletion of user data.
    • If the TTL of the column family is changed to a smaller value after an index is created, the index may be deleted after the deletion of user data.
  • After disaster recovery is enabled for HBase tables, a secondary index is created in the active cluster and index table changes are not automatically synchronized to the standby cluster. To implement disaster recovery in this case, perform the following operations:
    1. After the secondary index is created in the active table, create a secondary index with the same schema and name using the same method in the standby cluster.
    2. In the active cluster, manually set REPLICATION_SCOPE of the index column family (default value: d) to 1.

Parameter settings

  1. On the MRS cluster details page, click Components.

    For MRS 1.8.10 or earlier, log in to MRS Manager. For details, see Accessing MRS Manager. Then, choose Services.

  2. Choose HBase > Service Configuration and set Type to All. The HBase configuration page is displayed.

Navigation Path

Configuration Item

Default Value

Description

HMaster > System

hbase.coprocessor.master.classes

org.apache.hadoop.hbase.hindex.server.master.HIndexMasterCoprocessor

This coprocessor is used to handle Master-level operations after the HIndex function is enabled, for example, creating an index meta table, adding an index, and deleting an index, a table, and index metadata.

RegionServer > RegionServer

hbase.coprocessor.regionserver.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor

This coprocessor is used to handle the operations that the Master delivers to RegionServer after the HIndex function is enabled.

hbase.coprocessor.region.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionCoprocessor

This coprocessor is used to operate data in the Region after the HIndex function is enabled.

hbase.coprocessor.wal.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexWALCoprocessor

This coprocessor is used for Replication, which filters index data to prevent index data from being sent to the peer cluster. The peer cluster generates index data by itself.

1. The default value is the value that needs to be configured after the HBase HIndex function is enabled. The value has been configured by default for MRS clusters that support the HBase HIndex function.

2. Ensure that the master parameter is configured on HMaster and the region and regionserver parameters are configured on RegionServer.

Related APIs

The APIs that use HIndex are in the org.apache.hadoop.hbase.hindex.client.HIndexAdmin class. The following table describes the related APIs.

Operation

API

Description

Precautions

Add an index.

addIndices()

Add an index to a table without data. Calling this API will add the specified index to a table but skips index data generation. Therefore, after this operation, the index cannot be used for the scanning and filtering operations. This API applies to scenarios where users want to add indexes in batches to tables that have a large amount of pre-existing user data. The specific operation is to use external tools such as the TableIndexer tool to build index data.

  1. An index cannot be modified once it is added. To modify the index, you need to delete the old index and then create a new one again.
  2. Do not create two indexes on the same column with different index names. Otherwise, storage and processing resources will be wasted.
  3. Indexes cannot be added to a system table.
  4. The append and increment operations are not supported when data is put into the index column.
  5. If any fault occurs on the client except DoNotRetryIOException, you need to try again.
  6. An index column family is selected from the following conditions in sequence based on availability:
    1. Typically, the default index column family is d. However, if the value of hindex.default.family.name is set, the value will be used.
    2. Symbol #, @, $, or %
    3. #0, @ 0, $ 0, %0, #1, @ 1 ...to #255, @ 255, $ 255, %255
    4. throw Exception
  7. You can use the HIndex TableIndexer tool to add indexes without building index data.

addIndicesWithData()

Add an index to a table with data. This method is used to add the specified index to the table and create index data for the existing user data. Alternatively, the method can be invoked to generate an index and then generate index data when the user data is stored. Therefore, after this operation, the index can be used for the scanning and filtering operations immediately.

Delete an index.

dropIndices()

This API is used to delete an index only. This API deletes the specified index from the table but skips the corresponding index data. After this operation, the index cannot be used for the scanning and filtering operations. The cluster automatically deletes old index data during major compaction.

This API applies to scenarios where a table contains a large amount of index data and dropIndicesWithData() is unavailable. In addition, you can use the TableIndexer tool to delete indexes and index data.

1. An index can be disabled when the index is in the ACTIVE, INACTIVE, or DROPPING state.

2. If you use dropIndices() to delete an index, ensure that the index data has been deleted before the index is added to the table with the same index name (that is, major compaction has been completed).

3. If you delete an index, the following information will be deleted:

3.1. A column family with an index

3.2. Any one of column families in a combination index

4. Indexes and index data can be deleted together using the HIndex TableIndexer tool.

dropIndicesWithData()

This API is used to delete index data. This API deletes the specified index and all index data corresponding to the index in the user table. After this operation, the index is completely deleted from the table and is no longer used for the scanning and filtering operations.

Enable/Disable an index.

disableIndices()

This API disables all indexes specified by the user so that they are no longer used for the scanning and filtering operations.

1. An index can be enabled when the index is in the ACTIVE, INACTIVE, or BUILDING state.

2. An index can be disabled when the index is in the ACTIVE or INACTIVE state.

3. Before disabling an index, ensure that the index data is consistent with the user data. If no new data is added to the table when the index is disabled, the index data is consistent with the user data.

4. When enabling an index, you can use the TableIndexer tool to build index data to ensure data consistency.

enableIndices()

This API enables all indexes specified by the user so that they can be used for the scanning and filtering operations.

View the created index.

listIndices()

This API is used to list all indexes of a specified table.

None

Querying data based on indexes

You can use a filter to query data in a user table with an index. The query result of a user table with a single or combination index is the same as that of a table without an index, but the table with an index provides higher data query performance than the table without an index.

The index usage rules are as follows:

1. Scenario 1: A single index is created for one or more columns.

When this column is used for AND or OR query filtering, an index can improve query performance.

Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2)

When you use "Index Column AND Non-Index Column" for filtering in the query, the index can improve query performance.

Example: Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

When you use "Index Column OR Non-Index Column" for filtering in the query but do not use an index, query performance will not be improved.

Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2) OR Filter_Condition(NonIndexCol1)

2. Scenario 2: A combination index is created for multiple columns.

When the columns to be queried are all or part of the combination index and have the same order as the combination index, using the index improves query performance.

For example, create a combination index for C1, C2, and C3. The index takes effect in the following situations:

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)

FILTER_CONDITION(IndexCol1)

The index does not take effect in the following situations:

Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol3)

FILTER_CONDITION(IndexCol2)

FILTER_CONDITION(IndexCol3)

When you use "Index Column AND Non-Index Column" for filtering in the query, the index can improve query performance.

Example:

Filter_Condition(IndexCol1)AND Filter_Condition(NonIndexCol1)

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

When you use "Index Column OR Non-Index Column" for filtering in the query but do not use an index, query performance will not be improved.

Example:

Filter_Condition(IndexCol1)OR Filter_Condition(NonIndexCol1)

(Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2))OR(Filter_Condition(NonIndexCol1))

When multiple columns are used for query, you can specify a value range for only the last column in the combination index and set other columns to a specified value.

For example, create a combination index for C1, C2, and C3. In a range query, only the value range of C3 can be set. The filter criteria are "C1 = XXX, C2 = XXX, and C3 = Value range."

Best query policy

Use SingleColumnValueFilter or SingleColumnRangeFilter. It will provide the definite value column_family:qualifierpair (called col1) in filter criteria.

If col1 is the first index column in the table, any index in the table can be a candidate index used during the query. Example:

If there is an index on the col1, the index can be used as a candidate index because col1 is the first and the only column of the index. If there is another index on col1 and col2, you can consider this index as a candidate index because col1 is the first column in the index list. On the other hand, if there is an index on col2 and col1, this index cannot be used as a candidate index because the first column in the index list is not col1.

The most suitable method to use the index now is that when there are multiple candidate indexes, the most suitable index for scanning data needs to be selected from possible candidate indexes.

You can use the following solutions to learn how to select the best index policy.

1. It is better to fully match.

Scenario: There are two indexes available, one for col1&col2 and the other for col1.

In this scenario, the second index is better than the first one, because it scans less index data.

2. If there are multiple candidate multi-column indexes, select an index with fewer index columns.

Scenario: There are two indexes available, one for col1&col2 and the other for col1&col2&col3.

In this case, you had better use the index on col1 and col2, because it scans less index data.

1. During a query based on an index, the index state must be ACTIVE. You can invoke the listIndices() API to view the index state.

2. To make that correct data can be queried based on the index, ensure the consistency between index data and user data.

3. Run the following command to perform a complex query on the HBase shell client (assuming that an index has been created for the specified column):

scan 'tablename', {FILTER => "SingleColumnValueFilter(family, qualifier, compareOp, comparator, filterIfMissing, latestVersionOnly)"}

Example: scan 'test', {FILTER => "SingleColumnValueFilter('info', 'age', =, 'binary:26', true, true)"}

In the preceding scenario, if you want to save the row where no column is found in the result, you should not create any index in any such column, because if the column to be queried does not exist, the row will be filtered out when SCVF is used to scan the index columns. When the SCVF whose filterIfMissingset is false (default value) scans non-index columns, rows where no column is queried will also be returned in the result. Therefore, to avoid inconsistent query results, you are advised to set filterIfMissing to true after creating SCVF for the index column.

4. Run the following command in hbase shell to view the index data created for user data:

scan 'tablename', {ATTRIBUTES => {'FETCH_INDEX_DATA' => 'true'}}

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel