Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Introduction to HIndex

Updated on 2024-07-19 GMT+08:00

Scenarios

HBase is a distributed storage database of the Key-Value type. Data in tables is sorted by dictionary based on row keys. If you query data by specifying a row key or scan data in a specific row key range, HBase can help you quickly locate the data to be read. In most cases, you need to query data whose column value is XXX. HBase provides the filter function to enable you to query data with a specific column value. All data is scanned in the sequence of row keys and is matched with the specific column value until the required data is found. To obtain the required data, the filter will scan some unnecessary data. As a result, the filter function cannot meet the requirements for high-performance, frequent queries.

HBase HIndex is designed to address these issues. HBase HIndex provides HBase with the capability of indexing based on specific column values, making queries faster.

Figure 1 HBase HIndex
NOTE:
  • Rolling upgrade is not supported for index data.
  • Composite index: You must add or delete all columns that participate in composite indexes. Otherwise, the data may be inconsistent.
  • You should not explicitly configure any split policy to a data table where an index has been created.
  • The mutation operations are not supported, such as increment and append.
  • Index of the column with maxVersions greater than 1 is not supported.
  • The value size of a column for which an index is added cannot exceed 32 KB.
  • When the user data is deleted because TTL of the column family is invalid, the corresponding index data will not be deleted immediately. The index data will be deleted during major compaction.
  • After an index is created, the TTL of the user column family must not be changed.
    • If the TTL of the column family is changed to a larger value after an index is created, delete the index and create one again. Otherwise, some generated index data may be deleted before the deletion of user data.
    • If the TTL of the column family is changed to a smaller value after an index is created, the index may be deleted after the deletion of user data.
  • After disaster recovery is enabled for HBase tables, a secondary index is created in the active cluster and index table changes are not automatically synchronized to the standby cluster. To implement disaster recovery in this case, perform the following operations:
    1. After the secondary index is created in the active table, create a secondary index with the same schema and name using the same method in the standby cluster.
    2. In the active cluster, manually set REPLICATION_SCOPE of the index column family (default value: d) to 1.

Parameter Configuration

  1. Log in to the MRS console, click a cluster name and choose Components.
  2. Go to the All Configurations page of the HBase service. For details, see Modifying Cluster Service Configuration Parameters.

  3. View parameters on the HBase configurations page.

    Navigation Path

    Parameter

    Default Value

    Description

    HMaster > System

    hbase.coprocessor.master.classes

    org.apache.hadoop.hbase.hindex.server.master.HIndexMasterCoprocessor,com.xxx.hadoop.hbase.backup.services.RecoveryCoprocessor,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,org.apache.hadoop.hbase.security.access.ReadOnlyClusterEnabler,org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint

    This coprocessor is used to handle Master-level operations after the HIndex function is enabled, for example, creating an index meta table, adding an index, and deleting an index, a table, and index metadata.

    RegionServer > RegionServer

    hbase.coprocessor.regionserver.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor,org.apache.hadoop.hbase.JMXListener,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor

    This coprocessor is used to handle the operations that the Master delivers to RegionServer after the HIndex function is enabled.

    hbase.coprocessor.region.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionCoprocessor,org.apache.hadoop.hbase.security.token.TokenProvider,com.xxx.hadoop.hbase.backup.services.RecoveryCoprocessor,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor,org.apache.hadoop.hbase.security.access.SecureBulkLoadEndpoint,org.apache.hadoop.hbase.security.access.ReadOnlyClusterEnabler,org.apache.hadoop.hbase.coprocessor.MetaTableMetrics

    This coprocessor is used to operate data in the Region after the HIndex function is enabled.

    hbase.coprocessor.wal.classes

    org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor,org.apache.hadoop.hbase.JMXListener,org.apache.ranger.authorization.hbase.RangerAuthorizationCoprocessor

    This coprocessor is used for Replication, which filters index data to prevent the index data from being sent to the peer cluster. The peer cluster generates index data by itself.

    This parameter is supported only in versions earlier than MRS 3.x.

    NOTE:

    1. The preceding default values need to be configured after the HBase HIndex function is enabled. In MRS clusters that support the HBase HIndex function, the values have been configured by default.

    2. Ensure that the master parameter is configured on HMaster and the region and regionserver parameters are configured on RegionServer.

Related Interfaces

The APIs that use HIndex are in the org.apache.hadoop.hbase.hindex.client.HIndexAdmin class. The following table describes the related APIs.

Operation

API

Description

Precautions

Add an index.

addIndices()

Add an index to a table without data. Calling this API will add the specified index to a table but skips index data generation. Therefore, after this operation, the index cannot be used for the scanning and filtering operations. This API applies to scenarios where users want to add indexes in batches to tables that have a large amount of pre-existing user data. The specific operation is to use external tools such as the TableIndexer tool to build index data.

  • An index cannot be modified once it is added. To modify the index, you need to delete the old index and then create a new one.
  • Do not create two indexes on the same column with different index names. Otherwise, storage and processing resources will be wasted.
  • Indexes cannot be added to a system table.
  • The append and increment operations are not supported when data is put into the index column.
  • If any fault occurs on the client except DoNotRetryIOException, you need to try again.
  • An index column family is selected from the following conditions in sequence based on availability:
    • Typically, the default index column family is d. However, if the value of hindex.default.family.name is set, the value will be used.
    • Symbol #, @, $, or %
    • #0, @ 0, $ 0, %0, #1, @ 1 ...to #255, @ 255, $ 255, %255
    • Throw exceptions.
  • You can use the HIndex TableIndexer tool to add indexes without building index data.

addIndicesWithData()

Add an index to a table with data. This API is used to add the specified index to the table and create index data for the existing user data. Alternatively, the API can be called to generate an index and then generate index data when the user data is being stored. Therefore, after this operation, the index can be used for the scanning and filtering operations immediately.

Delete an index.

dropIndices()

This API is used to delete an index only. It deletes the specified index from a table but skips the corresponding index data. After this operation, the index cannot be used for the scanning and filtering operations. The cluster automatically deletes old index data during major compaction.

This API applies to scenarios where a table contains a large amount of index data and dropIndicesWithData() is unavailable. In addition, you can use the TableIndexer tool to delete indexes and index data.

  • An index can be disabled when it is in the ACTIVE, INACTIVE, or DROPPING state.
  • If you use dropIndices() to delete an index, ensure that the index data has been deleted before the index is added to the table with the same index name (that is, major compaction has been completed).
  • If you delete an index, the following information will also be deleted:
    • A column family with an index
    • Any one of column families in a combination index
  • Indexes and index data can be deleted together using the HIndex TableIndexer tool.

dropIndicesWithData()

Delete index data. This API deletes the specified index and all index data corresponding to the index in a user table. After this operation, the index is completely deleted from the table and is no longer used for the scanning and filtering operations.

Enable/Disable an index.

disableIndices()

This API disables all indexes specified by a user so that they are no longer used for the scanning and filtering operations.

  • An index can be enabled when the index is in the ACTIVE, INACTIVE, or BUILDING state.
  • An index can be disabled when the index is in the ACTIVE or INACTIVE state.
  • Before disabling an index, ensure that the index data is consistent with the user data. If no new data is added to the table when the index is disabled, the index data is consistent with the user data.
  • When enabling an index, you can use the TableIndexer tool to build index data to ensure data consistency.

enableIndices()

This API enables all indexes specified by a user so that they can be used for the scanning and filtering operations.

View the created index.

listIndices()

This API is used to list all indexes of a specified table.

N/A

Querying Data Based on Indexes

You can use a filter to query data in a user table with an index. The query result of a user table with a single or combination index is the same as that of a table without an index, but the table with an index provides higher data query performance than the table without an index.

The index usage rules are as follows:

  • Scenario 1: A single index is created for one or more columns.
    • When this column is used for AND or OR query filtering, an index can improve query performance.

      Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2)

    • When you use Index Column AND Non-Index Column for filtering in the query, the index can improve query performance.

      Example: Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

    • When you use Index Column OR Non-Index Column for filtering in the query but do not use an index, query performance will not be improved.

      Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2) OR Filter_Condition(NonIndexCol1)

  • Scenario 2: A combination index is created for multiple columns.
    • When the columns to be queried are all or part of the combination index and have the same order as the combination index, using the index improves query performance.

      For example, create a combination index for C1, C2, and C3.

      • The index takes effect in the following situations:

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)

        FILTER_CONDITION(IndexCol1)

      • The index does not take effect in the following situations:

        Filter_Condition(IndexCol2)AND Filter_Condition(IndexCol3)

        Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol3)

        FILTER_CONDITION(IndexCol2)

        FILTER_CONDITION(IndexCol3)

    • When you use Index Column AND Non-Index Column for filtering in the query, the index can improve query performance.

      Examples:

      Filter_Condition(IndexCol1)AND Filter_Condition(NonIndexCol1)

      Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)AND Filter_Condition(NonIndexCol1)

    • When you use Index Column OR Non-Index Column for filtering in the query but do not use an index, query performance will not be improved.

      Examples:

      Filter_Condition(IndexCol1)OR Filter_Condition(NonIndexCol1)

      (Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2))OR(Filter_Condition(NonIndexCol1))

    • When multiple columns are used for query, you can specify a value range for only the last column in the combination index and set other columns to specified values

      For example, create a combination index for C1, C2, and C3. In a range query, only the value range of C3 can be set. The filter criteria are "C1 = XXX, C2 = XXX, and C3 = Value range."

Query Policy Selection

Use SingleColumnValueFilter or SingleColumnRangeFilter. It will provide the definite value column_family:qualifierpair (called col1) in filter criteria.

If col1 is the first index column in the table, any index in the table can be a candidate index used during the query. The following provides an example:

If there is an index on col1, the index can be used as a candidate index because col1 is the first and the only column of the index. If there is another index on col1 and col2, you can consider this index as a candidate index because col1 is the first column in the index list. However, if there is an index on col2 and col1, this index cannot be used as a candidate index because the first column in the index list is not col1.

The most suitable method to use the index now is that when there are multiple candidate indexes, select the most suitable index for scanning data.

You can use the following solutions to learn how to select the best index policy.

  • It is better to fully match.

    Scenario: There are two indexes available, one for col1&col2 and the other for col1.

    In this scenario, the second index is better than the first one, because it scans less index data.

  • If there are multiple candidate multi-column indexes, select an index with fewer index columns.

    Scenario: There are two indexes available, one for col1&col2 and the other for col1&col2&col3.

    In this case, you had better use the index on col1&col2, because it scans less index data.

NOTE:
  • During a query based on an index, the index state must be ACTIVE. You can call the listIndices() API to view the index state.
  • To query the correct data based on the index, ensure the consistency between index data and user data.
  • Run the following command to perform a complex query on the HBase shell client (assuming that an index has been created for the specified column):

    scan 'tablename', {FILTER => "SingleColumnValueFilter(family, qualifier, compareOp, comparator, filterIfMissing, latestVersionOnly)"}

    Example: scan 'test', {FILTER => "SingleColumnValueFilter('info', 'age', =, 'binary:26', true, true)"}

    In the preceding scenario, if you want to save the row where no column is found in the result, you should not create any index in any such column, because if the column to be queried does not exist, the row will be filtered out when SCVF is used to scan the index columns. When the SCVF whose filterIfMissingset is false (default value) scans non-index columns, rows where no column is queried will also be returned in the result. Therefore, to avoid inconsistent query results, you are advised to set filterIfMissing to true after creating SCVF for the index column.

  • Run the following command on the HBase shell client to view the index data created for user data:

    scan 'tablename', {ATTRIBUTES => {'FETCH_INDEX_DATA' => 'true'}}

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback