Overview of Full-Text Search

HBase-Elasticsearch stores user source data in HBase and uses the Elasticsearch search engine of Cloud Search Service (CSS for short) to supplement full-text search based on key-value query capabilities. You can define which fields in HBase need full-text search based on service requirements. When you create an HBase table, a CSS cluster you specify will be automatically connected and an index is created in Elasticsearch. Index data is stored in Elasticsearch. In addition, the native APIs (Put and Scan) of HBase support the write and query of index data.

How to Use

Working Principles

As a big data storage service, CloudTable stores user data in bytes and provides efficient key-value random query capabilities. You can customize a schema to specify a data type (generally the text type) for some fields to extend the full-text search capability of CloudTable. CloudTable is suitable to be a primary storage system to store massive amounts of source data (any data type), because it separates computing from storage and features easy scale-out and cost-effectiveness of data storage. CSS (Elasticsearch) preserves lightweight index data to support keyword search. The following figure shows the working principles.

Figure 1 Working principles

If you enable a full-text index for some specified fields when creating an HBase table, HBase automatically synchronizes full-text index data to CSS when writing data. In addition, the native HBase data read API Scan supports common full-text search in terms of key-value read capability. To obtain complex high-level search capabilities, you can call Elasticsearch APIs and then CloudTable read APIs to complete service logic.

Application Scenarios

Massive amounts of user service data requires HBase to function as a big data online storage system to provide the most basic key-value query capabilities, featuring efficiency, high-concurrency, and low-latency. In addition, there are many types and quantities of fields in the data, that is, the corresponding services are diversified. For example, for a row of data in a table, some text fields need to use keywords for full-text search, some fields are secondary indexes, and some fields are applied to tag bitmap indexes. In this case, the Elasticsearch full-text search function needs to be enabled for CloudTable, while other service expansion capabilities are preserved. Example:

  1. A search website stores massive amounts of search information, user environment information, and basic information in real time, extracts user information based on goods keywords, and resells the information to a third-party e-commerce platform.
  2. An intelligent hospital's case system stores patients' medical treatment information, including the basic information, health status, doctor's occupational information, symptom description, diagnosis results, and medicine. A hospital information platform collects statistics on or searches for patients with historical medical treatment using keywords of the current social epidemics, prohibited drugs, or technical breakthroughs for tracking discharged patients or contacting patients to use new technologies for secondary diagnosis and other innovative services.
  3. An intelligent public opinion governance system of governments stores massive amounts of data such as seditious speech, user information, and forwarding times of mainstream media platform users. It also searches for hot events in real time. If the event is a rumor, the system automatically reminds the user of the authenticity of the current event, the social impact data that the user publishes/forwards, relevant legal provisions, and similar cases. The intelligent feedback mechanism is a deterrent to rumormongers and guides good public opinions.

HBase Elasticsearch Schema Definition

HBase uses metadata of a table to store the definition of the Elasticsearch schema.

Table 1 Schema definition

Field

Description

Mandatory

hbase.index.es.enabled

Whether to create a full-text index for the HBase table in Elasticsearch. The value true indicates that the full-text index is created. The default value is false.

Yes

hbase.index.es.endpoint

Access address of the CSS cluster (Elasticsearch engine), for example, ip1:port,ip2:port

Yes

hbase.index.es.indexname

Index name of the HBase table in Elasticsearch. The index name must be in lower case.

Yes

hbase.index.es.shards

Number of index shards in Elasticsearch. The default value is 5. The value is an integer greater than or equal to 1.

No

hbase.index.es.replicas

Number of index replicas in Elasticsearch. The default value is 1. The value is an integer greater than or equal to 0.

No

hbase.index.es.schema

Field mapping between HBase and Elasticsearch. The value is characters in JSON array format. Each element contains the following fields:

  • name: Name of the field in Elasticsearch
  • type: Type of the field in Elasticsearch
  • hbaseQualifier: HBase qualifier of the data source
  • analyzer: You can configure analyzer to specify an analyzer for fields of the text type. Typically, the ik_smart analyzer is used for Chinese text. The default value is Standard, supporting English text.

Example:

'[ {"name":"contentCh","type":"text","hbaseQualifier":"cf1:contentCh","analyzer":"ik_smart"}, {"name":"contentEng","type":"text","hbaseQualifier":"cf2:contentEng"},{"name":"id","type":"long","hbaseQualifier":"cf1:id"} ]'

Yes

The data types supported by HBase-Elasticsearch full-text search are {"text", "long", "integer", "short", "byte", "double", "float","boolean"}, that is, the value type of type in the schema. text indicates the text type in Elasticsearch. Full-text search typically supports data of the text type and also supports accurate search of data of basic types.