Help Center/ GeminiDB/ GeminiDB Cassandra API/ FAQs/ Database Usage/ How Can I Use the Search Index of Lucene?
Updated on 2024-09-04 GMT+08:00

How Can I Use the Search Index of Lucene?

GeminiDB Cassandra API supports Lucene search indexes, which are used for multi-dimensional queries, text retrieval, document counting and analysis. Search indexes are used the same way as native secondary indexes, but search indexes support more syntax types.

Secondary Index Pain Points

When you create a secondary index, open-source Cassandra creates a hidden table where the primary key becomes one of the columns. The following constraints are inevitable:

  • Only "=" can be used to query the first primary key.
  • "=", ">", "<", ">=", and ''<=" can be used to query the second primary key.
  • Only "=" can be used to query the index column.
  • Indexes cannot be created for columns that are frequently deleted or updated.
  • Creating an index on a high-cardinality column is not recommended.

So you can see the query function provided by Cassandra secondary indexes is somewhat limited.

Lucene Search Index Architecture

Key technologies:

Search indexes are provided based on integration of the search engine of Lucene and the storage engine of a wide table.

Lucene is compatible with Apache Cassandra syntax and supports more SQL statements. Lucene provides capabilities such as multi-dimensional query, text retrieval, fuzzy query, and COUNT query, comprehensively improving user experience in querying massive volumes of data.

Figure 1 Lucene search index architecture

Usage Example of Lucene Search Indexes

Figure 2 Lucene search index usage

Create a table structure.

CREATE TABLE example (pk1 text, pk2 bigint, ck1 int,ck2 text,col1 int, col2 int, col3 text, col4 text, PRIMARY KEY ((pk1,pk2),ck1, ck2));

Create a Lucene search index using four attribute columns.

CREATE CUSTOM INDEX index_lucene ON test.example(col1,col2,col3,col4) USING LuceneGlobalIndex' 
WITH OPTIONS = {
'table_tokens': '3', 
'analyzed_columns': 'col4', 
'disable_doc_value': 'col4', 
'ordered_columns': 'col3,col4', 
'ordered_sequences': 'desc,asc', 
'analyzer_class': 'StandardAnalyzer'
};
Table 1 Optional parameters of OPTIONS

Parameter

Description

table_tokens

Number of initialized Lucene search index shards. If this parameter is not specified, default value 3 is used. Shards occupy CPU and memory resources and increase with data volumes.

analyzed_columns

Column used for full-text search

analyzer_class

Analyzer used for full-text search

Chinese parser:

'analyzer_class': 'SmartChineseAnalyzer'

Standard parser:

'analyzer_class': 'StandardAnalyzer'

IK parser:

'analyzer_class': 'IKAnalyzer'

ordered_columns

Default sorting order of Lucene search index columns. If this parameter is not specified, the sorting order of the index columns is the same as that of table data in an Apache Cassandra database by default. In this case, queries can be efficiently executed. Multiple values need to be separated by commas (,).

ordered_sequences

Ascending or descending order of index columns specified in ordered_columns. asc indicates the ascending order and desc descending order.

disable_doc_value

docValues is set to false for index columns that do not require operations such as sorting and aggregate query.

Multi-dimensional query: Implement a nested query based on any combination of index columns. Both exact query and range query are supported.

SELECT * from example WHERE pk1>='a' and pk2>=1000 and ck2 in ('a','b','c') and col1 <= 4 and col2 >= 2;

COUNT query: The number of rows in a data table or the number of matched rows is obtained based on a specific query criteria of index columns.

SELECT count(*) FROM example WHERE col1 > 3 AND EXPR(index_lucene, 'count');

Index column sorting order: Multiple index column sorting rules can be specified. A result set is returned based on the multi-dimensional query. For details about supported extended JSON syntax, see Extended JSON Syntax.

Fuzzy query: Prefix query and wildcard query are supported.

SELECT * FROM example WHERE col3 LIKE 'test%'; 
SELECT * FROM example WHERE col3 LIKE 'start*end';

Aggregate analysis: Perform simple aggregate analysis using functions SUM, MAX, MIN, and AVG based on the combination of index columns.

SELECT sum(col1) from example WHERE pk1>='a' and pk2>=1000 and col1 <= 4 and col2 >= 2;

Full-text search: Specify a Chinese or an English analyzer to perform word segmentation and obtain data related to the word segmentation results.

SELECT * FROM example WHERE col4 LIKE '%+test -index%';

Extended JSON Syntax

Table 2 Extended JSON syntax

Keyword

Function

filter

Keyword of queries in the extended JSON format

term

Whether a document contains a specific value in a query

match

Segments a queried value and performs full-text search.

range

Queries a specified field in a specified range (range query sub-keywords: eq, gte, gt, lte, and lt).

bool

Complex query that uses the combination of must, should and must not

must

A query clause in a bool query, including term, match, and range queries. The results must match all queries.

should

A query clause in a bool query, including term, match, and range queries. The results must match at least one of the queries.

must not

A query clause in a bool query, including term, match, and range queries. All matches are excluded from the results.

sort

Global index columns can be sorted in ascending or descending order.

Example of a typical JSON query statement

{
  "filter": {
    "bool": {
      "should": [
        {"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}
      ], 
      "must": [
        {"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},
        {"match": {"col4": "+lucene -index"}}
      ]
    }
  }, 
  "sort": [{"col1":"desc"}, {"col2":"asc"}]
}

Complete CQL statement

SELECT * from example where expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}], "must": [{"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},{"match": {"col4": "+lucene -index"}}]}}, "sort": [{"col1":"desc"}, {"col2":"asc"}]}');

Example of comparing a CQL statement with a JSON statement for typical queries

1. When partition keys (pk1 and pk2) are specified in a query, pk1 and pk2 must be separated from the JSON query condition. Otherwise, the performance will be affected.

SELECT * from example where pk1=*** and pk2=*** and expr(index_lucene, 'json');

2. Query condition: col1=1

SELECT * from example WHERE col1=1;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}]}}}');

These three statements are equivalent and recommended in sequence. You are advised to execute the first common CQL query. If the common CQL query is not supported, execute the extended JSON query.

3. Query condition: col1=1 and col2>=2

SELECT * from example WHERE col1=1 and col2>=2;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1},"range": {"col2": {"gte": 2}}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}, {"range": {"col2": {"gte": 2}}}]}}}');

The common CQL query is recommended.

4. Query condition: col1=1 and (col2<2 or col2>3)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.

5. Query condition: col1 in (1,2,3,4) and (col2<2 or col2>3)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.

6. Single query with partitions specified: pk1='a' and pk2=1000 and col1 in (1,2,3,4) and (col2<2 or col2>3)

SELECT * from example WHERE pk1='a' and pk2=1000 and expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');

7. Query condition: (((ck1<2 or ck1>=4) and (col1<2 or col1 >3)) or (pk1 in ('a', 'b', 'c'))) or (5<=col2<15 and pk2 > 2000)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

8. COUNT quey: The statement is as follows. You can also create the preceding query conditions in JSON format.

SELECT count(*) from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

Precautions

  • If common CQL statements can be executed, avoid JSON query statements as much as possible.
  • For a single-partition query, the partition key condition must be used independently instead of being added to the JSON query condition. Otherwise, the single query performance will be affected.
  • Avoid the must_not clause as much as possible.
  • If query results always need to be sorted based on sorting orders of some index columns, you can specify their sorting orders as the default when creating indexes to improve performance.