Help Center/ GeminiDB/ GeminiDB Cassandra API/ FAQs/ Database Usage/ How Can I Use the Search Index of Lucene?

Updated on 2025-10-17 GMT+08:00

View PDF

How Can I Use the Search Index of Lucene?

GeminiDB Cassandra API supports Lucene search indexes, which are used for multi-dimensional queries, text retrieval, document counting and analysis. Search indexes are used the similar way as native secondary indexes, but search indexes support more syntax types.

Pain Points of Open-Source Cassandra Secondary Indexes

Cassandra secondary indexes are stored in an implicit table. This table's primary key is an index column with values from the original table's primary keys, which is easy to implement. The following constraints are inevitably introduced:

Only "=" can be used to query the first primary key.
"=", ">", "<", ">=", and ''<=" can be used to query the second primary key.
Only "=" can be used to query the index column.
Indexes cannot be created for columns that are frequently deleted or updated.
Creating an index on a high-cardinality column is not recommended.

So you can see the query function provided by Cassandra secondary indexes is somewhat limited.

Lucene Search Index Architecture

Key technologies:

Search indexes are provided based on integration of the search engine of Lucene and the storage engine of a wide table.
Lucene is compatible with Apache Cassandra syntax and supports more SQL statements. Lucene provides capabilities such as multi-dimensional query, text retrieval, fuzzy query, and COUNT query, comprehensively improving user experience in querying massive volumes of data.

Figure 1 Lucene search index architecture
Click to enlarge

Usage Example of Lucene Search Indexes

Figure 2 Lucene search index usage
Click to enlarge

Example of creating a table structure:

CREATE TABLE example (pk1 text, pk2 bigint, ck1 int,ck2 text,col1 int, col2 int, col3 text, col4 text, PRIMARY KEY ((pk1,pk2),ck1, ck2));

Example of creating Lucene search indexes:

# Creates Lucene indexes in col1, col2, col3, and col4.
CREATE CUSTOM INDEX index_lucene ON test.example(col1,col2,col3,col4) USING 'LuceneGlobalIndex'  
WITH OPTIONS = {
'table_tokens': '3', # Sets the number of Lucene search index shards to be initialized to 3.
'analyzed_columns': 'col4', # Specifies col4 for full-text search.
'disable_doc_value': 'col4', # Disables DocValues for col4.
'ordered_columns': 'col3,col2', # Specifies col3 as the sorting column.
'ordered_sequences': 'desc,asc', # Sorts col3 in the descending order and col2 in the ascending order.
'analyzer_class': 'StandardAnalyzer', # Specifies StandardAnalyzer as the analyzer for full-text search.
'case_insensitive': 'col3' # Sets col3 to be case-insensitive for characters.
};

**Table 1** Optional parameters of OPTIONS
Parameter	Description
table_tokens	Number of initialized Lucene search index shards. If this parameter is not specified, default value 3 is used. Shards occupy CPU and memory resources and increase with data volumes.
analyzed_columns	Column used for full-text search. By default, characters are converted to lowercase letters for storage. To enable case-insensitive searches, you need to specify case_insensitive.
analyzer_class	Analyzer used for full-text search Chinese parser: 'analyzer_class': 'SmartChineseAnalyzer' Standard parser: 'analyzer_class': 'StandardAnalyzer' IK parser: 'analyzer_class': 'IKAnalyzer'
ordered_columns	Default sorting columns of Lucene search index columns. ordered_sequences is configured for columns specified for this parameter. If this parameter is not specified, the sorting mode is the same as that of sorting keys of GeminiDB Cassandra instances by default. Multiple index columns are separated by commas (,). This parameter cannot be used with disable_doc_value. Note: The query efficiency is the highest only in the default sorting mode.
ordered_sequences	Sorting order of Lucene search index columns specified in ordered_columns. asc indicates the ascending order and desc descending order.
disable_doc_value	docValues is set to false for index columns that do not require operations such as sorting and aggregate query. This parameter cannot be used with ordered_columns.
case_insensitive	Character case is ignored in specified index columns. The characters are automatically converted to lowercase letters during storage and query.

Multi-dimensional query: Implement a nested query based on any combination of index columns. Both exact query and range query are supported.

SELECT * from example WHERE pk1>='a' and pk2>=1000 and ck2 in ('a','b','c') and col1 <= 4 and col2 >= 2;

COUNT query: The number of rows in a data table or the number of matched rows is obtained based on a specific query criteria of index columns.

SELECT count(*) FROM example WHERE col1 > 3 AND EXPR(index_lucene, 'count');

Index column sorting order: Multiple index column sorting rules can be specified. A result set is returned based on the multi-dimensional query. For details about supported extended JSON syntax, see Extended JSON Syntax.

Fuzzy query: Prefix query and wildcard query are supported.

SELECT * FROM example WHERE col3 LIKE 'test%'; 
SELECT * FROM example WHERE col3 LIKE 'start*end';

Aggregate analysis: Perform simple aggregate analysis using functions SUM, MAX, MIN, and AVG based on the combination of index columns.

SELECT sum(col1) from example WHERE pk1>='a' and pk2>=1000 and col1 <= 4 and col2 >= 2;

Full-text search: Specify a Chinese or an English analyzer to perform word segmentation and obtain data related to the word segmentation results.

SELECT * FROM example WHERE col4 LIKE '%+test -index%';

Extended JSON Syntax

**Table 2** Extended JSON syntax
Keyword	Function
filter	Keyword of queries in the extended JSON format
term	Whether a document contains a specific value in a query
match	Segments a queried value and performs full-text search.
range	Queries a specified field in a specified range (range query sub-keywords: eq, gte, gt, lte, and lt).
bool	Complex query that uses the combination of must, should and must not
must	A query clause in a bool query, including term, match, and range queries. The results must match all queries.
should	A query clause in a bool query, including term, match, and range queries. The results must match at least one of the queries.
must not	A query clause in a bool query, including term, match, and range queries. All matches are excluded from the results.
sort	Global index columns can be sorted in ascending or descending order.

Example of a typical JSON query statement

{
  "filter": {
    "bool": {
      "should": [
        {"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}
      ], 
      "must": [
        {"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},
        {"match": {"col4": "+lucene -index"}}
      ]
    }
  }, 
  "sort": [{"col1":"desc"}, {"col2":"asc"}]
}

Complete CQL statement

SELECT * from example where expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}], "must": [{"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},{"match": {"col4": "+lucene -index"}}]}}, "sort": [{"col1":"desc"}, {"col2":"asc"}]}');

Example of comparing a CQL statement with a JSON statement for typical queries

1. When partition keys (pk1 and pk2) are specified in a query, pk1 and pk2 must be separated from the JSON query condition. Otherwise, the performance will be affected.

SELECT * from example where pk1=*** and pk2=*** and expr(index_lucene, 'json');

2. Query condition: col1=1

SELECT * from example WHERE col1=1;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}]}}}');

These three statements are equivalent and recommended in sequence. You are advised to execute the first common CQL query. If the common CQL query is not supported, execute the extended JSON query.

3. Query condition: col1=1 and col2>=2

SELECT * from example WHERE col1=1 and col2>=2;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1},"range": {"col2": {"gte": 2}}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}, {"range": {"col2": {"gte": 2}}}]}}}');

The common CQL query is recommended.

4. Query condition: col1=1 and (col2<2 or col2>3)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.

5. Query condition: col1 in (1,2,3,4) and (col2<2 or col2>3)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.

6. Single query with partitions specified: pk1='a' and pk2=1000 and col1 in (1,2,3,4) and (col2<2 or col2>3)

SELECT * from example WHERE pk1='a' and pk2=1000 and expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');

7. Query condition: (((ck1<2 or ck1>=4) and (col1<2 or col1 >3)) or (pk1 in ('a', 'b', 'c'))) or (5<=col2<15 and pk2 > 2000)

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

8. COUNT quey: The statement is as follows. You can also create the preceding query conditions in JSON format.

SELECT count(*) from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

Precautions

If common CQL statements can be executed, avoid JSON query statements as much as possible.
For a single-partition query, the partition key condition must be used independently instead of being added to the JSON query condition. Otherwise, the single query performance will be affected.
Avoid the must_not clause as much as possible.
If query results always need to be sorted based on sorting orders of some index columns, you can specify their sorting orders as the default when creating indexes to improve performance.

Parent Topic: Database Usage

Previous topic: How Do I Use a Secondary Index?

Next topic: How Do I Set Paging Query with Java?