How Can I Use the Search Index of Lucene?
GeminiDB Cassandra API supports Lucene search indexes, which are used for multi-dimensional queries, text retrieval, document counting and analysis. Search indexes are used the same way as native secondary indexes, but search indexes support more syntax types.
Secondary Index Pain Points
When you create a secondary index, open-source Cassandra creates a hidden table where the primary key becomes one of the columns. The following constraints are inevitable:
- Only "=" can be used to query the first primary key.
- "=", ">", "<", ">=", and ''<=" can be used to query the second primary key.
- Only "=" can be used to query the index column.
- Indexes cannot be created for columns that are frequently deleted or updated.
- Creating an index on a high-cardinality column is not recommended.
So you can see the query function provided by Cassandra secondary indexes is somewhat limited.
Lucene Search Index Architecture
Key technologies:
Search indexes are provided based on integration of the search engine of Lucene and the storage engine of a wide table.
Lucene is compatible with Apache Cassandra syntax and supports more SQL statements. Lucene provides capabilities such as multi-dimensional query, text retrieval, fuzzy query, and COUNT query, comprehensively improving user experience in querying massive volumes of data.
Usage Example of Lucene Search Indexes
Create a table structure.
CREATE TABLE example (pk1 text, pk2 bigint, ck1 int,ck2 text,col1 int, col2 int, col3 text, col4 text, PRIMARY KEY ((pk1,pk2),ck1, ck2));
Create a Lucene search index using four attribute columns.
CREATE CUSTOM INDEX index_lucene ON test.example(col1,col2,col3,col4) USING LuceneGlobalIndex' WITH OPTIONS = { 'table_tokens': '3', 'analyzed_columns': 'col4', 'disable_doc_value': 'col4', 'ordered_columns': 'col3,col4', 'ordered_sequences': 'desc,asc', 'analyzer_class': 'StandardAnalyzer' };
Parameter |
Description |
---|---|
table_tokens |
Number of initialized Lucene search index shards. If this parameter is not specified, default value 3 is used. Shards occupy CPU and memory resources and increase with data volumes. |
analyzed_columns |
Column used for full-text search |
analyzer_class |
Analyzer used for full-text search Chinese parser: 'analyzer_class': 'SmartChineseAnalyzer' Standard parser: 'analyzer_class': 'StandardAnalyzer' IK parser: 'analyzer_class': 'IKAnalyzer' |
ordered_columns |
Default sorting order of Lucene search index columns. If this parameter is not specified, the sorting order of the index columns is the same as that of table data in an Apache Cassandra database by default. In this case, queries can be efficiently executed. Multiple values need to be separated by commas (,). |
ordered_sequences |
Ascending or descending order of index columns specified in ordered_columns. asc indicates the ascending order and desc descending order. |
disable_doc_value |
docValues is set to false for index columns that do not require operations such as sorting and aggregate query. |
Multi-dimensional query: Implement a nested query based on any combination of index columns. Both exact query and range query are supported.
SELECT * from example WHERE pk1>='a' and pk2>=1000 and ck2 in ('a','b','c') and col1 <= 4 and col2 >= 2;
COUNT query: The number of rows in a data table or the number of matched rows is obtained based on a specific query criteria of index columns.
SELECT count(*) FROM example WHERE col1 > 3 AND EXPR(index_lucene, 'count');
Index column sorting order: Multiple index column sorting rules can be specified. A result set is returned based on the multi-dimensional query. For details about supported extended JSON syntax, see Extended JSON Syntax.
Fuzzy query: Prefix query and wildcard query are supported.
SELECT * FROM example WHERE col3 LIKE 'test%'; SELECT * FROM example WHERE col3 LIKE 'start*end';
Aggregate analysis: Perform simple aggregate analysis using functions SUM, MAX, MIN, and AVG based on the combination of index columns.
SELECT sum(col1) from example WHERE pk1>='a' and pk2>=1000 and col1 <= 4 and col2 >= 2;
Full-text search: Specify a Chinese or an English analyzer to perform word segmentation and obtain data related to the word segmentation results.
SELECT * FROM example WHERE col4 LIKE '%+test -index%';
Extended JSON Syntax
Keyword |
Function |
---|---|
filter |
Keyword of queries in the extended JSON format |
term |
Whether a document contains a specific value in a query |
match |
Segments a queried value and performs full-text search. |
range |
Queries a specified field in a specified range (range query sub-keywords: eq, gte, gt, lte, and lt). |
bool |
Complex query that uses the combination of must, should and must not |
must |
A query clause in a bool query, including term, match, and range queries. The results must match all queries. |
should |
A query clause in a bool query, including term, match, and range queries. The results must match at least one of the queries. |
must not |
A query clause in a bool query, including term, match, and range queries. All matches are excluded from the results. |
sort |
Global index columns can be sorted in ascending or descending order. |
Example of a typical JSON query statement
{ "filter": { "bool": { "should": [ {"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}} ], "must": [ {"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}}, {"match": {"col4": "+lucene -index"}} ] } }, "sort": [{"col1":"desc"}, {"col2":"asc"}] }
Complete CQL statement
SELECT * from example where expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}], "must": [{"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},{"match": {"col4": "+lucene -index"}}]}}, "sort": [{"col1":"desc"}, {"col2":"asc"}]}');
Example of comparing a CQL statement with a JSON statement for typical queries
1. When partition keys (pk1 and pk2) are specified in a query, pk1 and pk2 must be separated from the JSON query condition. Otherwise, the performance will be affected.
SELECT * from example where pk1=*** and pk2=*** and expr(index_lucene, 'json');
2. Query condition: col1=1
SELECT * from example WHERE col1=1; SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1}}}'); SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}]}}}');
These three statements are equivalent and recommended in sequence. You are advised to execute the first common CQL query. If the common CQL query is not supported, execute the extended JSON query.
3. Query condition: col1=1 and col2>=2
SELECT * from example WHERE col1=1 and col2>=2; SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1},"range": {"col2": {"gte": 2}}}}'); SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}, {"range": {"col2": {"gte": 2}}}]}}}');
The common CQL query is recommended.
4. Query condition: col1=1 and (col2<2 or col2>3)
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}'); SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');
These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.
5. Query condition: col1 in (1,2,3,4) and (col2<2 or col2>3)
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}'); SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');
These two statements are equivalent, but must_not is not recommended because of its lower query performance than should.
6. Single query with partitions specified: pk1='a' and pk2=1000 and col1 in (1,2,3,4) and (col2<2 or col2>3)
SELECT * from example WHERE pk1='a' and pk2=1000 and expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
7. Query condition: (((ck1<2 or ck1>=4) and (col1<2 or col1 >3)) or (pk1 in ('a', 'b', 'c'))) or (5<=col2<15 and pk2 > 2000)
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');
8. COUNT quey: The statement is as follows. You can also create the preceding query conditions in JSON format.
SELECT count(*) from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');
Precautions
- If common CQL statements can be executed, avoid JSON query statements as much as possible.
- For a single-partition query, the partition key condition must be used independently instead of being added to the JSON query condition. Otherwise, the single query performance will be affected.
- Avoid the must_not clause as much as possible.
- If query results always need to be sorted based on sorting orders of some index columns, you can specify their sorting orders as the default when creating indexes to improve performance.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot