本文导读

展开导读

文档首页/ 云数据库 GeminiDB/ GeminiDB Cassandra接口/ 常见问题/ 数据库使用/ 如何使用Lucene搜索索引

如何使用Lucene搜索索引

更新时间：2025-03-12 GMT+08:00

查看PDF

GeminiDB Cassandra支持Lucene搜索索引，已实现多维查询、文本检索、统计分析等能力，在使用体验上和原生二级索引类似，同时拥有更丰富的语法支持。

原生Cassandra二级索引痛点

原生Cassandra中二级索引是通过创建一张隐式的表来实现的。该表的主键是索引列，值为原表主键值，实现较为简单，因此不可避免地带来了一些约束条件：

第一主键只能用“=”查询。
第二主键可以使用“=、>、<、>=、<=”。
索引列只支持“=”查询。
删除、更新太过频繁的列不适合建立索引。
High-cardinality列不适合做索引。

基于以上约束，原生Cassandra二级索引提供的查询功能较为局限。

Lucene搜索索引架构

关键技术点：

内嵌Lucene搜索引擎，与存储引擎搭配，实现宽表存储引擎与搜索引擎的深度融合。
SQL层统一融合，在兼容原生Cassandra语法的基础上，提供多维查询、文本检索、模糊查询、统计分析等能力，全面提升用户在海量数据场景下的查询体验。

图1 Lucene搜索索引架构

Lucene搜索索引使用方式举例

图2 Lucene搜索索引使用方式

表结构示例：

CREATE TABLE example (pk1 text, pk2 bigint, ck1 int,ck2 text,col1 int, col2 int, col3 text, col4 text, PRIMARY KEY ((pk1,pk2),ck1, ck2));

创建Lucene搜索索引示例：

# 在col1,col2,col3,col4上创建Lucene索引
CREATE CUSTOM INDEX index_lucene ON test.example(col1,col2,col3,col4) USING 'LuceneGlobalIndex'  
WITH OPTIONS = {
'table_tokens': '3',  # 指定初始化Lucene搜索索引分片数为3
'analyzed_columns': 'col4',  # 指定col4用于全文搜索
'disable_doc_value': 'col4',  # 指定col4不进行DocValues存储
'ordered_columns': 'col3,col4',  # 指定col3,col4为排序列
'ordered_sequences': 'desc,asc',  # 指定col3为降序排序，col4为升序排序
'analyzer_class': 'StandardAnalyzer', # 指定全文搜索使用的分词器为StandardAnalyzer
'case_insensitive': 'col3' # 指定col3对英文字符大小写不敏感
};

表1 **可选参数OPTIONS说明**
参数名	作用
table_tokens	指定初始化Lucene搜索索引分片数，不指定默认为3，分片会占用一定的cpu和内存资源，并随数据量增长而增加。
analyzed_columns	指定用于全文搜索的列，英文字符默认转为小写存储。若业务需要针对英文字符大小写不敏感，则需指定case_insensitive选项。
analyzer_class	指定全文搜索使用的分词器。中文解析器： 'analyzer_class': 'SmartChineseAnalyzer' 标准解析器： 'analyzer_class': 'StandardAnalyzer' IK解析器： 'analyzer_class': 'IKAnalyzer'
ordered_columns	指定Lucene搜索索引默认排序列，需要与ordered_sequences一一对应，不指定时默认与GeminiDB Cassandra排序方式保持一致，多个索引列通过逗号隔开。注意：只有查询时排序方式与默认排序一致时，查询效率最高。
ordered_sequences	指定Lucene搜索索引列排序顺序，asc代表升序，desc代表降序，需要与ordered_columns一一对应。
disable_doc_value	指定索引列不进行DocValues存储，对于不需要进行排序、聚合等操作的索引列可以禁用DocValues存储。
case_insensitive	指定索引列对英文字符大小写不敏感，该索引列在进行存储和查询时会自动转化为小写。

多维查询：任意索引列组合的嵌套查询，支持精确查询和范围查询。

SELECT * from example WHERE pk1>='a' and pk2>=1000 and ck2 in ('a','b','c') and col1 <= 4 and col2 >= 2;

count计数：获取数据表的总行数，或根据索引列具体查询条件返回命中的数据行数。

SELECT count(*) FROM example WHERE col1 > 3 AND EXPR(index_lucene, 'count');

索引列排序：支持指定多个索引列排序规则，结合多维查询，返回指定排序的结果集。（通过JSON扩展语义支持，见下一节扩展JSON语义）

模糊查询：支持前缀查询和通配符查询。

SELECT * FROM example WHERE col3 LIKE 'test%'; 
SELECT * FROM example WHERE col3 LIKE 'start*end';

聚合分析：按照索引列组合条件进行简单的聚合分析(sum/max/min/avg)

SELECT sum(col1) from example WHERE pk1>='a' and pk2>=1000 and col1 <= 4 and col2 >= 2;

全文检索：支持指定中/英文分词器，进行分词检索，返回相关性高的结果。

SELECT * FROM example WHERE col4 LIKE '%+test -index%';

扩展JSON语义：

表2 **扩展JSON语义**
关键字	作用
filter	在查询语句中json查询的关键字。
term	查询时判断某个document是否包含某个具体的值。
match	将被询值进行分词，进行全文检索。
range	查询指定某个字段在某个特定的范围。(范围查询子关键字："eq"/"gte"/"gt"/"lte"/"lt")
bool	必须和 "must"、"should"、"must not" 一起组合出复杂的查询。
must	bool类型的子查询，封装"term"、"match"、"range" 查询。
should	bool类型的子查询，封装"term"、"match"、"range" 查询。
must not	bool类型的子查询，封装"term"、"match"、"range" 查询。
sort	支持全局索引列排序功能。

典型JSON查询语句示例：

{
  "filter": {
    "bool": {
      "should": [
        {"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}
      ], 
      "must": [
        {"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},
        {"match": {"col4": "+lucene -index"}}
      ]
    }
  }, 
  "sort": [{"col1":"desc"}, {"col2":"asc"}]
}

完整cql如下：

SELECT * from example where expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col3": "testcase7"}}], "must": [{"range": {"col2": {"lte": 7, "gt": 0}, "ck1": {"gte": 2}}},{"match": {"col4": "+lucene -index"}}]}}, "sort": [{"col1":"desc"}, {"col2":"asc"}]}');

下面对典型的查询场景cql语句结合JSON一起进行对比举例：

1. 带分区键的查询（指定pk1、pk2），需要将pk1和pk2从json条件中剥离出来，否则会影响性能。

SELECT * from example where pk1=*** and pk2=*** and expr(index_lucene, 'json');

2. 查询条件： col1=1。

SELECT * from example WHERE col1=1;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}]}}}');

上面三条语句，是等效的；类似这种情况，建议使用第一种的普通cql查询，只有当普通cql无法支持时，再使用json扩展查询；上面三个语句推荐顺序次为从上到下。

3. 查询条件：col1=1 and col2>=2。

SELECT * from example WHERE col1=1 and col2>=2;
SELECT * from example WHERE expr(index_lucene, '{"filter": {"term": {"col1": 1},"range": {"col2": {"gte": 2}}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}, {"range": {"col2": {"gte": 2}}}]}}}');

与第一种相同，推荐普通cql查询。

4. 查询条件：col1=1 and (col2<2 or col2>3)。

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"must": [{"term": {"col1": 1}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

上面两种方式效果相同，但是不推荐使用"must_not"，性能不如"should"。

5. 查询条件：col1 in (1,2,3,4) and (col2<2 or col2>3)。

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');
SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "must_not": [{"range": {"col2": {"gte": 2, "lte": 3}}}]}}}');

与4一样，上面两种方式效果相同，但是不推荐使用"must_not"，性能不如"should"。

6. 带分区键single查询：pk1='a' and pk2=1000 and col1 in (1,2,3,4) and (col2<2 or col2>3)。

SELECT * from example WHERE pk1='a' and pk2=1000 and expr(index_lucene, '{"filter": {"bool": {"should": [{"term": {"col1": 1, "col1": 2, "col1": 3, "col1": 4}}], "should": [{"range": {"col2": {"lt": 2}, "col2": {"gt": 3}}}]}}}');

7. 查询条件：(((ck1<2 or ck1>=4) and (col1<2 or col1 >3)) or (pk1 in ('a', 'b', 'c'))) or (5<=col2<15 and pk2 > 2000)。

SELECT * from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

8. count 查询，也可使用json构造查询条件，上面的查询条件，进行count查询，语句如下

SELECT count(*) from example WHERE expr(index_lucene, '{"filter": {"bool": {"should": [{"bool": {"should": [{"bool": {"must": [{"bool": {"should": [{"range": {"ck1": {"lt": 2}, "ck1": {"gte": 4}}}]}}, {"bool": {"should": [{"range": {"col1": {"lt": 2}, "col1": {"gt": 3}}}]}}]}}, {"bool": {"should": [{"term": {"pk1": "a", "pk1": "b", "pk1": "c"}}]}}]}}, {"bool": {"must": [{"range": {"col2": {"gte":5, "lte": 15}, "pk2": {"gt": 2000}}}]}}]}}}');

注意事项：

普通cql可以满足的查询条件，尽量避免依赖json查询。
单分区查询，要将分区键条件单独作为查询条件，不要放入json中，否则会影响single查询的性能。
尽量避免使用"must_not"。
如果查询总是需要按照某些索引列排序输出，可以考虑在创建索引时指定该排序方式为默认排序以提升性能。

父主题： 数据库使用

上一篇：如何使用二级索引

下一篇：如何设置分页查询（Java）

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

在使用文档中是否遇到以下问题

内容与产品页面不一致

内容不易理解

缺失示例代码

步骤不可操作

搜不到想要的内容

缺少最佳实践

意见反馈（选填）

0/500

请至少选择一项反馈信息并填写问题反馈

字符长度不能超过500

直接提交取消

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

智能客服提问云社区提问

如何使用Lucene搜索索引

原生Cassandra二级索引痛点

Lucene搜索索引架构

Lucene搜索索引使用方式举例

扩展JSON语义：

相关文档

意见反馈

文档内容是否对您有帮助？

7*24

备案

专业服务

退订

建议反馈

售前咨询热线

文档反馈