通过嵌套字段实现向量检索

在处理长篇幅文档（如技术手册、法律条文）时，通常需要将文档按段落或固定长度切分为多个语义块，并分别为每个块生成特征向量。传统的扁平化存储结构会导致文档元数据（如ID、标题）与多个向量之间的对应关系难以维护，且检索时无法精准定位是哪一个段落触发了匹配。为了解决这个问题，CSS向量数据库支持了嵌套字段（Nested）中的向量检索：它允许您在单条主文档中封装多个“向量子文档”，实现“一主多从”的结构化存储。在查询时，只需一次请求即可扫描所有嵌套向量，并支持通过评分模式选取最相关的段落返回，极大简化了长文本检索的架构复杂度。

功能介绍

在Elasticsearch/OpenSearch中，普通的object字段在底层会被“扁平化”处理，导致数组中不同子字段的关联关系丢失。而Nested类型将每个嵌套对象作为独立的“隐藏文档”存储，确保了每个嵌套对象中的“段落文本”与其对应的“向量”在检索时能够保持关联。

当您对Nested字段进行向量检索时，系统会经历以下过程：

子文档扫描：在父文档内部遍历所有子文档的向量。
局部评分：计算每个子向量与查询向量的相似度。
分值聚合：根据score_mode决定父文档的最终得分。score_mode通常设为max，表示只要文档中有一个段落与查询最相关，该文档就会被排在前面。

约束限制

仅Elasticsearch 7.10.2和OpenSearch 2.19.0版本的集群支持在Nested字段中使用向量检索。
嵌套字段会增加底层的文档总数，对于极大规模数据集，建议评估分片的承载压力。
嵌套字段不支持使用rescore语句进行重打分查询，仅支持使用query.vector语句（参见标准向量查询）进行重打分查询。

创建带有Nested字段的向量索引

执行以下命令，创建一个带有Nested字段的向量索引，该索引包含一个id字段，类型为keyword，包含一个embedding字段，类型为nested。embedding嵌套字段包含两个子字段chunk和emb，其中chunk为keyword类型，emb为vector类型。

PUT my_index
{
  "settings": {
    "index.vector": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "embedding": {
        "type": "nested",        // 声明嵌套类型
        "properties": {
          "chunk": {
            "type": "keyword"
          },
          "emb": {               // 子字段中的向量
            "type": "vector",
            "dimension": 2,
            "indexing": true,
            "algorithm": "GRAPH",
            "metric": "euclidean"
          }
        }
      }
    }
  }
}

导入多向量数据

执行以下命令，以数组形式一次性写入主文档及其关联的多个切片向量。每条文档包含了2条向量数据。

POST my_index/_bulk
{"index":{}}
{"id": 1, "embedding": [{"chunk":1,"emb": [1, 1]}, {"chunk":2,"emb": [2, 2]}]}
{"index":{}}
{"id": 2, "embedding": [{"chunk":1,"emb": [2, 2]}, {"chunk":2,"emb": [3, 3]}]}
{"index":{}}
{"id": 3, "embedding": [{"chunk":1,"emb": [3, 3]}, {"chunk":2,"emb": [4, 4]}]}

执行嵌套向量检索

在嵌套路径中搜索最相似的段落，并按最大相关度对文档排序。Nested字段需要使用nested查询，查询时需要指定path参数以指明要查询的嵌套路径，以及必须设置score_mode为max，表示文档的得分为该文档中所有向量与查询向量相似度的最大值。

标准查询

查询与向量[1, 1]最相似的Top10文档。

GET my_index/_search
{
  "_source": {"excludes": ["embedding"]}, // 仅返回主文档ID，隐藏复杂的向量数组
  "query": {
    "nested": {
      "path": "embedding",                // 必须指定嵌套字段路径
      "score_mode": "max",                // 子文档分值合并到父文档的策略：取最匹配的段落得分
      "query": {
        "vector": {
          "embedding.emb": {
            "vector": [1, 1],
            "topk": 10
          }
        }
      }
    }
  }
}

查询结果示例如下：

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "Hc4Vc5QBSxCnghau22AE",
        "_score" : 1.0,
        "_source" : {
          "id" : 1
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "Hs4Vc5QBSxCnghau22AE",
        "_score" : 0.33333334,
        "_source" : {
          "id" : 2
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "H84Vc5QBSxCnghau22AE",
        "_score" : 0.11111111,
        "_source" : {
          "id" : 3
        }
      }
    ]
  }
}

前置过滤查询

先筛选出id取值为["2", "3"]的文档，再返回与查询向量[1, 1]最相似的Top10文档。

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "embedding",                // 必须指定嵌套字段路径
      "score_mode": "max",                // 子文档分值合并到父文档的策略：取最匹配的段落得分
      "query": {
        "vector": {
          "embedding.emb": {
            "vector": [1, 1],
            "topk": 10,
            "filter": {
              "terms": {"id": ["2", "3"]}
            }
          }
        }
      }
    }
  }
}

查询结果示例如下：

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.33333334,
    "hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3t0ZypcB-Tff59gMTZO2",
        "_score" : 0.33333334,
        "_source" : {
          "id" : 2,
          "embedding" : [
            {
              "chunk" : 1,
              "emb" : [
                2,
                2
              ]
            },
            {
              "chunk" : 2,
              "emb" : [
                3,
                3
              ]
            }
          ]
        }
      },
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "390ZypcB-Tff59gMTZO2",
        "_score" : 0.11111111,
        "_source" : {
          "id" : 3,
          "embedding" : [
            {
              "chunk" : 1,
              "emb" : [
                3,
                3
              ]
            },
            {
              "chunk" : 2,
              "emb" : [
                4,
                4
              ]
            }
          ]
        }
      }
    ]
  }
}

父主题： CSS向量数据库

上一篇：进行向量检索

下一篇：优化读写性能