更新时间:2025-09-05 GMT+08:00

通过嵌套字段实现向量检索

使用嵌套字段可以实现在单条文档中存储多条向量数据,比如在RAG场景中,文档数据通常需要按段落或按长度进行切分,分别进行向量化得到多条语义向量,通过嵌套字段(Nested)可以将这些向量写入同一条ES的文档中。对于包含多条向量数据的文档,查询时任意一条向量数据与查询向量相似便会返回该条文档。

约束限制

仅OpenSearch 2.19.0版本的集群支持在嵌套字段中使用向量索引。

创建向量索引

创建一个带有嵌套字段的向量索引,该索引包含一个id字段,类型为keyword,包含一个embedding字段,类型为nested。embedding嵌套字段包含两个子字段chunk和emb,其中chunk为keyword类型,emb为vector类型。

PUT my_index
{
  "settings": {
    "index.vector": true
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "keyword"
      },
      "embedding": {
        "type": "nested",
        "properties": {
          "chunk": {
            "type": "keyword"
          },
          "emb": {
            "type": "vector",
            "dimension": 2,
            "indexing": true,
            "algorithm": "GRAPH",
            "metric": "euclidean"
          }
        }
      }
    }
  }
}

导入向量数据

使用Bulk操作,以数组形式写入数据,每条文档包含了2条向量数据。

POST my_index/_bulk
{"index":{}}
{"id": 1, "embedding": [{"chunk":1,"emb": [1, 1]}, {"chunk":2,"emb": [2, 2]}]}
{"index":{}}
{"id": 2, "embedding": [{"chunk":1,"emb": [2, 2]}, {"chunk":2,"emb": [3, 3]}]}
{"index":{}}
{"id": 3, "embedding": [{"chunk":1,"emb": [3, 3]}, {"chunk":2,"emb": [4, 4]}]}

向量检索

Nested字段需要使用nested查询,查询时需要指定path参数以指明要查询的嵌套路径,以及必须设置score_mode为max,表示文档的得分为该文档中所有向量与查询向量相似度的最大值。

  • 标准查询

    查询与向量[1, 1]最相似的Top10文档。

    GET my_index/_search
    {
      "_source": {"excludes": ["embedding"]},
      "query": {
        "nested": {
          "path": "embedding",
          "score_mode": "max",
          "query": {
            "vector": {
              "embedding.emb": {
                "vector": [1, 1],
                "topk": 10
              }
            }
          }
        }
      }
    }

    查询结果示例如下:

    {
      "took" : 2,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 3,
          "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "Hc4Vc5QBSxCnghau22AE",
            "_score" : 1.0,
            "_source" : {
              "id" : 1
            }
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "Hs4Vc5QBSxCnghau22AE",
            "_score" : 0.33333334,
            "_source" : {
              "id" : 2
            }
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "H84Vc5QBSxCnghau22AE",
            "_score" : 0.11111111,
            "_source" : {
              "id" : 3
            }
          }
        ]
      }
    }
  • 前置过滤查询

    先筛选出id取值为["2", "3"]的文档,再返回与查询向量[1, 1]最相似的Top10文档。

    GET my_index/_search
    {
      "query": {
        "nested": {
          "path": "embedding",
          "score_mode": "max",
          "query": {
            "vector": {
              "embedding.emb": {
                "vector": [1, 1],
                "topk": 10,
                "filter": {
                  "terms": {"id": ["2", "3"]}
                }
              }
            }
          }
        }
      }
    }

    查询结果示例如下:

    {
      "took" : 3,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 0.33333334,
        "hits" : [
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "3t0ZypcB-Tff59gMTZO2",
            "_score" : 0.33333334,
            "_source" : {
              "id" : 2,
              "embedding" : [
                {
                  "chunk" : 1,
                  "emb" : [
                    2,
                    2
                  ]
                },
                {
                  "chunk" : 2,
                  "emb" : [
                    3,
                    3
                  ]
                }
              ]
            }
          },
          {
            "_index" : "my_index",
            "_type" : "_doc",
            "_id" : "390ZypcB-Tff59gMTZO2",
            "_score" : 0.11111111,
            "_source" : {
              "id" : 3,
              "embedding" : [
                {
                  "chunk" : 1,
                  "emb" : [
                    3,
                    3
                  ]
                },
                {
                  "chunk" : 2,
                  "emb" : [
                    4,
                    4
                  ]
                }
              ]
            }
          }
        ]
      }
    }