Performing Vector Search

The CSS vector database supports a variety of query methods, including standard query, hybrid query, script_score query, rescore query, and painless syntax extension. They enable efficient vector data search by accommodating diverse search needs.

Standard Query: retrieves documents that are most similar to the query vector.
Hybrid Query: combines vector search with traditional OpenSearch queries, such as pre-filtering and Boolean queries.
Script Score Query: enables custom similarity calculations for vector searches by executing a custom script.
Rescore Query: rescores and reranks the top results returned by an initial query to improve recall.
Painless Syntax Extension: allows the use of vector distance or similarity calculation functions in custom scripts.

Standard Query

Standard query is used to retrieve documents that are most similar to the query vector.

The following command will return k (specified by size/topk) records that are the closest matches to the query vector.

POST my_index/_search
{
  "size":2,
  "_source": false, 
  "query": {
    "vector": {
      "my_vector": {
        "vector": [1, 1],
        "topk":2
      }
    }
  }
}

**Table 1** Parameters for standard query
Parameter	Mandatory	Type	Description
size	Yes	Integer	Number of search results to return. Default value: 10
_source	No	Boolean	Whether to return the source text in documents. To reduce data transmission and improve query performance, set this parameter to false. Value range: true (default): Returns the source text. false: Not to return the source text.
query	Yes	Map	Specifies the query vector. Parameters: vector (mandatory): indicates a vector query (vector similarity-based search), including the vector field and query vector value. my_vector (mandatory): queried vector field (for example, my_vector).
vector (sub-parameter)	Yes	Array/String	Query vector value. It is used to calculate the similarity between indexed vectors and the query vector. The value can be an array (for example, [1, 1]) or Base64-encoded value (for example, AAABAAACAAAD).
topk	Yes	Integer	The number of the most similar or relevant results to be returned. Default value: same as size.
ef	No	Integer	How many nearest neighbors to explore when inserting a new vector into the graph. A larger value indicates a higher query accuracy yet slower query speed. This parameter is available only when algorithm is set to GRAPH, GRAPH_PQ, GRAPH_SQ8, or GRAPH_SQ4. Value range: 0–100000 Default value: 200
max_scan_num	No	Integer	Maximum number of graph nodes to scan during search. A larger value indicates a higher query accuracy yet slower query speed. This parameter is available only when algorithm is set to GRAPH, GRAPH_PQ, GRAPH_SQ8, or GRAPH_SQ4. Value range: 0–1000000 Default value: 10000
nprobe	No	Integer	Number of centroids to explore during an IVF index query. A larger value indicates a higher query accuracy yet slower query speed. This parameter is available only when algorithm is IVF_GRAPH or IVF_GRAPH_PQ. Value range: 0–100000 Default value: 100

Hybrid Query

Hybrid query combines vector search with traditional OpenSearch queries, such as pre-filtering and Boolean queries.

Only OpenSearch 2.19.0 clusters support pre-filtering.

In the following example, the top 10 records whose my_label value is red are returned.

Pre-filtering query

First, filters are applied to retrieve matching results. Then, vector search is performed on these results to retrieve the most relevant vectors based on similarity.

The following is an example:

POST my_index/_search
{
  "size": 10,
  "query": {
    "vector": {
      "my_vector": {
        "vector": [1, 2],
        "topk": 10,
        "filter": {
          "term": { "my_label": "red" }
        }
      }
    }
  }
}

**Table 2** Parameters for pre-filtering query
Parameter	Mandatory	Type	Description
filter	Yes	Map	Vector query filters. Standard OpenSearch query filters are supported, such as term and range. If filter is too restrictive, leading to a small intermediate result set, you can set the index.vector.exact_search_threshold parameter, so that when the intermediate result set is smaller than this threshold, pre-filtering query automatically switches over to brute-force query (FLAT algorithm), which ensures a high recall rate. For more information, see Creating a Vector Index.
term	No	Map	Term query is a type of exact query. Documents that contain the exact term will be returned. For example, {"term": {"my_label": "red"}} means only to return documents whose my_label value is red.

Boolean query

A Boolean query is in fact a post-filtering query method. Filtering and vector similarity-based search are performed separately. Then, the results of the two are combined using Boolean logic defined by clauses like must, should, and filter.

The following is an example:

POST my_index/_search
{
  "size": 10,
  "query": {
    "bool": {
      "must": {
        "vector": {
          "my_vector": {
            "vector": [1, 2],
            "topk": 10
          }
        }
      },
      "filter": {
        "term": { "my_label": "red" }
      }
    }
  }
}

**Table 3** Boolean query parameters
Parameter	Mandatory	Type	Description
bool	Yes	Map	A compound query clause that combines subqueries using configured Boolean logic. Parameter description: must: Clauses that must match for documents to be included in the results. filter: It is similar to must, but do not contribute to the relevance score. should: Clauses that should match, but are not required. They are like nice-to-haves. must_not: Clauses that must not match for documents to be included in the results.
bool.must	Yes	Map	Clauses that must match for documents to be included in the results. Parameter description: vector: query vector my_vector: vector field topk: number of results to return
bool.filter	Yes	Map	Clauses that must match, but do not contribute to the relevance score. Standard OpenSearch query filters are supported, such as term and range.

Script Score Query

Script_score query enables custom similarity calculations for vector searches by executing a user-defined script. It works as follows:

Pre-filtering works with any query. script_score then calculates vector similarity on the pre-filtered results, and ranks the results. This query method does not use vector indexes. Its performance depends on the size of the intermediate result set after the pre-filtering. If the pre-filtering condition is set to match_all, a brute-force search is performed on all data.

The following is an example:

POST my_index/_search 
 { 
   "size":2, 
   "query": { 
   "script_score": { 
       "query": { 
         "match_all": {} 
       }, 
       "script": { 
         "source": "vector_score", 
         "lang": "vector", 
         "params": { 
           "field": "my_vector", 
           "vector": [1.0, 2.0], 
           "metric": "euclidean" 
         } 
       } 
     } 
   } 
 }

**Table 4** script_score query parameters
Parameter	Mandatory	Type	Description
script_score	Yes	Map	A root parameter for the script_score query. Parameter description: query: pre-filtering criteria. When it is set to match_all, a brute-force search is performed on all data. script: custom script that calculates similarity scores.
source	Yes	String	Script name. The value is fixed to vector_score, indicating that a built-in script is used for calculating similarity.
lang	Yes	String	Script language type. The value is fixed to vector.
field	Yes	String	Queried vector field, for example, my_vector.
vector	Yes	Array/String	Query vector value. It is used to calculate the similarity between indexed vectors and the query vector. The value can be an array (for example, [1, 1]) or Base64-encoded value (for example, AAABAAACAAAD).
metric	Yes	String	Vector distance metric, which measures the similarity or distance between vectors. Value range: euclidean (default): Euclidean distance inner_product: inner product distance cosine: cosine distance hamming: Hamming distance, which can be used only when dim_type is set to binary.

Rescore Query

Rescore query rescores and reranks the top results returned by an initial query to improve recall.

When the GRAPH_PQ or IVF_GRAPH_PQ indexing algorithm is used, query results are ranked based on the asymmetric distance calculated by PQ. Rescore query then rescores and reranks the initial search results to improve recall.

The following is an example of rescore query on a PQ index named my_index:

GET my_index/_search 
 { 
   "size": 10, 
   "query": { 
     "vector": { 
       "my_vector": { 
         "vector": [1.0, 2.0], 
         "topk": 100 
       } 
     } 
   }, 
   "rescore": { 
     "window_size": 100, 
     "vector_rescore": { 
       "field": "my_vector", 
       "vector": [1.0, 2.0], 
       "metric": "euclidean" 
     } 
   } 
 }

**Table 5** Rescore query parameters
Parameter	Mandatory	Type	Description
rescore	Yes	Map	Defines rescoring parameters. Key parameters: window_size: rescoring/reranking window size. vector_rescore: other vector rescoring settings.
window_size	Yes	Integer	Rescoring/reranking window size. The vector search returns the top k results, but only the first window_size results are rescored and reranked. A larger value indicates a larger reranking scope and hence a higher recall rate, but it also leads to higher computational overhead. Default value: 100
field	Yes	String	Queried vector field, for example, my_vector.
vector	Yes	Array/String	Query vector value. It is used to calculate the similarity between indexed vectors and the query vector. The value can be an array (for example, [1, 1]) or Base64-encoded value (for example, AAABAAACAAAD).
metric	Yes	String	Vector distance metric, which measures the similarity or distance between vectors. Value range: euclidean (default): Euclidean distance inner_product: inner product distance cosine: cosine distance hamming: Hamming distance, which can be used only when dim_type is set to binary.

Painless Syntax Extension

Painless syntax extension allows the use of vector distance or similarity calculation functions in custom scripts. CSS extension supports several vector distance/similarity calculation functions, which users can use readily in custom Painless scripts to build flexible rescoring formulas.

The following is an example:

POST my_index/_search
{
  "size": 10,
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "1 / (1 + euclidean(params.vector, doc[params.field]))",
        "params": {
          "field": "my_vector",
          "vector": [1, 2]
        }
      }
    }
  }
}

**Table 6** Supported vector distance/similarity calculation functions
Function Signature	Description
euclidean(Float[], DocValues)	Euclidean distance
cosine(Float[], DocValues)	Cosine similarity
innerproduct(Float[], DocValues)	Inner product
hamming(String, DocValues)	Hamming distance Only vectors whose dim_type is binary are supported. The input query vector must be a Base64-encoded character string. Only OpenSearch 1.3.6 clusters support this function.
hammings(String, DocValues)	Hamming distance Only vectors whose dim_type is binary are supported. The input query vector must be a Base64-encoded character string. Only OpenSearch 2.19.0 clusters support this function.