Using Vector Indexes for Data Search in an OpenSearch Cluster
Standard Query
Standard vector query syntax is provided for vector indexes. The following command will return n (specified by size/topk) data records that are the closest matches in the queried vector data.
POST my_index/_search { "size":2, "_source": false, "query": { "vector": { "my_vector": { "vector": [1, 1], "topk":2 } } } }
Parameter |
Description |
---|---|
vector (the first one) |
Indicates that the query type is VectorQuery. |
my_vector |
Indicates the name of the vector field you want to query. |
vector (the second one) |
Indicates the vector value you want to query, which can be an array or a Base64 string |
topk |
Same as the value of size generally. |
Other optional parameters |
Indicates optional query parameters. You can adjust the vector index parameters to achieve higher query performance or precision. For more information, see Table 2. |
Parameter |
Parameter |
Description |
---|---|---|
Graph index configuration parameters |
ef |
Queue size of the neighboring node during the query. A larger value indicates a higher query precision and slower query speed. The default value is 200. Value range: (0, 100000] |
max_scan_num |
Maximum number of scanned nodes. A larger value indicates a higher query precision and slower query speed. The default value is 10000. Value range: (0, 1000000] |
|
IVF index configuration parameters |
nprobe |
Number of center points. A larger value indicates a higher query precision and slower query speed. The default value is 100. Value range: (0, 100000] |
Compound Query
Vector search can be used with other OpenSearch subqueries to implement a compound query, such as Boolean query.
A Boolean query is in fact a post-filtering query method. Filtering and vector similarity-search are performed separately. Then, the results of the two are combined. The logic of combination is determined by clauses like must, should, and filter.
In the following example, the top 10 records whose my_label value is red are queried.
POST my_index/_search { "size": 10, "query": { "bool": { "must": { "vector": { "my_vector": { "vector": [1, 2], "topk": 10 } } }, "filter": { "term": { "my_label": "red" } } } } }
ScriptScore Query
You can use script_score to perform Nearest Neighbor Search (NSS) on vectors. The query syntax is provided below.
The pre-filtering condition can be any query. script_score traverses only the pre-filtered results, calculates the vector similarity, and sorts and returns the results. This query method does not use vector indexes. Its performance depends on the size of the intermediate result set after the pre-filtering. If the pre-filtering condition is set to match_all, brute-force search is performed on all data.
POST my_index/_search { "size":2, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "vector_score", "lang": "vector", "params": { "field": "my_vector", "vector": [1.0, 2.0], "metric": "euclidean" } } } } }
Parameter |
Description |
---|---|
source |
Script description. Its value is vector_score if the vector similarity is used for scoring. |
lang |
Script syntax description. Its value is vector. |
field |
Vector field name |
vector |
Vector data to be queried |
metric |
Measurement method, which can be euclidean, inner_product, cosine, and hamming. Default value: euclidean |
Re-Score Query
If the GRAPH_PQ or IVF_GRAPH_PQ index is used, the query results are sorted based on the asymmetric distance calculated by PQ. CSS supports re-scoring and ranking of query results to improve the recall rate.
Assuming that my_index is a PQ index, an example of re-scoring the query results is as follows:
GET my_index/_search { "size": 10, "query": { "vector": { "my_vector": { "vector": [1.0, 2.0], "topk": 100 } } }, "rescore": { "window_size": 100, "vector_rescore": { "field": "my_vector", "vector": [1.0, 2.0], "metric": "euclidean" } } }
Parameter |
Description |
---|---|
window_size |
Vector search returns topk search results and ranks the first window_size results. |
field |
Vector field name |
vector |
Vector data to be queried |
metric |
Measurement method, which can be euclidean, inner_product, cosine, and hamming. Default value: euclidean |
Painless Syntax Extension
CSS extension supports multiple vector distance calculation functions, which can be directly used in customized painless scripts to build flexible re-score formulas.
The following is an example:
POST my_index/_search { "size": 10, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "1 / (1 + euclidean(params.vector, doc[params.field]))", "params": { "field": "my_vector", "vector": [1, 2] } } } } }
Function Signature |
Description |
---|---|
euclidean(Float[], DocValues) |
Euclidean distance function |
cosine(Float[], DocValues) |
Cosine similarity function |
innerproduct(Float[], DocValues) |
Inner product function |
hamming(String, DocValues) |
Hamming distance function Only vectors whose dim_type is binary are supported. The input query vector must be a Base64-encoded character string. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot