Overview

Updated on 2024-10-14 GMT+08:00

View PDF

A dictionary is used to define stop words, that is, words to be ignored in full-text retrieval.

A dictionary can also be used to normalize words so that different derived forms of the same word will match. A normalized word is called a lexeme.

In addition to improving retrieval quality, normalization and removal of stop words can reduce the size of the tsvector representation of a document, thereby improving performance. Normalization and removal of stop words do not always have linguistic meaning. Users can define normalization and removal rules in dictionary definition files based on application environments.

A dictionary is a program that receives a token as input and returns:

An array of lexemes if the input token is known to the dictionary (note that one token can produce more than one lexeme).
A single lexeme with the TSL_FILTER flag set (which is automatically set in a filtering dictionary and is not perceived by users), to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a filtering dictionary).
An empty array if the input token is known to the dictionary but is a stop word.
NULL if the dictionary does not recognize the token.

GaussDB provides predefined dictionaries for many languages and also provides five predefined dictionary templates, Simple, Synonym, Thesaurus, Ispell, and Snowball. These templates can be used to create new dictionaries with custom parameters.

When using full-text retrieval, you are advised to:

In the text search configuration, configure a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until a dictionary recognizes it as a known word. If it is identified as a stop word, or no dictionary recognizes the token, it will be discarded and not indexed or searched for. Generally, the first dictionary that returns a non-NULL output determines the result, and any remaining dictionaries are not consulted. However, a filtering dictionary can replace the input token with a modified one, which is then passed to subsequent dictionaries.

The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer dictionary or a Simple dictionary, which recognizes everything. In the following example, for an astronomy-specific search (astro_en configuration), you can configure the token type asciiword (ASCII word) with a Synonym dictionary of astronomical terms, a general English Ispell dictionary, and a Snowball English stemmer dictionary:

     
        openGauss=# ALTER TEXT SEARCH CONFIGURATION astro_en
  ADD MAPPING FOR asciiword WITH astro_syn, english_ispell, english_stem;

A filtering dictionary can be placed anywhere in the list, except at the end where it would be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries.

Parent topic: Dictionaries

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot