Updated on 2024-04-19 GMT+08:00

Managing Word Dictionaries

You can configure the custom word dictionary to identify the segments of specified words. For example, you can search for the keyword of company names, such as, Huawei, and network buzzwords.

  • Hot update is supported. The updated custom word dictionary can take effect without cluster restart.
  • Custom word dictionaries are generally used for Chinese word segmentation. They can also be used to segment English words based on special characters except #&+-.@_

Context

Custom word dictionary uses the IK and synonym analyzer.

The IK analyzer has a main word dictionary and a stop word dictionary. The synonym analyzer has a synonym word dictionary. Before configuring a custom word dictionary, upload the prepared word dictionary file to OBS. For details, see Uploading the Word Dictionary File to OBS.

The IK analyzer uses the ik_max_word and ik_smart word segmentation policies. The synonym analyzer uses the ik_synonym word segmentation policy.

  • ik_max_word: splits the text at a fine granularity.
  • ik_smart: splits the text at a coarse granularity.

Prerequisites

  • To use the custom word dictionary, the account or IAM user used for logging in to the CSS management console must have both of the following permissions:
    • OBS Administrator for project OBS in region Global service
    • Elasticsearch Administrator in the current region
  • Prepare the word dictionary file on the local PC as required by referring to Uploading the Word Dictionary File to OBS.

Uploading the Word Dictionary File to OBS

Before configuring a custom word dictionary, upload the word dictionary to an OBS bucket.

  1. Prepare the word dictionary file according to Table 1.
    Table 1 Dictionary description

    Word Dictionary Type

    Introduction

    Requirement

    Main Word Dictionary

    Main words are the words on which users want to perform word segmentation. The main word dictionary is a collection of main words.

    The main word dictionary file must be a text file encoded using UTF-8 without BOM, with one subword per line. Letters must be in lowercase. The maximum size of a main word dictionary file is 100 MB.

    Stop Word Dictionary

    Stop words are the words which users can ignore. A stop word dictionary is a collection of stop words.

    The stop word dictionary file must be a text file encoded using UTF-8 without BOM, with one subword per line. The maximum size of a stop word dictionary file is 20 MB.

    Synonym Dictionary

    Synonyms are words with the same meaning. A synonym dictionary is a collection of synonyms.

    The synonym dictionary file must be a text file encoded using UTF-8 without BOM, with a pair of comma-separated synonyms per line. The maximum size of a synonym dictionary file is 20 MB.

  2. Upload the word dictionary file to an OBS bucket. For details, see Uploading an Object. The OBS bucket to which data is uploaded must be in the same region as the cluster.

Managing Word Dictionaries

  1. Log in to the CSS management console.
  2. In the navigation pane, choose Clusters > OpenSearch.
  3. On the Clusters page, click the name of the target cluster.
  4. Click the Word Dictionaries tab.
  5. On the displayed Word Dictionaries page, set the switch to enable or disable the custom word library function.
    • OBS Bucket: indicates the OBS bucket where the main word dictionary file, stop word dictionary file, and synonym dictionary file are stored. If no OBS bucket is available, create one by referring to Creating a Bucket. The OBS bucket must be in the same region as the cluster.
    • Main word dictionary object: The main word dictionary file must be a text file encoded using UTF-8 without BOM. One subword occupies a line. Letters must be in lowercase. The maximum size of a main word dictionary file is 100 MB.
    • Stop word dictionary object: The stop word dictionary file must be a text file encoded using UTF-8 without BOM, with one subword per line. The maximum size of a stop word dictionary file is 20 MB.
    • Synonym word dictionary object: The synonym dictionary file must be a text file encoded using UTF-8 without BOM. One pair of comma-separated synonyms occupies a line. The maximum size of a synonym dictionary file is 20 MB.
    Figure 1 Configuring a custom word dictionary
  6. Click Save. In the displayed Confirm dialog box, click OK. The word dictionary information is displayed in the lower part of the page. The word dictionary status is Updating. Wait for about one minute. After the word dictionary configuration is complete, the word dictionary status will change to Succeeded, indicating that the configured word dictionary has taken effect in the cluster.

Modifying a Word Dictionary

You can separately update the main word dictionary, the stop word dictionary, and the synonym dictionary.

On the Word Dictionaries page, modify OBS Bucket, Main Word Dictionary, Stop Word Dictionary, or Synonym Word Dictionary, and click Save. In the displayed dialog box, click OK. When the word dictionary status changes from Updating to Successful, the custom word dictionary is modified.

Figure 2 Configuring a custom word dictionary

Disabling a Word Dictionary

You can disable your word dictionary when it is no longer in need.

On the Word Dictionaries page, disable the function and click OK in the displayed dialog box. After the word dictionary is disabled, the word dictionary configuration information will not be displayed.