Creating and Modifying a KooSearch Knowledge Base

To use the KooSearch experience platform, start by creating a knowledge base. Once set up, you can upload your data to it, search the data, and ask questions.

Accessing the KooSearch Console

Log in to the CSS management console.
In the navigation pane on the left, choose KooSearch > KooSearch Document Q&A.
Select a document Q&A service created earlier, and click Q&A in the Operation column to switch to the KooSearch console.

Creating a Knowledge Base

On the KooSearch console, choose Knowledge Bases from the left navigation pane.
The Knowledge Bases page is displayed.
Click Create Knowledge Base in the upper-right corner.
On the displayed page, set the knowledge base information.

On the Create tab, set the parameters and click Next.

**Table 1** Creating a knowledge base
Parameter	Description
Knowledge Base Name	Name of the knowledge base. The value can contain 1 to 64 characters, including letters, digits, hyphens (-), and underscores (_), and must start with a letter or digit.
Knowledge Base Language	Language of the knowledge base. The following languages are supported: Chinese English Thai Arabic Spanish Portuguese
Description	A brief description of the knowledge base. A maximum of 100 characters are allowed.
Knowledge Base Tags	Tags that identify the knowledge base. You can search for knowledge bases by tags or grant access to knowledge bases to different users by their tags. Key: custom Value: custom
Custom Fields of Structured Data	Adds custom fields of structure data. Click Add Custom Field, and set Field and Value. After the knowledge base is created, you can upload structured data to the knowledge base based on these custom fields.

On the Parse and Split Settings tab, configure Parsing Settings and Splitting Settings, and click Next.

Parsing Settings: Select the needed capabilities.

**Table 2** Parsing settings
Parameter	Description
OCR Enhancement	Calls the OCR service for intelligent document recognition, such as table parsing and file scanning.
Image Parsing	If unselected, images in documents will be skipped by default. If selected, two parsing methods are available: Extract Image Text: Recognize and extract the text in images. Retain Original Images: Recognize image content and then upload the original images to OBS. The original images will be used in answers.
Header and Footer Parsing	If unselected, the parsing result does not contain document headers or footers. If selected, the parsing result contains document headers and footers.
Contents Page Parsing	If unselected, the parsing result does not contain the contents page. If selected, the parsing result contains the contents page.

Splitting Settings: Select a segmentation method.

**Table 3** Splitting settings
Parameter	Description
Auto Segmentation	The system automatically selects a proper segmentation method based on the characteristics of the document.
By Length	By default, a document is segmented and merged by paragraph. If a paragraph is too long, it is segmented and merged by identifiers. You need to further set the following parameters: Segment Identifier: A paragraph is segmented when a selected identifier is encountered. There are no priorities between the selected identifiers. Short segments will be combined to a specified maximum length. If there is no hit on any of the user-specified segment identifiers, the segmentation fails. Estimated Segment Length: Specifies the maximum segment length. A document will be split to segments of this length. Two adjacent segments will have certain overlapped characters.
By Hierarchy	Split the document by title hierarchy, and then split and merge the document by paragraph. Over-long paragraphs will be split by identifiers. Further set the following parameters for detailed splitting methods: Hierarchical Parsing Mode: Select Automatic Parsing or Rule Parsing. If you select Rule Parsing, you need to define the rules. For more information about the Rule Parsing mode, see Table 4.

**Table 4** By Hierarchy
Parameter	Description
Hierarchical Parsing Mode	Automatic Parsing: Automatically parses documents by system-defined rules.
Hierarchical Parsing Mode	Rule explanation: Different types of documents have different hierarchical structures. You can customize parsing rules for different types of documents to enable better parsing and splitting of the documents, thus improving the accuracy of document-based Q&A. Default rules Define the most typical rules as default rules. For details, see Examples of Default Rules. Custom rules Define custom rules using regular expressions. For details, see Table 6.
Title Level	Select the title level depth of documents.
Title Saving Mode	Select Save Multi-Title or Save Last-Level Title.
Segment Identifier	A paragraph is segmented when a selected identifier is encountered. There are no priorities between the selected identifiers. Short segments will be combined to a specified maximum length. If there is no hit on any of the user-specified segment identifiers, the segmentation fails.
Estimated Segment Length	Specifies the maximum segment length. A document will be split to segments of this length. Two adjacent segments will have certain overlapped characters.
Cross-Title Merge	When two short adjacent paragraphs appear under different section titles, they will be automatically merged into a single segment of predefined length. This helps AI generate better, more comprehensive answers. When Cross-Title Merge is disabled, paragraphs under different titles will not be merged automatically. NOTE: This setting is available under (Segmentation) By Hierarchy, where you can enable or disable it. It is unavailable under Auto Segmentation. When Auto Segmentation is used, it is enabled by default. It does not apply when (Segmentation) By Length is used.

**Table 5** Examples of default parsing rules
Type	Rule	Description
Chapter 1 Section 1 Article 1	^Chapter([01234567891-9]{1,7}) ^Section([01234567891-9]{1,7}) ^Article([01234567891-9]{1,7})	Take the rules for chapters as an example: Characters indicating numbers in square brackets can be identified as chapter numbers. Arabic numerals from 1 to 9 can be identified as chapter numbers. The maximum number of characters indicating a chapter number is 7. The rules for sections and articles are similar.

**Table 6** Example of custom parsing rules
Type	Rule	Description
Chapter 1 Section 1 Article 1	^Chapter([01234567891-9]{1,7}) ^Section([01234567891-9]{1,7}) ^Article([01234567891-9]{1,7})	/
1 1.1 1.1.1	^(\d+\.)(?=\s) ^(\d+)(\.\d+)(?!\.)(?=\s) ^(\d+)(\.\d+)(\.\d+)(?!\.)(?=\s)	Matches paragraphs that start with a digit. Note: [\u4e00-\u9fa5]+Chinese characters) Example: 1. Overview 1.1 Description 1.1.1 Detailed Explanation

On the Model Settings tab, configure the models to use. Then click Next.

**Table 7** Model Settings
Parameter	Description
Search Model Settings	Embedding model: A Pangu-based text representation model. It converts text into vectors represented by numerics and uses them for purposes such as text retrieval, clustering, and recommendations. Reranking model: A Pangu-based text representation model that converts text into vectors represented by numerics and uses them for purposes such as text retrieval, clustering, and recommendations. In the case of semantic search, a reranking model improves search results. Search planning model: The model provides capabilities such as intent classification, multi-turn query rewriting, complex query decomposition, and time extraction. In a retrieval augmented generation (RAG) task, intent classification enables queries to be routed to the correct logic and processes; query rewriting and decomposition help improve search accuracy. NOTE: There is a strong connection between the embedding model and the cache generation model. When an embedding model is created, the system automatically generates a cache generation model. If any configuration information is deleted by mistake, the model must be rebuilt using the same configuration parameters. For example, if the name of the embedding model is pangu_embedding, the name of the matching cache generation model is pangu_embedding_faq. When creating a knowledge base, both the embedding model (pangu_embedding) and cache generation model (pangu_embedding_faq) are required. If the cache generation model (pangu_embedding_faq) does not exist or is not accessible, an error is returned. In this case, the administrator needs to check whether the pangu_embedding_faq model exists or whether the knowledge base user has access to it. If the model is missing, create it. If the knowledge base user does not have access to it, grant them the access permission.
NLP Model Settings	NLP model: Select an NLP model. The Pangu NLP model can be used for interactive dialogues, question answering, and content creation.
NLP Model Settings	Extend Long Context: When enabled, the context length may be extended during document parsing to generate more comprehensive results. Additionally, Effective Context Length needs to be set to ensure optimal outputs.
AI Search Settings	Search service type: Select Web search engine service or Enhanced web search service. Select search service: Select an available search engine service. Deep Thinking Model: Select a deep thinking model.

Go to the Advanced Settings tab, set the parameters, and then click OK.

**Table 8** Advanced Settings
Parameter	Description
Reference Location	When enabled, generated answers will contain hyperlinks that point to the source text.
Image + Text	When enabled, the answer will include both text and images from the original documents. There are three image recalling methods: Recall only semantically related images (default): If an image in the referenced paragraph has semantically related context as the generated text, the image will be recalled. Otherwise, it will not be recalled. All images: Recall all images in the referenced text. AI recall: Recall images using AI. NOTE: To enable this function, select Parse and Split Settings > Parse Settings > Image Parsing and select Retain only Original Images. If you are modifying an existing knowledge base, this setting is likely to be unavailable because an old version is used. To use this setting, ensure that you have purchased the image + text service, modify the document parsing mode (Retain only Original Images), reconstruct the knowledge base version based on the latest document configuration, or select the required documents to try again. Currently, only AI recall is supported for knowledge bases whose language is not Chinese.
Tabular Q&A	When enabled, documents can be converted into tables, and NL2SQL can enable more accurate statistical analysis.
Knowledge Base Cache	When enabled, Q&A history will be cached. This enables the knowledge base to answer similar questions faster in the future. To use a knowledge base cache, you need to further set the following parameters: Cache Generation Model: Choose a model. Cache Threshold: Triggers the cache policy when this threshold is reached. Select a value from 0.1 to 1. Cache Policy: Select Highest Score or Random. This is the policy for choosing from multiple answers. Expiration Policy: Specifies how the cache is cleared. There are three options: Least Recently Used: Delete items that are least recently used. First In First Out: Delete the oldest data. Least Frequently Used: Delete the least frequently accessed cache items (with the least hits) when the cache capacity is about to run out. Keepalive Time (s): TTL of the cache. You can set it to Permanent.
Directory Management	When enabled, the default directory management function will be used to manage documents. CAUTION: A secondary development of the directory management settings will overwrite existing ones.

You can check the basic information about the newly created knowledge base on the knowledge base management page, including the knowledge base ID, name, and status.

Modifying Knowledge Base Settings

You can modify the settings of an existing knowledge base.

New document parsing and splitting rules will be applied only to newly uploaded documents or retried documents.

On the KooSearch console, choose Knowledge Bases from the left navigation pane.
The Knowledge Bases page is displayed.
On the Knowledge Bases page, select an existing knowledge base, and click Manage Documents in the Operation column.
The Document Management page is displayed.

Click Configure in the upper-right corner to modify parsing and splitting settings, and more.

Parsing and splitting settings
See Table 2 and Table 3.

Recall policies

Recall policies are classified into text recall policies and FAQ recall policies.

**Table 9** Recall policies
Parameter	Description
Text Recall Policy	Recall policy used for document searches. Options include semantic search, hybrid search, and keyword search. Semantic search: Queries document chunks using vector search, and FAQs using query-to-query similarity-based search. Hybrid search: Queries document chunks using the hybrid of vector search and keyword search, and FAQs using query-to-query similarity-based search. Keyword search: Queries document chunks using inverted index search, and FAQs using query-to-query similarity-based search.
	Top-k recalls for semantic search: the number of recalls for each semantic search. If not specified, the default value 50 is used. Top-k recalls by keyword: the number of recalls for each keyword-based search. FAQ recalls: Obtains the similarity score through query-to-query similarity-based search and recalls the specified number of results. The default value is 2.
	Refined Ranking: filters and ranks search results before displaying them. Reranking is enabled by default. Note that when reranking is disabled, the relevance score ranges from 0 to 200. When it is enabled, this score ranges from 0 to 1. After enabling or disabling reranking, you must reconfigure the relevance threshold and reference relevance threshold. Otherwise, relevance-based result filtering will be affected. Search Page Correlation Threshold: Only search results with a relevance score higher than the correlation (relevance) threshold will be displayed on the search results page. Q&A Reference Correlation Threshold: The search results with a relevance score higher than the correlation (relevance) threshold will be submitted to the LLM for summarization.
FAQ Recall Policy	Recall policy used for FAQ searches. FAQ Recall Similarity Threshold: Obtains the similarity score through query-to-query similarity-based search and recalls results based on a similarity threshold. The default value is 0.8. FAQs with a relevance score exceeding this threshold will be provided as answers directly. There is no need for the LLM to summarize the answer. Default value: 0.95.

More settings

Modify Search Model Settings, NLP Model Settings, AI Search Settings, and Advanced Settings. For details, see step 5 and step 6 in section "Creating a Knowledge Base."

You can also configure the settings under Others.

**Table 10** Other settings
Parameter	Description
Reference Documents	Sets the number of reference documents for the RAG model. If not configured, the default value 3 is used.
Query Rewriting	User queries are split and rewritten based on the multi-turn dialog. The rewritten queries are used for document retrieval only.
Intent Classification	Select an intent category. Human interaction: What's your name? Weather: What is the weather today? Industry knowledge: Prefix matching is recommended, allowing for future extensions. For example, "Industry knowledge-Finance: What is the definition of loan restructuring?" Industry knowledge-Manufacturing: What is the current stage of China's manufacturing? Industry knowledge-Healthcare: What types of medical errors are there? Industry knowledge-Government: What are the main guidelines in the New-Generation Artificial Intelligence Development Plan issued by the State Council of China? Industry Knowledge-Finance: How is the stock market doing today? NLP task: Please write an email of about 460 words asking for details about a new IT project. This email will be sent to the company's IT project manager. General knowledge: What is the difference between soybean juice and soy milk? Chit-chat: It's so exhausting taking a long-distance train. NOTE: Questions with identified intents are answered by an LLM directly. For questions with unidentified intents, they will first be searched in the knowledge base. Then, the LLM summarizes the results to generate answers.
Refuse Certain Questions	When enabled, you can set Response When Refusing a Question. If no answer is found for a question, this preset response is returned.
General Prompt	Use scenarios: non-RAG. In non-RAG scenarios, there are no search processes. The generative AI model generates answers directly. Elements: The prompt must contain the question, task instructions, and other requirements. Usage: The prompt can be customized. If not specified, the default prompt is used. Refer to the format of the default prompt when you write a custom prompt.
Custom Prompt for Question Generation	You are an expert in question extraction. Please summarize and generate up to {0} high-quality questions based on the content of the document text provided below. The requirements are as follows: (1) The generated questions should be answerable based on the provided document text. (2) Present the questions in a conversational and personalized manner, suitable for a knowledge base Q&A format. (3) Avoid revealing that your answer is based on some reference material. (4) Make sure the questions are diverse in terms of the knowledge points they cover. (5) Avoid overly simple questions; maintain high quality in the generated questions. Document text: {1} Note: {0} and {1} are placeholders in a fixed sequence. The retrieved document content will be filled to the location indicated by {1}. The format is as follows: [Document name]: {title1} [Document content]: {content1} [Document name]: {title2} [Document content]: {content2} ...... The number of questions generated will be filled to the location indicated by {0}.
Custom Prompt for Answer Generation	You are an expert in question extraction. Please summarize and generate up to {0} high-quality questions based on the content of the document text provided below. The requirements are as follows: (1) The generated questions should be answerable based on the provided document text. (2) Present the questions in a conversational and personalized manner, suitable for a knowledge base Q&A format. (3) Avoid revealing that your answer is based on some reference material. (4) Make sure the questions are diverse in terms of the knowledge points they cover. (5) Avoid overly simple questions; maintain high quality in the generated questions. Document text: {1} Note: {0} and {1} are placeholders in a fixed sequence. The retrieved document content will be filled to the location indicated by {1}. The format is as follows: [Document name]: {title} [Document content]: {content} The number of questions generated will be filled to the location indicated by {0} before answers are generated.

Click OK to confirm the modification.
After the modification, you need to re-import the required documents and files for the new knowledge base settings to take effect.