Updated on 2025-08-13 GMT+08:00

File content parsing

Function

Split and merge document parsing results.

URI

POST /v1/koosearch/doc-search/parse-result/split

Request Parameters

Table 1 Request header parameters

Parameter

Mandatory

Type

Description

X-Auth-Token

Yes

String

Parameter description:

Token used for API authentication. For how to obtain the token, see section 3.2 "Authentication."

Constraints:

N/A.

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

doc

Yes

ParsedDocument object

Document parsing information

mode

No

Integer

0: raw; 1: split by contents; 2: split by section rules; 3: split by length; 4: auto split

rule_regexs

No

Array of strings

Regular expression match

chunk_size

No

Integer

Segment size

Table 3 ParsedDocument

Parameter

Mandatory

Type

Description

doc_id

Yes

String

Document ID, which is generated based on the UUID

doc_name

Yes

String

Document

doc_type

Yes

String

Document type, for example, PDF or DOCX

preview_file_url

No

String

Preview file address

original_file

No

String

Original document path

file_size

No

Integer

Original document size, in bytes

pages

No

Array of ParsedDocumentPage objects

Document page information

images

No

Array of ParsedDocumentImage objects

Document image information

original_tables

No

Array of OriginalTable objects

Original table information

Table 4 ParsedDocumentPage

Parameter

Mandatory

Type

Description

page_num

Yes

Integer

Page number, which indicates the sequence number of a page in the document

preview_image_url

No

String

Address of the document page preview image

components

No

Array of ParsedDocumentComponent objects

Paragraph information on the page

Table 5 ParsedDocumentComponent

Parameter

Mandatory

Type

Description

id

Yes

String

Paragraph ID, which is generated based on the UUID

text

Yes

String

Paragraph Content

component_num

Yes

Integer

Paragraph code, which indicates the sequence number of a paragraph in the document. The value starts from 1.

pdf_coordinate

Yes

Array<Array<Integer>>

Coordinates of a paragraph on the page, corresponding to the upper left, upper right, lower right, and lower left, respectively, for highlighting

original_table_id

No

String

This parameter has a value only when the table is split. It is used to save the original long table to support the small2big feature.

Table 6 ParsedDocumentImage

Parameter

Mandatory

Type

Description

image_id

Yes

String

Image ID, which is the prefix img- and UUID

url

No

String

Path for uploading the image to OBS

data

No

String

Base64-encoded image data

title

No

String

Image title

desc

No

String

Image description

width

No

Integer

Image width,

height

No

Integer

Image height.

Table 7 OriginalTable

Parameter

Mandatory

Type

Description

id

Yes

String

Table ID. ParsedDocumentComponent will reference this identifier to avoid storing multiple copies.

content

Yes

String

Form content

Response Parameters

Status code: 200

Table 8 Response body parameters

Parameter

Type

Description

doc_id

String

Document ID, which is generated based on the UUID

doc_name

String

Document

doc_type

String

Document type, for example, PDF or DOCX

preview_file_url

String

Preview file address

original_file

String

Original document path

file_size

Integer

Original document size, in bytes

pages

Array of ParsedDocumentPage objects

Document page information

images

Array of ParsedDocumentImage objects

Document image information

original_tables

Array of OriginalTable objects

Original table information

Table 9 ParsedDocumentPage

Parameter

Type

Description

page_num

Integer

Page number, which indicates the sequence number of a page in the document

preview_image_url

String

Address of the document page preview image

components

Array of ParsedDocumentComponent objects

Paragraph information on the page

Table 10 ParsedDocumentComponent

Parameter

Type

Description

id

String

Paragraph ID, which is generated based on the UUID

text

String

Paragraph Content

component_num

Integer

Paragraph code, which indicates the sequence number of a paragraph in the document. The value starts from 1.

pdf_coordinate

Array<Array<Integer>>

Coordinates of a paragraph on the page, corresponding to the upper left, upper right, lower right, and lower left, respectively, for highlighting

original_table_id

String

This parameter has a value only when the table is split. It is used to save the original long table to support the small2big feature.

Table 11 ParsedDocumentImage

Parameter

Type

Description

image_id

String

Image ID, which is the prefix img- and UUID

url

String

Path for uploading the image to OBS

data

String

Base64-encoded image data

title

String

Image title

desc

String

Image description

width

Integer

Image width,

height

Integer

Image height.

Table 12 OriginalTable

Parameter

Type

Description

id

String

Table ID. ParsedDocumentComponent will reference this identifier to avoid storing multiple copies.

content

String

Form content

Status code: 400

Table 13 Response body parameters

Parameter

Type

Description

error_code

String

Error Code

error_msg

String

Error description

Status code: 401

Table 14 Response body parameters

Parameter

Type

Description

error_code

String

Error Code

error_msg

String

Error description

Status code: 500

Table 15 Response body parameters

Parameter

Type

Description

error_code

String

Error Code

error_msg

String

Error description

Example Requests

None

Example Responses

Status code: 200

Document splitting and merging result

{
  "pages" : [ {
    "components" : [ {
      "id" : "393c1d28cd9c40f5ad9f7a2d33dffb80",
      "text" : "1--- Level-1 title 1\n1.1 Level-2 title 1\nContent\n1.2 Level-2 title 2\nContent\n2 --- Level-1 title 2\n2.1 Level-2 title 3\nContent\n2.2 Level-2 title 4\nContent\nlistItemByLevel: stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n3 --- Level-1 title 3\n3.1 Level-2 title 1\nContent\n3.2 Level-2 title 2\nContent\n4 --- Level-1 title 4\n4.1 Level-2 title 3\nContent\n4.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n5 --- Level-1 title 5\n5.1 Level-2 title 1\nContent\n5.2 Level-2 title 2\nContent\n6 --- Level-1 title 6\n6.1 Level-2 title 3\nContent\n6.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n7--- Level-1 title 7\n7.1 Level-2 title 1\nContent\n7.2 Level-2 title 2\nContent\n8--- Level-1 title 8\n8.1 Level-2 title 3\nContent\n8.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.",
      "component_num" : 1
    } ],
    "page_num" : 0
  } ],
  "doc_id" : "844f805a7255437b8c139f4331ec3012",
  "doc_name" : "Test Title No..docx",
  "doc_type" : "DOCX",
  "original_file" : "uni-search/files/729cbd739854470da5426ed26bd900ca/fb9731ab-7085-474f-b6c7-64473586f0f3/c5e7dc40-9d43-49fd-8b5f-12c906ed66d2/d5a4ced94f07050841eb9424f87096af/Test Title No..docx",
  "file_size" : 68621
}

Status Codes

Status Code

Description

200

Document splitting and merging result

400

Invalid request parameters

401

Authentication error

500

Internal error

Error Codes

See Error Codes.