File content parsing
Function
Split and merge document parsing results.
URI
POST /v1/koosearch/doc-search/parse-result/split
Request Parameters
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
X-Auth-Token |
Yes |
String |
Parameter description: Token used for API authentication. For how to obtain the token, see section 3.2 "Authentication." Constraints: N/A. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
doc |
Yes |
ParsedDocument object |
Document parsing information |
mode |
No |
Integer |
0: raw; 1: split by contents; 2: split by section rules; 3: split by length; 4: auto split |
rule_regexs |
No |
Array of strings |
Regular expression match |
chunk_size |
No |
Integer |
Segment size |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
doc_id |
Yes |
String |
Document ID, which is generated based on the UUID |
doc_name |
Yes |
String |
Document |
doc_type |
Yes |
String |
Document type, for example, PDF or DOCX |
preview_file_url |
No |
String |
Preview file address |
original_file |
No |
String |
Original document path |
file_size |
No |
Integer |
Original document size, in bytes |
pages |
No |
Array of ParsedDocumentPage objects |
Document page information |
images |
No |
Array of ParsedDocumentImage objects |
Document image information |
original_tables |
No |
Array of OriginalTable objects |
Original table information |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
page_num |
Yes |
Integer |
Page number, which indicates the sequence number of a page in the document |
preview_image_url |
No |
String |
Address of the document page preview image |
components |
No |
Array of ParsedDocumentComponent objects |
Paragraph information on the page |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
id |
Yes |
String |
Paragraph ID, which is generated based on the UUID |
text |
Yes |
String |
Paragraph Content |
component_num |
Yes |
Integer |
Paragraph code, which indicates the sequence number of a paragraph in the document. The value starts from 1. |
pdf_coordinate |
Yes |
Array<Array<Integer>> |
Coordinates of a paragraph on the page, corresponding to the upper left, upper right, lower right, and lower left, respectively, for highlighting |
original_table_id |
No |
String |
This parameter has a value only when the table is split. It is used to save the original long table to support the small2big feature. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
image_id |
Yes |
String |
Image ID, which is the prefix img- and UUID |
url |
No |
String |
Path for uploading the image to OBS |
data |
No |
String |
Base64-encoded image data |
title |
No |
String |
Image title |
desc |
No |
String |
Image description |
width |
No |
Integer |
Image width, |
height |
No |
Integer |
Image height. |
Response Parameters
Status code: 200
Parameter |
Type |
Description |
---|---|---|
doc_id |
String |
Document ID, which is generated based on the UUID |
doc_name |
String |
Document |
doc_type |
String |
Document type, for example, PDF or DOCX |
preview_file_url |
String |
Preview file address |
original_file |
String |
Original document path |
file_size |
Integer |
Original document size, in bytes |
pages |
Array of ParsedDocumentPage objects |
Document page information |
images |
Array of ParsedDocumentImage objects |
Document image information |
original_tables |
Array of OriginalTable objects |
Original table information |
Parameter |
Type |
Description |
---|---|---|
page_num |
Integer |
Page number, which indicates the sequence number of a page in the document |
preview_image_url |
String |
Address of the document page preview image |
components |
Array of ParsedDocumentComponent objects |
Paragraph information on the page |
Parameter |
Type |
Description |
---|---|---|
id |
String |
Paragraph ID, which is generated based on the UUID |
text |
String |
Paragraph Content |
component_num |
Integer |
Paragraph code, which indicates the sequence number of a paragraph in the document. The value starts from 1. |
pdf_coordinate |
Array<Array<Integer>> |
Coordinates of a paragraph on the page, corresponding to the upper left, upper right, lower right, and lower left, respectively, for highlighting |
original_table_id |
String |
This parameter has a value only when the table is split. It is used to save the original long table to support the small2big feature. |
Parameter |
Type |
Description |
---|---|---|
image_id |
String |
Image ID, which is the prefix img- and UUID |
url |
String |
Path for uploading the image to OBS |
data |
String |
Base64-encoded image data |
title |
String |
Image title |
desc |
String |
Image description |
width |
Integer |
Image width, |
height |
Integer |
Image height. |
Parameter |
Type |
Description |
---|---|---|
id |
String |
Table ID. ParsedDocumentComponent will reference this identifier to avoid storing multiple copies. |
content |
String |
Form content |
Status code: 400
Parameter |
Type |
Description |
---|---|---|
error_code |
String |
|
error_msg |
String |
Error description |
Status code: 401
Parameter |
Type |
Description |
---|---|---|
error_code |
String |
|
error_msg |
String |
Error description |
Status code: 500
Parameter |
Type |
Description |
---|---|---|
error_code |
String |
|
error_msg |
String |
Error description |
Example Requests
None
Example Responses
Status code: 200
Document splitting and merging result
{ "pages" : [ { "components" : [ { "id" : "393c1d28cd9c40f5ad9f7a2d33dffb80", "text" : "1--- Level-1 title 1\n1.1 Level-2 title 1\nContent\n1.2 Level-2 title 2\nContent\n2 --- Level-1 title 2\n2.1 Level-2 title 3\nContent\n2.2 Level-2 title 4\nContent\nlistItemByLevel: stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n3 --- Level-1 title 3\n3.1 Level-2 title 1\nContent\n3.2 Level-2 title 2\nContent\n4 --- Level-1 title 4\n4.1 Level-2 title 3\nContent\n4.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n5 --- Level-1 title 5\n5.1 Level-2 title 1\nContent\n5.2 Level-2 title 2\nContent\n6 --- Level-1 title 6\n6.1 Level-2 title 3\nContent\n6.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.\n7--- Level-1 title 7\n7.1 Level-2 title 1\nContent\n7.2 Level-2 title 2\nContent\n8--- Level-1 title 8\n8.1 Level-2 title 3\nContent\n8.2 Level-2 title 4\nContent\nlistItemByLevel stores elements of each level, for example, 0, 1, and 2. 0 indicates a level-1 title. The corresponding value is the level-1 title in the document. New group titles will be cleared.\nlistContextMap: stores the mapping between numId and listContext.\nNew title of itemContext. The member variable number indicates the sequence number of the title. There is a parent-child relationship.\nListItemContext: nb in parent is used to count the number of children.\nThe numId of the level-1 and level-2 titles of this document is the same.", "component_num" : 1 } ], "page_num" : 0 } ], "doc_id" : "844f805a7255437b8c139f4331ec3012", "doc_name" : "Test Title No..docx", "doc_type" : "DOCX", "original_file" : "uni-search/files/729cbd739854470da5426ed26bd900ca/fb9731ab-7085-474f-b6c7-64473586f0f3/c5e7dc40-9d43-49fd-8b5f-12c906ed66d2/d5a4ced94f07050841eb9424f87096af/Test Title No..docx", "file_size" : 68621 }
Status Codes
Status Code |
Description |
---|---|
200 |
Document splitting and merging result |
400 |
Invalid request parameters |
401 |
Authentication error |
500 |
Internal error |
Error Codes
See Error Codes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot