Word Segmentation

Introduction

This API is used to segment words in the text.

For details about endpoints, see Endpoints.

Calling NLP APIs will incur fees. NLP packages are classified into the basic and domain-specific editions. When purchasing a package, view the APIs supported by the basic package and domain-specific packages in the Natural Language Processing Price Calculator.

URI

  • URI format
    POST /v1/{project_id}/nlp-fundamental/segment
  • Parameter description
    Table 1 URI parameters

    Parameter

    Mandatory

    Description

    project_id

    Yes

    Project ID. For details about how to obtain the project ID, see Obtaining a Project ID.

Request

Table 2 describes the request parameters.

Table 2 Request parameters

Parameter

Type

Mandatory

Description

text

String

Yes

Text to be split. The text is encoded using UTF-8 and contains 1 to 512 characters.

pos_switch

Integer

No

Whether to enable part-of-speech tagging (POS tagging). The options are 1 (yes) and 0 (no). The default value is 0.

lang

String

No

Supported text language type. Currently, Chinese (zh) and English (en) are supported. The default value is zh.

criterion

String

No

Supported word segmentation criterion

Supported word segmentation criteria. Currently, Peking University standard (PKU) and Chinese Penn Treebank (CTB) are supported. The default value is PKU.

The default word segmentation criterion for English text is Penn TreeBank. You do not need to configure this parameter.

Response

Table 3 describes the response parameters.

Table 3 Response parameters

Parameter

Type

Description

words

Array of words

Word segmentation result. For details, see Table 4.

error_code

String

Error code when the API fails to be called. For details, see Error Code.

The parameter is not included when the API call succeeds.

error_msg

String

Error message returned when the API fails to be called.

The parameter is not included when the API call succeeds.

Table 4 Word field data structure

Parameter

Type

Description

content

String

Word text.

pos

String

Lexical character corresponding to a word. For details, see Table 5, Table 6, and Table 7.

Table 5 Part of speech (POS) description (PKU)

Class-1 POS

Class-2 POS

Class-3 POS

n: Noun

nr: Name of a person

  • nr1: Chinese surname
  • nr2: Chinese given name
  • nrj: Japanese name
  • nrf: Transliterated name

ns: Place name

nsf: Transliterated place name

nt: Organization or group name

-

nz: Other exclusive name

-

nl: Nominal locution

-

ng: Nominal morpheme

-

t: Time word

tg: Time morpheme

-

s: Locative word

-

-

f: Positional word

-

-

v: Verb

vd: Adverbial form of a verb

-

vn: Gerund

-

vshi: Copula verb

-

vyou: Verb indicating "has/have"

-

vf: Directional verb

-

vx: Formal verb

-

vi: Intransitive verb

-

vl: Verbal locution

-

vg: Verbal morpheme

-

a: Adjective

ad: Adverbial adjective

-

an: Nominal adjective

-

ag: Adjective morpheme

-

al: Adjective locution

-

b: Distinguishing word

bl: Distinguishing locution

-

z: Status word

-

-

r: Pronoun

rr: Personal pronoun

-

rz: Demonstrative pronoun

  • rzt: Demonstrative pronoun for time
  • rzs: Demonstrative pronoun for location
  • rzv: Demonstrative pronoun for predicate

ry: Interrogative pronoun

  • ryt: Interrogative pronoun for time
  • rys: Interrogative pronoun for location
  • ryv: Interrogative pronoun for predicate

rg: Pronominal morpheme

-

m: Numeral

mq: Number word

-

mg: A, B, C, D, E, F, G, H, N, and G

-

q: Classifier

qv: Verbal classifier

-

qt: Time classifier

-

d: Adverb

-

-

p: Preposition

pba: Preposition ba

-

pbei: Preposition bei

-

c: Conjunction

cc: Coordinating conjunction

-

u: Particle

uzhe: Particle

-

ule: Particle

-

uguo: Particle

-

ude1: Particle

-

ude2: Particle

-

ude3: Particle

-

usuo: Particle

-

udeng: Particle

-

uyy: Particle

-

udh: Particle

-

uls: Particle

-

uzhi: Particle

-

ulian: Particle

-

e: Exclamation

-

-

y: Discourse word

-

-

o: Onomatopoeia

-

-

h: Prefix

-

-

k: Suffix

-

-

x: character string

xe: Email character string

-

xs: Weibo session separator

-

xm: Emoticon

-

xu: Website URL

-

w: Punctuation

wkz: Chinese left brackets

-

wky: Chinese right brackets

-

wyz: Chinese left quotation marks

-

wyy: Chinese right quotation marks

-

wj: Chinese full stop

-

ww: Question marks

-

wt: Exclamation marks

-

wd: Commas

-

wf: Semicolons

-

wn: Enumeration comma

-

wm: Colons

-

ws: Ellipsis

-

wp: Dashes

-

wb: Percentile and permil

-

wh: Unit

-

Table 6 POS description (CTB)

POS

Description

Example

AD

Adverb

word-1, word-2, word-3

AS

Dynamic particle

word-4, word-5, word-6

BA

"ba" structure

word-7

CC

Coordinating conjunction

word-8, word-9

CD

Quantifier

One, two, three

CS

Subordinating conjunction

Although, if, when

DEC

Complement or nominalization

word-10, word-11

DEG

Conjunctive or possessive

word-12, word-13

DER

Complement de

de

DEV

Adverb di

di

DT

Determiner

word-14, word-15, word-16

ETC

word-17

word-17, word-18

FW

Loanword

A E B

IJ

Exclamation

word-18, word-19

JJ

Modifier for noun

Big, new, small

LB

Long bei structure

word-20, word-21, word-22

LC

Positional word

middle, upper

M

Classifier

Unit, year, dollar

MSP

Particle

Particle-1, particle-2, particle-3

NN

Noun

Economy, enterprise, person

NR

Proper noun

China, Zhejiang

NT

Time noun

Present, last year

OD

Numeral

First, second, top

ON

Onomatopoeia

O

P

Preposition

Preposition-1, preposition-2, preposition-3

PN

Pronoun

He, I, myself

PU

Punctuation

Chinese comma, Chinese full stop

SB

Short bei structure

word-23, word-24

SP

Particle at the end of a sentence

Particle-1, particle-2, particle-3

VA

Predicative adjective

Big, many, good

VC

Linking verb

Verb-1, verb-2, verb-3

VE

Verb indicating "has/have"

Verb-4, verb-5, verb-6

VV

Verb

Verb-7, verb-8, verb-9

Table 7 POS description (Penn TreeBank)

POS

Description

Example

CC

Coordinating conjunction

and, but, or

CD

Cardinal number

one, two

DT

Determiner

a, the

EX

There be, to exist

there

FW

Foreign word

mea, culpa

IN

Preposition, subordinating conjunction

of, in, by

JJ

Adjective

yellow

JJR

Comparative form of adjectives

bigger

JJS

Superlative form of adjectives

wildest

LS

List item marker

1, 2, One

MD

Modal verb

can, could, might

NN

Noun, countable or uncountable

llama

NNS

Noun, in plural form

llamas

NNP

Proper noun, in singular form

IBM

NNPS

Proper noun, in plural form

Carolinas

PDT

Predeterminer

all, both

POS

Possessive adjective

's

PRP

Personal pronoun

I, me, you,

PRP$

Possessive pronoun

my, your, yours

RB

Adverb

quickly

RBR

Comparative form of adverbs

faster

RBS

Superlative form of adverbs

fastest

RP

Particle

up, off

SYM

Sign (mathematics or science)

+, % ,&

TO

to

to

UH

Exclamation

ah, oops

VB

Basic form of verbs

eat

VBD

Past tense of verbs

ate

VBG

Gerund or present participle

eating

VBN

Past participle

eaten

VBP

Non-third person singular form of verbs

eat

VBZ

Third person singular form of verbs

eats

WDT

wh-determiner

which, that

WP

wh-pronoun

what, who

WP$

wh-possesive pronoun

whose

WRB

wh-adverb

how, where

PU

Punctuation

, . :

Example

  • Example request
    POST https://{endpoint}/v1/{project_id}/nlp-fundamental/segment
    
    Request Header:
        Content-Type: application/json
        X-Auth-Token: MIINRwYJKoZIhvcNAQcCoIINODCCDTQCAQExDTALBglghkgBZQMEAgEwgguVBgkqhkiG...
    
    Request Body:
        {
            "text":"Text to segment",
            "pos_switch":1,
            "lang":"zh",
            "criterion":"PKU"
        }
  • Example response
    • Successful response example
      {
          "words": [
              {
                  "content": "word-1",
                  "pos": "t"
              },
              {
                  "content": "word-2",
                  "pos": "n"
              },
              {
                  "content": "word-3",
                  "pos": "d"
              },
              {
                  "content": "word-4",
                  "pos": "a"
              }
          ]
      }
    • Failed response example
      {
          "error_code": "NLP.0301",
          "error_msg": "The length of text should be in the range of 1-512"
      }

Status code

For details about status codes, see Status Code.

Error Code

For details about error codes, see Error Code.