Chinese Word Segmentation Evaluation Toolkit

By Joy (joy@cs.cmu.edu). Jan 2004

visits since Nov. 2005

Introduction

Unlike most western languages, written Chinese text has no separators (e.g. empty spaces) between words. If this is the case in English, you will read a sentence like this: itisveryimportanttohaveagoodwordsegmenter. Native speakers usually have no trouble reading the sentence without any separators. But most language processing technologies can't handle it. Language technology applications such as Machine Translation and Information Retrieval, use word as the basic units. Segmenting the Chinese sentence into a sequence of words is   So it is very important to have a good word segmenter. 

Word segmentation is a non-trivial task, and it is hard to have a "good" segmenter. It is almost impossible to segment a sentence perfectly. In fact even human has trouble to segment some ambiguous sentences. There are quite some word segmentation approaches: frequency-based, rule-based, maximum-entropy based approaches, just to name a few. To find the "best" segmenter for the NLP application, we need to evaluate the performance of the segmenter.


Evaluation method and the data set

The evaluation method used here is the Edit Distance of the Word Separator (EDWS).

Given a Chinese sentence without segmentation, we can represent it as a sequence of characters: C1C2....CiCi+1...Cn
A segmenter segments this sentence into a sequence of words by inserting separation marks S between characters. For example, a segmented sentence may look like: C1C2SC3C4C5SC6C7C8C9SC10S....CiCi+1...SCn. Two segmentations may have separation marks inserted at different positions in the sentence. The Edit Distance of the Word Separator (EDWS) thus measures how many edit operations (insertion, deletion and substitution) are needed to modify one segmentation to the standard segmentation (reference).

Example:

       Standard(reference):  C1C2SC3C4C5SC6C7   C8C9SC10SC11C12C13SC14
        
from one segmenter:   C1C2SC3C4C5SC6C7SC8C9SC10   C11C12C13SC14 

In this example there are     substitutions after C2, C5, C9 and C13; insertions after C10; deletions after C7

We can define the segmentation precision, recall and harmonic mean F as:

Prec= (number of sub) /(number of separators in Hyp)
Recall=(number of sub)/(number of separators in Ref)
F=2*Prec*Recall/(Prec+Recall)

Depends on the application, one can derive other metrics based on the number of substitutions, insertions and deletions. For example, if over-segmentation is more problematic, more penalty can be assigned to deletions.

The data set we used here are extracted from the Chinese Treebank, Out of 1715 sentences which have length between 10~50 words, we randomly selected about half of the sentences as the Development set (845 sentences) and the rest as the testing set (870 sentences). You can tune your segmenter (if it is tunable) on the dev-set. The test-set should not be touched for tuning purposes.

To make the reference set more robust, we used three word segmenters (lrsegmenter, GBMMseg and CTseg) to segment the dev/tst data and created "artificial" references. Plus the human segmented treebank data, we have 4 "reliable" segmentations as reference. The intersection of the 4 segmentations, i.e. insert a separation mark between two characters only when all 4 segmentation agree, is used to measure the optimal Recall. The union of the 4 segmentation, i.e. insert a separation mark when any of the 4 segmentation has it, is used to measure the optimal Precision.

  Unsegmented Text Single Reference (segmented by human) Multiple References
Development Data Set  (encoded in GB)

dev.gb

dev-ref.gb

dev.intersect.gb(for optimal recall) dev.union.gb (for optimal precion)
Testing Data Set (encoded in GB)

tst.gb

Submit your result here

 


Evaluation scripts

Download the evaluation script (perl5) here

To run the evaluation: 

Step1: download the unsegmented data (dev.gb or tst.gb)
Step2: use your segmenter to segment it, make sure that the segmented data has the same number of lines as the unsegmented data. Suppose the segmented data is called hyp.gb
Step3: run the following: 
        perl5 segmentationPrec.pl dev-ref.gb hyp.gb

The result looks like this:

    Total: Sub=21341
    Ins=3142
    Del=807
    Precision: 84.39%


Submit segmented testing data for evaluation

Step1: download the testing data (tst.gb)
Step2: use your segmenter to segment it,  suppose the segmented file is called hyp.gb
Step3: upload your results and fill in the information to publish your results

Your Name Email

System Description

The hypothesis file (segmentation results)


Word segmenters and their performance (on test set)

Systems

Single Reference (treebank) 4 References
Segmenter Developer Contact   Sub Ins Del Precision
(%)
Recall
(%)
F Precision
(%)
Recall
(%)
lrsegmenter Joy joy@cs.cmu.edu   22173 880 3259 87.19 96.18 0.91    
GBMMseg Bing Zhao bzhao@cs.cmu.edu   20875 2178 1410 93.67 90.55 0.92    
CTSeg Erik Peterson eepeter+@cs.cmu.edu   20688 2365 1523 93.14 89.74 0.91    
LDC Zhibiao Wu     22077 976 3326 86.91 95.77 0.91 77.88~99.59 93.46~99.77
SegTag Mandel Shi mandel@xmu.edu.cn   22187 866 593 97.40 96.24 0.97 86.74~99.69 83.90~99.65
DataparkSearch segmenter Maxim Zakharov maxime@sochi.net.ru   22077 976 3342 86.85 95.77 0.91 77.83~99.56 93.49~99.56
DataparkSearch segmenter v4.34 submission 1 Maxim Zakharov maxime@sochi.net.ru   23042 11 17220 57.23 99.95 0.73 49.28~67.16 99.89~100.00
DataparkSearch segmenter v4.34 submission 2 Maxim Zakharov maxime@sochi.net.ru   22188 865 3246 87.24 96.25 0.92 77.94~99.74 93.72~99.97
S-MSRSeg Jianfeng Gao jfgao@microsoft.com   21932 1121 3903 84.89 95.14 0.90 76.56~95.74 91.38~99.75
POSEG Cheongjae Lee lcj80@postech.ac.kr   21419 1634 470 97.85 92.91 0.95 87.46~99.41 80.39~96.55