By Joy (joy@cs.cmu.edu). Jan 2004
Unlike most western languages, written Chinese text has no separators (e.g. empty spaces) between words. If this is the case in English, you will read a sentence like this: itisveryimportanttohaveagoodwordsegmenter. Native speakers usually have no trouble reading the sentence without any separators. But most language processing technologies can't handle it. Language technology applications such as Machine Translation and Information Retrieval, use word as the basic units. Segmenting the Chinese sentence into a sequence of words is So it is very important to have a good word segmenter.
Word segmentation is a non-trivial task, and it is hard to have a "good" segmenter. It is almost impossible to segment a sentence perfectly. In fact even human has trouble to segment some ambiguous sentences. There are quite some word segmentation approaches: frequency-based, rule-based, maximum-entropy based approaches, just to name a few. To find the "best" segmenter for the NLP application, we need to evaluate the performance of the segmenter.
The evaluation method used here is the Edit Distance of the Word Separator (EDWS).
Given a Chinese sentence without segmentation, we can represent
it as a sequence of characters: C1C2....CiCi+1...Cn.
A segmenter segments this sentence into a sequence of words by inserting
separation marks S between characters. For example, a segmented
sentence may look like: C1C2SC3C4C5SC6C7C8C9SC10S....CiCi+1...SCn.
Two segmentations may have separation marks inserted at different positions in
the sentence. The Edit Distance of the Word Separator (EDWS)
thus measures how many edit operations (insertion, deletion and substitution)
are needed to modify one segmentation to the standard segmentation (reference).
Example:
Standard(reference):
C1C2SC3C4C5SC6C7
C8C9SC10SC11C12C13SC14
from one
segmenter: C1C2SC3C4C5SC6C7SC8C9SC10
C11C12C13SC14
In this example there are substitutions after C2, C5, C9 and C13; insertions after C10; deletions after C7
We can define the segmentation precision, recall and harmonic mean F as:
Prec= (number of sub) /(number of separators in Hyp)
Recall=(number of sub)/(number of separators in Ref)
F=2*Prec*Recall/(Prec+Recall)
Depends on the application, one can derive other metrics based on the number of substitutions, insertions and deletions. For example, if over-segmentation is more problematic, more penalty can be assigned to deletions.
The data set we used here are extracted from the Chinese Treebank, Out of 1715 sentences which have length between 10~50 words, we randomly selected about half of the sentences as the Development set (845 sentences) and the rest as the testing set (870 sentences). You can tune your segmenter (if it is tunable) on the dev-set. The test-set should not be touched for tuning purposes.
To make the reference set more robust, we used three word segmenters (lrsegmenter, GBMMseg and CTseg) to segment the dev/tst data and created "artificial" references. Plus the human segmented treebank data, we have 4 "reliable" segmentations as reference. The intersection of the 4 segmentations, i.e. insert a separation mark between two characters only when all 4 segmentation agree, is used to measure the optimal Recall. The union of the 4 segmentation, i.e. insert a separation mark when any of the 4 segmentation has it, is used to measure the optimal Precision.
| Unsegmented Text | Single Reference (segmented by human) | Multiple References | |
| Development Data Set (encoded in GB) | dev.intersect.gb(for optimal recall) dev.union.gb (for optimal precion) | ||
| Testing Data Set (encoded in GB) | |||
Download the evaluation script (perl5) here
To run the evaluation:
Step1: download the unsegmented data (dev.gb or
tst.gb)
Step2: use your segmenter to segment it, make sure that the segmented data has
the same number of lines as the unsegmented data. Suppose the segmented data is
called hyp.gb
Step3: run the following:
perl5 segmentationPrec.pl dev-ref.gb
hyp.gb
The result looks like this:
Total: Sub=21341
Ins=3142
Del=807
Precision: 84.39%
Step1: download the testing data (tst.gb)
Step2: use your segmenter to segment it, suppose the segmented file is
called hyp.gb
Step3: upload your results and fill in the information to publish your results
|
Systems |
Single Reference (treebank) | 4 References | |||||||||
| Segmenter | Developer | Contact | Sub | Ins | Del | Precision (%) |
Recall (%) |
F | Precision (%) |
Recall (%) |
|
| lrsegmenter | Joy | joy@cs.cmu.edu | 22173 | 880 | 3259 | 87.19 | 96.18 | 0.91 | |||
| GBMMseg | Bing Zhao | bzhao@cs.cmu.edu | 20875 | 2178 | 1410 | 93.67 | 90.55 | 0.92 | |||
| CTSeg | Erik Peterson | eepeter+@cs.cmu.edu | 20688 | 2365 | 1523 | 93.14 | 89.74 | 0.91 | |||
| LDC | Zhibiao Wu | 22077 | 976 | 3326 | 86.91 | 95.77 | 0.91 | 77.88~99.59 | 93.46~99.77 | ||
| SegTag | Mandel Shi | mandel@xmu.edu.cn | 22187 | 866 | 593 | 97.40 | 96.24 | 0.97 | 86.74~99.69 | 83.90~99.65 | |
| DataparkSearch segmenter | Maxim Zakharov | maxime@sochi.net.ru | 22077 | 976 | 3342 | 86.85 | 95.77 | 0.91 | 77.83~99.56 | 93.49~99.56 | |
| DataparkSearch segmenter v4.34 submission 1 | Maxim Zakharov | maxime@sochi.net.ru | 23042 | 11 | 17220 | 57.23 | 99.95 | 0.73 | 49.28~67.16 | 99.89~100.00 | |
| DataparkSearch segmenter v4.34 submission 2 | Maxim Zakharov | maxime@sochi.net.ru | 22188 | 865 | 3246 | 87.24 | 96.25 | 0.92 | 77.94~99.74 | 93.72~99.97 | |
| S-MSRSeg | Jianfeng Gao | jfgao@microsoft.com | 21932 | 1121 | 3903 | 84.89 | 95.14 | 0.90 | 76.56~95.74 | 91.38~99.75 | |
| POSEG | Cheongjae Lee | lcj80@postech.ac.kr | 21419 | 1634 | 470 | 97.85 | 92.91 | 0.95 | 87.46~99.41 | 80.39~96.55 | |