Joy, joy@cs.cmu.edu
(Oct 1, 2003)
Download Scripts (latest version, v11, in accordance with the NIST newest mteval-v11, Oct 25th 2004):
Basically, NIST/Bleu metrics report scores based on one "sample", the hypothesis itself. Without multiple "experiments" we are not able to tell how "confident" this score is. Say system A has NIST score 6.98 and system B has NIST score 6.92, is A "better" than B? How confident can we say the new method used in A really "improved" the system performance?
One way to measure the confidence or statistical significance on NIST/Bleu is to compare system A and B on the sentence level. For example we can calculate the sentence score for each of system A's 993 sentence score_a(1~993) and the same for B: score_b(1~993). Apply a paired student t-test on score_a and score_b we can have a sentence level statistical significance. The paired t-test focuses on the difference between the paired data and reports the probability that the actual mean difference is consistent with zero. This comparison is aided by the reduction in variance achieved by taking the differences. This tells us how "significant" system A is different from B on the sentence level.
Yet, NIST/Bleu scores calculated on the sentence level are different from the scores calculated on the test-set level. To measure the confidence intervals for these one population metrics we use resampling/bootstrapping to create a much larger population.
Here is how it works:
Suppose we developed a translation system A, and translated the 993 dryrun
testing sentence with A resulting a NIST score 7.25. We want to know how
confident we are about score 7.25.
Now we sample the testing sentences with replacement, together with its
corresponding reference(s). For example, we create a new hypothesis/testing set
as this:
New hyp/testing set No 1.
A new hypothesis: hyp_sent 65, hyp_sent 8, hyp_sent 124,
hyp_sent 98, hyp_sent 65, .....
and new reference1: ref1_sent65, ref1_sent 8, ref1_sent 124,
ref1_sent 98, ref1_sent 65, .....
and new reference2: ref2_sent65, ref2_sent 8, ref2_sent 124,
ref2_sent 98, ref2_sent 65, .....
and new reference3: ref3_sent65, ref3_sent 8, ref3_sent 124,
ref3_sent 98, ref3_sent 65, .....
and new reference4: ref4_sent65, ref4_sent 8, ref4_sent 124,
ref4_sent 98, ref4_sent 65, .....
Notice that since we sample with replacement, sentence 65 has two occurrences in this new sample.
We repeat this resampling for many times, say 1000 times. Now we have New hyp/testing set No 1, New hyp/testing set No 2, ..., New hyp/testing set No 1000. For each hyp/testing set, we can calculate the standard NIST/Bleu scores on the testing set level.
So we have a sequence of 1000 NIST/Bleus scores like this: 7.10 7.26 7.24 7.25 7.30 ......... 7.22. The median value of this sequence should be the same or very close to the original value (7.25 in this example). If we sort this sequence in order v1, v2,...., v998,v999,v1000, then we can say that system A's NIST score is in range [v25, v975]
If you don't like the idea of using 2.5%~97.5% percentile to measure the 95% confidence interval, here is another way (it may be more reliable):
Hmm, so ... how about this:
We can assume these 1000 NIST/Bleu scores are actually distributed according to
normal distribution (or t-distribution, if the bootstrap sample size is not very
large). This assumption is pretty accurate (see Figure1: normplot() and Figure2,
CDF of 5000 Bleu scores)
Figure 1. normplot(5000 Bleu scores) |
Figure 2. CDF of 5000 Bleu scores |
From these 5000 Bleu scores, we can calculate the mean, and the standard deviations. The 95% confidence interval is [μ-2.6σ, μ+2.6σ]
That's cool!
But isn't this computationally too expensive? We all know how long it takes to
calculate the Bleu scores: parsing the SGML tags, and go through the hypothesis
to find the n-gram matches. NIST? It takes longer time because it needs to
calculate the NIST info-gain for each n-gram in the reference set.
Let's forget about NIST info-gain for a moment. We know that Bleu/NIST calculate the scores based on two parts: precision and length penalty. Both precision and penalty are calculated from accumulated information at the sentence level. We need only one pass through the hyp/ref to get all the information we need, then the resampling can be done over these information instead of the hyp/referenece text.
Ok, here is how it works:
Now, you have a hypothesis from your best system A. Let's call it a.hyp. This is the translation for test set T, where the testing source sentences are in a sgml tagged file src.sgm We have the sgml tagged reference set: ref.sgm somewhere you know.
perl5 /afs/cs.cmu.edu/user/joy/Joy-work/sgml4eval/generateSGMLfromText.perl a.hyp sysId src.sgm > hyp.sgm
Step1: collect sentence level information
perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/generateLog.pl [-h] -r ref.sgm -s src.sgm -t hyp.sgm [-m] > a.log
Required arguments:
-r ref.sgm is a file containing the reference translations for
the documents to be evaluated.
-s src.sgm is a file containing the source documents for which
translations are to be evaluated
-t hyp.sgm is a file containing the translations to be evaluated
Optional arguments:
-h prints this help message to STDOUT
-m normalization method:
0 (default) NIST MTeval script
1 Bleu MTeval script
Step2: Resampling over the sentence level information
perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/bootstrapSingle.pl [ResampleTimes] < a.log
Step3: To compare the scores between two systems
From Step1, get a.log for system A and do the same thing for system B resulting b.long. Then,
perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/bootstrapCompare.pl
a.log b.log [ResampleTimes]
NIST Metric
Median=7.1868 nonPara interval: [7.0649,7.3195] : median
value of NIST scores;
[lowerBound at 2.5%, highBound at 97.5%]
Mean =7.1889 t-interval: [7.0636, 7.3142] Var=0.00398726 STDEV=0.0631
RSD=0.88%
Mean of NIST scores; 95% interval assuming t-distribution;
variance; standard deviation; Relative Standard Deviation (RSD = standard
deviation / | mean | )
and the same for BLEU and M-Bleu scores.
NIST Value:
-----------------
Sys1 Mean=7.1842
Median=7.1845 [7.0644,7.3009]
---
Sys2 Mean=7.4638
Median=7.4633 [7.3369,7.5988]
---
Diff(Sys1-Sys2):Median=-0.2791 [-0.1509,-0.4046]
: median and confidence interval
---
Paired t test for two systems:
Degree of freedom: 999
t=-142.2970
p=0.0000
Confidence of two systems are not equal: 100.000%
Bleu Value:
-----------------
Sys1 Mean=0.1839
Median=0.1838 [0.1762,0.1919]
---
Sys2 Mean=0.2404
Median=0.2402 [0.2318,0.2494]
---
Diff(Sys1-Sys2):Median=-0.0565 [-0.0481,-0.0649]
---
Paired t test for two systems:
Degree of freedom: 999
t=-28.6707
p=0.0000
Confidence of two systems are not equal: 100.000%
Modified Bleu Value:
-----------------
Sys1 Mean=0.2756
Median=0.2755 [0.2694,0.2818]
---
Sys2 Mean=0.3197
Median=0.3196 [0.3124,0.3270]
---
Diff(Sys1-Sys2):Median=-0.0441 [-0.0375,-0.0503]
---
Paired t test for two systems:
Degree of freedom: 999
t=-22.3654
p=0.0000
Confidence of two systems are not equal: 100.000%
-----------------
Since Bleu and NIST used different text normalization, there is an option -m in generateLog.pl script to select the normalization style.
The problem is if the log file is generated using Bleu text normalization, then the reported Bleu score and its confidence interval are reliable for Bleu. If one wants to report the NIST confidence or to compare the NIST scores between two systems, it is recommended to use NIST text normalization.
To calculate NIST score, n-gram info gain has to be calculated first based on the reference set. Since we resample the hypothesis and the reference set, each new sample may have a different reference set. This will result slightly different n-gram info gain for each sample. In this approach, we ignore this problem and calculate the n-gram info gain based on the original reference set. I do not think this can cause a big difference in NIST score.
Keywords: machine translation, automatic evaluation, confidence intervals, bootstrapping, NIST MTeval, IBM BLEU scores
$Rev: 2376 $
$LastChangedDate: 2007-01-21 01:44:51 -0500 (星期日, 21 一月 2007) $