Using bootstrapping for NIST/Bleu confidence intervals

Joy, joy@cs.cmu.edu
(Oct 1, 2003)

visits since Oct 18. 2005

View My Stats

 


Motivation and background:

Basically, NIST/Bleu metrics report scores based on one "sample", the hypothesis itself. Without multiple "experiments" we are not able to tell how "confident" this score is. Say system A has NIST score 6.98 and system B has NIST score 6.92, is A "better" than B? How confident can we say the new method used in A really "improved" the system performance?

One way to measure the confidence or statistical significance on NIST/Bleu is to compare system A and B on the sentence level. For example we can calculate the sentence score for each of system A's 993 sentence score_a(1~993) and the same for B: score_b(1~993). Apply a paired student t-test on score_a and score_b we can have a sentence level statistical significance. The paired t-test focuses on the difference between the paired data and reports the probability that the actual mean difference is consistent with zero. This comparison is aided by the reduction in variance achieved by taking the differences. This tells us how "significant" system A is different from B on the sentence level.

Yet, NIST/Bleu scores calculated on the sentence level are different from the scores calculated on the test-set level. To measure the confidence intervals for these one population metrics we use resampling/bootstrapping to create a much larger population.

Here is how it works:
Suppose we developed a translation system A, and translated the 993 dryrun testing sentence with A resulting a NIST score 7.25. We want to know how confident we are about score 7.25.
Now we sample the testing sentences with replacement, together with its corresponding reference(s). For example, we create a new hypothesis/testing set as this:

New hyp/testing set No 1.
A new hypothesis:     hyp_sent 65, hyp_sent 8, hyp_sent 124, hyp_sent 98, hyp_sent 65, .....
and new reference1:  ref1_sent65,  ref1_sent 8, ref1_sent 124, ref1_sent 98, ref1_sent 65, .....
and new reference2:  ref2_sent65,  ref2_sent 8, ref2_sent 124, ref2_sent 98, ref2_sent 65, .....
and new reference3:  ref3_sent65,  ref3_sent 8, ref3_sent 124, ref3_sent 98, ref3_sent 65, .....
and new reference4:  ref4_sent65,  ref4_sent 8, ref4_sent 124, ref4_sent 98, ref4_sent 65, .....

Notice that since we sample with replacement, sentence 65 has two occurrences in this new sample.

We repeat this resampling for many times, say 1000 times. Now we have New hyp/testing set No 1, New hyp/testing set No 2, ..., New hyp/testing set No 1000. For each hyp/testing set, we can calculate the standard NIST/Bleu scores on the testing set level. 

So we have a sequence of 1000 NIST/Bleus scores like this: 7.10 7.26 7.24 7.25 7.30 ......... 7.22. The median value of this sequence should be the same or very close to the original value (7.25 in this example). If we sort this sequence in order v1, v2,...., v998,v999,v1000, then we can say that system A's NIST score is in range [v25, v975]

If you don't like the idea of using 2.5%~97.5% percentile to measure the 95% confidence interval, here is another way (it may be more reliable):

Hmm, so ... how about this:
We can assume these 1000 NIST/Bleu scores are actually distributed according to normal distribution (or t-distribution, if the bootstrap sample size is not very large). This assumption is pretty accurate (see Figure1: normplot() and Figure2, CDF of 5000 Bleu scores)

Figure 1. normplot(5000 Bleu scores)

Figure 2. CDF of 5000 Bleu scores

From these 5000 Bleu scores, we can calculate the mean, and the standard deviations. The 95% confidence interval is [μ-2.6σ, μ+2.6σ]

That's cool!
But isn't this computationally too expensive? We all know how long it takes to calculate the Bleu scores: parsing the SGML tags, and go through the hypothesis to find the n-gram matches. NIST? It takes longer time because it needs to calculate the NIST info-gain for each n-gram in the reference set.

Let's forget about NIST info-gain for a moment. We know that Bleu/NIST calculate the scores based on two parts: precision and length penalty. Both precision and penalty are calculated from accumulated information at the sentence level. We need only one pass through the hyp/ref to get all the information we need, then the resampling can be done over these information instead of the hyp/referenece text.

Ok, here is how it works:


Calculating confidence intervals step-by-step

Now, you have a hypothesis from your best system A. Let's call it a.hyp. This is the translation for test set T, where the testing source sentences are in a sgml tagged file src.sgm We have the sgml tagged reference set: ref.sgm somewhere you know.

Step0: tag the hyp file

    perl5 /afs/cs.cmu.edu/user/joy/Joy-work/sgml4eval/generateSGMLfromText.perl a.hyp sysId src.sgm > hyp.sgm

Step1: collect sentence level information

    perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/generateLog.pl [-h] -r ref.sgm -s src.sgm -t hyp.sgm [-m] > a.log

Required arguments:
-r ref.sgm is a file containing the reference translations for
the documents to be evaluated.
-s src.sgm is a file containing the source documents for which
translations are to be evaluated
-t hyp.sgm is a file containing the translations to be evaluated

Optional arguments:
-h prints this help message to STDOUT
-m normalization method:
0 (default) NIST MTeval script
1 Bleu MTeval script

Step2: Resampling over the sentence level information

        perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/bootstrapSingle.pl  [ResampleTimes] < a.log

Step3: To compare the scores between two systems

        From Step1, get a.log for system A and do the same thing for system B resulting b.long. Then,

        perl5 /afs/cs.cmu.edu/user/joy/Joy-work/bootstrapMTeval/bootstrapCompare.pl a.log b.log [ResampleTimes]
    


Interpret the results


NIST_N=5 Bleu_N=4 Max_N=5     : the n-gram size used to calculate NIST/Bleu
RefNumber=4                                  : number of references used
Total 993 segments read.                 : number of segments for each sample
Create 10000 Samples                     : number of resampling
Critical value = 1.9842                    : critical value for df=(sample size-1)

NIST Metric
Median=7.1868 nonPara interval: [7.0649,7.3195]    : median value of NIST scores; [lowerBound at 2.5%, highBound at 97.5%]
Mean =7.1889   t-interval: [7.0636, 7.3142] Var=0.00398726 STDEV=0.0631 RSD=0.88%
Mean of NIST scores; 95% interval assuming t-distribution; variance; standard deviation; Relative Standard Deviation (RSD = standard deviation / | mean | )

and the same for BLEU and M-Bleu scores.

 


Since Bleu and NIST used different text normalization, there is an option -m in generateLog.pl script to select the normalization style.

The problem is if the log file is generated using Bleu text normalization, then the reported Bleu score and its confidence interval are reliable for Bleu. If one wants to report the NIST confidence or to compare the NIST scores between two systems, it is recommended to use NIST text normalization.


To calculate NIST score, n-gram info gain has to be calculated first based on the reference set. Since we resample the hypothesis and the reference set, each new sample may have a different reference set. This will result slightly different n-gram info gain for each sample. In this approach, we ignore this problem and calculate the n-gram info gain based on the original reference set. I do not think this can cause a big difference in NIST score.


Keywords: machine translation, automatic evaluation, confidence intervals, bootstrapping, NIST MTeval, IBM BLEU scores

$Rev: 2376 $
$LastChangedDate: 2007-01-21 01:44:51 -0500 (星期日, 21 一月 2007) $