Example: NGramMatchingStat4TestSet

Training data train.en:

a b c d
a c d b e
b c d
b a b e

Testing data test.en

b a b c
b c d
g a b

train.en indexed by IndexSA already

> NGramMatchingStat4TestSet.O32 train.en < test.en

Loading data...
Loading Vocabulary...
Loading existing vocabulary file: train.en.id_voc
Total 105 word types loaded
Max VocID=105
Vocabulary loaded in 0 seconds.
Loading corpus...
Corpus loaded in 0 seconds.
Loading suffix...
Initialize level-1 buckets...
Suffix loaded in 0 seconds.
Total: 4 sentences loaded.
Input sentences:
N=1: 9 / 10 90.0 OccInTrain= 35
N=2: 6 / 7 85.7 OccInTrain= 12
N=3: 3 / 4 75.0 OccInTrain= 4

Out of 3 input sentences, 1 can be found in the training data.
Time cost:0 seconds

Interpretation:

9 out of the 10 1-gram tokens in the test.en can be found in train.en. For each matched unigram token, its corresonding type occurs on average 35/9 times in the corpus

6 out of 7 2-gram tokens in the test.en can be found in the training data and 3 out of 4 trigrams exist in the traininng. No 4-grams can be found in the training data.