Training data train.en:
a b c d
a c d b e
b c d
b a b e
Testing data test.en
b a b c
b c d
g a b
train.en indexed by IndexSA already
> NGramMatchingStat4TestSet.O32 train.en < test.en
Loading data...
Loading Vocabulary...
Loading existing vocabulary file: train.en.id_voc
Total 105 word types loaded
Max VocID=105
Vocabulary loaded in 0 seconds.
Loading corpus...
Corpus loaded in 0 seconds.
Loading suffix...
Initialize level-1 buckets...
Suffix loaded in 0 seconds.
Total: 4 sentences loaded.
Input sentences:
N=1: 9 / 10 90.0 OccInTrain= 35
N=2: 6 / 7 85.7 OccInTrain= 12
N=3: 3 / 4 75.0 OccInTrain= 4
Out of 3 input sentences, 1 can be found in the training data.
Time cost:0 seconds
Interpretation:
9 out of the 10 1-gram tokens in the test.en can be found in train.en. For each matched unigram token, its corresonding type occurs on average 35/9 times in the corpus
6 out of 7 2-gram tokens in the test.en can be found in the training data and 3 out of 4 trigrams exist in the traininng. No 4-grams can be found in the training data.