Locate Embedded n-grams of a Sentence in an Indexed Corpus

Given an indexed corpus, this program locates the embedded n-grams of a testing sentence in this corpus. This is a useful application and has been used in the online phrase extraction/alignment approaches such as the PESA system [Vogel 2005] . In PESA, the source side of the bilingual corpus is indexed by IndexSA. For each testing sentence (in source language), use findPhrasesInASentence() function as demonstrated in this example to find the occurrences of the embedded source n-grams in the corpus. Knowing the locations of the source n-gram, PESA extracts their alignment from the bilingual corpus as the translation for the source n-gram/phrase.

Example:

Corpu: train.en
a b c d e f
a b a b
c d e f d e
a b f h
f d e g
b d e f d e
b a e g
e f a b c
b c d e f d e g

>LocateEmbeddedNgramsInCorpus.O32 train.en
Loading Vocabulary...
Loading existing vocabulary file: train.en.id_voc
Total 108 word types loaded
Max VocID=108
Vocabulary loaded in 0 seconds.
Loading corpus...
Corpus loaded in 0 seconds.
Loading suffix...
Initialize level-1 buckets...
Suffix loaded in 0 seconds.
Loading offset...
Offset loaded in 0 seconds.
Total: 9 sentences loaded.
Input sentences:

a b c d

N-gram [1, 1]: a found in corpus: SentId=2 Pos=3
N-gram [1, 1]: a found in corpus: SentId=2 Pos=1
N-gram [1, 1]: a found in corpus: SentId=8 Pos=3
N-gram [1, 1]: a found in corpus: SentId=1 Pos=1
N-gram [1, 1]: a found in corpus: SentId=4 Pos=1
N-gram [1, 1]: a found in corpus: SentId=7 Pos=2
N-gram [2, 2]: b found in corpus: SentId=2 Pos=4
N-gram [2, 2]: b found in corpus: SentId=2 Pos=2
N-gram [2, 2]: b found in corpus: SentId=7 Pos=1
N-gram [2, 2]: b found in corpus: SentId=8 Pos=4
N-gram [2, 2]: b found in corpus: SentId=1 Pos=2
N-gram [2, 2]: b found in corpus: SentId=9 Pos=1
N-gram [2, 2]: b found in corpus: SentId=6 Pos=1
N-gram [2, 2]: b found in corpus: SentId=4 Pos=2
N-gram [3, 3]: c found in corpus: SentId=8 Pos=5
N-gram [3, 3]: c found in corpus: SentId=1 Pos=3
N-gram [3, 3]: c found in corpus: SentId=3 Pos=1
N-gram [3, 3]: c found in corpus: SentId=9 Pos=2
N-gram [4, 4]: d found in corpus: SentId=3 Pos=5
N-gram [4, 4]: d found in corpus: SentId=6 Pos=5
N-gram [4, 4]: d found in corpus: SentId=1 Pos=4
N-gram [4, 4]: d found in corpus: SentId=3 Pos=2
N-gram [4, 4]: d found in corpus: SentId=6 Pos=2
N-gram [4, 4]: d found in corpus: SentId=9 Pos=3
N-gram [4, 4]: d found in corpus: SentId=9 Pos=6
N-gram [4, 4]: d found in corpus: SentId=5 Pos=2
N-gram [1, 2]: a b found in corpus: SentId=2 Pos=3
N-gram [1, 2]: a b found in corpus: SentId=2 Pos=1
N-gram [1, 2]: a b found in corpus: SentId=8 Pos=3
N-gram [1, 2]: a b found in corpus: SentId=1 Pos=1
N-gram [1, 2]: a b found in corpus: SentId=4 Pos=1
N-gram [2, 3]: b c found in corpus: SentId=8 Pos=4
N-gram [2, 3]: b c found in corpus: SentId=1 Pos=2
N-gram [2, 3]: b c found in corpus: SentId=9 Pos=1
N-gram [3, 4]: c d found in corpus: SentId=1 Pos=3
N-gram [3, 4]: c d found in corpus: SentId=3 Pos=1
N-gram [3, 4]: c d found in corpus: SentId=9 Pos=2
N-gram [1, 3]: a b c found in corpus: SentId=8 Pos=3
N-gram [1, 3]: a b c found in corpus: SentId=1 Pos=1
N-gram [2, 4]: b c d found in corpus: SentId=1 Pos=2
N-gram [2, 4]: b c d found in corpus: SentId=9 Pos=1
N-gram [1, 4]: a b c d found in corpus: SentId=1 Pos=1

As can be seen in the above example, there are probably too many occurrences of the matched embedded n-grams in the corpus. For online phrase extraction, it is not necessary to find all the locations of word "the" and extracts its translation in the bilingual corpus for every single occurrence.

To reduce the number of returned 'location' information, there are several parameters one can set in the C_SuffixArraySearchApplicationBase class, namely:

Run the previous example again, but now with constraints such that

> LocateEmbeddedNgramsInCorpus.O32 train.en 5 4 2 3

Loading Vocabulary...
Loading existing vocabulary file: train.en.id_voc
Total 108 word types loaded
Max VocID=108
Vocabulary loaded in 0 seconds.
Loading corpus...
Corpus loaded in 0 seconds.
Loading suffix...
Initialize level-1 buckets...
Suffix loaded in 0 seconds.
Loading offset...
Offset loaded in 0 seconds.
Total: 9 sentences loaded.
Input sentences:

a b c

N-gram [1, 2]: a b found in corpus: SentId=2 Pos=3
N-gram [1, 2]: a b found in corpus: SentId=2 Pos=1
N-gram [1, 2]: a b found in corpus: SentId=8 Pos=3
N-gram [1, 2]: a b found in corpus: SentId=1 Pos=1
N-gram [2, 3]: b c found in corpus: SentId=8 Pos=4
N-gram [2, 3]: b c found in corpus: SentId=1 Pos=2
N-gram [2, 3]: b c found in corpus: SentId=9 Pos=1
N-gram [1, 3]: a b c found in corpus: SentId=8 Pos=3
N-gram [1, 3]: a b c found in corpus: SentId=1 Pos=1


Revision $Rev: 3665 $ Last updated $LastChangedDate: 2007-06-16 15:40:59 -0400 (Sat, 16 Jun 2007) $