What is SALM?
SALM is C++ package that provides functions to locate and estimates statistics of n-grams in a large corpus. SALM toolkit provides example applications such as estimating type/token frequency, locating n-gram occurrences, and a suffix array language model that can have arbitrarily long history for a very large training corpus.
| Subscribe to the Suffix Array Toolkit for Empirical Language Processing and Modeling Mailing List |
| Visit this group |
Empirical natural language processing (EMNLP) estimates statistics of natural language from a large amount of text (corpus). These statistics are used to train different statistical models for applications such as information retrieval (IR), statistical machine translation (SMT) and automatic speech recognition (ASR).
One of the fundamental information used in EMNLP is the n-gram statistics. An n-gram is a continuous sequence of words. We can not deal with the whole corpus as one unit since it is too large. Usually we derive information of a corpus from its building blocks: n-grams. Many EMNLP research are based on the n-grams, such as the phrase-based SMT, n-gram based language modeling, and etc.
Information of an n-gram, such as the frequency in a corpus can tell us a lot about the data. For example, the much higher frequency of n-gram "the Grand Canyon" vs. the rare occurrence of "the Great Canyon" is a good indicator that "the Great Canyon" is probably an incorrect expression. There are many many ways one can derive some useful information from the n-gram statistics, for example, Chinese word segmentation, noun-phrase identification, language modeling, spell checking, just to name a few.
When the corpus size becomes large, the total number of n-grams (esp. when n is large, say up to 5-grams or 6-grams) could become very large and store all the n-grams in the memory becomes infeasible. SALM indexes the corpus according to its suffix array order and provides efficient algorithms to collect n-gram statistics on the fly.
SALM toolkit has been developed since 2002 and has been widely used by the SMT group of CMU in our research of statistical data-driven machine translation. The SALM toolkit has been shown to be very efficient and can deal with very large corpora (on a 12G RAM Linux machine, we have indexed a corpus of 1 billion words).
SALM has been extended to client/server architecture to make use of arbitrarily large corpora. [see Zhang, Hildebrand, Vogel 2006 ]
If you are interested in developing using SALM and contribute to the toolkit, please email me.
Report bugs
Join the mailing list for future announcement:
| Subscribe to the Suffix Array Toolkit for Empirical Language Processing and Modeling Mailing List |
| Visit this group |
Revision $Rev: 4015 $
$LastChangedDate: 2007-08-09 20:36:38 -0400 (Thu, 09 Aug 2007) $