SALM: Suffix Array and its Applications in Empirical Language Processing

By Joy, Email me


View My Stats

What is SALM?
SALM is C++ package that provides functions to locate and estimates statistics of n-grams in a large corpus. SALM toolkit provides example applications such as estimating type/token frequency, locating n-gram occurrences, and a suffix array language model that can have arbitrarily long history for a very large training corpus.

Google Groups Beta
Subscribe to the Suffix Array Toolkit for Empirical Language Processing and Modeling Mailing List
Email:
Visit this group

Introduction

Empirical natural language processing (EMNLP) estimates statistics of natural language from a large amount of text (corpus). These statistics are used to train different statistical models for applications such as information retrieval (IR), statistical machine translation (SMT) and automatic speech recognition (ASR).

One of the fundamental information used in EMNLP is the n-gram statistics. An n-gram is a continuous sequence of words. We can not deal with the whole corpus as one unit since it is too large. Usually we derive information of a corpus from its building blocks: n-grams. Many EMNLP research are based on the n-grams, such as the phrase-based SMT, n-gram based language modeling, and etc.

Information of an n-gram, such as the frequency in a corpus can tell us a lot about the data. For example, the much higher frequency of n-gram "the Grand Canyon" vs. the rare occurrence of "the Great Canyon" is a good indicator that "the Great Canyon" is probably an incorrect expression. There are many many ways one can derive some useful information from the n-gram statistics, for example, Chinese word segmentation, noun-phrase identification, language modeling, spell checking, just to name a few.

When the corpus size becomes large, the total number of n-grams (esp. when n is large, say up to 5-grams or 6-grams) could become very large and store all the n-grams in the memory becomes infeasible. SALM indexes the corpus according to its suffix array order and provides efficient algorithms to collect n-gram statistics on the fly.

SALM toolkit has been developed since 2002 and has been widely used by the SMT group of CMU in our research of statistical data-driven machine translation. The SALM toolkit has been shown to be very efficient and can deal with very large corpora (on a 12G RAM Linux machine, we have indexed a corpus of 1 billion words).

SALM has been extended to client/server architecture to make use of arbitrarily large corpora. [see Zhang, Hildebrand, Vogel 2006 ]


Latest Update

Download

Category Descriptions (Click for examples)
Complexity
Linux32 Linux64 Win32
Index
O(N log N)
IndexSA.O32 IndexSA.O64 IndexSA.exe
Search
O(n log N)
FrequencyOfNgrams.O32 FrequencyOfNgrams.O64 FrequencyOfNgrams.exe
NGramMatchingStat4TestSet.O32 NGramMatchingStat4TestSet.O64 NGramMatchingStat4TestSet.exe
NgramTypeInTestSetMatchedInCorpus.O32 NgramTypeInTestSetMatchedInCorpus.O64 NgramTypeInTestSetMatchedInCorpus.exe
O(L^2 log N)
NgramMatchingFreq4Sent.O32 NgramMatchingFreq4Sent.O64 NgramMatchingFreq4Sent.exe
Output the non-compositionalities of the embedded n-grams in a sentence
NgramMatchingFreqAndNonCompositionality4Sent.O32 NgramMatchingFreqAndNonCompositionality4Sent.O64 NgramMatchingFreqAndNonCompositionality4Sent.exe
FilterDuplicatedSentences.O32 FilterDuplicatedSentences.O64 FilterDuplicatedSentences.exe
  CollectNgramFreqCount.O32 CollectNgramFreqCount.O64 CollectNgramFreqCount.exe
O(n logN)
LocateNgramInCorpus.O32 LocateNgramInCorpus.O64 LocateNgramInCorpus.exe
LocateEmbeddedNgramsInCorpus.O32 LocateEmbeddedNgramsInCorpus.O64 LocateEmbeddedNgramsInCorpus.exe
Scan
O(n N)
CalcCountOfCounts.O32 CalcCountOfCounts.O64 CalcCountOfCounts.exe
O(n N)
OutputHighFreqNgram.O32 OutputHighFreqNgram.O64 OutputHighFreqNgram.exe
O(n N)
TypeTokenFreqInCorpus.O32 TypeTokenFreqInCorpus.O64 TypeTokenFreqInCorpus.exe
Language Modeling
EvaluateLM.O32 EvaluateLM.O64 EvaluateLM.exe
Utility
Initial vocabulary
InitializeVocabulary.O32 InitializeVocabulary.O64 InitializeVocabulary.exe
Update universal vocabulary given a corpus
UpdateUniversalVoc.O32 UpdateUniversalVoc.O64 UpdateUniversalVoc.exe

 


Source Code

  1. License agreement
  2. Citation: Ying Zhang, Stephan Vogel, "Suffix Array and its Applications in Empirical Natural Language Processing," In the Technical Report CMU-LTI-06-010, Pittsburgh PA, USA, Dec 2006. [Bibtex entry]
  3. SALM toolkit is developed using C++ and STL.
    Download the source code for SALM: salm-src.tar.gz
  4. To compile the example programs:
  5. To develope your own applications using SALM, you might want to study the header file of class C_SuffixArraySearchApplicationBase and C_SuffixArrayScanningBase. These two classes provide most of the functions needed for n-gram based NLP.
  6. To integrate the suffix array language model, check class C_SuffixArrayLanguageModel and follow the example program EvaluateLM.cpp


If you are interested in developing using SALM and contribute to the toolkit, please email me.
Report bugs

Join the mailing list for future announcement:
Google Groups Beta
Subscribe to the Suffix Array Toolkit for Empirical Language Processing and Modeling Mailing List
Email:
Visit this group


Documentation

  1. SALM API Documentation
  2. Tutorial
  3. "Suffix Array and its Applications in Empirical Natural Language Processing," Technical Report CMU-LTI-06-010

References

Links


Revision $Rev: 4015 $
$LastChangedDate: 2007-08-09 20:36:38 -0400 (Thu, 09 Aug 2007) $