Fundamental Statistical Techniques

Tong Zhang

The statistical approach to natural language processing (NLP) has become more and more important in recent years. This chapter gives an overview of some fundamental statistical techniques that have been widely used in different NLP tasks. Methods for statistical NLP mainly come from machine learning, which is a scientific discipline concerned with learning from data. That is, to extract information, discover patterns, predict missing information based on observed information, or more generally construct probabilistic models of the data. Machine learning techniques covered in this chapter can be divided into two types: supervised and unsupervised. Supervised learning is mainly concerned with predicting missing information based on observed information. For example, predicting part-of-speech based on sentences. It employs statistical methods to construct a prediction rule from labeled training data. Supervised learning algorithms discussed in this chapter include Naive Bayes, support vector machines, and logistic regression. The goal of unsupervised learning is to group data into clusters. The main statistical techniques are mixture models and the expectation max- imization algorithm. This chapter will also cover methods used in sequence analysis, such as hidden Markov model (HMM), conditional random field, and the Viterbi decoding algorithm.

Bibtex Citation

  @incollection{zhang-handbook10,
    author = {Tong Zhang},
    title = {Fundamental Statistical Techniques},
    booktitle = {Handbook of Natural Language Processing, Second Edition},
    editor = {Nitin Indurkhya and Fred J. Damerau},
    publisher = {CRC Press, Taylor and Francis Group},
    address = {Boca Raton, FL},
    year = {2010},
    note = {ISBN 978-1420085921}
  }