Alignment

Dekai Wu

In this chapter we discuss the work done on automatic alignment of parallel texts for various purposes. Fundamentally, an alignment algorithm accepts as input a bitext, and produces as output a bisegmentation relation that identifies corresponding segments between the texts. Bitext alignment fundamentally lies at the heart of all data-driven machine translation methods, and the rapid research progress on alignment since 1990 reflects the advent of statistical machine translation (SMT) and example-based machine translation (EBMT) approaches. Yet the importance of alignment extends as well to many other practical applications for translators, bilingual lexicographers, and even ordinary readers. A wide variety of techniques now exist, ranging from the most simple (counting characters or words) to the more so- phisticated, sometimes involving linguistic data (lexicons) which may or may not have been automatically induced themselves. Techniques have been developed for aligning passages of various granularities: documents, paragraphs, sentences, constituents, collocations or phrases, words, and characters. Some techniques work on precisely translated parallel corpora, while others work on noisy, comparable, or non-parallel corpora. Some techniques make use of apparent morphological features, while others rely on cognates and loan-words; of particular interest is work done on languages which do not have a common writing system. Some techniques align only shallow, flat chunks, while others align compositional, hierarchical structures. The robustness and generality of different techniques has generated much discussion.

Bibtex Citation

  @incollection{wu-handbook10,
    author = {Dekai Wu},
    title = {Alignment},
    booktitle = {Handbook of Natural Language Processing, Second Edition},
    editor = {Nitin Indurkhya and Fred J. Damerau},
    publisher = {CRC Press, Taylor and Francis Group},
    address = {Boca Raton, FL},
    year = {2010},
    note = {ISBN 978-1420085921}
  }