Corpus Creation

Richard Xiao

A corpus can be defined as a collection of machine-readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular natural language or language variety. Corpora provide a material basis and a test bed for building NLP systems.

As a corpus is always designed for a particular purpose, the usefulness of a ready-made corpus must be judged with regard to the purpose to which a user intends to put it. Consequently, while there are many corpora readily available, it is often the case that readers will find that they are not able to address their research questions using ready-made corpora. In such circumstances, one must build one’s own corpus. This chapter covers principal considerations involved in creating such DIY (‘do-it-yourself’) corpora as well as the issues that come up in major corpus creation projects.

This chapter discusses core issues in corpus creation such as corpus size (section 7.2), representativeness, balance and sampling (section 7.3), data capture and copyright (section 7.4), markup and annotation (section 7.5), as well as peripheral issues such as multilingual (section 7.6) and multimodal (section 7.7) corpora.

Bibtex Citation

  @incollection{xiao-handbook10,
    author = {Richard Xiao},
    title = {Corpus Creation},
    booktitle = {Handbook of Natural Language Processing, Second Edition},
    editor = {Nitin Indurkhya and Fred J. Damerau},
    publisher = {CRC Press, Taylor and Francis Group},
    address = {Boca Raton, FL},
    year = {2010},
    note = {ISBN 978-1420085921}
  }

Supplementary Material

S7.1. Survey of Existing Corpus Resources

We note in the chapter that corpus creation is an activity that costs time and money. Therefore, before you embark on a corpus building project, it is advisable to make sure that no existing corpus resources can satisfy your needs or they are not readily available. One way to do this is to check my recent survey of well-known and influential corpora of various types for English as well as many other languages (Xiao 2008). This is one of the most comprehensive and up-to-date surveys of existing corpus resources. An earlier version of the article is available online at the companion website of Corpus-Based Language Studies (McEnery, Xiao and Tono 2006):

http://www.ling.lancs.ac.uk/corplang/cbls/corpora.asp

or

http://www.routledge.com/textbooks/0415286239/Resources/corpa.htm

S7.2. Developing Linguistic Corpora

Developing Linguistic Corpora: a Guide to Good Practice (Oxfod: Oxbow Books, 2005), is edited by Martin Wynne and contributed by leading experts in the field of corpus linguistics. It provides an easy-to-read guide to good practice of developing linguistic corpora, covering a range of key concepts and practicalities of corpus creation. The book is available online at the website of the UK Arts and Humanities Data Service (AHDS).

Cover and Table of Contents

Preface By Martin Wynne

Chapter 1. Corpus and Text: Basic Principles By John Sinclair

Chapter 2. Adding Linguistic Annotation By Geoffrey Leech

Chapter 3. Metadata for Corpus Work By Lou Burnard

Chapter 4. Character Encoding in Corpus Construction By Tony McEnery and Richard Xiao

Chapter 5. Spoken Language Corpora By Paul Thompson

Chapter 6. Archiving, Distribution and Preservation By Martin Wynne

Appendix to chapter one. How to make a corpus By John Sinclair

Bibliography

S7.3. Useful Internet links

CLAWS part-of-speech tagger

Corpora List

Corpus-Based Language Studies

Corpus4u Community

Corpus Encoding Standard

The Dublin Core Metadata Initiative

European Language Resources Association

Extensible Markup Language (XML): A Gentle Introduction

The ICTCLAS Chinese lexical analysis system

The Linguistic Data Consortium

Multilingual Corpus Tools

Open Language Archives

Oxford Text Archive

Project Gutenberg

TEI: Text Encoding Initiative

Unicode

The WordSmith Tools

XML Aware Indexing and Retrieval Architecture (Xaira)

XML Tutorial

References

McEnery, A., Xiao, R. and Tono, Y. (2006) Corpus-based Language Studies: An Advanced Resource Book. London: Routledge.

Xiao, R. (2008) Well-known and influential corpora. In A. Lüdeling and M. Kyto (eds.) Corpus Linguistics: An International Handbook [Volume 1]. Berlin: Mouton de Gruyter. 383-458.