Normalized Web Distance and Word Similarity

Rudi Cilibrasi
Paul Vitanyi

The goal of this chapter is to introduce the normalized web distance (NWD) method to determine similarity between words and phrases. It is a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world.

Bibtex Citation

  @incollection{cilibrasi-handbook10,
    author = {Rudi Cilibrasi and Paul Vitanyi},
    title = {Normalized Web Distance and Word Similarity},
    booktitle = {Handbook of Natural Language Processing, Second Edition},
    editor = {Nitin Indurkhya and Fred J. Damerau},
    publisher = {CRC Press, Taylor and Francis Group},
    address = {Boca Raton, FL},
    year = {2010},
    note = {ISBN 978-1420085921}
  }

Online Resources

CompLearn is an open-source software system to analyze data using data compressors.

Categories can be automatically determined using webpage counts and support vector machines. Here are 100 Wordnet Experiments in automatic category learning. For more information on how these categories were recognized, see Automatic Extraction of Meaning from the Web.