Puneet Varma (Editor)

Corpus of Contemporary American English

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

The freely searchable 450-million-word Corpus of Contemporary American English (COCA) is the largest corpus of American English currently available, and the only publicly available corpus of American English to contain a wide array of texts from a number of genres.

Contents

It was created by Mark Davies, Professor of Corpus Linguistics at Brigham Young University.

Content

The corpus is composed of more than 450 million words from more than 160,000 texts, including 20 million words each year from 1990 to 2015. The most recent update was made in December 2015. The corpus is used by approximately tens of thousands of people each month, which may make it the most widely used "structured" corpus currently available.

For each year, the corpus is evenly divided between the five genres: spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

  • Spoken: (85 million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs.
  • Fiction: (81 million words) Short stories and plays, first chapters of books 1990–present, and movie scripts.
  • Popular magazines: (86 million words) Nearly 100 different magazines, from a range of domains such as news, health, home and gardening, women's, financial, religion, and sports.
  • Newspapers: (81 million words) Ten newspapers from across the US, with text from different sections of the newspapers, such as local news, opinion, sports, and the financial section.
  • Academic Journals: (81 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress Classification system.
  • Queries

  • The interface is the same as the BYU-BNC interface for the 100 million word British National Corpus, the 100 million word TIME Magazine corpus, and the 400 million word Corpus of *Historical* American English (COHA), 1810s–2000s (see links below)
  • Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)
  • The corpus is tagged by CLAWS, the same part of speech tagger that was used for the BNC and the TIME corpus
  • Chart listings (totals for all matching forms in each genre or year, 1990–present, as well as for subgenres) and table listings (frequency for each matching form in each genre or year)
  • Full collocates searching (up to ten words left and right of node word)
  • Re-sortable concordances, showing the most common words/strings to the left and right of the searched word
  • Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2005–2010 than previously)
  • One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small' and 'little', or 'Democrats' and 'Republicans', or 'men' and 'women', or 'rob' vs 'steal')
  • Users can include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')
  • Users can also create their own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)
  • Note that the corpus is only available through the web interface, due to copyright restrictions.
  • References

    Corpus of Contemporary American English Wikipedia