Rahul Sharma (Editor)

List of text corpora

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Contents

English language

  • Google Books Ngram Corpus
  • American National Corpus
  • Bank of English
  • British National Corpus
  • Corpus Juris Secundum
  • Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.
  • Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
  • International Corpus of English
  • Oxford English Corpus
  • Scottish Corpus of Texts & Speech
  • Corpus Resource Database (CoRD), more than 80 English language corpora.
  • European languages

  • Bulgarian National Corpus
  • CETENFolha
  • Croatian Language Corpus
  • Croatian National Corpus
  • Czech National Corpus
  • Google Books Ngram Corpus
  • Russian National Corpus
  • General Internet Corpus of Russian
  • Slovenian National Corpus
  • Thesaurus Linguae Graecae (Ancient Greek)
  • Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
  • National Corpus of Polish
  • German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
  • Free corpus of German mistakes from people with dyslexia
  • Spanish text corpus by Molino de Ideas, which contains 660 million words.
  • CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania
  • Reference Corpus of Contemporary Portuguese (CRPC)
  • Turkish National Corpus
  • Middle Eastern Languages

  • Hamshahri Corpus (Persian a.k.a. Farsi)
  • Persian in MULTEXT-EAST corpus (Persian a.k.a. Farsi)
  • Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
  • TEP: Tehran English-Persian Parallel Corpus
  • TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling
  • Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 322 pp. ISBN 964-8699-32-1
  • Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan Department of English language and linguistics
  • Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
  • Neo-Assyrian Text Corpus Project
  • Quranic Arabic Corpus (Classical Arabic)
  • East Asian Languages

  • Kotonoha Japanese language corpus
  • LIVAC Synchronous Corpus (Chinese)
  • Parallel corpora of diverse languages

  • Europarl Corpus - proceedings of the European Parliament from 1996–2011
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database
  • OPUS: Open source Parallel Corpus in many many languages
  • Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie) (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.
  • GRALIS parallel texts for various slavic languages, compiled by the institute for slavic languages at Graz University (Branko Tošović et al.)
  • Comparable Corpora

  • WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)
  • Disambiguating Similar Language Corpora Collection (DSLCC) (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
  • Wikipedia Comparable Corpora (41 million aligned Wikipedia articles for 253 language pairs)
  • References

    List of text corpora Wikipedia


    Similar Topics