List of text corpora

Updated on Feb 05, 2026

Edit

Comment

Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

English language

Google Books Ngram Corpus

American National Corpus

Bank of English

British National Corpus

Corpus Juris Secundum

Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.

Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.

International Corpus of English

Oxford English Corpus

Scottish Corpus of Texts & Speech

Corpus Resource Database (CoRD), more than 80 English language corpora.

European languages

Bulgarian National Corpus

CETENFolha

Croatian Language Corpus

Croatian National Corpus

Czech National Corpus

Google Books Ngram Corpus

Russian National Corpus

General Internet Corpus of Russian

Slovenian National Corpus

Thesaurus Linguae Graecae (Ancient Greek)

Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.

National Corpus of Polish

German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.

Free corpus of German mistakes from people with dyslexia

Spanish text corpus by Molino de Ideas, which contains 660 million words.

CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania

Reference Corpus of Contemporary Portuguese (CRPC)

Turkish National Corpus

Middle Eastern Languages

Hamshahri Corpus (Persian a.k.a. Farsi)

Persian in MULTEXT-EAST corpus (Persian a.k.a. Farsi)

Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)

TEP: Tehran English-Persian Parallel Corpus

TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling

Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 322 pp. ISBN 964-8699-32-1

Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan Department of English language and linguistics

Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012

Neo-Assyrian Text Corpus Project

Quranic Arabic Corpus (Classical Arabic)

East Asian Languages

Kotonoha Japanese language corpus

LIVAC Synchronous Corpus (Chinese)

Parallel corpora of diverse languages

Europarl Corpus - proceedings of the European Parliament from 1996–2011

EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database

OPUS: Open source Parallel Corpus in many many languages

Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.

NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie) (legacy repo)

SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.

GRALIS parallel texts for various slavic languages, compiled by the institute for slavic languages at Graz University (Branko Tošović et al.)

Comparable Corpora

WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)

Disambiguating Similar Language Corpora Collection (DSLCC) (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)

Wikipedia Comparable Corpora (41 million aligned Wikipedia articles for 253 language pairs)

References

List of text corpora Wikipedia

(Text) CC BY-SA

Contents