The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.
Contents
History
The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. BulNC incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts.
Contents
The Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present.
It also includes parallel corpora of various size for 47 foreign languages.
BulNC is annotated at various linguistic levels.
Applications
The Bulgarian National Corpus enables a number of applications in various linguistic areas: in computational linguistics; in lexicography; within theoretical studies of specific linguistic phenomena; for observations of the characteristics of individual language domains; for extracting exemplary sentences for the education in Bulgarian language, etc.
Some of the more specific applications of the Corpus are listed below:
Access
Access to BulNC is free of charge for public use and includes: