Samiksha Jaiswal (Editor)

Bulgarian National Corpus

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

Contents

History

The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. BulNC incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts.

Contents

The Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present.

It also includes parallel corpora of various size for 47 foreign languages.

BulNC is annotated at various linguistic levels.

Applications

The Bulgarian National Corpus enables a number of applications in various linguistic areas: in computational linguistics; in lexicography; within theoretical studies of specific linguistic phenomena; for observations of the characteristics of individual language domains; for extracting exemplary sentences for the education in Bulgarian language, etc.

Some of the more specific applications of the Corpus are listed below:

  • Extraction of specific or general sub-corpora following particular criteria (subject, author, year / period of publication, source, etc.), which could be used as training corpora for a number of applications – grammatical and semantic tagging, among others, as well as for other research purposes.
  • Observations on the usage frequency of words or language constructions, generation of frequency lists, etc.
  • Searches in the Corpus for instances of particular linguistic phenomena, lexicographic examples or for educational purposes in the Bulgarian language instruction (available to use over the Internet).
  • Access

    Access to BulNC is free of charge for public use and includes:

  • Access to the BulNC search engine
  • Certain subcorpora are available for download
  • BulNC
  • BulNC Search Engine
  • Institute for Bulgarian Language
  • References

    Bulgarian National Corpus Wikipedia