Rahul Sharma (Editor)

LIVAC Synchronous Corpus

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Initial release
  
July 1995

Type
  
Corpus

Operating system
  
Cross-platform

Website
  
www.livac.org

LIVAC Synchronous Corpus

Available in
  
English, Traditional and Simplified Chinese

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative communities in the Pan-Chinese region including Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Formosan Straits news, as well as news on finance, sports and entertainment. By 2017, 2.5 billion characters of news media texts have been filtered so far, of which 600 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and their speech communities in the Pan-Chinese region, and the result shows considerable and important variations.

Contents

The "Windows" approach is the most representative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible. Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 20 years, based on variables such as region, duration and content. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Media Personalities Rosters) and construction of monthly new word lexicons (LIVAC Annual Pan-Chinese New Word Rosters). On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible.

Corpus data processing

  1. Accessing media texts, manual input, etc.
  2. Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions
  3. Automatic word segmentation
  4. Automatic alignment of parallel texts
  5. Manual verification, part-of-speech tagging
  6. Extraction of words and addition to regional sub-corpora
  7. Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database

Labeling for data curation

  1. Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc.
  2. Construction of databases of proper names, place names, and specific terms, etc.
  3. Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words
  4. Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc.

Applications

  1. Compilation of Pan-Chinese dictionaries or local dictionaries
  2. Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining
  3. Comparative studies on linguistic and cultural developments in the Pan-Chinese regions
  4. Language teaching and learning research, and speech-to-text conversion
  5. Customized service on linguistic research and lexical search for international corporations and government agencies

References

LIVAC Synchronous Corpus Wikipedia