Neha Patil (Editor)

Constraint Grammar

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

Constraint Grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation (lexeme or base form), inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Contents

The Constraint Grammar concept was launched by Fred Karlsson in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for part of speech (word class) of over 99%. A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based phrase structure grammars or dependency grammars, and a number of Treebank projects have used Constraint Grammar for automatic annotation. CG methodology has also been used in a number of language technology applications, such as spell checkers and machine translation systems.

CG-1

The first CG implementation was CGP by Fred Karlsson in the early 1990s. It was purely LISP-based, and the syntax was based on LISP s-expressions (Karlsson 1990).

CG-2

Pasi Tapanainen's CG-2 implementation mdis removed some of the parentheses in the grammar format and was implemented in C++, interpreting the grammar as a Finite State Transducer for speed.

CG-2 was later reimplemented (with a non-FST method) by the VISL group at Syddansk Universitet as the open source VISL CG [1], keeping the same format as Tapanainen's closed-source mdis.

CG-3

The VISL project later turned into VISL CG-3, which brought further changes and additions to the grammar format, e.g.:

  • full Unicode support through International Components for Unicode
  • different interpretation of negation (NOT)
  • named relations in addition to plain dependency relations
  • variable-setting
  • full regex matching
  • wrappers for reading/writing Apertium and HFST formats
  • support for subreadings (where one reading has several "parts", used for multi-word expressions and compounds)
  • scanning past point of origin or even window boundaries
  • Unlike the Tapanainen implementation, the VISL implementations do not use finite state transducers. Rules are ordered within sections, which gives more predictability when writing grammars, but at the cost of slower parsing and the possibility of endless loops.

    Lately, there have been experimental open-source FST-based implementations that for small grammars reach the speed of VISL CG-3, if not mdis.

    List of systems

    Free software
  • VISL CG-3 Constraint Grammar compiler/parser
  • North and Lule Sami, Faroese, Komi and Greenlandic from the University of Tromsø (more information, Northern Sami documentation)
  • Fred Karlsson's original Finnish FinCG is also available from the University of Tromsø as GPL, both in the original CG1 and in a converted CG3 version.
  • Estonian [2]
  • Norwegian Nynorsk and Bokmål online, Oslo-Bergen tagger(source code)
  • Breton, Welsh, Irish Gaelic and Norwegian (converted from the above) in Apertium (see CG in Apertium)
  • Non-free software
  • Basque [3]
  • Catalan CATCG
  • Danish DanGram
  • English ENGCG, ENGCG-2, VISL-ENGCG
  • Esperanto EspGram
  • French FrAG
  • German GerGram
  • Irish online
  • Italian ItaGram
  • Spanish HISPAL
  • Swedish SWECG
  • Swahili
  • Portuguese PALAVRAS
  • References

    Constraint Grammar Wikipedia