Supriya Ghosh (Editor)

Scannerless parsing

Updated on
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Covid-19

In computer science, scannerless parsing (also called lexerless parsing) refers to performing both tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. It also refers to the associated grammar: using a single formalism to express both the lexical (word level) grammar and phrase level grammar used to parse a language.

Contents

Dividing processing up into a lexer followed by a parser is generally viewed as better design because it is more modular, and scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, simple per application scripting languages, and Perl 6.

Advantages

  • Only one metalanguage is needed
  • Non-regular lexical structure is handled easily
  • "Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language reserved words (such as "while" in C)
  • Grammars can be compositional (can be merged without human intervention) [1]
  • Disadvantages

  • Since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages
  • Most parsers of character-level grammars are nondeterministic
  • There is no guarantee that the resulting grammar is unambiguous
  • Required extensions

    Eelco Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:

  • Follow restrictions, a limited form of "longest match"
  • Reject productions, a limited form of negative matching (as found in boolean grammars)
  • Preference attributes to handle the dangling else construct in C-like languages
  • Per-production transitions rather than per-nonterminal transitions in order to facilitate:
  • Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
  • Precedence/priority rules, which prevent self-references in higher-precedence productions from producing lower-precedence productions
  • Implementations

  • SGLR is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
  • JSGLR, a pure Java implementation of SGLR, also based on SDF.
  • TXL supports character-level parsing.
  • dparser generates ANSI C code for scannerless GLR parsers.
  • Spirit allows for both scannerless and scanner-based parsing.
  • SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
  • Laja is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
  • The Perl 6 Grammars feature of the general purpose programming language Perl 6.
  • PyParsing is a scannerless parser written in pure Python.
  • META II Has built in token parsers functions.
  • TREE-META Like META II also is scanerless having builtin lexer functions.
  • CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.
  • References

    Scannerless parsing Wikipedia


    Similar Topics
    The Sea (2013 film)
    Sam Masich
    John Contreras
    Topics
     
    B
    i
    Link
    H2
    L