Harman Patil (Editor)

Sweble

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit
Original author(s)
  
OSR Group

Operating system
  
Cross-platform

Written in
  
Java

Type
  
Parser

Initial release
  
May 1, 2011; 5 years ago (2011-05-01)

Stable release
  
2.0 / September 14, 2014; 2 years ago (2014-09-14)

The Sweble Wikitext parser is an open-source tool to parse the Wikitext markup language used by MediaWiki, the software behind Wikipedia. The initial development was done by Hannes Dohrn as a Ph.D. thesis project at the Open Source Research Group of professor Dirk Riehle at the University of Erlangen-Nuremberg from 2009 until 2011. The results were presented to the public for the first time at the WikiSym conference in 2011. Before that, the dissertation was inspected and approved by an independent scientific peer-review and was published at ACM Press.

Contents

Based on the statistics at Ohloh the parser is mainly written in the Java programming language. It was open-sourced in May 2011. The parser itself is generated from a parsing expression grammar (PEG) using the Rats! parser generator. The encoding validation is done using a flex lexical analyser written in JFlex.

A preprint version of the paper on the design of the Sweble Wikitext Parser can be found at the projects homepage. In addition to that, a summary page exists at the MediaWiki's futures.

The current state of parsing

The parser used in MediaWiki converts the content directly from Wikitext into HTML. This process is done in two stages:

  1. Searching and expansion of templates (like infoboxes), variables, and meta-information (e.g. {{lc:ABC}} gets converted into lower-case abc). Template pages can again have such meta-information so that those have to be evaluated as well (Recursion). This approach is similar to macro expansion used e.g. in programming languages like C++.
  2. Parsing and rendering of the now fully expanded text. Hereby, the text is processed by a sequence of built-in functions of MediaWiki that each recognise a single construct. Those analyse the content using Regular Expressions and replace e.g. = HEAD = with its HTML equivalent <h1>HEAD</h1>. In most of the cases, these steps are done line by line, with exceptions being tables or lists.

As the authors of Sweble write in their paper, an analysis of the source code of MediaWiki's parser showed that the strategy of using separate transformation steps leads to new problems: Most of the used functions do not take the scope of the surrounding elements into account. This consequently leads to wrong nesting in the resulting HTML output. As a result, the evaluation and rendering of the latter can be ambiguous and depends on the rendering engine of the used web browser. They state:

"The individual processing steps often lead to unexpected and inconsistent behavior of the parser. For example, lists are recognized inside table cells. However, if the table itself appears inside a framed image, lists are not recognized."

As argued on the WikiSym conference in 2008, a lack of language precision and component decoupling hinders evolution of wiki software. If the wiki content had a well-specified representation that is fully machine processable, this would not only lead to better accessibility of its content but also improve and extend the ways in which it can be processed.

In addition, a well-defined object model for wiki content would allow further tools to operate on it. Until now there have been numerous attempts at implementing a new parser for MediaWiki (see [1]). None of them has succeeded so far. The authors of Sweble state that this might be "due to their choice of grammar, namely the well-known LALR(1) and LL(k) grammars. While these grammars are only a subset of context-free grammars, Wikitext requires global parser state and can therefore be considered a context-sensitive language." As a result, they base their parser on a parsing expression grammar (PEG).

How Sweble works

Sweble parses the Wikitext and produces an abstract syntax tree as output. This helps to avoid errors from incorrect markup code (e.g. having a link spanning over multiple cells of a table). A detailed description of the abstract syntax tree model can be found in a technical report about the Wikitext Object Model (WOM).

Steps of parsing

The parser processes Wikitext in five stages:

1. Encoding validation 
Since not all possible characters are allowed in Wikitext (e.g. control characters in Unicode), a cleaning step is needed before starting the actual parsing. In addition, some internal naming is performed to facilitate the later steps by making the resulting names for entities unique. In this process it must be ensured that character used as prefix for the parser are not escaped or changed. However, this stage should not lead to information loss due to stripping of characters from the input.
2. Pre-processing 
After cleaning the text from illegal characters, the resulting Wikitext is prepared for expansion. For this purpose it is scanned for XML-like comments, meta-information such as redirections et cetera, conditional tags, and tag extensions. The latter are XML elements that are treated similar to parser functions or variables, respectivle. XML elements with unknown names are treated as if they are generic text.
The result of this stage is an AST which consists mostly of text nodes but also redirect links, transclusion nodes and the nodes of tag extensions.
3. Expansion 
Pages in a MediaWiki are often built using templates, magic words, parser functions and tag extensions. To use the AST in a WYSIWYG editor, one would leave out expansion to see the unexpanded transclusion statements and parser function calls in the original page. However, for rendering the content e.g. as HTML page these must be processed to get the complete output. Moreover, pages used as templates can themselves transclude other pages which makes expansion a recursive process.
4. Parsing 
Before parsing starts, the AST has to be converted back into Wikitext. Once this step is done, a PEG parser analyzes the text and generates an AST capturing the syntax and semantics of the wiki page.
5. Post-processing 
In this stage tags are matched to form whole output elements. Moreover, apostrophes are analyzed to decide which of them are real prose apostrophes and which have to be interpreted as markup for bold or italic font in Wikitext. The assembly of paragraphs is also handled in this step. Hereby, the AST is processed using a depth-first traversal on the tree structure.
The rendering of the different kinds of output as well as the analyzing functions are realized as Visitors. This helps to separate the AST data structure from the algorithms that operate on the data.

References

Sweble Wikipedia