In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.
Contents
History
The problem of non-hierarchical structures in documents has been recognised since 1988; resolving this problem against the dominant paradigm of text as a single hierarchy (an ordered hierarchy of content objects or OHCO) was initially thought to be merely a technical issue, but has, in fact, proven much more difficult. In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists".
Properties and types
A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter. Contiguous overlap can always be represented as a linear document with milestones, without the need for fragmentation and pointers to fragments, but non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (self-overlap).
A scheme may have a privileged hierarchy. Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means; these are said to be non-privileged.
Approaches and implementations
DeRose (2004, Evaluation criteria) identifies several criteria for judging solutions to the overlap problem: readability and maintainability, tool support and compatibility with XML, possible validation schemes, and ease of processing.
Tag soup is, strictly speaking, not overlapping markup—it is malformed HTML, which is a non-overlapping language, and may be ill-defined. HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy. With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.
SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any. DTD validation is a challenge when using CONCUR, validation across hierarchies is hard if not impossible, it could not support self-overlap, and it interacted poorly with commonly used SGML features. This feature was poorly supported by tools and saw very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.
Within hierarchical languages
There are several approaches to representing overlap in a non-overlapping language:
The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup. All four of the above approaches are suggested. The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible. It uses empty milestone elements to encode non-privileged components.
New languages
Another approach is to design an entirely new markup language. These forego the tool support in existing languages for a less complicated semantic model and more convenient syntax.
Graph-based formalisms
Rather than grounding markup information in a tree, standoff XML employs a data model based on directed graphs. As an alternative to traditional markup, such graph-based data models can be represented with formalisms originally developed for generalized directed multigraphs, most notably the Resource Description Framework (RDF). EARMARK is an early RDF/OWL representation that encompasses GODDAGs.
RDF provides different linearizations, including an XML format that can be modeled to mirror conventional standoff XML, and a linearization that lets RDF be expressed in XML attributes (RDFa). But while it is semantically equivalent to standoff XML, it does not require special-purpose technology for storing, parsing and querying. Multiple interlinked RDF files representing a document or a corpus may constitute an example of Linguistic Linked Open Data.