hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML.
Contents
Software
These OCR software can output the recognition result as hocr file:
Example
The following example is an extract of an hOCR file:
The recognized text is stored in normal text nodes of the html file. The distribution into separate lines and words is here given by the surrounding span tags. Moreover, the usual html entities are used, for example the p tag for a paragraph. Additional information is given in the properties such as:
References
HOCR Wikipedia(Text) CC BY-SA