Supriya Ghosh (Editor)

HOCR

Updated on
Edit
Like
Comment
Share on FacebookTweet on TwitterShare on LinkedInShare on Reddit

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR). The definition encodes text, style, layout information, recognition confidence metrics and other information using Extensible Markup Language (XML) in form of Hypertext Markup Language (HTML) or XHTML.

Contents

Software

These OCR software can output the recognition result as hocr file:

  • OCRopus
  • Tesseract
  • Cuneiform
  • HebOCR
  • Example

    The following example is an extract of an hOCR file:

    The recognized text is stored in normal text nodes of the html file. The distribution into separate lines and words is here given by the surrounding span tags. Moreover, the usual html entities are used, for example the p tag for a paragraph. Additional information is given in the properties such as:

  • different layout elements such as "ocr_par", "ocr_line", "ocrx_word"
  • geometric information for each element with a bounding box "bbox"
  • language information "lang"
  • some confidence values "x_wconf"
  • References

    HOCR Wikipedia