Machine learning for document structure recognition

Paaß, Gerhard; Konya, Iuliu

doi:10.1007/978-3-642-22613-7_12

2011

Book Article

Abstract

The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.

Author(s)

Paaß, Gerhard

Konya, Iuliu

Mainwork

Modeling, learning, and processing of text-technological data structures

Options

Machine learning for document structure recognition