Fraunhofer-Gesellschaft

Publica

Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Machine learning for document structure recognition

 
: Paaß, Gerhard; Konya, Iuliu

:
Preprint urn:nbn:de:0011-n-1945657 (4.4 MByte PDF)
MD5 Fingerprint: 8031e6a937c6ee2823deea336f372f4b
The original publication is available at springerlink.com
Created on: 16.2.2012


Mehler, A.:
Modeling, learning, and processing of text-technological data structures
Berlin: Springer, 2011 (Studies in computational intelligence 370)
ISBN: 3-642-22612-4
ISBN: 978-3-642-22612-0
ISBN: 978-3-642-22613-7
ISSN: 1860-949X
pp.221-247
English
Book Article, Electronic Publication
Fraunhofer IAIS ()
machine learning; digitizing paper documents; document structure recognition; rule-based; document segmentation

Abstract
The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.

: http://publica.fraunhofer.de/documents/N-194565.html