• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Artikel
  4. Machine learning for document structure recognition
 
  • Details
  • Full
Options
2011
Book Article
Title

Machine learning for document structure recognition

Abstract
The backbone of the information age is digital information which may be searched, accessed, and transferred instantaneously. Therefore the digitization of paper documents is extremely interesting. This chapter describes approaches for document structure recognition detecting the hierarchy of physical components in images of documents, such as pages, paragraphs, and figures, and transforms this into a hierarchy of logical components, such as titles, authors, and sections. This structural information improves readability and is useful for indexing and retrieving information contained in documents. First we present a rule-based system segmenting the document image and estimating the logical role of these zones. It is extensively used for processing newspaper collections showing world-class performance. In the second part we introduce several machine learning approaches exploring large numbers of interrelated features. They can be adapted to geometrical models of the document structure, which may be set up as a linear sequence or a general graph. These advanced models require far more computational resources but show a better performance than simpler alternatives and might be used in future.
Author(s)
Paaß, Gerhard  
Konya, Iuliu
Mainwork
Modeling, learning, and processing of text-technological data structures  
Open Access
DOI
10.24406/publica-r-226963
10.1007/978-3-642-22613-7_12
File(s)
Download (4.42 MB)
Rights
Under Copyright
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • machine learning

  • digitizing paper documents

  • document structure recognition

  • rule-based

  • document segmentation

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024