An incremental approach to document structure recognition

Under CopyrightNeuhold, E.J.Encarnacao, J.L.Xu, Y.Y.Xu31.07.20021998https://publica.fraunhofer.de/handle/publica/27333610.24406/publica-fhg-273336Verschiedene elektronische Informationsquellen bieten ihre Dokumente in unterschiedlicher Form an. Insbesondere ihre Struktur ist oft nur in anbieterspezifischem Format verfügbar. Für die weitere Bearbeitung, den Austausch und die Archivierung muß diese Struktur extrahiert werden. Diese Dissertation entwickelt einen Ansatz zur automatischen Erkennung der Struktur von elektronischen Dokumenten auf Basis von nur wenigen, manuell strukturierten Beispielsdokumenten. Dazu wird eine regelorientierte Sprache zur Spezifikation von Erkennungsprogrammen eingeführt. Auf dieser Basis werden Techniken des maschinellen Lernens - Versionsraum und Grammatik-Inferenz - entwickelt, die Erkennungsprogramme aus Beispielen generieren.Most of the electronic documents available from todays huge number of electronic information sources have on implicit structure. In order to manipulate, exchange, and archive these documents, it is important to extract their logical structure and to make it explicitly available. Many researches have noted the importance of document logical structure recognition, yet we still lack an easy method for recognizing the implicit structure of electronic documents. The two most widely used methods are: recognizing structure by hand, or through structure recognition programs. Due to the large number of documents, the manual approach is tedious and error-prone although in principle it is very simple. Writing a complete recognition pro-gram is much more effective, but it requires significant intellectual effort. To combine the ad-vantages of both methods, this thesis presents an approach to automate the learning of recognition grammars from manually structured examples. The approach uses two techniques from the field of machine learning: Version space - to abstract from the concrete contents of the structured examples in order to recognize examples with different content, and grammatical inference - to generalize the syntactic structure of the structured examples in order to recognize examples with slightly deviating structure. These two techniques are embedded into an incremental structure learning system - MarkItUp! - which allows for a convenient refinement of a recognition grammar towards new examples with unanticipated structure. This dissertation presents the design, analysis, and implementation of MarkItUp!. The characteristics of MarkItUp! are as follows. (1) it supports a simple way for the user to obtain a suitable recognition grammar; (2) it uses incremental learning so that the recognition gram-mar can be efficiently modified using additional structured examples. Experimental results on combining the version-space method with a grammatical inference approach in the learning cycle are also presented.Abstract S.3 Kurzfassung S.4 Acknowledgements S.5 Introduction S.11 1.1 Problem Domain and Overall Goals S.11 1.2 Overall Approach S.13 1.3 Contribution S.16 1.4 A Guide to this Dissertation S.17 Preliminaries S.19 2.1 Formal Languages S.19 2.2 Graphs S.22 2.3 Document Structures S.25 2.4 Markup S.27 2.5 Document Description Language ? SGML S.28 An Approach to Document-Structure Recognition S.33 3.1 DREAM S.33 3.2 System Overview of MarkItUp! S.39 3.3 Demonstration of MarkItUp! S.48 3.4 Summary S.53 Learning S.55 4.1 Learning Problems and Learning Levels S.55 4.2 Learning at Content Level S.56 4.3 Learning at Structure Level S.66 4.4 Summary S.78 Sequence of Arbitrary Ordering S.79 5.1 Problem and Goal S.79 5.2 Basic Concepts and Notations S.80 5.3 Inferring a General Expression S.81 5.4 Summary S.89 Implementation S.91 6.1 System Architecture of MarkItUp! S.91 6.2 User Interface S.91 6.3 From Marked-up Example to a Grammar S.103 6.4 Implementation of Learning Component S.103 6.5 DREAM Grammar Generator S.107 6.6 Learning Strategies S.109 6.7 Experimental Evaluation of the System S.116 6.8 Summary S.119 Related Work S.121 7.1 Editing-By-Example S.121 7.2 Inductive Learning and Learning Methods S.123 7.3 Application to Wrapping Semi-Structured Data S.126 Conclusion S.129 Bibliography S.133 Appendix: List of Figures and Tables S.139enDokumentstrukturerkennungmaschinelles Lernendocument structure recognizationmachine learning006004005An incremental approach to document structure recognitiondoctoral thesis