Options
1998
Doctoral Thesis
Titel
An incremental approach to document structure recognition
Abstract
Verschiedene elektronische Informationsquellen bieten ihre Dokumente in unterschiedlicher Form an. Insbesondere ihre Struktur ist oft nur in anbieterspezifischem Format verfügbar. Für die weitere Bearbeitung, den Austausch und die Archivierung muß diese Struktur extrahiert werden. Diese Dissertation entwickelt einen Ansatz zur automatischen Erkennung der Struktur von elektronischen Dokumenten auf Basis von nur wenigen, manuell strukturierten Beispielsdokumenten. Dazu wird eine regelorientierte Sprache zur Spezifikation von Erkennungsprogrammen eingeführt. Auf dieser Basis werden Techniken des maschinellen Lernens - Versionsraum und Grammatik-Inferenz - entwickelt, die Erkennungsprogramme aus Beispielen generieren.
;
Most of the electronic documents available from todays huge number of electronic information sources have on implicit structure. In order to manipulate, exchange, and archive these documents, it is important to extract their logical structure and to make it explicitly available. Many researches have noted the importance of document logical structure recognition, yet we still lack an easy method for recognizing the implicit structure of electronic documents. The two most widely used methods are: recognizing structure by hand, or through structure recognition programs. Due to the large number of documents, the manual approach is tedious and error-prone although in principle it is very simple. Writing a complete recognition pro-gram is much more effective, but it requires significant intellectual effort. To combine the ad-vantages of both methods, this thesis presents an approach to automate the learning of recognition grammars from manually structured examples. The approach uses two techniques from the field of machine learning: Version space - to abstract from the concrete contents of the structured examples in order to recognize examples with different content, and grammatical inference - to generalize the syntactic structure of the structured examples in order to recognize examples with slightly deviating structure. These two techniques are embedded into an incremental structure learning system - MarkItUp! - which allows for a convenient refinement of a recognition grammar towards new examples with unanticipated structure. This dissertation presents the design, analysis, and implementation of MarkItUp!. The characteristics of MarkItUp! are as follows. (1) it supports a simple way for the user to obtain a suitable recognition grammar; (2) it uses incremental learning so that the recognition gram-mar can be efficiently modified using additional structured examples. Experimental results on combining the version-space method with a grammatical inference approach in the learning cycle are also presented.
ThesisNote
Zugl.: Darmstadt, TU, Diss., 1998