Options
Robust Information Extraction From Unstructured Documents
Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations. Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.
Robustheit
Informationsextraktion
Computerlinguistik
NLP
Texterkennung
optische Zeichenerkennung
OCR
Generierung synthetischer Dokumente
Noise-Aware Training
Sequence Labeling
Eigennamenerkennung
NER
Einbettungen
Sprachmodellierung
OCR-Fehler
Rechtschreibfehler
künstliche Fehlererzeugung
empirische Fehlermodellierung
Fehlerkorrektur
unüberwachte Datengenerierung
parallele Datengenerierung
Tabellenextraktion
Tabellenerkennung
semantische Tabelleninterpretation
robustness
information extraction
natural language processing
text recognition
optical character recognition
data augmentation
alpha compositing
synthetic document generation
named entity recognition
embeddings
OCR errors
misspellings
artificial error generation
empirical error modeling
error correction
unsupervised data generation
noisy language modeling
parallel data generation
table extraction
table recognition
semantic table interpretation
maximum weight matching