Evaluation and re-usable implementation of DL-based approaches for Entity Recognition

Gugu, Andel

doi:10.24406/publica-fhg-283846

2021

Master Thesis

Abstract

Named-entity recognition (NER) aims to identify and label instances of predefined entities in a chunk of text. Even though conceptually simple, it is a challenging task that requires some amount of context and a good understanding of what constitutes an entity. Being a precursor to other natural language applications such as question answering and text summarization, it is essential to have high-quality NER systems. For a long time, they have relied on domain-specific knowledge or resources such as gazetteers to perform well. In the past decade, deep learning (DL) techniques have been applied to NER. They do not resort to any external resources and have achieved state-of-the-art results. This thesis investigates several aspects of DL-based approaches for NER. Recent improvements mainly com e from utilizing unsupervised language model pretraining to produce representations depending on a words contextual use. Intuitively, more informative embeddings lead to better generalization, that is, detecting mentions that do not appear in the training data for NER models. Several word representations are evaluated for German, using the two biggest available datasets CoNLL-03 and GermEval. The results show that recent contextualized representations improve the entity extraction performance on both datasets, due to being more robust against entity type ambiguities (e.g. Is "Washington" a person or a location?) or lengthy entities (e.g publication titles).Such embeddings are useful for multilingual NER too. Two approaches for generating multilingual embeddings are pitted against each other, in order to find out which is the most useful for extracting entities in a mix of German, English and Dutch data. Another investigated aspect is improving the performance on "low-data" domains through transfer learning, using finetuning. This is motivated by the fact that neural models tend to underperform, due to lack of sufficient data. Finetuning a pre-trained model using contextualized embeddings significantly improves the performance on a relatively small annotated German dataset from Europarl. The final step of this work is providing a re-usable implementation of a DL-based NER model within a framework for building NLP pipelines such as DKPro Core. Challenges of integrating an external Python-based model in a Java-based framework are investigated.

Thesis Note

Bonn, Univ., Master Thesis, 2021

Author(s)

Gugu, Andel

Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS

Person Involved

Lehmann, Jens

Universität Bonn

Behnke, Sven

Universität Bonn

Publishing Place

Bonn

Options

Evaluation and re-usable implementation of DL-based approaches for Entity Recognition