Noise Reduction in Distant Supervision for Relation Extraction Using Probabilistic Soft Logic

Kirsch, Birgit; Niyazova, Zamira; Mock, Michael; Rüping, Stefan

doi:10.1007/978-3-030-43887-6_6

2019

Conference Paper

Abstract

The performance of modern relation extraction systems is to a great degree dependent on the size and quality of the underlying training corpus and in particular on the labels. Since generating these labels by human annotators is expensive, Distant Supervision has been proposed to automatically align entities in a knowledge base with a text corpus to generate annotations. However, this approach suffers from introducing noise, which negatively affects the performance of relation extraction systems. To tackle this problem, we propose a probabilistic graphical model which simultaneously incorporates different sources of knowledge such as domain experts knowledge about the context and linguistic knowledge about the sentence structure in a principled way. The model is defined using the declarati ve language provided by Probabilistic Soft Logic. Experimental results show that the proposed approach, compared to the original distantly supervised set, not only improves the quality of such generated training data sets, but also the performance of the final relation extraction model. The performance of modern relation extraction systems is to a great degree dependent on the size and quality of the underlying training corpus and in particular on the labels. Since generating these labels by human annotators is expensive, Distant Supervision has been proposed to automatically align entities in a knowledge base with a text corpus to generate annotations. However, this approach suffers from introducing noise, which negatively affects the performance of relation extraction systems. To tackle this problem, we propose a probabilistic graphical model which simultaneously incorporates different sources of knowledge such as domain experts knowledge about the context and linguistic knowledge about the sentence structure in a principled way. The model is defined using the declarati ve language provided by Probabilistic Soft Logic. Experimental results show that the proposed approach, compared to the original distantly supervised set, not only improves the quality of such generated training data sets, but also the performance of the final relation extraction model.