FactRunner: A new system for NLP-based information extraction from wikipedia

Sutoyo, R.; Quix, C.; Kastrati, F.

2014

Conference Paper

Abstract

Wikipedia is playing an increasing role as a source of humanreadable knowledge, because it contains an enormous amount of high quality information written by human authors. Finding a relevant piece of information in this huge collection of natural language text is often a time-consuming process, as a keyword-based search interface is the main method for querying. Therefore, an iterative process to explore the document collection to find the information of interest is required. In this paper, we present an approach to extract structured information from unstructured documents to enable structured queries. Information Extraction (IE) systems have been proposed for this tasks, but due to the complexity of natural language, they often produce unsatisfying results. As Wikipedia contains, in addition to the plain natural language text, links between documents and other metadata, we propose an approach which exploits this information to extract more accurate structured information. Our proposed system FactRunner focusses on extracting structured information from sentences containing such links, because the links may indicate more accurate information than other sentences. We evaluated our system with a subset of documents from Wikipedia and compared the results with another existing system. The results show that a natural language parser combined with Wikipedia markup can be exploited for extracting facts in form of triple statements with a high accuracy.

Author(s)

Sutoyo, R.

Quix, C.

Kastrati, F.

Hauptwerk

Web information systems and technologies. 9th international conference, WEBIST 2013

Konferenz

International Conference on Web Information Systems and Technologies (WEBIST) 2013

Options

FactRunner: A new system for NLP-based information extraction from wikipedia