Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Entity resolution by kernel methods

: Pilz, A.; Molzberger, L.; Paaß, G.

Postprint urn:nbn:de:0011-n-932347 (127 KByte PDF)
MD5 Fingerprint: a22f206f3655aee4bf8e165d98c10ec7
Created on: 9.4.2009

Heyer, G.:
Text Mining Services - Building and applying text mining based service infrastructures in research and industry : Proceedings of the Conference on Text Minng Services, TMS 2009, at Leipzig University, 23.-25.3.09
Leipzig: Universität Leipzig, 2009 (Leipziger Beiträge zur Informatik 14)
ISBN: 978-3-94-1608-01-6
Conference on Text Mining Services (TMS) <2009, Leipzig>
Conference Paper, Electronic Publication
Fraunhofer IAIS ()
text mining; entity recognition; entity disambiguation

An important problem in text mining and semantic retrieval is entity resolution which aims at detecting the identity of a named entity. Note that the name of a unique entity may be written in variant ways and different unique entities may have the same name. The term "bush" for instance may refer to a woody plant, a mechanical fixing, 52 persons and 8 places covered in Wikipedia and thousands of other persons. For the first time, according to our knowledge we apply a kernel entity resolution approach to the German Wikipedia as reference for named entities. We describe the context of named entities in Wikipedia and the context of a detected name phrase in a new document by a context vector of relevant features. These contain not only the name itself and variant writings, but also relevant key terms, other identified named entities as well as topic indicators generated by an LDA topic model. We formulate different kernels for comparing these context vectors and use kernel classifiers, e.g. rank classifiers, to determine the right match. In comparison to a baseline approach using only text similarity the addition of topics approach gives a much higher f-value, which is comparable to the results published for English. It turns out that the procedure also is able to detect with high reliability if a person is not covered by the Wikipedia.