Investigating word correlation at different scopes

Heinrich, G.; Kindermann, J.; Lauth, C.; Paaß, G.; Monzon, J.S.

2005

Conference Paper

Abstract

This paper presents work in progress on clustering methods that identify semantic concepts in a document collection. These methods are based on the observation that semantically related words occur close together. We investigate the size of neighborhood which should be taken into account for this purpose: sentences or documents. We further investigate how local co-occurrence affects the clustering quality by including word bigrams as additional terms. We apply two different latent-concept models, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), to a corpus of German news stories. The resulting soft clusterings are compared with a given a priori classification of documents using an information-based distance metric. Preliminary results show that this cluster distance was smaller using (1) entire documents (compared to combinations of documents and sentences), as well as (2) combinations of unigrams and bigrams (compared to exclusive use of unigrams or bigrams).

Author(s)

Heinrich, G.

Kindermann, J.

Lauth, C.

Paaß, G.

Monzon, J.S.

Mainwork

Learning and extending lexical ontologies by using machine learning methods

Conference

Workshop on Learning and Extending Lexical Ontologies by Using Machine Learning Methods 2005

Options

Investigating word correlation at different scopes