Options
2005
Conference Paper
Title
Investigating word correlation at different scopes
Title Supplement
A latent-concept approach
Abstract
This paper presents work in progress on clustering methods that identify semantic concepts in a document collection. These methods are based on the observation that semantically related words occur close together. We investigate the size of neighborhood which should be taken into account for this purpose: sentences or documents. We further investigate how local co-occurrence affects the clustering quality by including word bigrams as additional terms. We apply two different latent-concept models, probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), to a corpus of German news stories. The resulting soft clusterings are compared with a given a priori classification of documents using an information-based distance metric. Preliminary results show that this cluster distance was smaller using (1) entire documents (compared to combinations of documents and sentences), as well as (2) combinations of unigrams and bigrams (compared to exclusive use of unigrams or bigrams).