Options
2012
Conference Paper
Title
The tell-tale title: How to track topics over time with a two-step-approach
Abstract
We suggest a system that can identify and track topics over time by clustering documents in a recent time period and connecting these topics with those from previous periods. In contrast to most related work, this is achieved by a two-step algorithm in which the document clusters and the connections of these clusters across time are calculated independently. Both steps learn from older data in order to handle current or more recent data. One of our main findings in building the approach is, that the clustering based on title and authors performs best, since it is neither disturbed by insufficient data coverage nor hindered by too large clusters due to small differences in the topic term distributions. By comparing the identified topics with conference track labels our evaluation results suggest that the system can successfully cluster and track topics over time. Introduction The manual exploration and analysis of bibliometric data is part of the daily routine of many scientists. New and old publications that might be relevant for their scientific field or the current/new problem they are working on have to be found and investigated. The possibilities to do this are manifold: starting from an interesting paper analyzing the documents that cite it (to get newer ones) or are referenced in it (to get older ones), searching with keywords and/or a field restriction in a bibliometric database, subscribing to journals and newsletters, etc. In this paper, we propose a support system to discover scientific topics and track them over time. Documents of separate time periods, i.e. publication years, are first clustered according to common term distributions using the well-established topic modeling algorithm Latent Dirichlet Allocation (LDA, Blei & al., 2003). Next, the clusters of these different time periods are connected independently of their temporal distribution. Based on these results, the evolution of different topics over time, i.e. their growth or decline, could be measured automatically.