Now showing 1 - 5 of 5
  • Publication
    Deterministic CUR for improved large-scale data analysis: An empirical study
    Low-rank approximations which are computed from selected rows and columns of a given data matrix have attracted considerable attention lately. They have been proposed as an alternative to the SVD because they naturally lead to interpretable decompositions which was shown to be successful in application such as fraud detection, fMRI segmentation, and collaborative filtering. The CUR decomposition of large matrices, for example, samples rows and columns according to a probability distribution that depends on the Euclidean norm of rows or columns or on other measures of statistical leverage. At the same time, there are various deterministic approaches that do not resort to sampling and were found to often yield factorization of superior quality with respect to reconstruction accuracy. However , these are hardly applicable to large matrices as they typically suffer from high computational costs. Consequently, many practitioners in the field of data mining have abandon deterministic approaches in favor of randomized ones when dealing with today's large-scale data sets. In this paper, we empirically disprove this prejudice. We do so by introducing a novel, linear-time, deterministic CUR approach that adopts the recently introduced Simplex Volume Maximization approach for column selection. The latter has already been proven to be successful for NMF-like decompositions of matrices of billions of entries. Our exhaustive empirical study on more than $30$ synthetic and real-world data sets demonstrates that it is also beneficial for CUR-like decompositions. Compared to other determinis tic CUR-like methods, it provides comparable reconstruction quality but operates much faster so that it easily scales to matrices of billions of elements. Compared to sampling-based methods, it provides competitive reconstruction quality while staying in the same run-time complexity class.
  • Publication
    Early drought stress detection in cereals: Simplex volume maximization for hyperspectral image analysis
    ( 2012)
    Römer, Christoph
    ;
    ;
    Ballvora, Agim
    ;
    Pinto, Francisco
    ;
    Rossini, Micol
    ;
    Cinzia, Panigada
    ;
    Behmann, Jan
    ;
    Léon, Jens
    ;
    ; ; ;
    Rascher, Uwe
    ;
    Plümer, Lutz
    Early water stress recognition is of great relevance in precision plant breeding and production. Hyperspectral imaging sensors can be a valuable tool for early stress detection with high spatio-temporal resolution. They gather large, high dimensional data cubes posing a significant challenge to data analysis. Classical supervised learning algorithms often fail in applied plant sciences due to their need of labelled datasets, which are difficult to obtain. Therefore, new approaches for unsupervised learning of relevant patterns are needed. We apply for the first time a recent matrix factorisation technique, simplex volume maximisation (SiVM), to hyperspectral data. It is an unsupervised classification approach, optimised for fast computation of massive datasets. It allows calculation of how similar each spectrum is to observed typical spectra. This provides the means to express how likely it is that one plant is suffering from stress. The method was tested for drought stress, applied to potted barley plants in a controlled rain-out shelter experiment and to agricultural corn plots subjected to a two factorial field setup altering water and nutrient availability. Both experiments were conducted on the canopy level. SiVM was significantly better than using a combination of established vegetation indices. In the corn plots, SiVM clearly separated the different treatments, even though the effects on leaf and canopy traits were subtle.
  • Publication
    How players lose interest in playing a game: An empirical study based on distributions of total playing times
    ( 2012) ; ; ; ;
    Drachen, Anders
    ;
    Canossa, Alessandro
    Analyzing telemetry data of player behavior in computer games is a topic of increasing interest for industry and research, alike. When applied to game telemetry data, pattern recognition and statistical analysis provide valuable business intelligence tools for game development. An important problem in this area is to characterize how player engagement in a game evolves over time. Reliable models are of pivotal interest since they allow for assessing the long-term success of game products and can provide estimates of how long players may be expected to keep actively playing a game. In this paper, we introduce methods from random process theory into game data mining in order to draw inferences about player engagement. Given large samples (over 250,000 players) of behavioral telemetry data from five different action-adventure and shooter games, we extract information as to how long individual players have played these games and apply techniques from lifetime analysis to identify common patterns. In all five cases, we find that the Weibull distribution gives a good account of the statistics of total playing times. This implies that an average players interest in playing one of the games considered evolves according to a non-homogeneous Poisson process. Therefore, given data on the initial playtime behavior of the players of a game, it becomes possible to predict when they stop playing.
  • Publication
    More influence means less work: Fast latent dirichlet allocation by influence scheduling
    Name ambiguity arises from the polysemy of names and causes uncertainty about the true identity of entities referenced in unstructured text. This is a major problem in areas like information retrieval or knowledge management, for example when searching for a specific entity or updating an existing knowledge base. We approach this problem of named entity disambiguation (NED) using thematic information derived from Latent Dirichlet Allocation (LDA) to compare the entity mention's context with candidate entities in Wikipedia represented by their respective articles. We evaluate various distances over topic distributions in a supervised classification setting to find the best suited candidate entity, which is either covered in Wikipedia or unknown. We compare our approach to a state of the art method and show that it achieves significantly better results in predictive performance, regarding both entities covered in Wikipedia as well as uncovered entities. We show that our approach is in general language independent as we obtain equally good results for named entity disambiguation using the English, the German and the French Wikipedia.
  • Publication
    Yes we can - simplex volume maximization for descriptive web-scale matrix factorization
    Matrix factorization methods are among the most common techniques for detecting latent components in data. Popular examples include the Singular Value Decomposition or Non- negative Matrix Factorization. Unfortunately, most meth- ods su er from high computational complexity and therefore do not scale to massive data. In this paper, we present a lin- ear time algorithm for the factorization of gigantic matrices that iteratively yields latent components. We consider a constrained matrix factorization s.t. the latent components form a simplex that encloses most of the remaining data. The algorithm maximizes the volume of that simplex and thereby reduces the displacement of data from the space spanned by the latent components. Hence, it also lowers the Frobenius norm, a common criterion for matrix factorization quality. Our algorithm is e\'0ecient, well-grounded in distance geometry, and easily applicable to matrices with billions of entries. In addition, the resulting factors allow for an in- tuitive interpretation of data: every data point can now be expressed as a convex combination of the most extreme and thereby often most descriptive instances in a collection of data. Extensive experimental validations on web-scale data, including 80 million images and 1.5 million twitter tweets, demonstrate superior performance compared to related fac- torization or clustering techniques.