Now showing 1 - 10 of 11
  • Publication
    A Comparison of Methods for Player Clustering via Behavioral Telemetry
    The analysis of user behavior in digital games has been aided by the introduction of user telemetry in game development, which provides unprecedented access to quantitative data on user behavior from the installed game clients of the entire population of players. Player behavior telemetry datasets can be exceptionally complex, with features recorded for a varying population of users over a temporal segment that can reach years in duration. Categorization of behaviors, whether through descriptive methods (e.g. segmention) or unsupervised/supervised learning techniques, is valuable for finding patterns in the behavioral data, and developing profiles that are actionable to game developers. There are numerous methods for unsupervised clustering of user behavior, e.g. k-means/c-means, Non-negative Matrix Factorization, or Principal Component Analysis. Although all yield behavior categorizations, interpretation of the resulting categories in terms of actual play behavior can be difficult if not impossible. In this paper, a range of unsupervised techniques are applied together with Archetypal Analysis to develop behavioral clusters from playtime data of 70,014 World of Warcraft players, covering a five year interval. The techniques are evaluated with respect to their ability to develop actionable behavioral profiles from the dataset.
  • Publication
    Deterministic CUR for improved large-scale data analysis: An empirical study
    Low-rank approximations which are computed from selected rows and columns of a given data matrix have attracted considerable attention lately. They have been proposed as an alternative to the SVD because they naturally lead to interpretable decompositions which was shown to be successful in application such as fraud detection, fMRI segmentation, and collaborative filtering. The CUR decomposition of large matrices, for example, samples rows and columns according to a probability distribution that depends on the Euclidean norm of rows or columns or on other measures of statistical leverage. At the same time, there are various deterministic approaches that do not resort to sampling and were found to often yield factorization of superior quality with respect to reconstruction accuracy. However , these are hardly applicable to large matrices as they typically suffer from high computational costs. Consequently, many practitioners in the field of data mining have abandon deterministic approaches in favor of randomized ones when dealing with today's large-scale data sets. In this paper, we empirically disprove this prejudice. We do so by introducing a novel, linear-time, deterministic CUR approach that adopts the recently introduced Simplex Volume Maximization approach for column selection. The latter has already been proven to be successful for NMF-like decompositions of matrices of billions of entries. Our exhaustive empirical study on more than $30$ synthetic and real-world data sets demonstrates that it is also beneficial for CUR-like decompositions. Compared to other determinis tic CUR-like methods, it provides comparable reconstruction quality but operates much faster so that it easily scales to matrices of billions of elements. Compared to sampling-based methods, it provides competitive reconstruction quality while staying in the same run-time complexity class.
  • Publication
    Early drought stress detection in cereals: Simplex volume maximization for hyperspectral image analysis
    ( 2012)
    Römer, Christoph
    ;
    ;
    Ballvora, Agim
    ;
    Pinto, Francisco
    ;
    Rossini, Micol
    ;
    Cinzia, Panigada
    ;
    Behmann, Jan
    ;
    Léon, Jens
    ;
    ; ; ;
    Rascher, Uwe
    ;
    Plümer, Lutz
    Early water stress recognition is of great relevance in precision plant breeding and production. Hyperspectral imaging sensors can be a valuable tool for early stress detection with high spatio-temporal resolution. They gather large, high dimensional data cubes posing a significant challenge to data analysis. Classical supervised learning algorithms often fail in applied plant sciences due to their need of labelled datasets, which are difficult to obtain. Therefore, new approaches for unsupervised learning of relevant patterns are needed. We apply for the first time a recent matrix factorisation technique, simplex volume maximisation (SiVM), to hyperspectral data. It is an unsupervised classification approach, optimised for fast computation of massive datasets. It allows calculation of how similar each spectrum is to observed typical spectra. This provides the means to express how likely it is that one plant is suffering from stress. The method was tested for drought stress, applied to potted barley plants in a controlled rain-out shelter experiment and to agricultural corn plots subjected to a two factorial field setup altering water and nutrient availability. Both experiments were conducted on the canopy level. SiVM was significantly better than using a combination of established vegetation indices. In the corn plots, SiVM clearly separated the different treatments, even though the effects on leaf and canopy traits were subtle.
  • Publication
    How players lose interest in playing a game: An empirical study based on distributions of total playing times
    ( 2012) ; ; ; ;
    Drachen, Anders
    ;
    Canossa, Alessandro
    Analyzing telemetry data of player behavior in computer games is a topic of increasing interest for industry and research, alike. When applied to game telemetry data, pattern recognition and statistical analysis provide valuable business intelligence tools for game development. An important problem in this area is to characterize how player engagement in a game evolves over time. Reliable models are of pivotal interest since they allow for assessing the long-term success of game products and can provide estimates of how long players may be expected to keep actively playing a game. In this paper, we introduce methods from random process theory into game data mining in order to draw inferences about player engagement. Given large samples (over 250,000 players) of behavioral telemetry data from five different action-adventure and shooter games, we extract information as to how long individual players have played these games and apply techniques from lifetime analysis to identify common patterns. In all five cases, we find that the Weibull distribution gives a good account of the statistics of total playing times. This implies that an average players interest in playing one of the games considered evolves according to a non-homogeneous Poisson process. Therefore, given data on the initial playtime behavior of the players of a game, it becomes possible to predict when they stop playing.
  • Publication
    Adapting information theoretic clustering to binary images
    We consider the problem of finding points of interest along local curves of binary images. Information theoretic vector quantization is a clustering algorithm that shifts cluster centers towards the modes of principal curves of a data set. Its runtime characteristics, however, do not allow for efficient processing of many data points. In this paper, we show how to solve this problem when dealing with data on a 2D lattice. Borrowing concepts from signal processing, we adapt information theoretic clustering to the quantization of binary images and gain significant speedup.
  • Publication
    Age recognition in the wild
    In this paper, we present a novel approach to age recognition from facial images. The method we propose, combines several established features in order to characterize facial characteristics and aging patterns. Since we explicitly consider age recognition in the wild, i.e. vast amounts of unconstrained Internet images, the methods we employ are tailored towards speed and efficiency. For evaluation, we test different classifiers on common benchmark data and a new data set of unconstrained images harvested from the Internet. Extensive experimental evaluation shows state of the art performance on the benchmarks, very high accuracy for the novel data set, and superior runtime performance; to our knowledge, this is the first time that automatic age recognition is carried out on a large Internet data set.
  • Publication
    Yes we can - simplex volume maximization for descriptive web-scale matrix factorization
    Matrix factorization methods are among the most common techniques for detecting latent components in data. Popular examples include the Singular Value Decomposition or Non- negative Matrix Factorization. Unfortunately, most meth- ods su er from high computational complexity and therefore do not scale to massive data. In this paper, we present a lin- ear time algorithm for the factorization of gigantic matrices that iteratively yields latent components. We consider a constrained matrix factorization s.t. the latent components form a simplex that encloses most of the remaining data. The algorithm maximizes the volume of that simplex and thereby reduces the displacement of data from the space spanned by the latent components. Hence, it also lowers the Frobenius norm, a common criterion for matrix factorization quality. Our algorithm is e\'0ecient, well-grounded in distance geometry, and easily applicable to matrices with billions of entries. In addition, the resulting factors allow for an in- tuitive interpretation of data: every data point can now be expressed as a convex combination of the most extreme and thereby often most descriptive instances in a collection of data. Extensive experimental validations on web-scale data, including 80 million images and 1.5 million twitter tweets, demonstrate superior performance compared to related fac- torization or clustering techniques.
  • Publication
    The snippet statistics of font recognition
    This paper considers the topic of automatic font recognition. The task is to recognize a specific font from a text snippet. Unlike previous contributions, we evaluate, how the frequencies of certain letters or words influence automatic recognition systems. The evaluation provides estimates on the general feasibility of font recognition under various changing conditions. Results on a data-set containing 747 different fonts shows that precision can vary between 16% and 94%, dependent on (i) which letters are provided, (ii) how many letters are provided, and (iii) which language is used - as these factors considerably influence the text snippet statistics. As a second contribution, we introduce a novel bag-offeatures based approach to font recognition.
  • Publication
    The good, the bad, and the ugly
    Automatic classification of the aesthetic content of a picture is one of the challenges in the emerging discipline of computational aesthetics. Any suitable solution must cope with the facts that aesthetic experiences are highly subjective and that a commonly agreed upon theory of their psychological constituents is still missing. In this paper, we present results obtained from an empirical basis of several thousand images. We train SVMbased classifiers to predict aesthetic adjectives rather than aesthetic scores and we introduce a probabilistic postprocessing step that alleviates effects due to misleadingly labeled training data. Extensive experimentation indicates that aesthetics classification is possible to a large extent. In particular, we find that previously established low-level features are well suited to recognize beauty. Robust recognition of unseemliness, on the other hand, appears to require more high-level analysis.
  • Publication
    Analyzing the evolution of social groups in world of warcraft
    This paper investigates the evolution of social structures in the game WORLD OF WARCRAFT . We analyze 192 million recordings of 18 million characters belonging to 1.4 million teams, spanning a period of 4 years. Using a recent matrix factorization method, we extract lower dimensional data embeddings. The embeddings provide intuitively nterpretable categorizations and we find a tendency towards guilds comprised of casual gamers. To our knowledge, this is the first study considering such a vast amount of data for analyzing groups in MMORPGs.