Convex nonnegative matrix factorization for massive datasets
 Knowledge and information systems 29 (2011), No.2, pp.457478 ISSN: 02191377 ISSN: 02193116 

 English 
 Journal Article 
 Fraunhofer IAIS () 
 matrix factorization; lowrank approximation; data mining; information retrieval; largescale data analysis 
Abstract
Nonnegative matrix factorization (NMF) has become a standard tool in data mining, information retrieval, and signal processing. It is used to factorize a nonnegative data matrix into two nonnegative matrix factors that contain basis elements and linear coefficients, respectively. Often, the columns of the first resulting factor are interpreted as "cluster centroids" of the input data, and the columns of the second factor are understood to contain cluster membership indicators. When analyzing data such as collections of gene expressions, documents, or images, it is often beneficial to ensure that the resulting cluster centroids are meaningful, for instance, by restricting them to be convex combinations of data points. However, known approaches to convexNMF suffer from high computational costs and therefore hardly apply to largescale data analysis problems. This paper presents a new framework for convexNMF that allows for an efficient factorization of data matrices of millions of data points. Triggered by the simple observation that each data point can be expressed as a convex combination of vertices of the data convex hull, we require the basic factors to be vertices of the data convex hull. The benefits of convexhull NMF are twofold. First, for a growing number of data points the expected size of the convex hull, i.e. the number of its vertices, grows much slower than the dataset. Second, distance preserving lowdimensional embeddings allow us to efficiently sample the convex hull and hence to quickly determine candidate vertices. Our extensive experimental evaluation on large datasets shows that convexhull NMF compares favorably to convexNMF in terms of both speed and reconstruction quality. We demonstrate that our method can easily be applied to largescale, realworld datasets, in our case consisting of 750,000 DBLP entries, 4,000,000 digital images, and 150,000,000 votes on World of Warcraft ®guilds, respectively.