Publica
Hier finden Sie wissenschaftliche Publikationen aus den FraunhoferInstituten. Yes we can  simplex volume maximization for descriptive webscale matrix factorization
:
Postprint urn:nbn:de:0011n1486459 (767 KByte PDF) MD5 Fingerprint: 445918142c1623d76cc65373ac2e1bcd © ACM 2010 This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. Erstellt am: 22.12.2010 
 Huang, X.J.; Jones, Gareth; Koudas, Nick; Wu, Xindong; CollinsThompson, Kevyn ; Association for Computing Machinery ACM, Special Interest Group on Information Retrieval SIGIR; Association for Computing Machinery ACM, Special Interest Group on Hypertext, Hypermedia, and Web; Association for Computing Machinery ACM, Special Interest Group on Knowledge Discovery and Data Mining SIGKDD: CIKM 2010, 19th International Conference on Information & Knowledge Management and Colocated Workshops. CDROM : October 2630, 2010, Toronto, Ontario, Canada, proceedings New York: ACM, 2010 ISBN: 9781450300995 S.17851788 
 International Conference on Information and Knowledge Management (CIKM) <19, 2010, Toronto> 

 Englisch 
 Konferenzbeitrag, Elektronische Publikation 
 Fraunhofer IAIS () 
Abstract
Matrix factorization methods are among the most common techniques for detecting latent components in data. Popular examples include the Singular Value Decomposition or Non negative Matrix Factorization. Unfortunately, most meth ods su er from high computational complexity and therefore do not scale to massive data. In this paper, we present a lin ear time algorithm for the factorization of gigantic matrices that iteratively yields latent components. We consider a constrained matrix factorization s.t. the latent components form a simplex that encloses most of the remaining data. The algorithm maximizes the volume of that simplex and thereby reduces the displacement of data from the space spanned by the latent components. Hence, it also lowers the Frobenius norm, a common criterion for matrix factorization quality. Our algorithm is e\'0ecient, wellgrounded in distance geometry, and easily applicable to matrices with billions of entries. In addition, the resulting factors allow for an in tuitive interpretation of data: every data point can now be expressed as a convex combination of the most extreme and thereby often most descriptive instances in a collection of data. Extensive experimental validations on webscale data, including 80 million images and 1.5 million twitter tweets, demonstrate superior performance compared to related fac torization or clustering techniques.