Efficient learning for hashing proportional data

Xu, Zhao; Kersting, Kristian; Bauckhage, Christian

doi:10.1109/ICDM.2012.142

2012

Conference Paper

Abstract

Spectral hashing (SH) seeks compact binary codes of data points so that Hamming distances between codes correlate with data similarity. Quickly learning such codes typically boils down to principle component analysis (PCA). However, this is only justified for normally distributed data. For proportional data (normalized histograms), this is not the case. Due to the sum-to-unity constraint, features that are as independent as possible will not all be uncorrelated. In this paper, we show that a linear-time transformation efficiently copes with sum-to-unity constraints: first, we select a small number K of diverse data points by maximizing the volume of the simplex spanned by these prototypes; second, we represent each data point by means of its cosine similarities to the K selected prototypes. This maximum volume hashing is sensible since each dimension in the transformed space is likely to follow a von Mises (vM) distribution, and, in very high dimensions, the vM distribution closely resembles a Gaussian distribution. This justifies to employ PCA on the transformed data. Our extensive experiments validate this: maximum volume hashing outperforms spectral hashing and other state of the art techniques.

Author(s)

Xu, Zhao

Kersting, Kristian

Bauckhage, Christian

Mainwork

IEEE 12th International Conference on Data Mining, ICDM 2012. Proceedings. Pt.2

Conference

International Conference on Data Mining (ICDM) 2012

Options

Efficient learning for hashing proportional data