Scalable environmental sounds analysis
This paper describes a method for environmental audio events analysis. The audio events are modeled using a common universal codebook. The codebook is based on the bag-of-frames (BOF). The features corresponding to the frames and extracted from all audio files are grouped into clusters using the k-means algorithm. The individual audio file is modeled on the normalized distribution of the numbers of cluster bins corresponding to the frames of this file. Each audio file is described by one vector. The audio data are represented as feature-file matrix similar to term-document representation in Latent Semantic Indexing (LSI). The LSI is applied to the feature-file matrix to represent the data in latent semantic space. Then the primary file description is converted to the vectors of similarity to anchor reference data. For anchor reference the training data are used. Each component of this vector is a probabilistic similarity between target file and anchor reference file corresponding to the considered component. The LSI is applied once more to the new feature-file matrix, mapping the data to the latent semantic space in the anchor reference space. For audio recognition and audio retrieval the nearest-neighbor (NN) algorithm is exploited. The described data representation improves the results of audio retrieval and recognition.