Options
2020
Conference Paper
Title
On open-set classification with L3-net embeddings for machine listening applications
Abstract
Obtaining labeled data for machine listening applications is expensive because labeling audio data requires humans listening to recordings. However, state-of-the-art deep learning based systems usually require large amounts of labeled data to be trained with. A solution for this problem is to train a neural network with a large collection of unlabeled data to extract embeddings and then use these embeddings to train a shallow classifier on a small but labeled dataset suitable for the application. One example are Look, Listen, and Learn (L3-Net) embeddings, which are trained self-supervised to capture audio-visual correspondence in videos. Since shallow classifiers are trained discriminatively and thus tacitly assume a closed-set classification task, they do not perform well in open-set classification tasks. In this paper, a neural network that combines all L3-Net embeddings belonging to one recording into a single vector by using an x-vector mechanism as well as an open -set classification system based on that are presented. In experiments conducted on the open-set acoustic scene classification task belonging to the DCASE challenge 2019, the proposed system significantly outperforms a shallow discriminative classifier and all other previously published systems, while at the same time performing equally well as a shallow classifier on multiple closed-set machine listening datasets.