Acoustic scene classification by combining autoencoder-based dimensionality reduction and convolutional neural networks
Motivated by the recent success of deep learning techniques in various audio analysis tasks, this work presents a distributed sensor-server system for acoustic scene classification in urban environments based on deep convolutional neural networks (CNN). Stacked autoencoders are used to compress extracted spectrogram patches on the sensor side before being transmitted to and classified on the server side. In our experiments, we compare two state-of-the-art CNN architectures subject to their classification accuracy under the presence of environmental noise, the dimensionality reduction in the encoding stage, as well as a reduced number of filters in the convolution layers. Our results show that the best model configuration leads to a classification accuracy of 75% for 5 acoustic scenes. We furthermore discuss which confusions among particular classes can be ascribed to particular sound event types, which are present in multiple acoustic scene classes.