Variance-maximizing acoustic dimensions for machine learning-based soundscape assessment

Bergner, Jakob; Masovic, Drasko; Sladeczek, Christoph; Bös, Joachim

March 2025

Conference Paper

Abstract

Machine learning (ML) models used in audio signal processing commonly rely on time-frequency representations, such as linear and log Mel spectrograms, as input features. Although these two-dimensional representations facilitate the application of advanced machine vision techniques to audio tasks, they limit the auditory and acoustic information to an average of sound energy (or amplitude) over a relatively coarse time and frequency grid. Prior research has indicated that a substantial amount of information in audio signals can be represented using a few fundamental acoustic dimensions. These dimensions are derived from time series of low-level signal features motivated by human auditory perception within the soundscape framework and are designed to maximize variance across the selected sample. This study examines the application of these acoustic dimensions as inputs for ML tasks, including automatic acoustic scene classification (ASC) and acoustic event detection (AED), comparing them to traditional spectrogram representations. The comparison focuses on accuracy, robustness to unlearned acoustic environments and events, computational complexity, and explainability. Finally, the potential advantages of employing these dimensions for robust, efficient, and automated soundscape assessment are discussed in relation to existing environmental monitoring tools.