Speech-dependent data augmentation for own voice reconstruction with hearable microphones in noisy environments

Ohlenbusch, Mattes; Rollwage, Christian; Doclo, Simon

doi:10.1186/s13636-025-00418-1

July 31, 2025

Journal Article

Abstract

Hearable devices, equipped with one or more microphones, can be used to capture the user’s own voice in noisy environments. In such environments, an own voice reconstruction (OVR) system is needed to enhance the quality and intelligibility of the recorded own voice. In this work, we aim to estimate clean broadband speech from a microphone at the outer face of the hearable and an in-ear microphone, which captures the own voice at a higher signal-to-noise ratio than the outer microphone, but with a limited bandwidth and additive body-produced noise. Training a supervised deep learning-based OVR system requires a substantial amount of own voice signals as training data. Such training data can be collected by recording many utterances from different talkers wearing the hearable, which is costly, or generated by augmenting existing clean speech datasets. In this paper, we investigate several data augmentation techniques to simulate a large amount of in-ear own voice signals from a limited amount of recorded own voice signals. More specifically, we consider different models for the own voice transfer characteristics between the outer microphone and the in-ear microphone, ranging from a fixed talker-averaged relative transfer function to a phoneme-dependent individual model. We investigate the influence of the amount of recorded own voice signals on the performance of an OVR system based on the FT-JNF architecture, either by directly using the recorded signals for training or by using the recorded signals to generate augmented data for training (with and without fine-tuning with recorded signals). Experimental results show that training using the proposed speech-dependent individual data augmentation technique and additional fine-tuning with recorded signals yields the best performance in terms of objective metrics, even when only few recorded own voice signals are available.

Author(s)

Ohlenbusch, Mattes

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Rollwage, Christian

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Doclo, Simon

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Journal

EURASIP Journal on audio, speech, and music processing : EURASIP JASMP

Options

Speech-dependent data augmentation for own voice reconstruction with hearable microphones in noisy environments