Text Vs. Speech? Detecting Audio Deepfakes on Instagram

Schäfer, Karla

doi:10.1007/978-3-032-21300-6_32

2026

Conference Paper

Abstract

With the increasing use of AI, deepfakes are becoming an increasingly prevalent threat in today’s world. At the same time, the performance of most detectors drops significantly when faced with unseen data, whereas generation models are improving, resulting in fewer artefacts. We examined deepfakes published on Instagram, using the SocialDF dataset. In addition to analysing the deepfakes in the frequency domain using audio deepfake detectors, we transcribed the speech and analysed the text (e.g. emotion and topics) and the audio content (e.g. emotion and music genre). We found that audio deepfake detectors struggle to identify real-world deepfakes on Instagram. Furthermore, current audio deepfake detection uses audio artefacts only. Content is not used for detection purposes. We suggest using both the speech recording and the content. This approach improves results on real-world data and provides an explanation for the classification. Using content information, we outperformed frequency-based detection with an F1-score of 74.3%.

Author(s)

Schäfer, Karla

Fraunhofer-Institut für Sichere Informationstechnologie SIT

Mainwork

Advances in Information Retrieval. 48th European Conference on Information Retrieval, ECIR 2026. Proceedings. Part II

Conference

European Conference on Information Retrieval 2026

Options

Text Vs. Speech? Detecting Audio Deepfakes on Instagram