Diarizing large corpora using multi-modal speaker linking

Ferràs, M.; Masneri, S.; Schreer, O.; Bourlard, H.

2014

Conference Paper

Abstract

Speaker diarization of a collection of recordings with uniquely identified speakers is a challenging task. A system addressing such task must account for the inter-session variability present from recording to recording and it is asked to scale well to massive amounts of data. In this paper we use a two-stage approach to corpus-wide speaker diarization involving speaker diarization and speaker linking stages. The speaker linking system agglomeratively clusters speaker factor posterior distributions obtained via Joint Factor Analysis using the Ward method and the Hotteling t-square statistic as distance measure. We extend this framework to link speakers based on both speech and visual modalities to improve the robustness of the system. The system is evaluated using the data collected for the Augmented Multiparty Interaction (AMI) project, involving over one hundred meetings. We provide results in terms of within-recording and across-recording diarization error rates (DER ) to support the effectiveness of multi-modal speaker linking to enable large scale speaker diarization.

Author(s)

Ferràs, M.

Masneri, S.

Schreer, O.

Bourlard, H.

Hauptwerk

INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association. Online resource

Konferenz

International Speech Communication Association (INTERSPEECH Annual Conference) 2014

Options

Diarizing large corpora using multi-modal speaker linking