Baseline System
The baseline recipe of NOTSOFAR-1, inspired by Yoshioka et al. and Raj et al., consists of three steps: continuous speech separation (CSS), ASR, and speaker diarization. In this section, we will briefly describe each step.
CSS: The objective of CSS is to receive an input audio stream, consisting of either 1 or 7 channels, and convert it into a set of overlap-free streams. In our case, we generate N=3 distinct speech streams, which can support up to 3 speakers overlapping at the same time. An effective CSS essentially monitors the input audio stream, and when overlapping utterances are found, it distributes them to the different output channels. We follow the conventional CSS framework, first training a speech separation network with permutation invariant training (PIT) loss, which takes fixed-length audio segments as input and outputs N speech masks and one noise mask. The masks are in STFT (short-time Fourier transform) domain. For the network architecture we select the conformer model .
In the inference phase, the network is applied in a block-wise streaming fashion to overlapping fixed-length segments. Since the order of the N speech mask outputs may not be consistent across segments, we align every pair of adjacent segments. To estimate the best order, we consider all permutations and select the one with the lowest mean squared error (MSE) between the masked magnitude spectrograms, calculated over the frames shared by the two segments. After stitching the N speech masks and one noise mask over time, we proceed to generate each of the N output streams. For the single-channel variant, this consists of multiplying the input mixture by the target speech mask. For the multi-channel variant, we rely on mask-based minimum variance distortionless response (MVDR) beamforming. As part of this scheme, the spatial covariance matrices (SCMs) of the target and interference signals are computed, where the interference signal is defined as the sum of all non-target speakers and the background noise.
ASR: For automatic speech recognition (ASR), we employ Whisper “large-v2”, which supports word-level timestamps. ASR is applied independently to each audio stream produced by CSS.
Speaker Diarization: The task of the diarization module is to assign a speaker label to every word transcribed by the ASR module. Speaker labels are unique identifiers such as spk0
, spk1
, etc., and they are not required to be identical to the reference speakers. We first apply an offline diarization method to diarize each of the audio streams produced by CSS. Then, we assign each word to its speaker label based on its word boundary information and the diarization output from the source stream.
For offline diarization, we adopted the diarization recipe of the NeMo toolkit. There are two configurations supported. The first configuration is pre-SR diarization, performing diarization as a first stage. It includes the “nmesc” variant which performs offline clustering using the normalized maximum eigengap-based spectral clustering (NME-SC) algorithm, and the “nmesc-msdd” variant which performs NME-SC followed by the multi-scale diarization decoder (MSDD). To assign a speaker label to an ASR word, we look up the active speakers within its time boundaries in the diarization output from the corresponding audio stream. In most cases, there is only one active speaker within the word’s boundaries, and it is assigned to the word. If there is no active speaker within the word’s boundaries (i.e. diarization didn’t detect speech), the speaker label of the nearest word in time is assigned. If there are multiple active speakers within the word’s boundaries (i.e. diarization detected overlapped speech), the speaker who is active for the longest duration is assigned.
The second configuration is post-SR, which utilizes the word boundaries from ASR. We extract a speaker embedding vector for each word, and then perform NME-SC clustering. Multiple-scale speaker embedding vectors are extracted for each word, each scale using different window sizes. The final affinity matrix is a simple average of the affinity matrices of all the scales.
Note that the sole distinction between the single-channel and multi-channel variants of our system is in the CSS module which processes either 1 channel or 7 channels as its input, applying mask multiplication or MVDR respectively.
📊 Baseline Results on NOTSOFAR dev-set-1
Values are presented in tcpWER / tcORC-WER (session count)
format.
Systems are ranked based on the speaker-attributed tcpWER , while the speaker-agnostic tcORC-WER serves as a supplementary metric for analysis.
We include analysis based on a selection of hashtags from our metadata, providing insights into how different conditions affect system performance.
Single-Channel | Multi-Channel | |
---|---|---|
All Sessions | 46.8 / 38.5 (177) | 32.4 / 26.7 (106) |
#NaturalMeeting | 47.6 / 40.2 (30) | 32.3 / 26.2 (18) |
#DebateOverlaps | 54.9 / 44.7 (39) | 38.0 / 31.4 (24) |
#TurnsNoOverlap | 32.4 / 29.7 (10) | 21.2 / 18.8 (6) |
#TransientNoise=high | 51.0 / 43.7 (10) | 33.6 / 29.1 (5) |
#TalkNearWhiteboard | 55.4 / 43.9 (40) | 39.9 / 31.2 (22) |
References
- T. Yoshioka et al., “Advances in Online Audio-Visual Meeting Transcription,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 276-283.
- D. Raj et al., “Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis,” 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 2021, pp. 897-904.
- Tae Jin Park, Nithin Rao Koluguri, Jagadeesh Balam, and Boris Ginsburg. Multi-scale speaker diarization with dynamic scale weighting. https://arxiv.org/abs/2203.15974, 2022.