Rules
Summary of the rules for systems participating in the challenge:
-
The participants can use the development set to evaluate model performance during system development. It can be used to select the best model checkpoint, tune hyperparameters, and compare different system configurations. However, the dev set must not be used to train the model or update its internal parameters in any way.
-
For system development, participants are permitted to use the MCoRec training subset, as well as the external data and pre-trained models listed in the Data and Pre-trained Models subsection. If you believe a public dataset or model is missing from this list, you may propose its addition until September 30, 2025.
-
For evaluation, each recording must be considered separately. The system should not be in any way fine-tuned on the entire evaluation set (e.g. by computing global statistics, gathering speaker information across multiple recordings).
-
Participants must submit a system description with details about their submitted system. Publishing code is not required but encouraged.
Systems that do not comply with these rules (e.g., by using a private dataset) may still be submitted but will be excluded from the final rankings.
Evaluation
Overview
The system is evaluated on three main metrics:
- Individual Speaker’s WER
- Conversation Clustering Performance (Pairwise F1 Score)
- Joint ASR-Clustering Error Rate - Primary Evaluation Metric
1. Individual Speaker’s WER
-
Output Required:
For each speaker, the system must produce a.vtt
file containing their speech transcript, time-aligned to the video. -
Reference:
Ground-truth.vtt
files are provided for each speaker. - Evaluation Steps:
- For each speaker:
- Extract the reference and hypothesis transcripts from their respective
.vtt
files. - Restrict evaluation to the time intervals specified by the speaker’s UEM (Un-partitioned Evaluation Map) in the session metadata.
- Normalize the text (including removal of disfluencies and standard text normalization).
- Extract the reference and hypothesis transcripts from their respective
- Compute the WER for each speaker:
- WER = (Substitutions + Deletions + Insertions) / Number of words in reference
- Average WER is calculated across all speakers across all sessions.
- For each speaker:
-
Implementation: script/evaluate.evaluate_speaker_transcripts
2. Conversation Clustering Performance (Pairwise F1 Score)
-
Output Required:
The system must output a mapping (speaker_to_cluster.json
) assigning each speaker to a conversation cluster (cluster ID). -
Reference:
Ground-truth cluster assignments are provided for each speaker. - Evaluation Steps:
- For all unordered pairs of speakers in a session:
- Determine if the pair is in the same cluster in both the system output and the ground truth.
- Compute the following:
- True Positives (TP): Pairs correctly predicted to be in the same cluster.
- False Positives (FP): Pairs predicted to be in the same cluster but are not in the ground truth.
- False Negatives (FN): Pairs in the same cluster in the ground truth but not predicted as such.
- Calculate:
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- Pairwise F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
- Average F1 Score is reported across all sessions.
- For all unordered pairs of speakers in a session:
-
Implementation: script/evaluate.evaluate_conversation_clustering
3. Joint ASR-Clustering Error Rate - Primary Metric
This is the main evaluation metric that combines both transcription and clustering performance at the speaker level.
3.1 Per-Speaker Clustering F1 Score
For each speaker, a clustering F1 score is computed using a one-vs-rest approach:
- For speaker i:
- Consider all other speakers j in the same session
- For each pair (i, j):
- Check if they are in the same cluster in the ground truth:
true_same
- Check if they are in the same cluster in the prediction:
pred_same
- Check if they are in the same cluster in the ground truth:
- Count:
- TP: Cases where
pred_same = True
andtrue_same = True
- FP: Cases where
pred_same = True
andtrue_same = False
- FN: Cases where
pred_same = False
andtrue_same = True
- TP: Cases where
- Compute speaker i’s clustering F1:
- Precision: TP / (TP + FP)
- Recall: TP / (TP + FN)
- F1: 2 * (Precision * Recall) / (Precision + Recall)
- Implementation: script/evaluate.evaluate_speaker_clustering
3.2 Combined Metric Calculation
For each speaker:
\text{Joint ASR-Clustering Error Rate} = 0.5 \times \text{Speaker\_WER} + 0.5 \times (1 - \text{Per\_Speaker\_Clustering\_F1})
This metric:
- Ranges from 0 (perfect) to 1 (worst possible)
- Equally weights transcription accuracy and clustering accuracy
- Lower values are better
The final primary metric is the average Joint ASR-Clustering Error Rate across all speakers in all sessions.
External data and pre-trained models
Besides the MCoRec dataset published with this challenge, the participants are allowed to use public datasets and pre-trained models listed below. In case you want to propose additional dataset or pre-trained model to be added to these lists, do so by contacting us at Slack or email to mcorecchallenge@gmail.com until September 30, 2025. If you want to use a private dataset or model, you may still submit your system to the challenge, but we will not include it in the final rankings.
Participants may use these publicly available datasets for building the systems:
- AVA
- Lip Reading Sentences 2
- AMI
- LibriSpeech
- TEDLIUM
- MUSAN
- RWCP Sound Scene Database
- REVERB Challenge RIRs.
- Aachen AIR dataset.
- BUT Reverb database.
- SLR28 RIR and Noise Database (contains Aachen AIR, MUSAN noise, RWCP sound scene database and REVERB challenge RIRs, plus simulated ones).
- VoxCeleb 1&2
- FSD50k
- WSJ0-2mix, WHAM, WHAMR, WSJ
- SINS
- LibriCSS acoustic transfer functions (ATF)
- NOTSOFAR1 simulated CSS dataset
- Ego4D
- Project Aria Datasets
- DNS challenge noises
In addition, following pre-trained models may be used:
- Audio-Visual Speech Recognition:
- Wav2vec:
- Wav2vec 2.0:
- Fairseq:
- All models including Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) and the multi-lingual XLSR-53 56k
- Torchaudio:
- Huggingface:
- facebook/wav2vec2-base-960h
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-960h-lv60-self
- facebook/wav2vec2-base
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- wav2vec2-large lv60 + speaker verification
- Other models on Huggingface using the same weights as the Fairseq ones.
- S3PRL
- Fairseq:
- HuBERT
- WavLM
- Tacotron2
- ECAPA-TDNN
- X-vector extractor
- Pyannote Segmentation
- Pyannote Diarization (Pyannote Segmentation+ECAPA-TDNN from SpeechBrain)
- NeMo toolkit ASR pre-trained models:
- NeMo toolkit speaker ID embeddings models:
- NeMo toolkit VAD models:
- NeMo toolkit diarization models:
- Whisper
- OWSM: Open Whisper-style Speech Model
- Icefall Zipformer
- RWKV Transducer