Rules

Summary of the rules for systems participating in the challenge:

The participants can use the development set to evaluate model performance during system development. It can be used to select the best model checkpoint, tune hyperparameters, and compare different system configurations. However, the dev set must not be used to train the model or update its internal parameters in any way.
For system development, participants are permitted to use the MCoRec training subset, as well as the external data and pre-trained models listed in the Data and Pre-trained Models subsection. If you believe a public dataset or model is missing from this list, you may propose its addition until September 30, 2025.
For evaluation, each recording must be considered separately. The system should not be in any way fine-tuned on the entire evaluation set (e.g. by computing global statistics, gathering speaker information across multiple recordings).
Participants must submit a system description with details about their submitted system. Publishing code is not required but encouraged.

Systems that do not comply with these rules (e.g., by using a private dataset) may still be submitted but will be excluded from the final rankings.

Evaluation

Overview

The system is evaluated on three main metrics:

Individual Speaker’s WER
Conversation Clustering Performance (Pairwise F1 Score)
Joint ASR-Clustering Error Rate - Primary Evaluation Metric

1. Individual Speaker’s WER

Output Required:
For each speaker, the system must produce a .vtt file containing their speech transcript, time-aligned to the video.
Reference:
Ground-truth .vtt files are provided for each speaker.
Evaluation Steps:
- For each speaker:
  - Extract the reference and hypothesis transcripts from their respective .vtt files.
  - Restrict evaluation to the time intervals specified by the speaker’s UEM (Un-partitioned Evaluation Map) in the session metadata.
  - Normalize the text (including removal of disfluencies and standard text normalization).
- Compute the WER for each speaker:
  - WER = (Substitutions + Deletions + Insertions) / Number of words in reference
- Average WER is calculated across all speakers across all sessions.
Implementation: script/evaluate.evaluate_speaker_transcripts

2. Conversation Clustering Performance (Pairwise F1 Score)

Output Required:
The system must output a mapping (speaker_to_cluster.json) assigning each speaker to a conversation cluster (cluster ID).
Reference:
Ground-truth cluster assignments are provided for each speaker.
Evaluation Steps:
- For all unordered pairs of speakers in a session:
  - Determine if the pair is in the same cluster in both the system output and the ground truth.
- Compute the following:
  - True Positives (TP): Pairs correctly predicted to be in the same cluster.
  - False Positives (FP): Pairs predicted to be in the same cluster but are not in the ground truth.
  - False Negatives (FN): Pairs in the same cluster in the ground truth but not predicted as such.
- Calculate:
  - Precision: TP / (TP + FP)
  - Recall: TP / (TP + FN)
  - Pairwise F1 Score: 2 * (Precision * Recall) / (Precision + Recall)
- Average F1 Score is reported across all sessions.
Implementation: script/evaluate.evaluate_conversation_clustering

3. Joint ASR-Clustering Error Rate - Primary Metric

This is the main evaluation metric that combines both transcription and clustering performance at the speaker level.

3.1 Per-Speaker Clustering F1 Score

For each speaker, a clustering F1 score is computed using a one-vs-rest approach:

For speaker i:
- Consider all other speakers j in the same session
- For each pair (i, j):
  - Check if they are in the same cluster in the ground truth: true_same
  - Check if they are in the same cluster in the prediction: pred_same
- Count:
  - TP: Cases where pred_same = True and true_same = True
  - FP: Cases where pred_same = True and true_same = False
  - FN: Cases where pred_same = False and true_same = True
- Compute speaker i’s clustering F1:
  - Precision: TP / (TP + FP)
  - Recall: TP / (TP + FN)
  - F1: 2 * (Precision * Recall) / (Precision + Recall)
Implementation: script/evaluate.evaluate_speaker_clustering

3.2 Combined Metric Calculation

For each speaker:

\text{Joint ASR-Clustering Error Rate} = 0.5 \times \text{Speaker\_WER} + 0.5 \times (1 - \text{Per\_Speaker\_Clustering\_F1})

This metric:

Ranges from 0 (perfect) to 1 (worst possible)
Equally weights transcription accuracy and clustering accuracy
Lower values are better

The final primary metric is the average Joint ASR-Clustering Error Rate across all speakers in all sessions.

External data and pre-trained models

Besides the MCoRec dataset published with this challenge, the participants are allowed to use public datasets and pre-trained models listed below. In case you want to propose additional dataset or pre-trained model to be added to these lists, do so by contacting us at Slack or email to mcorecchallenge@gmail.com until September 30, 2025. If you want to use a private dataset or model, you may still submit your system to the challenge, but we will not include it in the final rankings.

Participants may use these publicly available datasets for building the systems:

In addition, following pre-trained models may be used:

Audio-Visual Speech Recognition:
Wav2vec:
- S3PRL
  - wav2vec-large
Wav2vec 2.0:
- Fairseq:
  - All models including Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) and the multi-lingual XLSR-53 56k
- Torchaudio:
- Huggingface:
  - facebook/wav2vec2-base-960h
  - facebook/wav2vec2-large-960h
  - facebook/wav2vec2-large-960h-lv60-self
  - facebook/wav2vec2-base
  - facebook/wav2vec2-large-lv60
  - facebook/wav2vec2-large-xlsr-53
  - wav2vec2-large lv60 + speaker verification
  - Other models on Huggingface using the same weights as the Fairseq ones.
- S3PRL
HuBERT
WavLM
- Huggingface:
- S3PRL:
Tacotron2
- Torchaudio:
ECAPA-TDNN
- Speechbrain
- NeMo toolkit
X-vector extractor
Pyannote Segmentation
Pyannote Diarization (Pyannote Segmentation+ECAPA-TDNN from SpeechBrain)
NeMo toolkit ASR pre-trained models:
NeMo toolkit speaker ID embeddings models:
NeMo toolkit VAD models:
- MarbleNet VAD
NeMo toolkit diarization models:
- Multi-Scale Diarization Decoder
Whisper
- Whisper official repo (all versions: small, medium, large v1/v2/v3)
OWSM: Open Whisper-style Speech Model
Icefall Zipformer
RWKV Transducer