Results

On this page we present the results of all submitted systems together with the published baseline. We thank all participants for their submissions, which improve performance on this MCoRec task and explore a range of modeling approaches.

Metrics follow the Rules page. Systems are scored on word error rate (WER) averaged over speakers, per-speaker conversation clustering F1 (reported here as Conv F1), and the joint ASR–clustering error (reported here as Joint Err), which combines WER and clustering errors as the primary metric for ranking. The primary leaderboard below lists each system’s main score for these three metrics.

Primary results

Rank Team name System
name
WER ↓ Conv
F1 ↑
Joint
Err ↓

Detailed analyses on the evaluation set

By number of conversations

This table reports system performance grouped by the number of concurrent conversations in a session. Each column corresponds to sessions containing 1, 2, or 3 conversations, and reports the average speaker WER and conversation clustering F1. On the evaluation set, 14.9% of sessions contain a single conversation, while 44.8% and 40.3% contain two and three conversations, respectively. This breakdown shows how system performance changes as conversational structure becomes more complex.

Team Name System Name WER ↓
(1 conv.)
F1 ↑
(1 conv.)
WER ↓
(2 conv.)
F1 ↑
(2 conv.)
WER ↓
(3 conv.)
F1 ↑
(3 conv.)

By number of speakers

This table reports system performance grouped by the total number of speakers in a session. For each speaker-count condition, it reports the average speaker WER and conversation clustering F1.

On the evaluation set, the distribution is as follows: 12 sessions with 2 speakers (17.9%), 1 with 3 speakers (1.5%), 13 with 4 speakers (19.4%), 12 with 5 speakers (17.9%), 25 with 6 speakers (37.3%), 1 with 7 speakers (1.5%), and 3 with 8 speakers (4.5%).

This breakdown provides a view of how recognition and clustering performance change as the number of participants increases.

Team Name System Name WER ↓
(2 spk)
F1 ↑
(2 spk)
WER ↓
(3 spk)
F1 ↑
(3 spk)
WER ↓
(4 spk)
F1 ↑
(4 spk)
WER ↓
(5 spk)
F1 ↑
(5 spk)
WER ↓
(6 spk)
F1 ↑
(6 spk)
WER ↓
(7 spk)
F1 ↑
(7 spk)
WER ↓
(8 spk)
F1 ↑
(8 spk)

Error rate by type (overall)

This table decomposes overall WER errors into substitution, insertion, and deletion components for each system. SubR, InsR, and DelR are the normalized rates of each error type, while Sub%, Ins%, and Del% show their shares among all recognition errors. This shows an overall error profile for each system.

Team Name System Name SubR ↓ InsR ↓ DelR ↓ Sub % Ins % Del %

By speaking activity (speaker-level ratio)

This table shows performance grouped by how much each speaker talks. The speaking activity ratio is the amount of time a speaker talks (from reference transcripts) divided by the total session duration. Speakers are grouped into low (<45%), mid (45–60%), and high (>60%) activity levels. On the evaluation set, this corresponds to 320 speakers in total, with 24.1% low-activity, 26.9% mid-activity, and 49.1% high-activity speakers. This analysis provides a speaker-level view of how speaking activity affects recognition and clustering performance.

Team Name System Name Low Activity
(WER ↓)
Low Activity
(F1 ↑)
Mid Activity
(WER ↓)
Mid Activity
(F1 ↑)
High Activity
(WER ↓)
High Activity
(F1 ↑)

By speaking activity (speaker-level ratio, error type)

This table analyzes speaker-level performance by error type across different speaking activity groups. Speakers are divided into low (<45%), mid (45–60%), and high (>60%) activity based on their speaking ratio. For each group, it reports substitution, insertion, and deletion rates, computed as the total number of each error type divided by the total number of reference words. This breakdown shows how different types of recognition errors vary with speaker activity, providing an error composition view at the speaker level.

Team Name System Name Low Activity
(SubR ↓)
Low Activity
(InsR ↓)
Low Activity
(DelR ↓)
Mid Activity
(SubR ↓)
Mid Activity
(InsR ↓)
Mid Activity
(DelR ↓)
High Activity
(SubR ↓)
High Activity
(InsR ↓)
High Activity
(DelR ↓)

By activity imbalance (session-level)

This table reports performance grouped by session-level activity imbalance, which measures how unevenly speaking time is distributed among speakers in a session. For each session, the speaking ratio of each speaker is computed, and the imbalance score is defined as the standard deviation of these ratios divided by their mean. This yields a normalized measure of participation unevenness: lower values indicate more balanced participation, while higher values indicate that a few speakers dominate.

Sessions are grouped into three ranges: low (≤ 0.15), mid (0.15–0.30), and high (> 0.30) imbalance. On the evaluation set, most sessions fall into the mid range (52.2%), followed by low (26.9%) and high (20.9%) imbalance.

These groups correspond to distinct interaction patterns. Low-imbalance sessions show balanced participation, with similar speaking ratios across speakers (average max–min spread ≈ 0.12). Mid-imbalance sessions exhibit moderate unevenness, where one or a few speakers are more active (spread ≈ 0.32). High-imbalance sessions show strong dominance, with large gaps between active and inactive speakers (spread ≈ 0.52), often including many low-activity speakers alongside a few dominant ones.

This grouping provides a session-level view of how participation imbalance relates to recognition and clustering performance.

Team Name System Name Low Imbalance
(WER ↓)
Low Imbalance
(F1 ↑)
Mid Imbalance
(WER ↓)
Mid Imbalance
(F1 ↑)
High Imbalance
(WER ↓)
High Imbalance
(F1 ↑)

By speaking activity (session average ratio)

This table reports performance grouped by session-level average speaking activity. For each session, the mean speaking ratio across all speakers is computed, and sessions are divided into low (<45%), mid (45–60%), and high (≥60%) activity levels. On the evaluation set, 5 sessions (7.5%) fall into the low category, 34 (50.7%) into mid, and 28 (41.8%) into high.

This grouping captures the overall activity level of a session and provides insight into how system performance varies under different levels of conversational activity.

Team Name System Name Low Activity
(WER ↓)
Low Activity
(F1 ↑)
Mid Activity
(WER ↓)
Mid Activity
(F1 ↑)
High Activity
(WER ↓)
High Activity
(F1 ↑)

This site uses Just the Docs, a documentation theme for Jekyll.