Baseline System

Overview

The baseline system for CHiME-9 Task 1 addresses the challenging problem of Multi-Modal Context-aware Recognition (MCoRec) in single-room environments with multiple concurrent conversations. The system processes 360° video and audio recordings to both transcribe each speaker’s speech and identify which speakers belong to the same conversation.

Task Requirements

The baseline system provides an initial framework for addressing two interconnected challenges:

Individual Speaker Transcription: Generate time-aligned transcripts (.vtt files) for each target speaker within specified evaluation intervals
Conversation Clustering: Group participants into their respective conversations by generating speaker-to-cluster mappings

Input: Single 360° video with audio, plus bounding boxes identifying target participants Output: Per-speaker transcriptions and conversation cluster assignments

Baseline System Architecture

The baseline system is provided at Github. Please refer to the README therein for information about how to install and run the system.

Core Components

1. Active Speaker Detection

Purpose: Determines which speaker is actively speaking at any given moment
Baseline Model: A Light Weight Model for Active Speaker Detection
Input: Face crop video and audio extracted from the 360° video
Output: Active speaker detection scores for each frame of the corresponding track video. These scores are used to determine when a speaker is talking, allowing the audio-visual speech recognition system to run only during the speaking segments.

2. Face Landmark Detection and Mouth Cropping

Purpose: Extracts mouth region from face crop videos.
Models:
- Face detector based on RetinaFace
- 2D facial landmark detector based on FAN (Face Alignment Network)
Input: Face crop video and audio extracted from the 360° video
Output: Mouth crop video with precise mouth region extraction
Processing Pipeline: Referenced from Auto-AVSR/preparation repository

3. Video Segmentation and Chunking

Purpose: Splits long mouth crop videos into smaller segments (≤15 seconds) based on active speaker detection scores for efficient processing
Algorithm:
- Hysteresis Thresholding: Uses onset and offset thresholds to identify speech regions from ASD scores
- Duration Filtering: Removes short speech segments and fills short gaps between speech regions
- Segment Splitting: Divides long continuous speech regions into manageable chunks
Input: Mouth crop video and active speaker detection scores
Output: List of video segments with start/end timestamps

4. Audio-Visual Speech Recognition

Purpose: Combines audio and visual cues to transcribe speech into text for each video segment
Input: Segmented mouth crop videos with corresponding audio streams and timestamps
Output: Time-aligned transcriptions in WebVTT format with start/end timestamps for each segment
Baseline Model
- AV-HuBERT CTC/Attention: Cocktail-Party Audio-Visual Speech Recognition
- Auto-AVSR: Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels
- Muavic-EN: MuAViC: A Multilingual Audio-Visual Corpus for Robust Speech Recognition and Robust Speech-to-Text Translation

5. Time-based Conversation Clustering

Purpose: Groups speakers into their respective conversations based on temporal speaking patterns and overlap analysis
Input: Active speaker detection scores
Processing Workflow:
- Speaker Activity Extraction: Uses ASD scores from the Active Speaker Detection component to identify time segments where each speaker is actively talking
- Conversation Score Calculation: For each pair of speakers, computes interaction scores based on temporal overlap patterns:
  - Simultaneous speech (overlap) → indicates different conversations
  - Sequential speech (non-overlap) → indicates same conversation
  - Score formula: 1 - (overlap_duration / total_duration)
- Distance Matrix Construction: Converts conversation scores to distances for clustering:
  - Higher scores (less overlap) = smaller distances = higher probability of same conversation
  - Lower scores (more overlap) = larger distances = lower probability of same conversation
- Agglomerative Clustering: Hierarchically groups speakers using the distance matrix
Output: Speaker-to-cluster mapping in JSON format (speaker_to_cluster.json)

This baseline establishes a reference implementation that participants can build upon and improve through more sophisticated approaches to better handle the challenging multi-conversation scenarios with high speech overlap and complex acoustic environments.

Results

The results for the baseline systems on dev subset are the following:

System	AVSR Model	MCoRec finetuned	Conversation Clustering	Conversation Clustering F1 Score	Speaker WER	Joint ASR-Clustering Error Rate
BL1	AV-HuBERT CTC/Attention	No	Time-based	0.8153	0.5536	0.3821
BL2	Muavic-EN	No	Time-based	0.8153	0.7180	0.4643
BL3	Auto-AVSR	No	Time-based	0.8153	0.8315	0.5211
BL4	AV-HuBERT CTC/Attention	Yes	Time-based	0.8153	0.4990	0.3548

For detailed implementation and inference instructions, please refer to the baseline repository on GitHub.