Task 3 - MMCSG

ASR for multimodal conversations in smart glasses

Slack

TL; DR

The CHiME-8 MMCSG task involves natural conversations between two people recorded with smart glasses. The goal is to obtain speaker-attributed transcriptions in streaming fashion, using audio, video, and IMU input modalities. See the dedicated pages to learn more about the data, rules and evaluation and the baseline system. Similar to the other tasks, we are interested in practical and innovative solutions besides achieving strong performance. We encourage the participants to submit novel ideas or simple systems even if they do not have particularly strong performance. Moreover, we are also interested in systems that do not use any in-domain data for training or adaptation, as this is an interesting practical scenario.

For communication, please join the channel #chime-8-dasr in the dedicated Slack workspace. If you are considering participating in the challenge, please register your interest using this form.

Overview

Smart glasses are growing in popularity, especially for speech and audio use cases like audio playback and communication. Equipped with multiple microphones, cameras, and other sensors, and located on your head, they offer various advantages compared to other devices such as phones or static smart speakers. One particularly interesting application is closed captioning of live conversations, which could eventually lead to applications like realtime translation between languages, among others. Such a system will have to solve many problems together including target speaker identification/localization, activity detection, speech recognition and diarization. The addition of other signals such as continuous accelerometer and gyroscope readings in combination with the audio modality can potentially aid in all of these tasks.

This challenge focuses on transcribing both sides of a conversation where one participant is wearing smart glasses equipped with a microphone array and other sensors. The conversations represent natural spontaneous speech of two conversation participants, some of which include noise. Given the use case of real-time captioning, both transcription and diarization need to happen in a streaming fashion with an as short as possible latency.

The main research questions being asked are how well such a system can perform, to what extent technologies such as target speaker identification/localization, speaker activity detection, speech enhancement, speech recognition and diarization contribute to successful recognition and diarization of live conversations, and to what extent the use of signals from other modalities such as cameras, accelerometer, and gyroscope can improve performance over audio-only systems. Another question is the impact from a non-static microphone array on the wearer’s head where head movements are adding another complexity for identifying and separating the wearer’s speech from that of conversation partners and potential bystanders due to motion blur in audio and video.

Important dates

February 15th, ‘24 Challenge begins; release of train and dev datasets and baseline system
March 20th, ‘24 Deadline for proposing additional public datasets and pre-trained models
June 15th, ‘24 Evaluation set released
June 28th, ‘24 Deadline for submitting results
June 28th, ‘24 Deadline for submitting technical reports
September 6th, ‘24 CHiME-8 Workshop

Table of contents