Distant Automatic Speech Recognition with Multiple Devices in Diverse Scenarios.
This Task focuses on distant automatic speech transcription and segmentation with multiple recording devices and is an update of last year CHiME-7 DASR Task.
The goal of each participant is to devise an automated system that can tackle this problem, and is able to generalize across
- different array topologies,
- acoustic scenarios: meetings, dinner parties and interviews,
- wide-variety of number of speakers (this year up to 7 max speakers),
- language style variations e.g. more formal/colloquial.
Participants can possibly exploit commonly used open-source datasets (e.g. Librispeech) and pre-trained models. In detail, this includes popular self-supervised representation (SSLR) models (see Rules Section for a complete list).
What's new compared to CHiME-7 ?
- An additional scenario from the NOTSOFAR-1 Task, featuring short office meetings with up to 7 participants. The addition of this scenario alone should make this already challenging task quite more difficult.
- More relaxed rules about the a-priori information that participants can use regarding the array topology (like channels belonging to the same device).
- We want to encourage more research towards DNN-based front-ends.
- No oracle diarization track. Just one track for joint ASR+diarization (same as last year main track).
CHiME-7 DASR succeeded only partially in some of the scientific goals that were set at the time of its organization. Participants submissions demonstrated that it is possible to devise a single system that is array- topology agnostic, can generalize to multiple speakers and performs quite well across the scenarios. The top ranking systems improved quite a lot over CHiME-6 Challenge results in the CHiME-6 scenario, while also demonstrating high performance on the other 2 scenarios.
All participants also found crucial the use of pre-trained models to obtain such impressive result.
There were also some shortcomings:
- Most submissions were largely based on ensembling tech- niques, making the approaches unpractical for real-world deployment.
- Evaluation data was not fully blind.
- Failed to produce significant advancement in the speech sep- aration component. All participants relied mostly on guided source separation (GSS).
This year we will have fully blind evaluation from NOTSOFAR-1 Task and an incentive for participants to produce more practically viable systems that rely more on novel effective techniques rather than on brute force (this will be announced later, so stay tuned).
Since this year there will be two closely related Tasks, DASR and NOTSOFAR-1, our goal is also to compare submissions between these two. The idea is to see how a generalizable transcription system (as required in DASR) compares to a specialized transcription system for the particular NOTSOFAR-1 scenario. We want to encourage participants to enroll in both tasks to find out what is the best way to adapt a general system to the NOTSOFAR-1 scenario and vice-versa.
For this reason there will be cross-task coordination with NOTSOFAR-1 in order to make it as easy as possible to run experiments in both.