- [June 2, 2023] Added information details regarding the evaluation.
- Teams do not need to pre-register, but if you are considering participating or just want to learn more then please sign up to the CHiME Google Group. We will be using this group to send general announcements that will keep prospective participants updated as the challenge progresses. Participants are also invited to use the “chime-7-task-2-udase” channel of the the CHiME slack workspace for discussions about the UDASE task.
- Teams can be from one or more institutions.
- The organizers may enter the challenge themselves but will not be eligible to win.
Teams must provide a technical report with a complete description of the submitted system, including a description of any external ressource (software, model, etc.) that might have been used. See the Submission section for the more details.
All technical documents will be published.
Teams are encouraged – but not required – to make their code open source.
The LibriSpeech and WHAM! datasets (train and dev sets) - from which LibriMix is built - can be used individually, for instance to learn isolated speech and noise signal models. They can also be used to create synthetic mixtures differently from the original LibriMix dataset.
- Participants are allowed to create synthetic mixtures using noise-only segments that would be extracted from the binaural recordings of the CHiME-5 training set, only if this extraction does not rely on the CHiME-5 ground-truth transcription.
- Participants are also allowed to use room impulse responses (RIRs) to create reverberant utterances from LibriSpeech, only if the RIRs are synthetic.
- Using other datasets of clean speech signals, noise signals, or measured RIRs in order to create synthetic noisy speech mixtures labeled with the clean speech reference signal is not allowed.
- Kinect recordings of the CHiME-5 dataset cannot be used.
A synthetic dataset better matching with the real CHiME-5 data could be created with more engineering effort and knowledge about the target domain. However, the goal of the UDASE task is to simulate more realistic conditions where such knowledge is not available. The motivation for the above rules is to encourage participants to use a relatively identical synthetic dataset and show that models trained with out-of-domain labeled data can be adapted using unsupervised learning from in-domain unlabeled data.
To allow the submitted systems to be comparable, the use of external datasets of noisy speech recordings is not allowed, even without ground-truth clean speech reference.
All system parameters should be tuned on the training or development sets of the LibriMix, CHiME-5 and reverberant LibriCHiME-5 datasets as described and provided in the Data section, or variations that comply with the above rules.
- The only data that systems can use during evaluation is the noisy speech input signal. Noisy speech input signals should be processed independently of each other.
- Participants can use external voice activity detector, diarization, speaker counting, or signal-to-noise ratio estimation systems.
Participants may choose to use all, some, or none of the parts of the baseline model.
There are no latency or computational constraints on the submitted systems.
The submitted systems will follow a two-step evaluation process:
- They will first be evaluated with the scale-invariant signal-to-distortion ratio (SI-SDR) on the reverberant LibriCHiME-5 dataset (complete
evalset) and with the DNS-MOS P.835 on the single-speaker segments of the CHiME-5 dataset (
eval/1subset of the
- The 4 best-performing systems in terms of SI-SDR or overall MOS for DNS-MOS will then be evaluated with a listening test using audio samples from the
eval/listening_testsubset of the CHiME-5 dataset.
The listening test will be inspired by the ITU-T Recommendation P.835 “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm”. The experiment will involve using headphones to listen to short audio samples (about 4 seconds). Each trial will include three repetitions of the same sample. For one repetition the participant will be instructed to focus on the speech signal and rate how natural it sounds. For another, he/she will be instructed to focus on the background and rate how noticeable or intrusive it sounds. For the third repetition, he/she will be instructed to attend to both the speech signal and the background and rate his/her opinion of the overall quality of the sample for purposes of everyday speech communication. Ratings will involve 5-point likert scales, allowing us to compute mean opinion scores (MOS).
The final ranking of the systems will be based on the results of the listening test.
- A team can submit up to two entries.
- In case a team submits multiple entries, only the one that obtains the best performance during the first evaluation stage (objective metrics) can be qualified for the second evaluation stage (listening test).
As a condition of submission, entrants grant the organizers, a perpetual, irrevocable, worldwide, royalty-free, and non-exclusive licence to use, reproduce, adapt, modify, publish, distribute, publicly perform, create a derivative work from, and publicly display the submitted audio signals and CSV files (containing the performance scores).
The main motivation for this condition is to donate the submitted audio signals and corresponding human listening scores to the community as a voice quality dataset.
In case of doubt regarding the task rules, please contact the organizers ahead of the submission deadline.