Data

Updates:

[June 2, 2023] Added description of the evaluation sets for the the CHiME-5 and Reverberant LibriCHiME-5 datasets.

The UDASE task builds upon the following datasets:

The CHiME-5 in-domain unlabeled data for training, development and evaluation;
The LibriMix out-of-domain labeled data for training and development;
The reverberant LibriCHiME-5 close to in-domain data for development and evaluation.

Each dataset is described below. The description includes a link to a github repository with instructions to generate the data.

CHiME-5: In-domain unlabeled data

The CHiME-5 data consists of 20 dinner parties/sessions (four participants in each party), each recorded in a different home (Barker et al., 2018). Please visit the website of the 5th CHiME challenge for a full description of the data, in particular the Overview and Data sections.

For the UDASE task, we extracted from the binaural recordings single-channel audio segments where the participant wearing the microphone does not speak. Each audio segment therefore contains up to three simultaneously-active speakers and background noise. We used the right channel of the binaural recordings, as the left channel is not always reliable. We also discarded unreliable portions of the recordings. The noisy speech signals are not labeled with the clean speech reference signals. The main objective of the UDASE task is to develop new approaches that can leverage this in-domain unlabeled dataset for speech enhancement.

Compared with the original CHiME-5 dataset, sessions S07 and S17 were moved from the training to the developement set, so that a sufficient amount of noise-only segments can be used to create the reverberant LibriCHiME-5 dataset (see below for more information). The training (train), development (dev) and evaluation (eval) sessions for the UDASE task are the following:

train = [S03, S04, S05, S06, S08, S12, S13, S16, S18, S19, S20, S22, S23, S24];
dev = [S02, S09, S07, S17];
eval = [S01, S21].

Training set

The training set consists of the raw single-channel audio segments extracted from the binaural recordings where the participant wearing the microphone does not speak. This extraction was done using the transcription files.

The table below summarizes the data in the train set.

Number of segments	Segment duration (sec)		Total duration (HH:MM:SS)
Number of segments	mean	std	Total duration (HH:MM:SS)
27 517	10.91	14.10	83:22:29

In the github repository (see Instructions below), we give the possibility of cutting the raw segments into consecutive chunks of maximum 10 seconds, and we provide the results of a pre-trained voice activity detector applied on the raw audio segments.

Development set

The dev set can be used to tune the speech enhancement system (e.g., its hyperparameters), which requires being able to compute objective performance metrics. This is a difficult problem as the dataset contains noisy multi-speaker speech recordings that are not labeled with the clean speech reference signals.

We used the transcription of the CHiME-5 recordings to extract short audio segments of duration at least 3 seconds. This extraction is done as follows:

extract all segments where no speaker is active (i.e., noise-only segments);
extract all segments that were not extracted previously and without overlapping speakers (i.e., single-speaker segments);
extract all segments that were not extracted previously and with at most two overlapping speakers;
extract all segments that were not extracted previously and with at most three overlapping speakers.

The extracted audio segments are therefore labeled with the maximum number of simultaneously-active speakers: 0, 1, 2 or 3. Note that this only corresponds to a maximum value, i.e., through the duration of a segment the number of simultaneously-active speakers can vary between 0 and this maximum value. Moreover, a segment might contain more speakers than the labeled maximum number of simultaneously-active speakers. For instance, a segment labeled as single-speaker might actually contain two active speakers who do not speak simultaneously.

Note that the indicated maximum number of simultaneously-active speakers may not be fully accurate because the utterance start and end times in the transcription of the CHiME-5 recordings were manually annotated.

As a post-processing, Brouhaha (Lavechin et al., 2022) was used on the 0- and 1-speaker segments to verify the absence and presence of speech respectively (misclassified segments were reviewed and removed when appropriate).

Noise-only (0 speaker) segments are used to create the reverberant LibriCHiME-5 dataset (see below), allowing for computing objective metrics such as the SI-SDR on close to in-domain data. Single-speaker segments are used to compute DNS-MOS. There is no official guideline on how to exploit the 2- and 3-speaker segments for development.

The table below summarizes the data in the dev set. The name of the subset indicates the maximum number of simultaneously-active speakers.

Subset	Number of segments	Segment duration (sec)		Total duration (HH:MM:SS)
Subset	Number of segments	mean	std	Total duration (HH:MM:SS)
dev/0	912	6.50	4.10	01:38:49
dev/1	5 719	5.89	3.49	09:21:53
dev/2	3 835	5.23	2.43	05:34:33
dev/3	667	4.61	1.84	00:51:14

One may wonder why doing the above extraction on the dev set and not on the training set. The reason is that we need to be able to compute objective performance metrics for development, because tuning a system cannot be automated by relying only on listening tests. We are simulating the reasonable scenario where we can afford to manually annotate a small amount of data with speaker count labels for development (and evaluation, see below), but this procedure cannot be easily done for a large training set.

Evaluation set

The evaluation set was created following a procedure similar to the development set. The table below summarizes the data in the eval set. The name of the subset indicates the maximum number of simultaneously-active speakers.

Subset	Number of segments	Segment duration (sec)		Total duration (HH:MM:SS)
Subset	Number of segments	mean	std	Total duration (HH:MM:SS)
eval/0	977	5.73	3.35	01:33:19
eval/1	3 013	5.54	2.94	04:38:05
eval/2	1 552	4.88	2.04	02:06:07
eval/3	233	4.21	1.17	00:16:21

The eval/1 subset will be used to evaluate the submitted systems in terms of DNS-MOS (for the evaluation stage 1). The eval/2 and eval/3 subsets are not used to evaluate the systems, they can be ignored by the participants (they are only provided to be consistent with the dev set).

In addition, we provide to the participants the eval/listening_test subset, which contains the audio samples that will be used for the listening test (evaluation stage 2, for systems that passed the first evaluation stage):

Subset	Number of segments	Segment duration (sec)		Total duration (HH:MM:SS)
Subset	Number of segments	mean	std	Total duration (HH:MM:SS)
eval/listening_test	241	4.72	0.34	00:18:58

These audio samples were extracted from the CHiME-5 eval set by looking for segments of 4 to 5 seconds that contain at least 3 seconds of speech and with 0.25 second without speech at the beginning and at the end of the segment. Additional constraints were taken into account to ensure a balanced subset in terms of speaker gender, recording location, and session. Among the selected segments:

184 have both male and female speakers, 29 have only female speakers, 28 have only male speakers.
121 are taken from session S01 and 120 from S21.
81 are recorded in the kitchen, 80 in the dining room and 80 in the living room.

Instructions

The original CHiME-5 dataset is available under license here.

Scripts to extract the audio segments from the original CHiME-5 binaural recordings are available in this github repository.

LibriMix: Out-of-domain labeled data

Solving the speech enhancement task only from unlabeled in-domain data is difficult, which is the reason why a labeled dataset is also provided. We chose LibriMix (Cosentino et al., 2020) because it is a standard open-source dataset in the community. LibriMix was originally developed for speech separation in noisy environments, it is derived from LibriSpeech clean utterances (Panayotov et al., 2015) and WHAM! noises (Wichern et al., 2019).

Two versions of LibriMix exist: Libri2Mix and Libri3Mix contain noisy speech mixtures with 2 and 3 overlapping speakers respectively, which are labeled with the corresponding clean speech signals. A single-speaker version of LibriMix can be obtained easily by discarding one of the two speakers in Libri2Mix.

Participants can optionally report results on the LibriMix dataset (see Submission section), but these will not be used to rank the systems.

Instructions

The scripts to create LibriMix are available in the official github repository of the dataset.

For more information on how to adapt the LibriMix dataset for the UDASE task, participants are referred to the baseline github repository (please visit the Baseline section).

Reverberant LibriCHiME-5: Close to in-domain labeled data

In real-world conditions, particularly for the CHiME-5 recordings, it is impossible to have access to the ground-truth clean speech reference signals associated with the noisy speech mixtures. Yet, when developing and evaluating a speech enhancement algorithm it is necessary to compute objective performance metrics. For this purpose, we created the reverberant LibriCHiME-5 dataset for development and evaluation only.

It consists of synthetic labeled mixtures of reverberant speech and noise, with up to three simultaneously-active speakers. Noise signals were extracted from the CHiME-5 recordings (see above), clean speech utterances were taken from the LibriSpeech dataset and were convolved with room impulse responses (RIRs) measured in domestic environments, provided in the VoiceHome corpus (Bertin et al., 2016). Clean speech utterances were mixed following speech activity patterns taken from the CHiME-5 data (provided by the transcription files), to simulate a conversation between multiple speakers. The per-speaker signal-to-noise ratio was chosen to approximately match that of the CHiME-5 single-speaker segments, as estimated by Brouhaha.

Development set

The developement set is created from:

the dev-clean subset of LibriSpeech;
noise-only segments from the dev set of CHiME-5;
a subset of VoiceHome RIRs.

The table below summarizes the data in the dev set. The name of the subset indicates the maximum number of simultaneously-active speakers.

Subset	Number of examples	Example duration (sec)		Total duration (HH:MM:SS)
Subset	Number of examples	mean	std	Total duration (HH:MM:SS)
dev/1	1 187	7.14	4.67	02:21:09
dev/2	565	5.37	2.24	00:50:31
dev/3	65	4.81	1.66	00:05:12

Evaluation set

The evaluation set is created from:

the test-clean subset of LibriSpeech;
noise-only segments from the eval set of CHiME-5;
a subset of VoiceHome RIRs.

The table below summarizes the data in the eval set. The name of the subset indicates the maximum number of simultaneously-active speakers.

Subset	Number of examples	Example duration (sec)		Total duration (HH:MM:SS)
Subset	Number of examples	mean	std	Total duration (HH:MM:SS)
eval/1	1 394	6.25	3.75	02:25:17
eval/2	494	4.44	1.34	00:36:35
eval/3	64	4.21	1.07	00:04:29

This evaluation set will be used to evaluate the submitted systems in terms of SI-SDR (for the evaluation stage 1).

Disclaimer

Despite the effort for generating a synthetic dataset that matches the distribution of the target domain as much as possible, there still exist a mismatch between the reverberant LibriCHiME-5 dataset and the CHiME-5 dataset, e.g. read speech for the latter and spontaneous speech for the former. It is indeed impossible to create synthetic labeled data that perfectly match real-world unlabeled recordings, hence the UDASE task. Nevertheless, as already mentioned, it is required for development and evaluation to be able to compute objective performance metrics, complementary to listening tests. DNS-MOS provides a way to evaluate the performance on single-speaker segments of the CHiME-5 data without having access to the clean speech reference signals, but this is not sufficient as a non-negligible amount of the CHiME-5 data is multispeaker.

We believe it is reasonable to expect systems that succesfully managed to leverage the unlabeled CHiME-5 data to have better results on the reverberant LibriCHiME-5 dataset than fully-supervised systems only trained on the labeled LibriMix dataset. Indeed, in the reverberant LibriCHiME-5, the speech utterances were convolved with real RIRs measured in domestic environments, the noise signals were extracted from the CHiME-5 recordings, the per-speaker SNR was chosen to approximately match that of the CHiME-5 data (as estimated by Brouhaha), and the speech utterances were mixed to simulate a conversation using the CHiME-5 transcription. We can thus hope that the performance computed on the reverberant LibriCHiME-5 dataset corresponds to an imperfect estimate of the performance on the CHiME-5 dataset. This should be confirmed by the results of the listening tests.

Instructions

Scripts to generate the reverberant LibriCHiME-5 dataset are available in this github repository.

References

Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth ‘CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines. In Interspeech.

Lavechin, M., Métais, M., Titeux, H., Boissonnet, A., Copet, J., Rivière, M., … & Bredin, H. (2022). Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation. arXiv preprint arXiv:2210.13248.

Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). Librimix: An open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262.

Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: an ASR corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Wichern, G., Antognini, J., Flynn, M., Zhu, L. R., McQuinn, E., Crow, D., … & Roux, J. L. (2019). WHAM!: Extending speech separation to noisy environments. In Interspeech.

Bertin, N., Camberlein, E., Vincent, E., Lebarbenchon, R., Peillon, S., Lamandé, É., … & Jamet, E. (2016). A French corpus for distant-microphone speech processing in real homes. In Interspeech.