Track 2, Medium vocabulary

Task overview Data Baseline software tools Instructions

Task overview

The task considers the problem of recognizing utterances being spoken in a noisy living room from recordings made using a binaural mannikin. The task uses the same setup as the 2011 CHiME Challenge in terms of reverberation and noise conditions, but the target utterances are here taken from the speaker-independent medium (5k) vocabulary subset of the Wall Street Journal (WSJ0) corpus, a well-known corpus of read speech.

Data

Mixing process

The target utterances are speech utterances from the Linguistic Data Consortium's CSR-I (WSJ0) dataset. As in the 2011 CHiME Challenge, each utterance has been convolved with a fixed Binaural Room Impulse Response (BRIR) corresponding to a frontal position at a distance of 2 m, then mixed with binaural recordings of genuine room noise made over a period of days in the same family living room. The temporal placement of the utterances within the noise background has been controlled in a manner which produces mixtures at 6 different ranges of SNR with limited rescaling of the speech and noise signals: -6, -3, 0, 3, 6, 9 dB.

More details about the background noise and BRIR recording process can be found here. Some audio demos using the Grid corpus as target speech are available here.

Training, development and test data

The data are now available through the LDC as LDC2017S10.

All data are provided as 16 bit stereo WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter involve 5 s of background noise before and after the utterance.

Training set: 7138 reverberated utterances from a total of 83 speakers forming the WSJ0 SI-84 training set and the same utterances each mixed at one random SNR.

Development set: 409 noisy utterances from 10 other speakers, forming the “no verbal punctuation” (NVP) part of the WSJ0 speaker-independent 5k vocabulary development set at each of 6 ranges of SNR.

Test set: 330 noisy utterances from 12 other speakers, forming the Nov’92 ARPA WSJ evaluation set (NVP, 5k vocabulary) at each of 6 ranges of SNR.

Different BRIRs are used for each dataset and these are the same as in the 2011 CHiME Challenge.

In addition to the above data, we also provide 7h of noise background and an optional set of noisy utterances with larger vocabulary, derived from the WSJ0 speaker-independent 20k vocabulary development and test sets at each of 6 ranges of SNR. These data are not part of the challenge, but you are welcome to use them provided that you also report the results obtained using the official training and development sets.

Public data subset

A "public" subset of the development test set is made available here to all participants for the purpose of evaluation only under agreement with the LDC. It consists of 240 stereo audio files which are noisy versions of 40 utterances by one female speaker and one male speaker of the WSJ0 development set (si_dt_05) at 6 different SNRs.

(For ftp downloads log in with user name 'anonymous')

(ftp) isolated noisy mixtures (93 MB)
(ftp) embedded noisy mixtures (206 MB)

References

If you eventually use these data in any published research please cite

Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M. The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
Barker, J.P., Vincent, E., Ma, N., Christensen, H. and Green, P.D. "The PASCAL CHiME Speech Separation and Recognition Challenge", Computer Speech and Language, 27:3 (2013) pages 621-633

Baseline software tools

The task is to transcribe all test utterances. Success is measured in terms of Word Error Rate (WER), i.e., the number of word substitutions, insertions and deletions as a fraction of the number of target words.

Baseline scoring, decoding and retraining tools based on HTK and on Keith Vertanen's recipes are also available in LDC2017S10.

These tools include 3 baseline speaker-independent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts allowing you to

train a baseline recognition system from the training data after processing by your own denoising front end (through retraining, not flat start),
transcribe utterances in the development and test sets using one of the 3 provided systems or your own trained system,
score the resulting transcriptions in terms of word error rate.

While extensive testing has been performed, please don't hesitate to contact us in case you would encounter any problem installing or using them.

Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we would like participants to follow.

Which information can I use?

You are encouraged to use the embedded data in any way that may help, e.g. to learn about the acoustic environment in general, or the immediate acoustic context of each utterance. However, you should not train models of the noise background within a given test utterance on other test utterances. Because the noise signals in different utterances temporally overlap, this would lead to strong overfitting.

Which information shall I not use?

The systems should not exploit:

the SNR labels in the test data,
the fact that the same utterances are used at each SNR,
the fact that the same noise backgrounds are used in the development and test sets,
the fact that the same utterances are used within the clean, reverberated and noisy training sets.

It is allowed to jointly process all the test utterances but the fact that the BRIRs are identical between different test utterances shall not be explicitly used.

All parameters should be tuned on the training set or the development set. Once you are satisfied with your system's tuning, run it only once on the final test set.

Can I use different features, a different recognizer or more data?

You are entirely free in the development of your system, from the front end to the back end and beyond, and you may even use extra data, e.g., derived from the provided 7h of noise background. However, if you change the features or the recogniser compared to the baseline or if you use extra data, you should provide enough information, results and comparisons, such that one can understand where the performance gains obtained by your system come from. For example, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance. If you use extra data, you should also report the results obtained from the official training and development sets alone.