Track 1, Small vocabulary

Task overview Data Baseline software tools Instructions

Task overview

The task considers the problem of recognising commands being spoken in a noisy living room from recordings made using a binaural mannikin. As in the 2011 CHiME Challenge, the target utterances are taken from the small-vocabulary Grid corpus. However, while the target speaker was previously located at a fixed position of 2m directly in front of the mannikin, he/she is now allowed to make small head movements within a square zone of +/- 10cm around that position.

Data

Recording and mixing process

The target utterances consist of 34 speakers reading simple 6-word sequences of the form <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> where the numbers in brackets indicate the number of choices at each point.

Each utterance has been convolved with a set of binaural room impulse responses (BRIRs) simulating speaker movements and reverberation. The target speaker is static at the beginning of each utterance, then he/she moves once, and finally he/she is static again. Movements follow a straight left-right line at fixed front-back distance from the mannikin and each movement is at most 5cm at a speed of at most 15cm/s. These movements have been simulated by interpolating a set of fixed BRIRs recorded at closely spaced positions in a way that has been shown to provide a reasonable approximation to actual time-varying BRIRs.

The reverberated utterances have then been mixed with binaural recordings of genuine room noise made over a period of days in the same family living room. The temporal placement of the utterances within the noise background has been controlled in a manner which produces mixtures at 6 different ranges of SNR without rescaling the speech and noise signals: -6, -3, 0, 3, 6, 9 dB.

More details about the background noise and BRIR recording process can be found here and there are some audio demos here.

Training, development and test data

The data are now available through the LDC as LDC2017S07.

These include:

training set: 500 utterances from each of 34 speakers,
development set: 600 utterances at each of 6 ranges of SNR,
test set: 600 utterances at each of 6 ranges of SNR,
7h of noise background which are not part of the training set.

The noise-free reverberated utterances of the development set are provided for benchmarking purposes only, e.g., for computing the SNR achieved by the denoising front-end, and shall not be exploited to obtain the output transcripts in any way. You are welcome to use the additional noise background data, provided that you also report the results obtained using the official training and development sets.

All data are provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter either involve 5 s of background noise before and after the utterance (in the training set) or they are mixed in continuous 5 min noise background recordings (in the development and test sets).

References

If you eventually use the data in any published research please cite

Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M. The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
Barker, J.P., Vincent, E., Ma, N., Christensen, H. and Green, P.D. "The PASCAL CHiME Speech Separation and Recognition Challenge", Computer Speech and Language, 27:3 (2013) pages 621-633

Filenaming conventions and annotation files

Isolated utterances are named as <GridUtt> or s<Speaker>_<GridUtt> where <GridUtt> are 6 characters encoding the word sequence and <Speaker> the speaker ID. Embedded utterances are split up into 5 minute segments named as CR_lounge_<Date>_<Time>.s<SegmentNumber> where <Time> is the beginning of the recording and <SegmentNumber> the offset in seconds after the beginning of the recording.

The data is accompanied by one annotation file per speaker or per SNR. Each line encodes the available information about each utterance in the format <Utt> <NoiseSegment> <StartSample> <Duration> <SNR> <Y> <XStart> <XEnd> <TStart> <TEnd> where

<Utt> is the utterance filename
<NoiseSegment> the noise background segment filename
<StartSample> the position (start sample) of the utterance within the noise background segment
<Duration> the duration of the utterance in samples
<SNR> the range of SNR in dB (available for training only)
<Y> the front-back distance from the microphones in meters
<XStart> the initial left-right position in meters compared to the front direction of the mannikin (before the move)
<XEnd> the final left-right position in meters compared to the front direction of the mannikin (after the move)
<TStart> the starting time of the move in samples after the beginning of the utterance
<TEnd> the ending time of the move in samples after the beginning of the utterance

Baseline software tools

Recognition systems are evaluated on their ability to correctly recognise the letter and digit tokens. Baseline scoring, decoding and training tools based on HTK are also available in LDC2017S07.

These tools include 3 baseline speaker-dependent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts allowing you to

train a baseline recognition system from the training data after processing by your own denoising front end,
transcribe utterances in the development and test sets using one of the 3 provided systems or your own trained system,
score the resulting transcriptions in terms of keyword recognition rates.

Running the baseline system should produce the following set of results,

Development test set
	-6	-3	0	3	6	9
clean	11.83	12.33	16.50	17.50	21.75	23.50
rev	32.08	36.33	50.33	64.00	75.08	83.50
noisy	49.67	57.92	67.83	73.67	80.75	82.67

Evaluation test set
	-6	-3	0	3	6	9
clean	10.58	11.17	13.33	17.75	21.17	24.42
rev	32.17	38.33	52.08	62.67	76.08	83.83
noisy	49.33	58.67	67.50	75.08	78.83	82.92

Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we would like participants to follow.

Which information can I use?

You are encouraged to use the embedded data in any way that may help, e.g. to learn about the acoustic environment in general, or the immediate acoustic context of each utterance. Also, the recognition system is allowed to assume that the speaker identity is known and use a corresponding model. Finally, information about the movements of the target speaker can be used, although the participants are encouraged to consider conditions where they don't explicitly know the head position.