Track 1, Small vocabulary
Task overview Data Baseline software tools Instructions
Task overview
The task considers the problem of recognising commands being spoken in a noisy living room from recordings made using a binaural mannikin. As in the 2011 CHiME Challenge, the target utterances are taken from the small-vocabulary Grid corpus. However, while the target speaker was previously located at a fixed position of 2m directly in front of the mannikin, he/she is now allowed to make small head movements within a square zone of +/- 10cm around that position.Data
Recording and mixing process
The target utterances consist of 34 speakers reading simple 6-word sequences of the form <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> where the numbers in brackets indicate the number of choices at each point.Each utterance has been convolved with a set of binaural room impulse responses (BRIRs) simulating speaker movements and reverberation. The target speaker is static at the beginning of each utterance, then he/she moves once, and finally he/she is static again. Movements follow a straight left-right line at fixed front-back distance from the mannikin and each movement is at most 5cm at a speed of at most 15cm/s. These movements have been simulated by interpolating a set of fixed BRIRs recorded at closely spaced positions in a way that has been shown to provide a reasonable approximation to actual time-varying BRIRs.
The reverberated utterances have then been mixed with binaural recordings of genuine room noise made over a period of days in the same family living room. The temporal placement of the utterances within the noise background has been controlled in a manner which produces mixtures at 6 different ranges of SNR without rescaling the speech and noise signals: -6, -3, 0, 3, 6, 9 dB.
More details about the background noise and BRIR recording process can be found here and there are some audio demos here.
Training, development and test data
The data are now available through the LDC as LDC2017S07.
These include:
- training set: 500 utterances from each of 34 speakers,
- development set: 600 utterances at each of 6 ranges of SNR,
- test set: 600 utterances at each of 6 ranges of SNR,
- 7h of noise background which are not part of the training set.
All data are provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter either involve 5 s of background noise before and after the utterance (in the training set) or they are mixed in continuous 5 min noise background recordings (in the development and test sets).
References
If you eventually use the data in any published research please cite
- Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M. The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
- Barker, J.P., Vincent, E., Ma, N., Christensen, H. and Green, P.D. "The PASCAL CHiME Speech Separation and Recognition Challenge", Computer Speech and Language, 27:3 (2013) pages 621-633
Filenaming conventions and annotation files
Isolated utterances are named as <GridUtt> or s<Speaker>_<GridUtt> where <GridUtt> are 6 characters encoding the word sequence and <Speaker> the speaker ID. Embedded utterances are split up into 5 minute segments named as CR_lounge_<Date>_<Time>.s<SegmentNumber> where <Time> is the beginning of the recording and <SegmentNumber> the offset in seconds after the beginning of the recording.The data is accompanied by one annotation file per speaker or per SNR. Each line encodes the available information about each utterance in the format <Utt> <NoiseSegment> <StartSample> <Duration> <SNR> <Y> <XStart> <XEnd> <TStart> <TEnd> where
- <Utt> is the utterance filename
- <NoiseSegment> the noise background segment filename
- <StartSample> the position (start sample) of the utterance within the noise background segment
- <Duration> the duration of the utterance in samples
- <SNR> the range of SNR in dB (available for training only)
- <Y> the front-back distance from the microphones in meters
- <XStart> the initial left-right position in meters compared to the front direction of the mannikin (before the move)
- <XEnd> the final left-right position in meters compared to the front direction of the mannikin (after the move)
- <TStart> the starting time of the move in samples after the beginning of the utterance
- <TEnd> the ending time of the move in samples after the beginning of the utterance
Baseline software tools
Recognition systems are evaluated on their ability to correctly recognise the letter and digit tokens. Baseline scoring, decoding and training tools based on HTK are also available in LDC2017S07.These tools include 3 baseline speaker-dependent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts allowing you to
- train a baseline recognition system from the training data after processing by your own denoising front end,
- transcribe utterances in the development and test sets using one of the 3 provided systems or your own trained system,
- score the resulting transcriptions in terms of keyword recognition rates.
|
|