Instructions

(N.B the PASCAL challenge is now officially closed and the results are available here. Instructions and data are being left on-line so that the benefit of groups wishing to compare their algorithms with those that have been submitted.)

Instructions

Task overview

The task considers the problem of recognising commands being spoken in a noisy living room from recordings made using a binaural mannikin from a distance of 2 metres. To simplify the problem, participants may exploit the fact that the recording manikin maintains a fixed position in the room and that the target speech is coming from a fixed position relative to the manikin (2m distance directly in front). However, no assumptions can be made about the distractor sounds other than that they are recordings of genuine room noise made over a period of days in the same family living room.

Participants may either construct recognition systems and return recognition results directly, or they may build source separation systems and return denoised signals for further processing by a standard recognition system that has been provided.

Data description and organization

The target utterances are the same as those used in the 1st Pascal Speech separation challenge, namely, 600 utterances from the Grid corpus. This corpus consists of 34 speakers reading sentences which are simple sequences of the form,

<command:4><color:4><preposition:4><letter:25><number:10><adverb:4>

e.g. "place white at L 3 now"

(the numbers in brackets indicate the number of choices at each point).

Recognition systems are evaluated on their ability to correctly recognise the letter and digit tokens.

The test set has been convolved with binaural room impulse responses (BRIR) and mixed with binaural recordings from the CHiME domestic audio corpus. The BRIR was measured at a position 2 metres directly in front of the dummy. The temporal placement of the Grid utterances within the 20 hours of CHiME data has been controlled in a manner which produces mixtures at 6 different SNRS: -6, -3, 0, 3, 6, 9 dB) giving 3,600 test utterances in total. Note, the range of SNRs have not been constructed by scaling the speech or noise amplitudes, but instead by choosing different noise segments for each SNR point. More details of the CHiME audio corpus and CHiME-Grid mixing process can be found in this Interspeech 2010 paper. A concise description of the Grid corpus is presented here.

The development test data is provided as a set of 3,600 stereo 16 bit WAV files available at either 16 kHz or 48 kHz. The data is available as a tar file which unpacks into a folder for each SNR. Each file contains an end-pointed noisy utterance with a name that indicates the speaker identity and the utterance word sequence. The development set is also available in an unsegmented form: ie. the Grid utterances embedded in the continuous CHiME audio. The unsegmented data is accompanied by a file storing the position (start sample and duration) of the utterances to be recognised. Participants are encouraged to use the unsegmented data in any way that may help, e.g. to learn about the acoustic environment in general, or the immediate acoustic context of each utterance.

A reverberated copy of the Grid corpus training set has also been provided. This has been produced by convolving the 17,000 utterance Grid training set with a BRIR measured at the same position as the one used to construct the evaluation data. Note, however, although the same position is used, the response was measured at a different time and with a different room configuration, e.g. doors open/closed, curtains drawn/undrawn. The training data contains 500 utterances from each of the 34 Grid speakers and can be used to construct speaker dependent models. The recognition system is allowed to assume that the speaker identity is known and use a corresponding model. The speaker identities can be seen from the test set file names.

A further test set will be released in February for final evaluation. This will be produced in the same way as the development test set, but will use different Grid utterances, temporally positioned at different points in the CHiME background, and mixed using a different example of the 2 metre distance BRIR.

Required output

To attract as wide an audience as possible, the challenge has three entry levels: signals, features or recognition output.

For participants producing separated signals:

Output should be in the form of 16kHz single-channel 16 bit WAV files. Participants will be provided with a Grid-utterance recognition system based on that used in the 1st Pascal Speech Separation challenge. The system will have been trained on the reverberant but additive-noise free training data. Alternatively there will be the option to upload signals to the organisers for remote evaluation. (An adaptation dataset and script will also be made available to allow participants to accommodate for any distortion that their separation technique may introduce.)

For participants producing robust features:

Participants producing robust features should process both the training set and the test set. Scripts will be supplied for training models and performing the recognition evaluations. Alternatively there will be the option to upload feature sets to the organisers for remote evaluation. If participants wish to optimise the recogniser to their features they can enter a full system.

For participants producing recognition output:

Participants supplying a complete recognition system should use the supplied scoring script which will calculate and report key word accuracy.

Challenge Guidelines

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we would like participants to follow.

The systems should not exploit the SNR labels in the test data.
Systems should not exploit the fact that the same utterances are used at each SNR.
As already stated on the instructions page, the speaker identity is available from the utterance file name and you can assume this to be known.
Noise background. The temporal location of the utterance has been given in order that the utterance can be located in the continuous CHiME audio. It is acceptable to exploit knowledge of the utterance's acoustic context (i.e. the audio occurring before and perhaps even after the utterance). There will probably be a lot of variation in the extent to which system's use this knowledge, so please be clear about the extent to which the context is used.
Parameter tuning. We have provided the labels for the final test set so that participants can evaluate their own systems. However, please do not abuse this by tuning free parameters on the final test data. Tune free parameters on the development set and only run the final test set when you are satisfied with your system's tuning. We will expect to see results for both the development and test sets in the final papers.

Submit your results

Click here to submit your results to the CHiME challenge workshop.