Overview


CHiME-5 targets the problem of distant microphone conversational speech recognition in everyday home environments. Speech material has been collected from twenty real dinner parties that have taken place in real homes. The parties have been made using multiple 4-channel microphone arrays and have been fully transcribed.

The challenge features:

  • simultaneous recordings from multiple microphone arrays;
  • real conversation, i.e. talkers speaking in a relaxed and unscripted fashion;
  • a range of room acoustics from 20 different homes each with two or three separate recording areas;
  • real domestic noise backgrounds, e.g., kitchen appliances, air conditioning, movement, etc.

Fully transcribed utterances are provided in continuous audio with ground truth speaker labels and start/end time annotations for segmentation.

The scenario


The dataset is made up of the recording of twenty separate dinner parties that are taking place in real homes. Each dinner party has four participants - two acting as hosts and two as guests. The party members are all friends who know each other well and who are instructed to behave naturally.

Efforts have been taken to make the parties as natural as possible. The only constraints are that each party should last a minimum of 2 hours and should be composed of three phases, each corresponding to a different location:

  • kitchen - preparing the meal in the kitchen area;
  • dining - eating the meal in the dining area;
  • living - a post-dinner period in a separate living room area.

Participants have been allowed to move naturally from one location to another but with the instruction that each phase should last at least 30 minutes.

Participants are free to converse on any topics of their choosing -- there is no artificial scenario-ization. Some personally identifying material has been redacted post-recording as part of the consent process. Background television and commercial music has been disallowed in order to avoid capturing copyrighted content.

The recording set up


Each party has been recorded with a set of six Microsoft Kinect devices. The devices have been strategically placed such that there are always at least two capturing the activity in each location.

Each Kinect device has a linear array of 4 sample-synchronised microphones and a camera. The microphone array geometry is illustrated in the figure below.

Kinect microphones
The 4-channel Kinect microphone array.

The raw microphone signals and video have been recorded. Each Kinect is recorded onto a separate laptop computer.

Click on the images below to hear short samples of a single Kinect channel.

In addition to the Kinects, to facilitate transcription, each participant is wearing a set of Soundman OKM II Classic Studio binaural microphones. The audio from these is recorded via a Soundman A3 adapter onto Tascam DR-05 stereo recorders being worn by the participants.

A 'beep' has been played at the start of each party to provide an approximate initial synchronisation. However, as usual in ad-hoc microphone array scenarios, a significant drift occurs over time due to differences in the clock speeds (for both Kinects and binaural microphone pairs) and to frames being dropped (for Kinects only). Due to this, the provided Kinect and binaural recordings all have a different number of samples.

Rather than providing realigned microphone signals, we provide a separate set of transcriptions for each Kinect and each binaural microphone pair. Starting from the reference transcription of each speaker obtained manually from the corresponding binaural signal, we shift the start and end times of all utterances by the estimated delays with respect to the Kinects and the other binaural microphone pairs. These delays have been obtained by means of simple cross-correlation between the signals. The resulting start and end times may not be fully accurate. This approach should be considered as a baseline to be improved as part of the challenge.

The recordings have been divided into training, development test and evaluation test sets. Each set features non-overlapping home and speakers. For full details of the data sets please continue to the next page.