Data
The CHiME-5 data consists of 20 parties each recorded in a different home.
To refer to these data in a publication, please cite:
- Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal
The fifth `CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines
Interspeech, 2018.
The data have been split into training, development test, and evaluation test sets as follows.
Dataset | Parties | Speakers | Hours | Utterances |
Train | 16 | 32 | 40:33 | 79,980 |
Dev | 2 | 8 | 4:27 | 7,440 |
Eval | 2 | 8 | 5:12 | 11,028 |
The audio data and the transcriptions follow this directory structure:
├── audio
│ ├── dev
│ ├── eval
│ └── train
└── transcriptions
├── dev
├── eval
└── train
Each audio/transcription directory has subdirectories for training, development, and evaluation sets. The evaluation data will be released later.
Audio
All audio data are distributed as WAV files with a sampling rate of 16 kHz. Each session consists of the recordings made by the binaural microphones worn by each participant (4 participants per session), and by 6 microphone arrays with 4 microphones each. Therefore, the total number of microphones per session is 32 (2 x 4 + 4 x 6). These WAV files are named as follows:
- Binaural microphones
<session ID>_<speaker ID>.wav , e.g., S02_P05.wav - Array microphone
<session ID>_<array ID>.CH<channel ID>.wav , e.g., S02_U05.CH1.wav
Note:
- The recordings made by the binaural microphones are stereo WAV files which include both left and right channels, while the recordings made by array microphones are decomposed into one mono WAV file per channel.
- We will not distribute the binaural microphone recordings for the evaluation set during the challenge.
The following tables provide more detailed statistics and notes about each session:
Training sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S03 | P09, P10, P11, P12 | 2:11:22 | 4,090 | P11 dropped from min ~15 to ~30 |
S04 | P09, P10, P11, P12 | 2:29:36 | 5,563 | |
S05 | P13, p14, p15, P16 | 2:31:44 | 4,939 | U03 missing (crashed) |
S06 | P13, p14, p15, P16 | 2:30:06 | 5,097 | |
S07 | p17, P18, p19, P20 | 2:26:53 | 3,656 | |
S17 | p17, P18, p19, P20 | 2:32:16 | 5,892 | |
S08 | P21, P22, P23, P24 | 2:31:35 | 6,175 | |
S16 | P21, P22, P23, P24 | 2:32:19 | 5,004 | |
S12 | P33, P34, P35, p36 | 2:29:24 | 3,300 | Last 15 minutes of U05 missing (Kinect was accidentally turned off) |
S13 | P33, P34, P35, p36 | 2:30:11 | 4,193 | |
S19 | p49, P50, P51, p52 | 2:32:38 | 4,292 | P52 mic unreliable |
S20 | p49, P50, P51, p52 | 2:18:04 | 5,365 | |
S18 | p41, P42, p43, p44 | 2:42:23 | 4,907 | |
S22 | p41, P42, p43, p44 | 2:35:44 | 4,758 | U03 missing |
S23 | p53, P54, P55, p56 | 2:58:43 | 7,054 | Neighbour interrupts |
S24 | p53, P54, P55, p56 | 2:37:09 | 5,695 | P54 mic unreliable, P53 disconnects for bathroom |
Development sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S02 | p05, P06, P07, p08 | 2:28:24 | 3,822 | |
S09 | p25, p26, p27, p28 | 1:59:21 | 3,618 | U05 missing |
Evaluation sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S01 | p01, p02, P03, p04 | 2:39:04 | 5,797 | No registration tone |
S21 | P45, P46, P47, p48 | 2:33:20 | 5,231 |
Transcriptions
The transcriptions are provided in JSON format for each session as <session ID>.json. The JSON file includes the following pieces of information for each utterance:
- Session ID ("session_id")
- Location ("kitchen", "dining", or "living")
- Speaker ID ("speaker")
- Transcription ("words")
- Start time ("start_time")
- For the binaural microphone recording of that speaker ("original")
- For all array recordings ("U01", etc.)
- For all binaural microphone recordings ("P01", etc.)
- End time ("end_time")
- Reference microphone array ID ("ref")
The following is an example annotation of one utterance in a JSON file:
{
"end_time": {
"original": "0:00:43.82",
"U01": "0:00:43.85",
"U02": "0:00:43.84",
"U03": "0:00:43.83",
"U04": "0:00:43.83",
"U05": "0:00:43.82",
"U06": "0:00:43.82",
"P05": "0:00:43.82",
"P06": "0:00:43.82",
"P07": "0:00:43.82",
"P08": "0:00:43.82"
},
"start_time": {
"original": "0:00:40.60",
"U01": "0:00:40.63",
"U02": "0:00:40.62",
"U03": "0:00:40.61",
"U04": "0:00:40.61",
"U05": "0:00:40.60",
"U06": "0:00:40.60",
"P05": "0:00:40.60",
"P06": "0:00:40.60",
"P07": "0:00:40.60",
"P08": "0:00:40.60"
},
"words": "[laughs] It's the blue, I think. I think.",
"speaker": "P05",
"ref": "U02",
"location": "kitchen",
"session_id": "S02"
},
Note
- "location" and "ref" are provided for development and evaluation data only (not for training), due to the fact that you are allowed to use all array channels and binaural microphones for training.
- Transcriptions include the following tags:
- [noise] denotes any non-language noise made by the speaker (ex: grunts, coughing, loud chewing, loud lip smacking etc.)
- [inaudible] denotes speech that is audible but not clear enough to be transcribed.
- [laughs] denotes an instance where the participant laughs.
- [redacted] are parts of the signals that have been zeroed out for privacy reasons
- For redacted utterances, "speaker" information is not provided.
- The start and end times have been manually annotated for the binaural microphone pair worn by the speaker ("original"). The other microphones are not aligned. Therefore, we also provide the start and end times for each array and each binaural microphone pair by shifting the original times by the appropriate delays. These delays have been estimated by means of simple cross-correlation between the signals and may not be fully accurate. This approach should be considered as a baseline to be improved as part of the challenge.
- The reference array microphone array ID ("ref") is fixed for a given session and location.
All data is available under licence via the download page.