Data
The challenge uses the CHiME-5 dataset which consists of 20 parties each recorded in a different home.
To refer to these data in a publication, please cite:
- Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal
The fifth `CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines
Interspeech, 2018.
Note, the signals for Track 2 (and Track 1) are generated from the CHiME-5 data. If you already have the CHiME-5 data there is no need to download it again. The software provided will modify the signals to generate an improved alignment, in particular, compensating for frame-drops and clock-skew. These modified signals form the starting point for CHiME-6.
The data have been split into training, development test, and evaluation test sets as follows.
Dataset | Parties | Speakers | Hours | Utterances |
Train | 16 | 32 | 40:33 | 79,980 |
Dev | 2 | 8 | 4:27 | 7,440 |
Eval | 2 | 8 | 5:12 | 11,028 |
The audio data and the transcriptions follow this directory structure:
├── audio
│ ├── dev
│ ├── eval
│ └── train
└── transcriptions
├── dev
├── eval
└── train
Each audio/transcription directory has subdirectories for training, development, and evaluation sets.
Audio
All audio data are distributed as WAV files with a sampling rate of 16 kHz. Each session consists of the recordings made by the binaural microphones worn by each participant (4 participants per session), and by 6 microphone arrays with 4 microphones each. Therefore, the total number of microphones per session is 32 (2 x 4 + 4 x 6). These WAV files are named as follows:
- Binaural microphones
<session ID>_<speaker ID>.wav , e.g., S02_P05.wav - Array microphone
<session ID>_<array ID>.CH<channel ID>.wav , e.g., S02_U05.CH1.wav
Note:
- The recordings made by the binaural microphones are stereo WAV files which include both left and right channels, while the recordings made by array microphones are decomposed into one mono WAV file per channel.
- The binaural microphone recordings for the evaluation set can be used for array sychronization only. They shall not be used for diarization, enhancement, and recognition.
The following tables provide more detailed statistics and notes about each session:
Training sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S03 | P09, P10, P11, P12 | 2:11:22 | 4,090 | P11 dropped from min ~15 to ~30 |
S04 | P09, P10, P11, P12 | 2:29:36 | 5,563 | |
S05 | P13, p14, p15, P16 | 2:31:44 | 4,939 | U03 and U04 missing (crashed) |
S06 | P13, p14, p15, P16 | 2:30:06 | 5,097 | |
S07 | p17, P18, p19, P20 | 2:26:53 | 3,656 | |
S17 | p17, P18, p19, P20 | 2:32:16 | 5,892 | |
S08 | P21, P22, P23, P24 | 2:31:35 | 6,175 | |
S16 | P21, P22, P23, P24 | 2:32:19 | 5,004 | |
S12 | P33, P34, P35, p36 | 2:29:24 | 3,300 | Last 15 minutes of U05 missing (Kinect was accidentally turned off) |
S13 | P33, P34, P35, p36 | 2:30:11 | 4,193 | |
S19 | p49, P50, P51, p52 | 2:32:38 | 4,292 | P52 mic unreliable |
S20 | p49, P50, P51, p52 | 2:18:04 | 5,365 | |
S18 | p41, P42, p43, p44 | 2:42:23 | 4,907 | |
S22 | p41, P42, p43, p44 | 2:35:44 | 4,758 | U03 missing |
S23 | p53, P54, P55, p56 | 2:58:43 | 7,054 | Neighbour interrupts |
S24 | p53, P54, P55, p56 | 2:37:09 | 5,695 | P54 mic unreliable, P53 disconnects for bathroom |
Development sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S02 | p05, P06, P07, p08 | 2:28:24 | 3,822 | |
S09 | p25, p26, p27, p28 | 1:59:21 | 3,618 | U05 missing |
Evaluation sessions
Session ID | Participants (Bold=Male) | Duration | #Utts | Notes |
S01 | p01, p02, P03, p04 | 2:39:04 | 5,797 | No registration tone |
S21 | P45, P46, P47, p48 | 2:33:20 | 5,231 |
All data is available for download under licence.