Data

The challenge uses the CHiME-5 dataset which consists of 20 parties each recorded in a different home.

To refer to these data in a publication, please cite:

Jon Barker, Shinji Watanabe, Emmanuel Vincent, and Jan Trmal
The fifth `CHiME’ Speech Separation and Recognition Challenge: Dataset, task and baselines
Interspeech, 2018.

Note, the transcriptions and signals for Track 1 (and Track 2) are generated from the CHiME-5 data. If you already have the CHiME-5 data there is no need to download it again. The CHiME-6 software provided will modify the signals and transcripts to generate an improved alignment, in particular, compensating for frame-drops and clock-skew. These modified signals form the starting point for CHiME-6

The data have been split into training, development test, and evaluation test sets as follows.

Dataset	Parties	Speakers	Hours	Utterances
Train	16	32	40:33	79,980
Dev	2	8	4:27	7,440
Eval	2	8	5:12	11,028

The audio data and the transcriptions follow this directory structure:

CHiME5/
├── audio
│ ├── dev
│ ├── eval
│ └── train
└── transcriptions
   ├── dev
   ├── eval
   └── train

Each audio/transcription directory has subdirectories for training, development, and evaluation sets.

Audio

All audio data are distributed as WAV files with a sampling rate of 16 kHz. Each session consists of the recordings made by the binaural microphones worn by each participant (4 participants per session), and by 6 microphone arrays with 4 microphones each. Therefore, the total number of microphones per session is 32 (2 x 4 + 4 x 6). These WAV files are named as follows:

Binaural microphones
<session ID>_<speaker ID>.wav , e.g., S02_P05.wav
Array microphone
<session ID>_<array ID>.CH<channel ID>.wav , e.g., S02_U05.CH1.wav

Note:

The recordings made by the binaural microphones are stereo WAV files which include both left and right channels, while the recordings made by array microphones are decomposed into one mono WAV file per channel.
The binaural microphone recordings for the evaluation set can be used for array sychronization only. They shall not be used for diarization, enhancement, and recognition.

The following tables provide more detailed statistics and notes about each session:

Training sessions

Session ID	Participants (Bold=Male)	Duration	#Utts	Notes
S03	P09, P10, P11, P12	2:11:22	4,090	P11 dropped from min ~15 to ~30
S04	P09, P10, P11, P12	2:29:36	5,563
S05	P13, p14, p15, P16	2:31:44	4,939	U03 and U04 missing (crashed)
S06	P13, p14, p15, P16	2:30:06	5,097
S07	p17, P18, p19, P20	2:26:53	3,656
S17	p17, P18, p19, P20	2:32:16	5,892
S08	P21, P22, P23, P24	2:31:35	6,175
S16	P21, P22, P23, P24	2:32:19	5,004
S12	P33, P34, P35, p36	2:29:24	3,300	Last 15 minutes of U05 missing (Kinect was accidentally turned off)
S13	P33, P34, P35, p36	2:30:11	4,193
S19	p49, P50, P51, p52	2:32:38	4,292	P52 mic unreliable
S20	p49, P50, P51, p52	2:18:04	5,365
S18	p41, P42, p43, p44	2:42:23	4,907
S22	p41, P42, p43, p44	2:35:44	4,758	U03 missing
S23	p53, P54, P55, p56	2:58:43	7,054	Neighbour interrupts
S24	p53, P54, P55, p56	2:37:09	5,695	P54 mic unreliable, P53 disconnects for bathroom

Development sessions

Session ID	Participants (Bold=Male)	Duration	#Utts	Notes
S02	p05, P06, P07, p08	2:28:24	3,822
S09	p25, p26, p27, p28	1:59:21	3,618	U05 missing

Evaluation sessions

Session ID	Participants (Bold=Male)	Duration	#Utts	Notes
S01	p01, p02, P03, p04	2:39:04	5,797	No registration tone
S21	P45, P46, P47, p48	2:33:20	5,231

Transcriptions

The transcriptions are provided in JSON format for each session as <session ID>.json. The JSON file includes the following pieces of information for each utterance:

Session ID ("session_id")
Location ("kitchen", "dining", or "living")
Speaker ID ("speaker")
Transcription ("words")
Start time ("start_time")

For the binaural microphone recording of that speaker ("original")
For all array recordings ("U01", etc.)
For all binaural microphone recordings ("P01", etc.)

End time ("end_time")
Reference microphone array ID ("ref")

The following is an example annotation of one utterance in a JSON file:


    {
        "end_time": "0:00:43.82",
        "start_time": "0:00:40.60",
        "words": "[laughs] It's the blue, I think. I think.",
        "speaker": "P05",
        "ref": "U02",
        "location": "kitchen",
        "session_id": "S02"
    },

Note

"location" and "ref" are provided for development and evaluation data only (not for training), due to the fact that you are allowed to use all array channels and binaural microphones for training.
Transcriptions include the following tags:

[noise] denotes any non-language noise made by the speaker (ex: grunts, coughing, loud chewing, loud lip smacking etc.)
[inaudible] denotes speech that is audible but not clear enough to be transcribed.
[laughs] denotes an instance where the participant laughs.
[redacted] are parts of the signals that have been zeroed out for privacy reasons

For redacted utterances, "speaker" information is not provided.
The start and end times have been manually annotated for the binaural microphone pair worn by the speaker ("original").
The reference array microphone array ID ("ref") is fixed for a given session and location.

All data is available for download under licence.