Data

CHiME-7 DASR "Official" Datasets

As explained in Task 1 page, this Task makes use of three different datasets:

  1. CHiME-6 Challenge [1]
  2. Amazon Alexa Dinner Party Corpus (DiPCO) [2]
  3. LDC Mixer 6 Speech [3]

We will refer to these three datasets as the Task official datasets, as they are the datasets where the participants will be evaluated.

Together with these “official” datasets, participants are free to use (and propose!) additional external datasets.
See the Rules page for a complete list of datasets and pre-trained models participants are allowed to use and how they can propose new ones. We will describe thereafter briefly each of these three “official” datasets.

Important
Please note that Mixer 6 Speech data as used here is different from one available in LDC website.
Even if you have it already, please download the correct version by following the procedure outlined below in Getting the Data.

CHiME-6 Challenge Data

The CHiME-6 corpus [1] features more than 40 h of recordings organized in 20 different sessions: 2 for eval, 2 for dev and 16 for training.
Each session consists in a dinner party between 4 participants, recorded with 6 far-field Kinect microphones arrays, each with 4 microphones.
Due to the informal setting, it features highly conversational speech with lots of overlap and back-channel responses.
Speakers are free to roam in different rooms within the apartment where the dinner party takes place, and arrays are placed across different rooms.
Binaural microphones signals, worn by each of the four participants are also provided.
They were employed to provide close-talk references to aid in the ground-truth transcription.
You can refer to the previous CHiME-6 Challenge data description page for more detailed info.

Note that in CHiME-7 DASR we re-arranged the training and evaluation partitions so you should always use our provided scripts to generate the CHiME-6 data portion of this Task dataset. In particular we moved two sessions from training to evaluation, removed the reference array information in evaluation (for both main and sub-track) and also use a slighly different text normalization.
We detail these changes below in Detailed CHiME-7 DASR Data Description

Dinner Party Corpus (DiPCo)

DiPCo [2] features a scenario similar to CHiME-6, consisting of 10 recorded dinner party sessions each between 4 speakers recorded by 5 far-field devices each with a 7-mic circular array (six plus one microphone at the center). Per-speaker close-talk microphones areprovided. DiPCo has no synchronization issues compared to CHiME- 6 and the close-talk microphones have less cross-talk. Each session has a lower duration between 15 to 47 mins and it is recorded in the same room.
More information is available in the DiPCo paper.

In CHiME-7 DASR, this data is largely kept the same and used for development and evaluation only, using the same splits as in the original DiPCo paper.

Mixer 6 Speech

Mixer 6 Speech features a quite different scenario. The data used in this challenge is a subset of the unlabeled data provided in LDC2013S03 for which we collected transcripts. Only the data provided as part of this challenge should be used. The original data consists of a total 1425 recording sessions, with native English speakers. Each session takes place in one of two rooms called the LDC, and HRM rooms, and lasts approximately 45 minutes. The session includes an interview (15 min.), a telephone call (10 min.), prompt dictation. The interview is between a subject and an interviewer, and both sides of the conversation are recorded. The interviewer speaks about 30% of the time across all sessions, though there is significant variance across sessions, interviewers, and subjects. One side of the call (spoken by the interviewed subject) is recorded by the same microphone configuration. We emphasize that this recording is not telephony, but rather the speech recorded by microphones in the room where one side of the telephone conversation is taking place. There are a multitude of microphones in each room (14 channels named CH01-CH14) of varying types, directionality and relative position to the sources (subject and interviewer). The microphones in the HRM room were placed according to the same configuration as in the LDC room.

In this challenge, we created 4 sets of data called train_intv, train_call, dev, test. The test set will be distributed sometime in May or June, no later than June 12th. The train_intv, train_call, and dev sets are available from the LDC free of charge (see Getting the Data for more info). The training transcripts (i.e., those provided in the train_intv and train_call sets) are of the subject’s speech only. All sessions in the training and dev sets occur in the LDC room. The dev transcripts correspond to the full interview portions of 59 sessions and are speaker disjoint from the training set. For training and development CH01-CH13 are available for training. CH01, CH02, (lapel), and CH03 (headset), are close talk microphones that are provided for training and development, but will not be available for the final evaluation.

The train_intv set consists of 189 sessions, 36.09 hr transcribed speech, from 85 unique speakers (77 subjects, and 10 interviewers).
The train_call set consists of 54 sessions, 26.57 hr transcribed speech, from 93 speakers (8 additional, 81 subjects and 14 interviewers).
The dev set consists of 59 sessions, 14.27 hr transcribed speech, from 27 speakers (26 subjects, and 1 interviewer).
The test set consists of 23 sessions, 5.75 hr of speech, from 18 speakers (15 subjects, and 3 interviewrs).
See Mixer 6 Speech page for further info on the audio collection, and the README.md in the downloaded challenge data for more info on the transcripts.

Getting the Data

In the Task Baseline we provide a convenient data generation script for generating the official Challenge dataset. However, due to datasets licenses this script will download automatically only DiPCo.
Participants have to manually obtain CHiME-5 data and Mixer 6 Speech data as outlined in the following.
The CHiME-5 data will be used to generate the CHiME-6 data as done in the previous CHiME-6 Challenge

Once you have downloaded and unpacked Mixer 6 Speech and CHiME-5 you can use our provided CHiME-DASR dataset generation scripts to generate the official datasets.

Important
If you already have CHiME-6 you don’t need to re-obtain CHiME-5 data, you can use directly this one and our provided CHiME-DASR dataset generation scripts.
You can skip the Obtaining CHiME-5 Challenge Data below.
If you have any questions don’t hesitate to reach to us.

Obtaining Mixer 6 Speech data

Each team should file a request for Mixer 6 Speech data by compiling this PDF Form and submitting it to the Linguistic Data Consortium (LDC) via email at ldc@ldc.upenn.edu, as specified in the document. After approval, LDC will provide the training and development partitions of Mixer 6 Speech (which, again differ from data as described in LDC website).
The evaluation partition will be sent to the participants by LDC by the 12th June 2023.

Obtaining CHiME-5 Challenge Data

The CHiME-5 Challenge data is available under license from University of Sheffield.
Please refer to the CHiME-6 Challenge page. to how to obtain CHiME-5 data.

Once CHiME-5 data is obtained you can use our data-generation scripts in the Task baseline to generate the CHiME-7 DASR required CHiME-6 data from CHiME-5. The script makes use of the original CHiME-6 Challenge synchronization scripts.

Detailed CHiME-7 DASR Data Description

CHiME-7 DASR official data has been split in into training, development test, and evaluation test sets as follows.
(Note that info about evaluation is not made available at this stage, also note that CHiME-6 evaluation in CHiME-7 DASR differs from CHiME-6 Challenge one)

Dataset Split Num Speakers Hours (hh:mm)


CHiME6
train 18 30:57
dev 8 4:27
eval - -

DiPCo
dev 16
eval - -

Mixer 6
train 93 63:06
dev 27 14:16
eval - -

The CHiME-7 DASR data follows this directory structure:

.
├── chime6
│   ├── audio
│   │   ├── dev
│   │   ├── eval
│   │   └── train
│   ├── transcriptions
│   │   ├── dev
│   │   ├── eval
│   │   └── train
│   ├── transcriptions_scoring
│   │   ├── dev
│   │   ├── eval
│   │   └── train
│   └── uem
│       ├── dev
│       ├── eval
│       └── train
├── dipco
│   ├── audio
│   │   ├── dev
│   │   └── eval
│   ├── transcriptions
│   │   ├── dev
│   │   └── eval
│   ├── transcriptions_scoring
│   │   ├── dev
│   │   └── eval
│   └── uem
│       ├── dev
│       └── eval
└── mixer6
    ├── audio
    │   ├── dev
    │   ├── eval
    │   ├── train_call
    │   └── train_intv
    ├── transcriptions
    │   ├── dev
    │   ├── eval
    │   ├── train_call
    │   └── train_intv
    ├── transcriptions_scoring
    │   ├── dev
    │   ├── eval
    │   ├── train_call
    │   └── train_intv
    └── uem
        ├── dev
        ├── eval
        ├── train_call
        └── train_intv

Note that evaluation folders will be available when evaluation is released. Each audio/transcription directory has subdirectories for training, development, and evaluation sets (released later during the challenge, CHiME-6 eval folders will be empty at this stage).
Each scenario (chime6, dipco, mixer6)) sub-folder has an additional sub-directory for universal evaluation map files uem.
These files indicate, for each session the start and stop (in seconds) of where your system will be scored (outside these boundaries your predictions won’t be scored but you can still use the information).
E.g. an example of the uem file for the CHiME-6 development set is given below, for session S02 evaluation is performed from second 40 to 8906 approximatively.
The other parts are not scored.

S02 1 40.600 8906.744
S09 1 65.580 7162.197

This Task annotation is similar to the previous CHiME-6 Challenge annotation.
We provide two types of annotation in the form of a JSON file for each session:

  1. plain transcriptions (transcriptions sub-folders)
  2. transcriptions used for scoring (transcriptions_scoring)

The former contain all metadata as provided in the respective datasets e.g. reference device, location (e.g. as in CHiME-6 development set), speaker gender, native vs non-native speaker (e.g as in DiPCo), punctuation in transcripts and non-words events (e.g. [noise]).
These transcript annotation is in the form of JSON files, one for each session (CHiME-6 session, DiPCo session or Mixer 6 session).
For each utterance there is an entry with, at least, this information:

  • Session ID ("session_id")
  • Speaker ID ("speaker")
  • Transcription ("words")
  • Start time ("start_time")
  • End time ("end_time")

This is an example (note that contrary to CHiME-6 here start and end times are in seconds and are strings):

{
  "end_time": "11.370",
  "start_time": "11.000",
  "words": "So, um [noise]",
  "speaker": "P03",
  "session_id": "S05"
}

As said, there can be additional entries depending on the dataset and the split.
For example CHiME-6 data has additional reference device and location entries in the development set annotation. See here. For DiPCo development set in particular in addition there are:

  • Mother tongue ("mother_tongues")
  • Native speaker or not ("nativeness")
  • Reference device ("ref")
  • Gender ("gender")

where reference devices indicates the supposedely best device for that particular utterance (e.g closest or the one directly in front of the speaker, see DiPCo paper and CHiME-6 description).

The transcriptions_scoring are instead the transcripts we actually use for evaluating your submission (of course we won’t provide them for evaluation but are available for training and dev).
This means that these have the the transcriptions normalized (and we use a slighly different normalization compared to CHiME-6 Challenge):

{
  "end_time": "11.370",
  "start_time": "11.000",
  "words": "so ummm",
  "speaker": "P03",
  "session_id": "S05"
}

As stated previously the normalisation used for scoring here differs slightly from the one used in previous CHiME-6 Challenge.
The difference is evident by looking at our data creation script.
We basically followed DiPCo convention and normalized all “uhm”, “um” etc words to “ummm”,”hmmm” as there were many inconsistencies across the datasets about these very common interjections. All letters are transformed in lowercase and most punctuation removed.

Details for CHiME-7 DASR CHiME-6 and DiPCo data

Note that in CHiME-7 DASR, DiPCo sessions are renamed with an offset of 24 (e.g. DiPCo S01 will be S25 here). The same for speaker identities, in order to avoid conflicts with the CHiME-6 data that also uses the same annotation convention.

Here we provide more information about which sessions from CHiME-7 and DiPCo we use for the training and development splits in CHiME-7 DASR.
These are summarized in this table, where we also report speaker ids and the duration of each session.

dataset split session speakers (p=Male, P=female) hours (hh:mm)

CHiME-6
train S03 p09, p10, p11, p12 2:11
S04 p09, p10, p11, p12 2:29
S05 p13, P14, P15, p16 2:31
S06 p13, P14, P15, p16 2:30
S07 P17, p18, P19, p20 2:26
S08 p21, p22, p23, p24 2:31
S12 p33, p34, p35, P36 2:29
S13 p33, p34, p35, P36 2:30
S16 p21, p22, p23, p24 2:32
S17 P17, p18, P29, p20 2:32
S18 P41, p42, P43, P44 2:42
S22 P41, p42, P43, P44 2:35
S23 P53, p54, p55, P56 2:58
S24 P53, p54, p55, P56 2:37
CHiME-6 dev S02 P05, p06, p07, P08 2:28
S09 p45, P46, p47, P48 2:33
DiPCo dev S26 p29, p30, P31, P32 00:30
S28 p37, P38, p39, P40 00:45
S29 p41, p42, p43, P44 00:45
S33 P53, p54, P55, p56 00:22
S34 P53, p54, P55, p56 00:20

Audio

Audio data for CHiME-6 and DiPCo are distributed similarly, as WAV files with 16 kHz sample rate. For each session there are available close-talk microphones recording for each participant and far-field recordings made by the arrays.

In CHiME-6 the close-talk microphones are binaural microphones and thus are stereo. In DiPCo they are mono. In both there are 4 participants per session so the files belonging to close-talk microphones are always 4 and have the following naming convention:

<session ID>_<speaker ID>.wav

where speaker_ID is the id of the speaker to which the close-talk microphone belongs to, e.g. S02_P05.wav, if speaker_ID is P05 and session is S02.

Regarding far-field devices, in CHiME-6 there are 6 Kinect devices with 4 microphones each, while in DiPCo there are 5 circular array devices with 7 microphones each. The format for the audio files belonging to these devices is the same:

<session ID>_<array ID>.CH<channel ID>.wav

so, for example, S02_U05.CH1.wav would indicate audio from first microphone (CH1) of array 5 from session 2.

Very Important
In all scenarios, because the audio is coming from different devices, which are not sample-synchronized, there can be significant misalignment (in the order of thousands of samples) between audio from two devices, due to clock drift etc. this is particularly evident in CHiME-6.
Also CHiME-6 data has some known issues in some sessions for some recording device, we encourage the participants to have a look here where these issues are listed exhaustively.

Details for CHiME-7 DASR Mixer 6 Speech Data

As explained, the Mixer 6 Speech Data used in this Task is different from the plain one as available through LDC subscription.
The audio is from this same data but great effort was done in order to get transcripts and improve other annotation (many thanks to Dr. Matthew Wiesner).

In the table below we report Mixer 6 Speech data details:

split num subjects num interviewers hours (hh:mm)
train_calls 81 14 36:09
train_intv 77 10 26:57
dev 26 1 14:75
eval - -

As explained above, the provided transcripts are only of the subject’s speech in the train_intv and train_call sets. You are free to produce your pseudo-annotation for the interviewer in files in the train_intv and train_call sets and leverage these in training. We did not do it for our baseline system. The train_call set contains 54 sessions that are not in the train_intv set. The interviewers in these sessions appear in the dev set, so we do not recommend using the audio from the interview portion of these extra sessions if you want to avoid over-fitting to the dev set.

Any Questions ?

Please contact us via the Google Group or in the Slack Workspace.

References

[1] Watanabe, S., Mandel, M., Barker, J., Vincent, E., Arora, A., Chang, X., et al. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. https://arxiv.org/abs/2004.09249
[2] Van Segbroeck, M., Zaid, A., Kutsenko, K., Huerta, C., Nguyen, T., Luo, X., et al. (2019). DiPCo–Dinner Party Corpus. https://arxiv.org/abs/1909.13447
[3] Brandschain, L., Graff, D., Cieri, C., Walker, K., Caruso, C., & Neely, A. (2010, May). Mixer 6. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10).