Data

Slack

⚠️ Do not use CHiME-7 DASR data
Training and evaluation portions are slightly different from ones in CHiME-7 DASR.
To generate CHiME-8 DASR core data:

  1. Each team should file a request for Mixer 6 Speech data by compiling this PDF Form and submitting it to the Linguistic Data Consortium (LDC) via email at ldc@ldc.upenn.edu, as specified in the document.
  2. You can generate the rest of the data (CHiME-6, DiPCo and NOTSOFAR1) using chime-utils (automatic download and generation).
    • E.g. to generate all data in one go: chime-utils dgen dasr ./download /path/to/mixer6_root ./chime8_dasr --part train,dev --download

πŸ’ͺ Together with the core datasets, participants are free to use (and propose new ones!) external datasets. See Rules page for how to propose additional ones.

Core Datasets Description

CHiME-8 DASR consists in 4 different scenarios on which participants systems will be evaluated.

  • All scenarios have a training portion, a development portion and an evaluation portion.
    • DiPCo and Mixer 6 development set has been split into a training and development portion.
  • Mixer 6 Speech has also additional partially annotated data for training (train_intv, train_call).
  • NOTSOFAR1 has also additional single channel far-field data (train_sc) for training (taken after array pre-processing).

In the Table below, we report, for each scenario and split, the total number of utterances, speakers and sessions as well as the percentage of silence, single speaker speech (1-spk) and overlapped speech (ovl) over the total data duration.
Evaluation values are intentionally omitted.

Scenario Split Size (h) Utterances Speakers Sessions silence (%) 1-spk (%) ovl (%)



CHiME-6

train 15:49 79967 32 16 22.1 53.1 24.8
dev 4:25 7437 8 2 12.5 43.7 43.8
eval 5:12 x x 2 x x x


DiPCo
train 1:12 1379 8 3 8.3 72.1 19.6
dev 1:30 2294 8 2 7.2 62.1 30.7
eval 2:36 x x 5 x x x




Mixer 6
train_call 36:09 27280 81 243 N.A. N.A. N.A.
train_intv 26:57 29893 77 189 N.A. N.A. N.A.
train 5:51 3785 19 24 2.8 78 19.2
dev 8:28 5903 22 35 3.4 76 20.6
eval 5:45 x x 23 x x x



NOTSOFAR1
train 10:59 26606 110 11 4.9 67.4 27.7
train_sc 17:59 43682 180 11 4.9 67.4 27.7
dev 10:39 x 106 12 x x x
eval 30:00 x x x x x x

Note that we cannot compute Mixer 6 train_intv and train_call because they are only partially annotated.

CHiME-6

⚠️ In CHiME-8 DASR, CHiME-6 scenario has the same splitting as in original CHiME-6.

πŸ›  We also offer Kaldi-derived forced-alignment annotation for the whole CHiME-6 scenario: CHiME6_falign.

CHiME-6 features more than 40 h of multi device recordings organized in 20 different sessions.
Each session consists in a dinner party between 4 friends. Due to the informal setting, it features highly conversational speech with lots of overlap and back-channel responses.
Speakers are free to roam in different rooms within the apartment where the dinner party takes place.

Recording Setup

Each session is recorded with 6 far-field Kinect microphones arrays, each with 4 microphones. These arrays are placed in different rooms.
Binaural microphones signals, worn by each of the four participants are also provided, albeit they have significant cross-talk.
They were employed to provide close-talk references to aid in the ground-truth transcription.
The audio files is the following:

  • close_talk microphones: <session_id>_<spk_id>.wav
  • far-field array microphones: <session_id>_<array_number>.<channel_number>.wav

⚠️ Some sessions may lack some device e.g. S02 lacks array U04. See CHiME-6 data page.

DiPCo

DiPCo features a scenario similar to CHiME-6, consisting of 10 recorded dinner party sessions each between 4 work colleagues.
Each session has a lower duration and every session is recorded in the same room.

Recording Setup
Each session is recorded by 5 far-field devices each with a 7-mic circular array (six plus one microphone at the center).
Per-speaker close-talk microphones are provided, close-talk microphones have less cross-talk than in CHiME-6.
Audio files formatting is the same as in CHiME-6:

  • close_talk microphones: <session_id>_<spk_id>.wav
  • far-field array microphones: <session_id>_<array_number>.<channel_number>.wav

Mixer 6 Speech

Mixer 6 Speech consists in 594 distinct native English speakers participating in a total of 1425 sessions, each lasting approximately 45 minutes.
Sessions include an interview between an β€œinterviewer” and a β€œsubject” (15 min.), a telephone call (10 min.), and prompt reading.

❗❗ Here we make use of only the interview portion. Full conversation annotation is available only for train, development and evaluation sets. There are two additional training sets which have only partial annotation:

  • train_call: only annotation for the β€œsubject” is available. β€œInterviewer” speech is not annotated.
  • train_intv: only annotation for the β€œsubject” is available. β€œInterviewer” speech is not annotated.

This year the development portion annotation was manually checked.
This latter was then split into a train and development set to allow for adaptation. ❗❗ Note that these two have always one speaker in common (the β€œinterviewer” P181 is the same).
Be careful then to avoid overfitting !

To know more about Mixer 6 Speech annotation process have a look at CHiME-7 DASR paper.

Recording Setup

Each session is recorded by 13 heterogeneous recording devices:

  • 3 close-talk devices (CH1-CH3 devices): <session_id>_<channel_number>.wav
  • 10 far-field devices (CH4-CH13 devices): <session_id>_<channel_number>.wav

NOTSOFAR1

NOTSOFAR1 scenario features very short meetings (~6 mins) from 4 to 8 total participants in an office environment.
For the training set NOTSOFAR1 offers also word-level alignment, which you may find useful.

❗❗ The development set of NOTSOFAR1 must be scored via the leaderboard (see results page) Also note that development set and training set have 10 speakers in common.
These speakers that are in training set are also in the development set:

{'P199', 'P204', 'P198', 'P194', 'P197', 'P195', 'P202', 'P192', 'P193', 'P200', 'P203'}

As such be careful when using the NOTSOFAR1 training split for fine-tuning or training diarization systems. You may end up overfitting the development set.

Recording Setup

Each session is recorded by 1 distant circular array device.
For training additional single channel data is provided (train_sc). This data comes either from another single-channel far-field device or from an array device but after some post-processing.
Close talk microphones are also available for the training portion. These have low cross-talk.
Audio files formatting is the same as in CHiME-6:

  • close_talk microphones: <session_id>_<spk_id>.wav
  • far-field array microphones: <session_id>_U01.<channel_number>.wav
    • only one array device per session is available.

More details are available in NOTSOFAR task data page.
Note that we use different data and metadata names in DASR to make it simpler for participants to parse the data across the 4 scenarios (NOTSOFAR1 is made as similar as possible to CHiME-6 with our data generation scripts) but the underlining data is actually the same.

Core Datasets Metadata and Organization

Once prepared with chime-utils or by running stage 0 in the baselines, the core data folder should look like this.
You can check if data generation has been successful with chime-utils dgen checksum ./chime8_dasr.

.
β”œβ”€β”€ chime6
β”‚Β Β  β”œβ”€β”€ audio
β”‚Β Β  β”œβ”€β”€ devices
β”‚Β Β  β”œβ”€β”€ transcriptions
β”‚Β Β  β”œβ”€β”€ transcriptions_scoring
β”‚Β Β  └── uem
β”œβ”€β”€ dipco
β”‚Β Β  β”œβ”€β”€ audio
β”‚Β Β  β”œβ”€β”€ devices
β”‚Β Β  β”œβ”€β”€ transcriptions
β”‚Β Β  β”œβ”€β”€ transcriptions_scoring
β”‚Β Β  └── uem
β”œβ”€β”€ mixer6
β”‚Β Β  β”œβ”€β”€ audio
β”‚Β Β  β”œβ”€β”€ devices
β”‚Β Β  β”œβ”€β”€ transcriptions
β”‚Β Β  β”œβ”€β”€ transcriptions_scoring
β”‚Β Β  └── uem
└── notsofar1
    β”œβ”€β”€ audio
    β”œβ”€β”€ devices
    β”œβ”€β”€ transcriptions
    β”œβ”€β”€ transcriptions_scoring
    └── uem

⚠️ Transcriptions and transcriptions scoring are available only in training and development portions.
For NOTSOFAR1 ground truth is available only for training, development portion is blind and should be scored via the Leaderboard (see results page) .

SegLST (CHiME-6 style JSON) annotation

Each session is annotated with a SegLST (Segment-wise Long-form Speech Transcription) formatted JSON file.

{
  "end_time": "11.370",
  "start_time": "11.000",
  "words": "So, um [noise]",
  "speaker": "P03",
  "session_id": "S05"
}

🚧 There may be additional fields for some datasets as they may offer more metadata (e.g. gender, reference array, word alignment and so on).

We have two metadata folders:

  1. transcriptions
    • this contains the original annotation with no changes.
  2. transcriptions_scoring
    • this contains the annotation in the form that will be used for scoring, with text normalization applied.

For example, here is how it will appear the former utterance with the text normalization:

{
  "end_time": "11.370",
  "start_time": "11.000",
  "words": "so hmm",
  "speaker": "P03",
  "session_id": "S05"
}

In CHiME-8 DASR we use a modified Whisper normalization. See submission page for more info about normalization and ranking score.
We will apply text normalization automatically on your predictions before scoring.

Devices JSONs

For each scenario and each part (training, dev and eval), we provide a JSON file for each session with information of each device (and thus the corresponding wav or flac file).
Here is an example from Mixer 6 for session 20090717_113617_LDC_120278.
Note that close talk devices "is_close_talk": true are only available in training.

    "20090717_113617_LDC_120278_CH02": {
        "is_close_talk": true,
        "speaker": "P78",
        "channel": 1,
        "tot_channels": 1,
        "device_type": "headmic"
    },
    "20090717_113617_LDC_120278_CH03": {
        "is_close_talk": true,
        "speaker": "P181",
        "channel": 1,
        "tot_channels": 1,
        "device_type": "lavaliere"
    },
    "20090717_113617_LDC_120278_CH04": {
        "is_close_talk": false,
        "speaker": null,
        "channel": 1,
        "tot_channels": 7,
        "device_type": "podium_mic"

You can use this file to help parsing the data as we di in chime-utils for e.g. lhotse manifest preparation.

UEM files

An universal evaluation map (UEM) files is in the uem folder of each corpus.
There is one file for each scenario all.uem.
Each line in an UEM files has this format: <session_id> 1 <start> <stop>.wav.

20090714_134807_LDC_120290 1 45.120 893.365
20090716_155120_LDC_120269 1 46.943 828.543
20090722_115429_LDC_120271 1 44.060 964.302
20090723_111806_LDC_120290 1 34.030 790.550
20090803_111429_LDC_120225 1 36.370 987.002
20090804_165853_LDC_120269 1 31.500 835.840
20090807_143559_LDC_120338 1 29.290 916.630
20090811_150119_LDC_120354 1 62.530 829.630

For each session it basically reports the boundaries within which your system will be scored. See NIST documentation for UEM files.

Allowed External Data

We allow to use these existing open-source datasets:

  1. AMI
  2. LibriSpeech
  3. MUSAN
  4. RWCP Sound Scene Database
  5. REVERB Challenge RIRs.
  6. Aachen AIR dataset.
  7. BUT Reverb database.
  8. SLR28 RIR and Noise Database (contains Aachen AIR, MUSAN noise,
    RWCP sound scene database and REVERB challenge RIRs, plus simulated ones).
  9. VoxCeleb 1&2
  10. FSD50k
  11. WSJ0-2mix, WHAM, WHAMR, WSJ
  12. SINS
  13. LibriCSS acoustic transfer functions (ATF)
  14. NOTSOFAR1 simulated CSS dataset

Note that some of the aforementioned external datasets may have overlapping data (e.g. SLR28 database containing noises from MUSAN).

πŸ“© Contact

For questions or help, you can reach the organizers via CHiME Google Group or via CHiME Slack Workspace.