Distant Meeting Transcription with a Single Device: Dual Tracks - (1) Single-Channel and (2) Known-Geometry Multi-Channel


The NOTSOFAR-1 (Natural Office Talkers in Settings Of Far-field Audio Recordings) Challenge concentrates on distant (far-field) speaker diarization and automatic speech recognition using a single recording device, emphasizing realistic conversational scenarios. The challenge features two tracks: (1) single-channel and, (2) known-geometry multi-channel. We are also introducing two new datasets to the community - a benchmarking dataset that is unparalleled in scope and quality, and a matched simulated training dataset closing major train-test gaps observed in prior work.

For evaluation purposes, we collected and released a new benchmarking dataset, comprising roughly 280 distinct meetings[1] across a variety of conference rooms. This dataset aims to capture the real-world complexities that a successful system must generalize to:

  1. A variety of acoustic situations, e.g., speech near a whiteboard (affecting the acoustic transfer function), motion during speech, speakers at varying distances and volumes.
  2. Meeting dynamics: overlap speech, interruptions, rapid speaker changes.
  3. The presence of transient and stationary noises.
  4. 30 different rooms of various dimensions, layouts, construction materials and acoustic characteristics.
  5. 4-8 attendees per meetings, covering a broad range of speaker numbers, involving over 30 unique speakers in total.

For training purposes, we released a specially simulated training dataset of roughly 1000 hours incorporating 15,000 real acoustic transfer functions (ATF) to power DNN-based front-ends and facilitate generalization to realistic settings.

Participants can exploit the various supervision signals available in the simulated training set and the training portion of the benchmarking dataset, as well as commonly used open-source datasets and pre-trained models (see the Rules section).

🧭 Getting Started

  • Register to participate.
  • Read below to learn more about the challenge, tracks and goals.
  • Dev-sets are initially blind (no ground-truth). Submit to the Hugging Face Leaderboard to evaluate your system.
    • Note the two dev-sets and their properties in the Data section.
    • Follow the timeline for important events and data release schedule.
    • Learn about submission formats.
  • Our GitHub repo has everything you need, including the baseline system (see description), and data download instructions.
  • Check out the Rules. Note the text normalization, metric, and use of a single device.
  • Contact us: join the discussion in chime-8-notsofar channel on the CHiME Slack, or chat with the organizers directly. You can also open a GitHub issue.

Introducing Two New Datasets to the Community

High-quality datasets are a key driver in advancing machine learning research. As part of this challenge, we at Microsoft are excited to release two datasets for the benefit of the wider research community.

I. Benchmarking Dataset: Natural Meeting Recordings

This dataset comprises approximately 280 unique meetings, each lasting on average 6 minutes, featuring English conversations recorded in about 30 different conference rooms at Microsoft offices. To increase acoustic diversity, each meeting was captured with several devices, each positioned differently. This setup, per meeting, involved around 5 single-channel (SC) devices producing a single internally processed stream each, and 4 multi-channel (MC) devices producing 7 raw streams each. A recording from a single device during one meeting is referred to as a ‘session’. Across all sessions, there are roughly 150 hours of SC data and 110 hours of MC data. Although multiple devices were used for recording, processing during inference is restricted to just one device (session). We divide the dataset into training, development, and evaluation sets. The latter, to be used as the fully blind evaluation set in this challenge, is entirely disjoint from the other sets, with no overlap in speakers or rooms. This dataset captures a wide range of real-world complexities and offers unique features, rendering it an excellent benchmark for conversational scenarios:

  1. Meetings last between 5-7 minutes, totaling roughly 280 different meetings. Since each meeting offers a unique mix of acoustics, conference room setting, and conversational dynamics, we obtain a highly diverse sample. Each meeting is approximately an independent and identically distributed (i.i.d.) element, facilitating the computation of meaningful confidence intervals.
  2. Attendees wore close-talk microphones and a multi-stage annotation strategy was employed to ensure accurate transcription and machine-bias mitigation.
  3. To enable deeper analysis of systems, each meeting was annotated with metadata.

See more details in the Data section.

II. Training Dataset: Unlocking DNN-based Front-ends

Participants in last year’s CHiME-7 DASR challenge relied mostly on guided source separation (GSS), which is not a supervised learning method in that it cannot leverage a large training set. To foster innovation and encourage the adoption of data-driven solutions for speech separation/enhancement, the NOTSOFAR-1 challenge takes the following approach to ensure alignment between training and testing phases:

  1. In the multi-channel track, we fix the array geometry of the device to be processed, enabling participants to develop solutions targeted to this exact configuration.
  2. We introduce an innovative simulated training set, that bridges several gaps between training and testing conditions, intended for use in both the single and multi-channel tracks.

The training set consists of about 1000 hours simulated with the target geometry, designed to generalize to realistic settings, featuring elements such as:

  1. Acoustic transfer functions (ATFs) recorded in real-world conference rooms with objects arranged to mimic real meeting environments. A total of 15,000 real ATFs were measured with the target geometry in various positions and rooms, employing a mouth simulator to account for speech source directivity.
  2. During the mixing process, the speech overlap patterns were designed to include realistic patterns such as short interruptions and rapid speaker changes.
  3. A variety of transient and stationary noises, recorded in real rooms with the target geometry, were added to the mixture.

Challenge Tracks

NOTSOFAR-1 features two main tracks, both centered on the use of a single device:

  1. Single-channel (SC) speech transcription, with recordings from a variety of commercial single-channel devices.
  2. Multi-channel (MC) speech transcription, with recordings from commercial devices utilizing a specific, known circular geometry (a central microphone surrounded by 6 others).

Participants may submit to either one track or both.

In each track, systems will be ranked according to the speaker-attributed tcpWER metric to evaluate the impact of both speaker diarization errors and word errors (see Rules). In addition, our baseline system reports the speaker-agnostic tcORC WER as a supplementary metric for analysis to isolate the impact of word errors.

To assist participants in getting started, we provide baseline models for each track, complete with inference, training, and evaluation code.

Scientific Goals

In collaboration with this year’s related CHiME-8 DASR task, which focuses on generalizing to unknown array geometries and varying scenarios, we aim to answer fundamental questions in far field speech transcription:

  1. What is the performance gain when transitioning from the single-channel setting (NOTSOFAR-SC) to the geometry specific multi-channel setting (NOTSOFAR-MC)? How is that gain impacted when transitioning to the geometry-agnostic multi-channel setting (CHiME-8 DASR) instead?

    We note that to enable this comparison, the NOTSOFAR natural conference meetings data set is featured in all three tracks. Additionally, each entry in CHiME-8 DASR (geometry agnostic) will also be automatically considered as an entry in NOTSOFAR-MC (geometry specific)

Furthermore, NOTSOFAR poses the following research questions:

  1. Can the introduced simulated training dataset lead to data-driven front-end solutions that generalize well to realistic acoustic scenarios?
  2. How can the various available supervision signals for training be leveraged to improve algorithms? Namely, the separated speech components within the simulated training dataset, along with the close-talk recordings, transcriptions, and speaker labels found in the real meeting recordings dataset.
  3. Can the data verticals analysis enabled by the metadata reveal potential avenues for progress?

We collaborate closely with CHiME-8 DASR to make it easy to experiment with both tasks. For those participants interested in the multi-channel setting, we encourage enrolling in both tasks and explore ways to transition their systems from the geometry-specific setting to the agnostic and vice versa.

🏅 Jury Award for "Most Practical and Efficient System"

This year we introduced a special jury award/mention to encourage the development of innovative and practical systems, moving away from performance-squeezing, brute-force approaches such as ensembling or iterative inference-time pseudo-labeling & retraining.
We highly recommend submitting your system, even if you think it is not great in terms of performance, as it still stands a chance to win this jury award if it is practically and scientifically interesting.

Important Dates

The challenge is hosted by CHiME-8, a satellite event of the Interspeech Conference.

Date Event
February 1st, ‘24 Challenge begins with the release of baseline model code, simulated training set of roughly 1000 hours, 32 hours of train-set-1, and 32 hours of dev-set-1 (no ground-truth)
Mid March February, ‘24 Live leaderboard launches, allowing each team up to 5 daily submissions on the dev-set
March 1st, ‘24 Second batch of 30 hours of recorded meeting train-set released
Mid April, ‘24 dev-set-2 released (no ground-truth), dev-set-1 ground-truth revealed. dev-set-2 is now the official dev set, and dev-set-1 joined the training set
June 13th, ‘24 dev-set-2 ground-truth revealed
July 1st, ‘24 Blind evaluation set released (no ground-truth). Teams are allowed to submit up to 4 systems per track for both final ranking and jury award
July 17th, ‘24 Challenge ends → teams submit their technical reports
TBD, ‘24 A modified version of the NOTSOFAR datasets is released for open academic research
September 6th, ‘24 CHiME-8 Satellite Workshop → an exciting event at Interspeech! (Dates and details TBD)

Please note: The dates and the number of hours specified for each release are subject to slight adjustments.

[1] Initially, we announced a total of 315 meetings. However, we have revised this figure to 280 to reserve some meetings for potential future challenges.

Table of contents