Task 2 - NOTSOFAR-1

Distant Meeting Transcription with a Single Device: Dual Tracks - (1) Single-Channel and (2) Known-Geometry Multi-Channel


Slack

The NOTSOFAR-1 (Natural Office Talkers in Settings Of Far-field Audio Recordings) Challenge concentrates on distant (far-field) speaker diarization and automatic speech recognition using a single recording device, emphasizing realistic conversational scenarios. The challenge features two tracks: (1) single-channel and, (2) known-geometry multi-channel. We are also introducing two new datasets to the community - a benchmarking dataset that is unparalleled in scope and quality, and a matched simulated training dataset closing major train-test gaps observed in prior work.

For evaluation purposes, we collected and released a new benchmarking dataset, comprising roughly 280 distinct meetings[1] across a variety of conference rooms. This dataset aims to capture the real-world complexities that a successful system must generalize to:

  1. A variety of acoustic situations, e.g., speech near a whiteboard (affecting the acoustic transfer function), motion during speech, speakers at varying distances and volumes.
  2. Meeting dynamics: overlap speech, interruptions, rapid speaker changes.
  3. The presence of transient and stationary noises.
  4. 30 different rooms of various dimensions, layouts, construction materials and acoustic characteristics.
  5. 4-8 attendees per meetings, covering a broad range of speaker numbers, involving over 30 unique speakers in total.

For training purposes, we released a specially simulated training dataset of roughly 1000 hours incorporating 15,000 real acoustic transfer functions (ATF) to power DNN-based front-ends and facilitate generalization to realistic settings.

Participants can exploit the various supervision signals available in the simulated training set and the training portion of the benchmarking dataset, as well as commonly used open-source datasets and pre-trained models (see the Rules section).

🧭 Getting Started

  • Register to participate.
  • Read below to learn more about the challenge, tracks and goals.
  • Dev-sets are initially blind (no ground-truth). Submit to the Hugging Face Leaderboard to evaluate your system.
    • Note the two dev-sets and their properties in the Data section.
    • Follow the timeline for important events and data release schedule.
    • Learn about submission formats.
  • Our GitHub repo has everything you need, including the baseline system (see description), and data download instructions.
  • Check out the Rules. Note the text normalization, metric, and use of a single device.
  • Contact us: join the discussion in chime-8-notsofar channel on the CHiME Slack, or chat with the organizers directly. You can also open a GitHub issue.

Introducing Two New Datasets to the Community

High-quality datasets are a key driver in advancing machine learning research. As part of this challenge, we at Microsoft are excited to release two datasets for the benefit of the wider research community.

I. Benchmarking Dataset: Natural Meeting Recordings

This dataset comprises approximately 280 unique meetings, each lasting on average 6 minutes, featuring English conversations recorded in about 30 different conference rooms at Microsoft offices. To increase acoustic diversity, each meeting was captured with several devices, each positioned differently. This setup, per meeting, involved around 5 single-channel (SC) devices producing a single internally processed stream each, and 4 multi-channel (MC) devices producing 7 raw streams each. A recording from a single device during one meeting is referred to as a ‘session’. Across all sessions, there are roughly 150 hours of SC data and 110 hours of MC data. Although multiple devices were used for recording, processing during inference is restricted to just one device (session). We divide the dataset into training, development, and evaluation sets. The latter, to be used as the fully blind evaluation set in this challenge, is entirely disjoint from the other sets, with no overlap in speakers or rooms. This dataset captures a wide range of real-world complexities and offers unique features, rendering it an excellent benchmark for conversational scenarios:

  1. Meetings last between 5-7 minutes, totaling roughly 280 different meetings. Since each meeting offers a unique mix of acoustics, conference room setting, and conversational dynamics, we obtain a highly diverse sample. Each meeting is approximately an independent and identically distributed (i.i.d.) element, facilitating the computation of meaningful confidence intervals.
  2. Attendees wore close-talk microphones and a multi-stage annotation strategy was employed to ensure accurate transcription and machine-bias mitigation.
  3. To enable deeper analysis of systems, each meeting was annotated with metadata.

See more details in the Data section.

II. Training Dataset: Unlocking DNN-based Front-ends

Participants in last year’s CHiME-7 DASR challenge relied mostly on guided source separation (GSS), which is not a supervised learning method in that it cannot leverage a large training set. To foster innovation and encourage the adoption of data-driven solutions for speech separation/enhancement, the NOTSOFAR-1 challenge takes the following approach to ensure alignment between training and testing phases:

  1. In the multi-channel track, we fix the array geometry of the device to be processed, enabling participants to develop solutions targeted to this exact configuration.
  2. We introduce an innovative simulated training set, that bridges several gaps between training and testing conditions, intended for use in both the single and multi-channel tracks.

The training set consists of about 1000 hours simulated with the target geometry, designed to generalize to realistic settings, featuring elements such as:

  1. Acoustic transfer functions (ATFs) recorded in real-world conference rooms with objects arranged to mimic real meeting environments. A total of 15,000 real ATFs were measured with the target geometry in various positions and rooms, employing a mouth simulator to account for speech source directivity.
  2. During the mixing process, the speech overlap patterns were designed to include realistic patterns such as short interruptions and rapid speaker changes.
  3. A variety of transient and stationary noises, recorded in real rooms with the target geometry, were added to the mixture.

Challenge Tracks

NOTSOFAR-1 features two main tracks, both centered on the use of a single device:

  1. Single-channel (SC) speech transcription, with recordings from a variety of commercial single-channel devices.
  2. Multi-channel (MC) speech transcription, with recordings from commercial devices utilizing a specific, known circular geometry (a central microphone surrounded by 6 others).

Participants may submit to either one track or both.

In each track, systems will be ranked according to the speaker-attributed tcpWER metric to evaluate the impact of both speaker diarization errors and word errors (see Rules). In addition, our baseline system reports the speaker-agnostic tcORC WER as a supplementary metric for analysis to isolate the impact of word errors.

To assist participants in getting started, we provide baseline models for each track, complete with inference, training, and evaluation code.

Scientific Goals

In collaboration with this year’s related CHiME-8 DASR task, which focuses on generalizing to unknown array geometries and varying scenarios, we aim to answer fundamental questions in far field speech transcription:

  1. What is the performance gain when transitioning from the single-channel setting (NOTSOFAR-SC) to the geometry specific multi-channel setting (NOTSOFAR-MC)? How is that gain impacted when transitioning to the geometry-agnostic multi-channel setting (CHiME-8 DASR) instead?

    We note that to enable this comparison, the NOTSOFAR natural conference meetings data set is featured in all three tracks. Additionally, each entry in CHiME-8 DASR (geometry agnostic) will also be automatically considered as an entry in NOTSOFAR-MC (geometry specific)

Furthermore, NOTSOFAR poses the following research questions:

  1. Can the introduced simulated training dataset lead to data-driven front-end solutions that generalize well to realistic acoustic scenarios?
  2. How can the various available supervision signals for training be leveraged to improve algorithms? Namely, the separated speech components within the simulated training dataset, along with the close-talk recordings, transcriptions, and speaker labels found in the real meeting recordings dataset.
  3. Can the data verticals analysis enabled by the metadata reveal potential avenues for progress?

We collaborate closely with CHiME-8 DASR to make it easy to experiment with both tasks. For those participants interested in the multi-channel setting, we encourage enrolling in both tasks and explore ways to transition their systems from the geometry-specific setting to the agnostic and vice versa.

🏅 Practical System Highlights

We aim to promote the development of innovative, practical systems rather than performance-squeezing approaches that are more brute-force in nature. When submitting results on the evaluation set we ask participants to include their system’s runtime and hardware specifications. Systems deemed practical and efficient will be featured on a dedicated leaderboard at the end of the challenge.

Additionally, following the submission deadline for official ranking, we will allow an exploration period of about two weeks where participants are encouraged to submit results from alternative systems - more efficient or simpler approaches, ablation studies etc. These can be described in an extended technical report and will be considered for highlighting on the dedicated leaderboards for efficient systems, and potentially other categories of interest.

Important Dates

The challenge is hosted by CHiME-8, a satellite event of the Interspeech Conference.

Date Event
February 1st, ‘24 Challenge begins with the release of baseline model code, simulated training set of roughly 1000 hours, 32 hours of train-set-1, and 32 hours of dev-set-1 (no ground-truth)
Mid March February, ‘24 Live leaderboard launches, allowing each team up to 5 daily submissions on the dev-set
March 1st, ‘24 Second batch of 30 hours of recorded meeting train-set released
Mid April, ‘24 dev-set-2 released (no ground-truth), dev-set-1 ground-truth revealed
June 1st, ‘24 dev-set-2 ground-truth revealed
June 15th, ‘24 40-60 (TBD) hours of blind evaluation set released (no ground-truth). Teams are allowed a single submission with up to two results per track for official ranking
June 28th, ‘24 Challenge ends → teams submit their technical reports
July 1st - July 15th, ‘24 Teams are invited to submit up to 10 results for exploration of additional systems and get the chance to be highlighted in the dedicated non-official leaderboards
September 6th, ‘24 CHiME-8 Satellite Workshop → an exciting event at Interspeech! (Dates and details TBD)

Please note: The dates and the number of hours specified for each release are subject to slight adjustments.


[1] Initially, we announced a total of 315 meetings. However, we have revised this figure to 280 to reserve some meetings for potential future challenges.


Table of contents