Data

The data consists of a two datasets:

A meeting recordings dataset for benchmarking and training.
A simulated training dataset.

See below for License.

Download

Refer to the NOTSOFAR GitHub for download instructions.

Meetings recordings dataset

The following table summarizes our dataset with single-channel (SC) and multi-channel (MC) recordings, corresponding to the challenge tracks.

	Train set 1	Train set 2	Train set 3 (formerly Dev set 1)	Dev set (formerly Dev set 2)	Eval set (blind)
Number of meetings	37	40	36	33	80
Meeting duration	6 minutes (avg.)
Devices (SC / MC)	5 / 4	5 / 4	5 / ~3	~4 / ~4	2 / 2
Sessions (SC / MC)	185 / 148	200 / 160	177 / 106	117 / 130	160 / 160
Hours (SC / MC)	18 / 14	20 / 16	17 / 10	11 / 13	16 / 16
Total number of rooms	20 rooms				~10 rooms
Total number of participants	22 participants				~10 participants

Although multiple devices were used for recording, during inference the NOTSOFAR challenge restricts processing to just one device (session).
The Evaluation set is entirely disjoint from the Training and Development sets, with no overlap in participants or rooms.
Dev-set-1 and the Training sets share most of their participants. Users of Dev-set-1 should be mindful of overfitting to specific participants.
Dev-set-2 includes mostly new participants compared to the Training sets and Dev-set-1, but there are still some overlapping participants. Namely: ‘Jim’, ‘Walter’, ‘David’, ‘Jerry’. It is advised not to use those participants for training so that dev-set-2 remains a reliable estimate of performance.
For the dataset release schedule see Challenge Important Dates. Upon the release of Dev-set-2, we will unveil the ground truth for Dev-set-1. Dev-set-2 will become the official dev set, and Dev-set-1 will join the training set.

Role-play meeting topics

Meetings are role-played by the participants. Most meetings feature semi-professional topics, in which participants role-play as professionals discussing a work-related issue. For example, a cruise ship company planning an event, administrators planning a city park, or users complaining about IT problems. Some meetings feature non-work-related topics, such as favorite TV shows, debating whether to raise kids as vegetarians, or friends sharing recipes.

Sample recording and transcription

This is a sample recording from a single channel stream. The meeting participants are discussing ways to recruit more employees.

🔈 Play

Transcription

start_time	end_time	speaker_id	text
9.71	11.3	Sophie	`<ST/> people to come work with us.`
11.41	15.09	Sophie	`So what can we change about our business about the way that we treat our uh <ST/>`
15.8	18.56	Sophie	`<ST/> um workers in order to get more people to come.`
18.75	22.1	Jim	`I think if we take better care of the workers you know like <ST/>`
21.23	21.92	Sophie	`How so?`
22.18	29.18	Jim	`<ST/> well for such you know like provide more things like food or free activities in the office, I think a lot more people would like to come.`
27.07	27.71	Sophie	`Mm-hmm.`
29.38	29.81	Sophie	`Yeah.`
29.97	32.22	Sophie	`Like if there was like maybe muffins in the morning.`
32.42	33.08	Jim	`Yeah.`
32.7	33.17	Bert	`Yes.`
32.99	35.86	Sophie	`<ST/> or free coffee or something, you know people would <ST/>`
34.54	35.31	Jim	`Definitely.`
35.06	35.42	Bert	`Yes.`
35.75	36.43	Bert	`Yeah. But how?`
36.59	37.65	Sophie	`<ST/> be more eager to come.`
37.64	38.16	Jim	`Yeah.`
38.21	42.4	Bert	`We need to get people to get in to the front door in the first place, so that's the problem as we need people.`
41.69	44.21	Sophie	`Well yeah, when you have like these thing they'll lure people in you know.`
42.31	42.76	Jim	`Yup.`
43.8	49.34	Jim	`Exactly. People hear about this from their friends that work here or we could have like you know um <ST/>`
44.42	45.06	Bert	`Mmm. Well.`
46.05	46.6	Bert	`OK.`
49.44	54.48	Jim	`<ST/> a page in one of the social networks showing a life at our company and how great that is.`

Experimental setup

The following images show the setup for recording the NOTSOFAR meetings dataset.

Experimental setup overview

Experimental setup

Front of room devices	Tabletop devices

(a) Jabra PanaCast 50 (b) Poly Studio (c) Logitech MeetUp	(d) Yealink SmartVision 60 (e) Yealink Intelligent Speaker (f) Sennheiser TeamConnect Intelligent Speaker (g) EPOS Expand Capture 5 (h) Loudspeaker for time-alignment signal

Tabletop items

Laptops and personal items next to the microphones contribute to the authentic acoustic properties of the recordings.

Close talk recorders

Mobile voice recorder, head mounted microphone and throat microphone.

Sample meeting images

Directory structure

The audio data and the transcriptions follow this directory structure:

240130.1_train
└───MTG
    ├───MTG_30830
    │   │   devices.json
    │   │   gt_meeting_metadata.json (*)
    │   │   gt_transcription.json (*)
    │   │
    │   ├───close_talk (*)
    │   │       CT_21.wav
    │   │       CT_22.wav
    │   │       CT_23.wav
    │   │       CT_25.wav
    │   │
    │   ├───mc_plaza_0
    │   │       ch0.wav ... ch6.wav
    │   │
    │   ├───mc_rockfall_0
    │   │       ch0.wav ... ch6.wav
    │   │
    │   ├───mc_rockfall_1
    │   │       ch0.wav ... ch6.wav
    │   │
    │   ├───mc_rockfall_2
    │   │       ch0.wav ... ch6.wav
    │   │
    │   ├───sc_meetup_0
    │   │       ch0.wav
    │   │
    │   ├───sc_plaza_0
    │   │       ch0.wav
    │   │
    │   ├───sc_rockfall_0
    │   │       ch0.wav
    │   │
    │   ├───sc_rockfall_1
    │   │       ch0.wav
    │   │
    │   └───sc_rockfall_2
    │           ch0.wav
    │
    ├───MTG_30831
    :
    ├───MTG_30832
    :    

240130.1_dev
└───MTG
    ├───MTG_30830
    :    

(*) Ground truth and metadata are only published for the Training Set meetings initially.

`MTG_30830`

A directory containing the data for this meeting.

`devices.json`

A file that lists the audio devices for this meeting. For example:

[
    {
        "device_name":"CT_21",
        "is_mc":false,
        "is_close_talk":true,
        "channels_num":1,
        "wav_file_names":"close_talk/CT_21.wav"
    },
    {
        "device_name":"plaza_0",
        "is_mc":true,
        "is_close_talk":false,
        "channels_num":7,
        "wav_file_names":"mc_plaza_0/ch0.wav,mc_plaza_0/ch1.wav, ..."
    },
]

device_name is the abbreviated alias device name. See the Recording devices table for the commercial name of each device.
is_mc indicates multi-channel devices.
is_close_talk indicates the close-talk devices. Each device records a single speaker’s head-mounted microphone.
channels_num is the number of audio channels.
wav_file_names is a comma-delimited list of relative paths to this device’s audio files.

`gt_meeting_metadata.json` (* Training set only)

A file that contains metadata for this meeting. For example:

{
    "meeting_id": "MTG_30830",
    "MtgType": "mtg",
    "MeetingDurationSec": 362.082625,
    "Room": "ROOM_10002",
    "NumParticipants": 4,
    "ParticipantAliases": [
        "Peter",
        "Sophie",
        "Olivia",
        "Jim"
    ],
    "Topic": "Should AI be used in schools?",
    "Hashtags": "#NaturalMeeting",
    "ParticipantAliasToCtDevice": {
        "Peter": "CT_21",
        "Sophie": "CT_22",
        "Olivia": "CT_23",
        "Jim": "CT_25"
    }
}

Room is the meeting room ID. See the Meeting rooms table for more information about each room.
ParticipantAliases are the meeting participant IDs. Participant are identified by first name aliases that are not their real names. The participants used these aliases to address each other during meetings. Each participant is identified by the same alias across all their meetings. The participant aliases are globally unique across all splits.
Topic is a manually-labeled topic of the meeting.
Hashtags is a list of attributes for the meeting. See Metadata for details on each Hashtag.
ParticipantAliasToCtDevice maps participants to their close-talk audio devices, i.e. their personal head-mounted microphones.

`gt_transcription.json` (* Training set only)

A file containing the ground truth (GT) transcriptions. For example:

[
    {
        "speaker_id":"Peter",
        "ct_wav_file_name":"close_talk/CT_21.wav",
        "start_time":15.42,
        "end_time":18.53,
        "text":"I think schools are about preparing children for <ST/>",
        "word_timing":[
            [
                "i",
                15.47,
                15.68
            ],
            [
                "think",
                15.68,
                16.12
            ],
            ...
        ]
    },
    ...
]

Tags in GT transcriptions

Tag	Definition
`<PName/>`	Personal name
`<BA/>`	Blank audio - an utterance with no word contents in the audio
`<FILL/>`	Filled pause or false start
`<FILLlaugh/>`	Laughter
`<ST/>`	Sentence truncation
`<UNKNOWN/>`	The audio is not intelligible
`<PAUSE/>`	There is a long pause in speech, often with a significant non-speech sound (like breathing)
`<ISSUE/>`	The audio is marked for future review

`close_talk` (* Training set only)

A directory containing the close-talk recordings for each participant. These recordings are published for the Training set and initially withheld from the Development and Evaluation sets.

`ch0.wav`, `ch1.wav`

Audio files, one file per channel.

Recording devices

The recording devices are identified in the data files by an abbreviated alias. The following table lists the commercial name of each recording device.

Device alias	Commercial name
`plaza_0`	Yealink SmartVision 60
`rockfall_0`	EPOS Expand Capture 5
`rockfall_1`	Yealink Intelligent Speaker
`rockfall_2`	Sennheiser TeamConnect Intelligent Speaker
`meetup_0`	Logitech MeetUp
`studio_0`	Poly Studio
`panacast_0`	Jabra PanaCast 50
`CT_20`, `CT_21`, …	Shure WH20 + TASCAM DR-40X Close Talk

Missing device recordings

A few device recordings are missing from some meetings due to recording errors. See devices.json files for the list of available devices in each meeting.

The multi-channel recording of plaza_0 is available across all meetings in all datasets, therefore we suggest to use this device as a common reference for cross-meetings analyses.

Metadata

Meetings are labeled by their acoustic scenarios using hashtags in the file gt_meeting_metadata.json. The following table describes these hashtags.

All hashtags have a dominant effect throughout the meeting (e.g., #Coughing indicates frequent coughing), except when they specify a particular participant. For example, #TalkNearWhiteboard=ParticipantAlias means this specific participant talked near the whiteboard. Note that when computing WER over the entire meeting, the effect of talking near the whiteboard can be underrepresented since it is averaged with speakers who were “not near the whiteboard.”

Hashtag	Description
`#NaturalMeeting`	A meeting with no special instructions.
`#WalkAndTalk=ParticipantAlias`	A participant is pacing around the room while speaking.
`#TalkNearWhiteboard=ParticipantAlias`	A participant is speaking near the whiteboard, sometimes facing the board and sometimes facing the audience.
`#DebateOverlaps`	Participants often speak in overlap, as in a heated debate or argument.
`#YesOkHmm`	Frequent short utterances such as “Yes”, “Hmm”, “OK”, “Uhh” and other fillers are spoken in overlap with the main speaker.
`#TransientNoise=`	Type and volume {low, high} of transient noise. Noises may come from the door, rummaging in bags, placing things on table, typing, etc.
`#WatchSlides`	The participants are watching slides. They sometimes turn their faces towards the TV screen while speaking, and speak towards each other.
`#StandNearPlaza`	Participants around the device `plaza_0`.
`#MoveInSeat`	The participants move their heads or swivel in their seats while speaking.
`#StandSit`	The participants are frequently toggling between standing up and sitting while talking.
`#LowTalkers=ParticipantAlias`	A participant speaks in a soft tone with low volume.
`#FarLowTalker=ParticipantAlias`	Someone speaks in a soft tone with low volume, and sits far from the device `plaza_0`.
`#Crowded`	The participants sit shoulder-to-shoulder along one side of the table. This simulates the scenario where the meeting room is at full capacity.
`#LeaveAndJoin=ParticipantAlias`	Someone leaves the meeting room and rejoins later, sometimes to a different seat.
`#TurnsNoOverlap`	No speech overlap. Speakers take turns.
`#Coughing`	Frequent coughing.
`#Laughing`	Frequent laughing.
`#ShortTurnsOverlap`	Short utterances, with overlaps in turns.
`#ReadText`	Participants read text in turns with no overlaps.
`#LongMeeting=`	Meeting duration in {minutes}.
`#Music`	Music playing in the background (without vocals).

Meeting participants

Meeting participants are native or near-native English speakers.

Each participant joined multiple meetings in the dataset. The Evaluation set participants did not participate in any Development set or Training sets meetings, thus they form a fully disjoint set of people. In contrast, the Development set and the Training set meetings share most of the same participants. Users of this dataset should be mindful of this and avoid creating models that over fit to specific participants.

Meeting rooms

The following table describes the Training set meeting rooms.

Room Id	Seating Capacity	Length (cm)	Width (cm)
ROOM_10001	10	520	490
ROOM_10002	11	600	410
ROOM_10006	16	710	600
ROOM_10007	13	640	360
ROOM_10009	10	520	440
ROOM_10012	25	740	590
ROOM_10020	11	590	370

Active speakers and speech overlap statistics

The plots below compare the speech activity overlap and patterns between meeting set splits: Development meetings, Batch 1 and Batch 2 of Training meetings, and Evaluation meetings.

The following plot shows the number of active speakers in 3-second windows. The plot indicates there is a similar distribution of speech activity patterns across the meeting set splits.

active speakers	dev	training1	training2	eval	dev_percentage	training1_percentage	training2_percentage	eval_percentage
0	304	265	2369	1199	1.15	0.98	2.76	1.42
1	6499	8278	30906	23352	24.65	30.77	36.04	27.65
2	6378	8133	25591	25156	24.19	30.23	29.84	29.79
3	5804	5660	16642	20024	22.01	21.04	19.4	23.71
4	4463	3205	7456	9580	16.93	11.91	8.69	11.34
5	2287	1087	2268	3957	8.67	4.04	2.64	4.69
6	576	257	441	996	2.18	0.96	0.51	1.18
7	55	21	91	166	0.21	0.08	0.11	0.2
8	0	0	1	15	0.0	0.0	0.0	0.02

The following plots compare the distribution of speech overlaps across meeting set splits. Dev Training1 Training2 Eval

Simulated training dataset

The simulated training dataset consists of about 1000 hours simulated with the same microphone-array geometry as the multi-channel devices in the NOTSOFAR meeting dataset.

Real-room acoustic transfer functions (ATFs)

The dataset features ATFs recorded in actual conference rooms, in an acoustic setup arranged to closely replicate authentic meeting environments. We collected a total of 15,000 real ATFs, measured in various positions and rooms by multiple devices sharing the same geometry.

The ATFs were reconstructed from chirps emitted by a mouth simulator speaker. The histograms below show the distribution of mouth speaker locations relative to the microphone array.

ATFs histogram

Mixtures of up to three speakers

The simulated dataset provides speech mixtures along with their separated speech and noise components, which serve as supervision signals for training speech separation and enhancement models.

The plots below show examples of simulated audio mixtures and their isolated components. Each plot shows a 3 second segment of mixed speech of two or three speakers and noise. The X-axis units are 10-millisecond frames.

mix_0 The mixture signal.
activity The ground truth (GT) speech activity scores for each speaker. Values:
- -1: Not speaking.
- 0: Borderline energy. Represents a transition region between speech and silence.
- 1: Speaking.
ee_spk0_0, ee_spk1_0 The direct path and early echoes component for each speaker.
rev_spk0_0, rev_spk1_0 The reverberation component for each speaker.
noise_0 The noise component of the segment.

The GT (ground truth) components sum up to the mixture:
mixture = gt_spk_direct_early_echoes + gt_spk_reverb + gt_noise

Simulated mixture example 1: Rapid speaker turns between two speakers

Simulated mixture example 2: Three speakers with speech overlaps

License

This public data is currently licensed for use exclusively in the NOTSOFAR challenge event. We appreciate your understanding that it is not yet available for academic or commercial use. However, we are actively working towards expanding its availability for these purposes. We anticipate a forthcoming announcement that will enable broader and more impactful use of this data. Stay tuned for updates. Thank you for your interest and patience.

Download

Meetings recordings dataset

Role-play meeting topics

Sample recording and transcription

Transcription

Experimental setup

Experimental setup overview

Tabletop items

Close talk recorders

Sample meeting images

Directory structure

MTG_30830

devices.json

gt_meeting_metadata.json (* Training set only)

gt_transcription.json (* Training set only)

Tags in GT transcriptions

close_talk (* Training set only)

ch0.wav, ch1.wav

Recording devices

Missing device recordings

Metadata

Meeting participants

Meeting rooms

Active speakers and speech overlap statistics

Simulated training dataset

Real-room acoustic transfer functions (ATFs)

Mixtures of up to three speakers

Simulated mixture example 1: Rapid speaker turns between two speakers

Simulated mixture example 2: Three speakers with speech overlaps

License

`MTG_30830`

`devices.json`

`gt_meeting_metadata.json` (* Training set only)

`gt_transcription.json` (* Training set only)

`close_talk` (* Training set only)

`ch0.wav`, `ch1.wav`