Data Collection
Data collection
Topics and Speakers
-
Conversation Topics: Everyday life & hobbies, work & school, hypotheticals, entertainment, news, and personal stories.
-
Participants per Session: 2 to 8 speakers, divided into groups of 2–4 participants.
Recording Devices
- GoPro Max 360 (4K resolution)
- Smartphones (720p resolution)
- Lapel Microphones (connected via adapter to smartphones)
Layout and Setup
-
Recording Environments: Data was collected across approximately 10 different rooms of varying sizes and types, including living rooms, meeting rooms, lecture halls, and other indoor spaces.
-
Seating Arrangement: All speakers sit around a table; distances vary by table size (around 2 to 5 meters).
-
Smartphone Placement: Each speaker has a smartphone in front (selfie mode) with a lapel mic clipped near the mouth.
-
360° Capture: A GoPro Max mounted at the center of the table captures all participants.
-
Session Markers: Moderator signals start and end with a distinctive whistle.
Annotation
Signal Alignment
To synchronize recordings from multiple devices, we use the moderator’s whistle cue in a two-step process:
-
Manual Annotation: Listen to each audio track and mark the start/end regions containing the whistle.
-
Automatic Detection: Compute the spectral-flux onset strength envelope with librosa.onset.onset_strength, then identify the timestamp of the highest peak to pinpoint the exact whistle moment.
Transcription Workflow
High-quality audio from smartphones and lapel mics is used for transcription
-
Automatic Transcript: Run each clip through the Whisper-large-v2 model.
-
Post-Editing: Annotators use Label Studio to:
- Adjust segment boundaries to isolate the target speaker’s speech.
- Correct transcript text for accuracy.
360° Video Processing
Frame Padding for Horizon Artifacts
-
When a speaker straddles the image boundary, their face can split across the frame
-
We resolve this by padding each frame: append 20% of the left edge to the right side, creating a continuous panorama.
-
Face Recognition and Linking:
-
Face Detection: Run the padded 360° videos through a face-recognition pipeline to extract face crops.
-
Manual Association: Link each face crop from the 360° feed to the corresponding smartphone video of the same speaker.
-