Office conferencing is one of the most common and yet challenging scenarios faced by meeting-intelligence systems today. Specifically, distant (far-field) speech diarization and transcription solutions are the key to enable a range of co-pilot downstream services and has crucial business and research implications. Unfortunately, even though artificial intelligence, foundation models, and computational resources all evolve at an accelerated pace - reliable and practical solutions are still required.
We in Microsoft believe data can push new solutions beyond existing boundaries, and therefore provide to the scientific community with an unmatched database in volume and in quality that includes a development set and a blind test set of 40 hours of real and natural English conference meetings taken in various rooms and played-out by multiple participants who wore close-talk microphones for a near-perfect human annotation. To meet both multi-channel and single-channel approaches, multiple devices were utilized to record data in each configuration.
The challenge focuses on building deep-learning-based diarization and ASR systems for far-field acoustic setups and testing their acoustic robustness in various realistic data verticals, including double-talk between two or more speakers, talking with movements, speech interruptions, occlusions such as talking near a white board, transients, and environmental noises.
Our team will also provide 500 hours of training set data in matched conditions, baseline models, training and inference code, and evaluation framework for deep-dive analysis based on our meta-data.