Call for Papers, Special Issue at Computer Speech & Language
Multi-Speaker, Multi-Microphone, and Multi-Modal Distant Speech Recognition
Automatic speech recognition (ASR) has significantly progressed in the single-speaker scenario, owing to extensive training data, sophisticated deep learning architectures, and abundant computing resources. Building on this success, the research community is now tackling real-world multi-speaker speech recognition, where the number and nature of the sound sources are unknown and changing over time. In this scenario, refining core multi-speaker speech processing technologies such as speech separation, speaker diarization, and robust speech recognition is essential, and the effective integration of these advancements becomes increasingly crucial. In addition, emerging approaches, such as end-to-end neural networks, speech foundation models, and advanced training methods (e.g., semi-supervised, self-supervised, and unsupervised training) incorporating multi-microphone and multi-modal information (such as video and accelerometer data), offer promising avenues to alleviate these challenges. This special issue gathers recent advances in multi-speaker, multi-microphone, and multi-modal speech processing studies to establish real-world conversational speech recognition.
See this link for details
Guest editors:
- Shinji Watanabe (CMU, Lead)
- Michael Mandel (Meta)
- Marc Delcroix (NTT)
- Leibny Paola Garcia Perera (JHU)
- Katerina Zmolikova (Meta)
- Samuele Cornell (CMU)
Special issue information:
Relevant research topics include (but are not limited to):
- Speaker identification and diarization
- Speaker localization and beamforming
- Single- or multi-microphone enhancement and source separation
- Robust features and feature transforms
- Robust acoustic and language modeling for distant or multi-talker ASR
- Traditional or end-to-end robust speech recognition
- Training schemes: data simulation and augmentation, semi-supervised, self-supervised, and unsupervised training for distant or multi-talker speech processing
- Pre-training and fine-tuning of speech and audio foundation models and their application to distant and multi-talker speech processing
- Robust speaker and language recognition
- Robust paralinguistics
- Cross-environment or cross-dataset performance analysis
- Environmental background noise modeling
- Multimodal speech processing
- Systems, resources, and tools for distant Speech Recognition
In addition to traditional research papers, the special issue also hopes to include descriptions of successful conversational speech recognition systems where the contribution is more in the implementation than the techniques themselves, as well as successful applications of conversational speech recognition systems. For example, the recently concluded seventh and eighth CHiME challenges serve as a focus for discussion in this special issue. The challenge considered the problem of conversational speech separation, speech recognition, and speaker diarization in everyday home environments from multi-microphone and multi-modal input. Seventh and eighth CHiME challenges consist of multiple tasks based on 1) distant automatic speech recognition with multiple devices in diverse scenarios, 2) unsupervised domain adaptation for conversational speech enhancement, 3) distant diarization and ASR in natural conferencing environments, and 4) ASR for multimodal conversations in smart glasses. Papers reporting evaluation results on the CHiME-7/8 datasets or other datasets dealing with real-world conversational speech recognition are equally welcome.
Tentative Dates:
- Submission Open Date: August 19, 2024
- Manuscript Submission Deadline: December 2, 2024
- Editorial Acceptance Deadline: September 1, 2025
Submissions
Contributed full papers must be submitted via Computer Speech & Language online submission system (Editorial Manager®): https://www.editorialmanager.com/ycsla/default2.aspx. Please select the article type “VSI: Multi-DSR” when submitting the manuscript online.
Please refer to the Guide for Authors to prepare your manuscript: https://www.elsevier.com/journals/computer-speech-and-language/0885-2308/guide-for-authors
For any further information, the authors may contact the Guest Editors.
Keywords:
Speech recognition, speech enhancement/separation, speaker diarization, multi-speaker, multi-microphone, multi-modal, Distant Speech Recognition, CHiME challenge