Despite tremendous progress in close-microphone automatic speech recognition (ASR) for broadcast news, telephone speech or meeting speech, robust distant-microphone ASR in real-world environments remains a very challenging problem. The ease with which humans address it belies the fact that it raises many difficulties, chief among which being the reverberation of the target speech source and the presence of highly dynamic background noise made up of multiple sound sources. Solutions to this problem require tighter collaboration between the Audio and Acoustic Signal Processing (AASP), Speech and Language Processing (SLP) and Machine Learning for Signal Processing (MLSP) communities.
One year ago, we organised the 2011 PASCAL CHiME Speech Separation and Recognition Challenge as an attempt towards bridging the gap between the above communities. The task consisted of recognising keywords within speech utterances binaurally mixed with real-world domestic background noise. These utterances were taken from the Grid corpus, a corpus of speaker-dependent read speech with small vocabulary size and fixed syntax, so as to foster novel AASP approaches and novel interfaces between AASP, SLP and MLSP, as opposed to pure SLP solutions to the more traditional problems of channel robustness, stationary noise, speech naturalness and large vocabulary assessed by the NIST Challenges. Thanks to the availability of a baseline ASR system and evaluation software, the challenge attracted 13 submissions, including from AASP researchers without any prior experience in ASR. It finally led to a highly successful satellite workshop at Interspeech 2011 with more than 70 attendees.
The New Challenge
Based on the feedback received from the participants to this workshop, a number of directions have been identified for the design of future such evaluations. The new IEEE AASP `CHiME' Speech Separation and Recognition Challenge aims to build upon the success of the previous CHiME Challenge by extending the difficulty of the task along two directions, namely vocabulary size and mixing condition. In order to avoid too large an increase in difficulty and to retain participation from groups without access to the necessarly large vocabulary corpus, vocabulary size and mixing condition will form two separate tracks.