Programme

The detailed programme follows below.

9:00	Welcome:
9:10	Keynote 1: Steven J. Rennie (IBM T.J. Watson Research Center) (View Slides)
10:00	Break
10:20	Overview of the 2nd CHiME Challenge
10:50	Oral session 1: challenge papers
12:10	Lunch
13:40	Poster session
15:40	Break
16:00	Oral session 2: challenge-related paper
16:20	Keynote 2: Daniel P.W. Ellis (Columbia University) (View Slides)
17:10	Plenary discussion
17:50	Closing

Detailed Programme

Welcome

The 2nd International Workshop on Machine Listening in Multisource Environments (View Slides)
Emmanuel Vincent (Inria, France), Jon Barker (University of Sheffield, UK), Shinji Watanabe, Jonathan Le Roux (MERL, USA), Francesco Nesta and Marco Matassoni (FBK-Irst, Italy)

Keynote 1

Session Chair: Jen-Tzung Chien (National Chiao Tung University, Taiwan)

Model-based speech separation and recognition: Yesterday, Today, and Tomorrow (View Slides)
Steven J. Rennie (IBM T.J. Watson Research Center)
Abstract Recently, model-based approaches for multi-talker speech separation and recognition have demonstrated great success in highly constrained scenarios, and efficient algorithms for separating data with literally trillions of underlying states have been unveiled. In less constrained scenarios, deep neural networks (DNNs) learned on features inspired by human auditory processing have shown great capacity for directly learning masking functions from parallel data. Ideally, a robust speech separation/recognition system should be continuously learning, adapting, and exploiting structure that is present in both target and peripheral signals and interactions, make minimal assumptions about the data to be separated/recognized, not require parallel data streams, and have essentially unlimited information capacity. In this talk I’ll briefly review the current state of robust speech separation/recognition technology--where we are, where we apparently need to go, and how we might get there. I’ll then discuss in more detail recent work that I’ve been involved with that is aligned with these goals. Specifically, I will discuss some new results on efficiently learning the structure of models and efficiently optimizing a wide class of matrix-valued functions, some recent work on Factorial Restricted Boltzmann machines for robust ASR, and finally, Direct-product DBNs, a new architecture that makes it feasible to learn DNNs with literally millions of neurons.

Overview of the 2nd CHiME Challenge

Datasets, tasks, baselines and results (View Slides)
Emmanuel Vincent (Inria, France), Jon Barker (University of Sheffield, UK), Shinji Watanabe, Jonathan Le Roux (MERL, USA), Francesco Nesta and Marco Matassoni (FBK-Irst, Italy)

Oral session 1

Session Chair: Tomohiro Nakatani (NTT, Japan)

Noise robust distant automatic speech recognition utilizing NMF based source separation and auditory feature extraction
Niko Moritz (Fraunhofer IDMT, Germany), Marc René Schädler (University of Oldenburg, Germany), Kamil Adiloglu (Hörtech gGmbH, Germany), Bernd T. Meyer, Tim Jürgens, Timo Gerkmann, Birger Kollmeier, Simon Doclo (University of Oldenburg, Germany) and Stefan Goetze (Fraunhofer IDMT, Germany)
Binaural signal processing for enhanced speech recognition robustness in complex listening environments
Hendrik Meutzner (Ruhr-Universität Bochum, Germany), Anton Schlesinger (GAMPT mbH, Germany), Steffen Zeiler and Dorothea Kolossa (Ruhr-Universität Bochum, Germany)
Compact long context spectral factorization models for noise robust recognition of medium vocabulary speech
Antti Hurmalainen (Tampere University of Technology, Finland), Jort F. Gemmeke (Katholieke Universiteit Leuven, Belgium) and Tuomas Virtanen (Tampere University of Technology, Finland)
Discriminative methods for noise robust speech recognition: A CHiME Challenge benchmark
Yuuki Tachioka (Mitsubishi Electric IT R&D Center, Japan), Shinji Watanabe, Jonathan Le Roux and John R. Hershey (MERL, USA)

Poster session

Session Chair: Dorothea Kolossa (Ruhr-Universität Bochum, Germany)

The TUM+TUT+KUL approach to the 2nd CHiME Challenge: Multi-stream ASR exploiting BLSTM networks and sparse NMF (* Best Paper Award Winner *)
Jürgen T. Geiger, Felix Weninger (Technische Universität München, Germany), Antti Hurmalainen (Tampere University of Technology, Finland), Jort F. Gemmeke (Katholieke Universiteit Leuven, Belgium), Martin Wöllmer (BMW Group, Germany), Björn Schuller, Gerhard Rigoll (Technische Universität München, Germany) and Tuomas Virtanen (Tampere University of Technology, Finland)
Using full-rank spatial covariance models for noise-robust ASR
Dung T. Tran, Emmanuel Vincent, Denis Jouvet (Inria, France) and Kamil Adiloglu (Hörtech gGmbH, Germany)
A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
Francesco Nesta, Marco Matassoni (FBK-Irst, Italy) and Ramon Fernandez Astudillo (INESC-ID, Portugal)
Noise-robust automatic speech recognition with exemplar-based sparse representations using multiple length adaptive dictionaries
Emre Yılmaz, Jort F. Gemmeke, and Hugo Van hamme (Katholieke Universiteit Leuven, Belgium)
HMM-regularization for NMF-based noise robust ASR
Jort F. Gemmeke (Katholieke Universiteit Leuven, Belgium), Tuomas Virtanen and Antti Hurmalainen (Tampere University of Technology, Finland)
A fragment-decoding plus missing-data imputation ASR system evaluated on the 2nd CHiME Challenge
Ning Ma (MRC Institute of Hearing Research, UK) and Jon Barker (University of Sheffield, UK)
The 2nd ‘CHiME' Speech Separation and Recognition Challenge: Approaches on single-channel speech separation and model-driven speech enhancement
Pejman Mowlaee, Juan A. Morales-Cordovilla, Franz Pernkopf, Hannes Pessentheiner, Martin Hagmüller and Gernot Kubin (Graz University of Technology, Austria)
Fusion of acoustic, perceptual and production features for noise robust speech recognition in highly non-stationary noise
Ganesh Sivaraman (University of Maryland, USA), Vikramjit Mitra (SRI International, USA) and Carol Y. Espy-Wilson (University of Maryland, USA)
Employing stochastic constrained LMS algorithm for ASR frontend processing
Michael Stadtschnitzer, Daniel Stein and Rolf Bardeli (Fraunhofer IAIS, Germany)
Noise robust missing data mask estimation based on automatically learned features
Sami Keronen, Ulpu Remes, Heikki Kallasjoki and Kalle Palomäki (Aalto University, Finland)
Recurrent neural network feature enhancement: The 2nd CHiME Challenge
Andrew L. Maas, Tyler M. O’Neil, Awni Y. Hannun and Andrew Y. Ng (Stanford University, USA)
Approaches to multiple concurrent species bird song recognition
Jonathan Springer, Bryan Pardo and Zhiyao Duan (Northwestern University, USA)
Speech separation with dereverberation-based pre-processing incorporating visual cues
Muhammad Salman Khan, Syed Mohsen Naqvi, and Jonathon Chambers (Loughborough University, UK)

Oral session 2

Session Chair: Kalle Palomäki (Aalto University, Finland)

THE Munich feature enhancement approach to the 2nd CHiME Challenge using BLSTM recurrent neural networks
Felix Weninger, Jürgen Geiger (Technische Universität München, Germany), Martin Wöllmer (BMW Group, Germany), Björn Schuller and Gerhard Rigoll (Technische Universität München, Germany)

Keynote 2

Session Chair: Richard Lyon (Google, USA)

Recognizing and Classifying Environmental Sounds (View Slides)
Daniel P.W. Ellis (Columbia University)

Abstract Animal hearing exists to extract useful information out of the environment, and for a lot of animals for a large portion of the evolutionary history of hearing this sound environment has not consisted of speech or music, but of more generic acoustic information arising from collisions, motions, and other events in the external world. This aspect of sound analysis -- getting information out of non-speech, non-music, environmental sounds -- is finally beginning to gain popularity in research since it holds promise as a tool for automatic search and retrieval of audio/video recordings, an increasingly urgent problem. I will discuss our recent work in using audio analysis to manage and search environmental sound archives (including personal audio lifelogs and consumer video collections), and illustrate with some of the approaches that work more or less well, with an effort to explain why.

Bio Dan Ellis is an Associate Professor of Electrical Engineering at Columbia University, where he leads the Laboratory for Recognition and Organization of Speech and Audio (LabROSA), concerned with extracting useful information from real-world sounds of all kinds. His bachelors degree is from Cambridge, his Ph.D. from the MIT Media Lab, and he was a postdoc at the International Computer Science Institute in Berkeley, where he remains an external fellow. He is the author of a number of widely-used software tools, and he runs the AUDITORY email list of over 2000 researchers interested in the perception and cognition of sound.