Software

Download and references

The baseline software is distributed via the LDC as LDC2017S24.

To refer to this software in a publication, please cite:

Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe
The third 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
In Proc. IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 504-511, 2015.
Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe
The third 'CHIME' Speech Separation and Recognition Challenge: Analysis and outcomes
Computer Speech and Language, vol. 46, pp. 605-626, 2017.

Overview

We provide three software tools for acoustic simulation, speech enhancement, and ASR. The acoustic simulation and speech enhancement tools are not intended as state-of-the-art implementations but as baselines which the participants are encouraged to improve or replace. They are only distributed via the LDC. The ASR tool, by contrast, provides state-of-the-art performance. It is also available as a recipe in the official distribution of Kaldi. Note however that this recipe differs from the original one distributed by the LDC.

Acoustic simulation baseline

The acoustic simulation baseline is distributed in the following directory of the CHiME3 package:

CHiME3/tools/simulation

It can be used to estimate time-varying impulse responses and to subtract or add speech to the noisy recordings (in the training and development sets only).

In a first step, the signals are represented in the complex-valued short time Fourier transform (STFT) domain using half-overlapping sine windows of 256 samples. The time frames are partitioned into variable-length blocks such that the amount of speech is similar in each block. The blocks are half-overlapping and windowed by a sine window. The STFT-domain impulse responses between the close-talking microphone (considered as clean speech) and the other microphones are estimated in the least-squares sense in each frequency bin and each block [1]. This is used to estimate the signal-to-noise ratio (SNR) and, in the case of simulated development or test data, to estimate the noise signal by subtracting the convolved close-talking microphone signal.

In a second step, the signals are represented in the complex-valued STFT domain using half-overlapping sine windows of 1024 samples. The spatial position of the speaker is tracked using SRP-PHAT (see below). The time-varying filter modeling direct sound between the speaker and the microphones is then convolved with a clean speech signal and mixed with a noise signal. In the case of training data, the clean speech signal is taken from the ORG recordings and it is mixed with a separately recorded noise background. An equalization filter is applied that is estimated as the ratio of the average power spectrum of BTH data and the average power spectrum of ORG data. In the case of development and test data, the clean speech signal is taken from the BTH recordings and it is mixed with the original noisy recording from which speech has been taken out. In either case, the convolved speech signal is rescaled such that the SNR matches that of the original recording.

This baseline does not address the simulation of microphone mismatches, microphone failures, early echoes, reverberation... You are encouraged to address these limitations in order to get the most out of simulated data.

[1] Emmanuel Vincent, Remi Gribonval, Mark Plumbley, Oracle estimators for the benchmarking of source separation algorithms, Signal Processing, 87(8):1933-1950, 2007.

Speech enhancement baseline

The speech enhancement baseline is distributed in the following directory of the CHiME3 package:

CHiME3/tools/enhancement

It aims to transform the multichannel noisy input signal into a single-channel enhanced output signal suitable for ASR processing.

The signals are represented in the complex-valued STFT domain using half-overlapping sine windows of 1024 samples. In a first step, the spatial position of the target speaker in each time frame is encoded by a nonlinear SRP-PHAT pseudo-spectrum [2], which was found to perform best among a variety of source localization techniques [3]. The peaks of the SRP-PHAT pseudo-spectrum are then tracked over time using the Viterbi algorithm. The transition probabilities between successive speaker positions are inversely related to their distance and to the distance to the center of the microphone array.

The multichannel covariance matrix of noise is estimated from 400 ms to 800 ms of context immediately before the test utterance. The speech signal is then estimated by time-varying minimum variance distortionless response (MVDR) beamforming with diagonal loading [4], taking possible microphone failures into account. The full 5 s of allowed context are not used here, since they often contain some unannotated speech, which would result in cancellation of the target. The relevant noise context is therefore often much shorter.

This baseline does not address the automatic detection of the relevant noise context, modeling of early echoes and reverberation, spatial and spectral post-filtering... You are encouraged to address these limitations in order to get the most out of enhancement.

[2] Benedikt Loesch and Bin Yang, Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions, in Proceedings of the 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pp. 41-48, 2010.

[3] Charles Blandin, Alexey Ozerov, and Emmanuel Vincent, Multi-source TDOA estimation in reverberant audio using angular spectra and clustering, Signal Processing, 92(8):1950-1960, 2012.

[4] Xavier Mestre and Miguel Á. Lagunas, On diagonal loading for minimum variance beamformers, in Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 459-462, 2003.

ASR baseline

The ASR baseline is distributed in the following directory of the CHiME3 package:

CHiME3/tools/ASR

The ASR baseline is also included in the Kaldi repository as

kaldi-trunk/egs/chime3/s5

NOTE 1: In future, these two versions will differ since the version on the Kaldi repository can be changed by anyone. The package version will be used to score the baseline, while the Kaldi version will provide up-to-date, state-of-the-art results.

NOTE 2: The results of the ASR baseline can differ for every run and every machine due to random initialisation and to machine-specific issues. The difference can be up to a few tens of percent absolute for small WERs and up to several percent absolute for large WERs.

The main script (run.sh) includes

local/run_init.sh: initial script for data preprocessing, language and lexicon model building, and ASR baseline construction using clean and multi condition data
local/run_gmm.sh: training and evaluation of enhanced speech data using Gaussian Mixture Model (GMM)
local/run_dnn.sh: training and evaluation of enhanced speech data using Deep Neural Network (DNN)

Note that training on clean data means original WSJ0 data only (no booth data), and the multi condition data means the real and simulation data of the noisy conditions (no booth data again). local/run_init.sh optionally provides the ASR baseline only using real data and simulation data (these are commented out). You can find some detailed descriptions in the README file or comments in each script.

The GMM baseline includes the standard triphone based acoustic models with various feature transformations including linear discriminant analysis (LDA), maximum likelihood linear transformation (MLLT), and feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT). The effectiveness of these feature transformation techniques for distant talk speech recognition was shown in [5]. This baseline is designed to evaluate the ASR performance of the enhanced data quickly. Therefore, advanced processing with heavy computational cost (e.g., discriminative training) is not included.

The DNN baseline provides the state-of-the-art ASR performance. It is based on the Kaldi recipe for Track 2 of the 2nd CHiME Challenge [6]. The DNN is trained using the standard procedure (pre-training using restricted Boltzmann machine, cross entropy training, and sequence discriminative training). This baseline requires relatively massive computational resources (GPUs for the DNN training and many CPUs for lattice generation).

[5] Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, and John R. Hershey, Discriminative methods for noise robust speech recognition: A CHiME Challenge Benchmark, in Proceedings of the 2nd International Workshop on Machine Listening in Multisource Environments (CHiME), pp. 19-24, 2013.

[6] Chao Weng, Dong Yu, Shinji Watanabe, and Biing-Hwang (Fred) Juang, Recurrent Deep Neural Networks for Robust Speech Recognition, in Proceedings of the 39th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , pp. 5569-5572, 2014.

Quick instruction of the ASR baseline (package version)

Download and compile Kaldi (we tested our script with version 4710)
svn co -r 4710 https://svn.code.sf.net/p/kaldi/code/trunk kaldi-trunk-r4710
cd kaldi-trunk-r4710/tools
make
cd ../src
./configure
make depend
make
move to the CHiME3 ASR baseline directory
cd <your CHiME3 directory>/CHiME3/tools/ASR
specify kaldi in CHiME3/tools/ASR/path.sh e.g.,
export KALDI_ROOT=<your kaldi trunk>
execute run.sh
./run.sh
we suggest using the following command to save the main log file
nohup ./run.sh > run.log
if you have your own enhanced speech data for training and test data, you can evaluate the performance of GMM and DNN systems by
local/run_gmm.sh <enhancement method> <enhanced speech directory>
local/run_dnn.sh <enhancement method> <enhanced speech directory>

(you don't have to execute local/run_init.sh twice).
you can find the resulting word error rates (WERs) in the following files:
enhan=<enhancement method>
GMM clean training: exp/tri3b_tr05_orig_clean/best_wer_$enhan.result
GMM multi training: exp/tri3b_tr05_multi_$enhan/best_wer_$enhan.result
DNN multi training: exp/tri4a_dnn_tr05_multi_${enhan}_smbr_i1lats/best_wer_${enhan}.result

Quick instruction of the ASR baseline (Kaldi version)

Download the latest Kaldi and compile as above
svn co https://svn.code.sf.net/p/kaldi/code/trunk kaldi-trunk
specify kaldi in path.sh e.g.,
export KALDI_ROOT=`pwd`/../../..
execute run.sh
./run.sh
we suggest using the following command to save the main log file
nohup ./run.sh > run.log
if you have your own enhanced speech data for training and test data, you can evaluate the performance of GMM and DNN systems by
local/run_gmm.sh <enhancement method> <enhanced speech directory>
local/run_dnn.sh <enhancement method> <enhanced speech directory>

(you don't have to execute local/run_init.sh twice).
you can find the resulting WERs in the following files:
enhan=<enhancement method>
GMM clean training: exp/tri3b_tr05_orig_clean/best_wer_$enhan.result
GMM multi training: exp/tri3b_tr05_multi_$enhan/best_wer_$enhan.result
DNN multi training: exp/tri4a_dnn_tr05_multi_${enhan}_smbr_i1lats/best_wer_${enhan}.result