Software

We provide three software baselines for acoustic simulation, speech enhancement, and ASR (used to obtain the baseline challenge results) and an additional speech enhancement tool (which was not used to obtain the baseline challenge results, but which participants may improve and eventually replace the baseline with). With the exception of the ASR baseline, these baselines and tools are not intended as state-of-the-art implementations but as starting points for improvement.

Acoustic simulation baseline

The acoustic simulation baseline is distributed in the following directory of the CHiME4 package:

CHiME4/tools/simulation

It can be used to estimate time-varying impulse responses and to subtract or add speech to the noisy recordings.

In a first step, the signals are represented in the complex-valued short time Fourier transform (STFT) domain using half-overlapping sine windows of 256 samples. The time frames are partitioned into variable-length blocks such that the amount of speech is similar in each block. The blocks are half-overlapping and windowed by a sine window. The STFT-domain impulse responses between the close-talking microphone (considered as clean speech) and the other microphones are estimated in the least-squares sense in each frequency bin and each block [1]. This is used to estimate the signal-to-noise ratio (SNR) and, in the case of simulated development or test data, to estimate the noise signal by subtracting the convolved close-talking microphone signal.

In a second step, the signals are represented in the complex-valued STFT domain using half-overlapping sine windows of 1024 samples. The spatial position of the speaker is tracked using SRP-PHAT (see below). The time-varying filter modeling direct sound between the speaker and the microphones is then convolved with a clean speech signal and mixed with a noise signal. In the case of training data, the clean speech signal is taken from the ORG recordings and it is mixed with a separately recorded noise background. An equalization filter is applied that is estimated as the ratio of the average power spectrum of BTH data and the average power spectrum of ORG data. In the case of development and test data, the clean speech signal is taken from the BTH recordings and it is mixed with the original noisy recording from which speech has been taken out. In either case, the convolved speech signal is rescaled such that the SNR matches that of the original recording.

This baseline does not reproduce all properties of live recordings. For instance, it does not handle microphone mismatches, microphone failures, early echoes, reverberation, and Lombard effect. This is known to provide an overly optimistic enhancement performance for direction-of-arrival based adaptive beamformers such as MVDR. You are encouraged to address these limitations in order to get the most out of simulated data.

[1] Emmanuel Vincent, Remi Gribonval, Mark Plumbley, Oracle estimators for the benchmarking of source separation algorithms, Signal Processing, 87(8):1933-1950, 2007.

Enhancement and ASR baseline

The enhancement and ASR baseline for each track is distributed in the following directory of the CHiME4 package:

CHiME4/tools/ASR_1ch_track
CHiME4/tools/ASR_2ch_track
CHiME4/tools/ASR_6ch_track

The enhancement and ASR baseline for each track is also included in the Kaldi github repository as

kaldi/egs/chime4/s5_1ch
kaldi/egs/chime4/s5_2ch
kaldi/egs/chime4/s5_6ch

In addition, the CHiME4 package also includes the following acoustic and language models, which are produced by the above baseline scripts

CHiME4/tools/ASR_models
├── data
│   ├── lang
│   ├── lang_test_5gkn_5k       # 5-gram KN language model
│   ├── lang_test_rnnlm_5k_h300 # RNN language model
│   ├── lang_test_tgpr_5k       # 3-gram KN language model
│   └── local
├── exp
│   ├── tri3b_tr05_multi_noisy                 # GMM acoustic model
│   ├── tri4a_dnn_tr05_multi_noisy             # DNN acoustic model
│   └── tri4a_dnn_tr05_multi_noisy_smbr_i1lats # DNN-sMBR acoustic model
└── README

NOTE 1: In future, these two (CHiME4 package and Kaldi github) versions will differ since the version on the Kaldi github repository can be changed by anyone. The package version will be used to score the baseline, while the Kaldi version will provide up-to-date, state-of-the-art results.

NOTE 2: The results of the ASR baseline can differ for every run and every machine due to random initialization and to machine-specific issues. The difference can be up to a few tens of percent absolute for small WERs and up to several percent absolute for large WERs.

NOTE 3: All baseline scripts (CHiME4/tools/ASR_{1,2,6}ch_track/run.sh) in the current version produce the same acoustic models with noisy multi condition training (channel 5) and language models by default. Of course, participants can provide respective acoustic (and language) models for each track.

The main script (run.sh) is developed by [2], and provides quite strong baseline. Ths script includes:

Initial script:

local/run_init.sh: initial script for data preprocessing, 3-gram language and lexicon model building.

Enhancement script:

local/run_beamform_2ch_track.sh: weighted delay-and-sum beamforming based on BeamformIt toolkit [3] for 2ch track setup.

local/run_beamform_6ch_track.sh: weighted delay-and-sum beamforming based on BeamformIt toolkit [3] for 6ch track setup. The default setup excludes channel 2 from the beamformer input.

GMM baseline script:

local/run_gmm.sh: training and recognition script using Gaussian Mixture Model (GMM). The GMM is trained by using noisy multi condition data (channel 5). The recognition is performed with enhanced speech data for the development and evaluation sets. The GMM baseline includes the standard triphone based acoustic models with various feature transformations including linear discriminant analysis (LDA), maximum likelihood linear transformation (MLLT), and feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT). The effectiveness of these feature transformation techniques for distant talk speech recognition was shown in [4]. This baseline is designed to evaluate the ASR performance of the enhanced data quickly. Therefore, advanced processing with heavy computational cost (e.g., discriminative training) is not included.

local/run_gmm_recog.sh: subset of the GMM baseline script, which provides recognition of enhanced speech data for development and evaluation data given the GMM model. This script does not include the training part (the training part is included in local/run_gmm.sh).

DNN baseline script:

local/run_dnn.sh: training and recognition script using Deep Neural Network (DNN). The DNN is trained by using noisy multi condition data (channel 5). The recognition is performed with enhanced speech data for the development and evaluation sets. The DNN baseline provides the state-of-the-art ASR performance. It is based on the Kaldi recipes for Track 2 of the 2nd CHiME Challenge and 3rd CHiME challenge [5]. The DNN is trained using the standard procedure (pre-training using restricted Boltzmann machine, cross entropy training, and sequence discriminative training [6]). This baseline requires relatively massive computational resources (GPUs for the DNN training and many CPUs for lattice generation).

local/run_dnn_recog.sh: subset of the DNN baseline script, which provides recognition of enhanced speech data for development and evaluation data given the DNN model. This script does not include the training part (the training part is included in local/run_dnn.sh).

Rescoring script:

local/run_lmrescore.sh: training and recognition script using 5-gram KN language model and Recurrent Neural Network Langauge Model (RNNLM) [7]. The training procedure takes a long time (a few days). The recognition is performed by using a rescoring method. By using an option (local/run_lmrescore.sh --stage 3), you can skip the training if you set language models appropriately.

local/run_lmrescore.sh: subset of the rescoring script. You can skip the training if you set language models appropriately.

NOTE 4: The above local scripts are located in CHiME4/tools/ASR_1ch_track/local (or kaldi/egs/chime4/s5_1ch/local) directory, and the other two tracks link this directory by a symbolic link (i.e., ASR_{2,6}ch_track/local -> ASR_1ch_track/local).

[2] Takaaki Hori, Zhuo Chen, Hakan Erdogan, John R. Hershey, Jonathan Le Roux, Vikramjit Mitra, and Shinji Watanabe, The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition, in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 475-481, 2015

[3] Xavier Anguera, Chuck Wooters and Javier Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech and Language Processing, volume 15, number 7, pp.2011-2023, 2007.

[4] Yuuki Tachioka, Shinji Watanabe, Jonathan Le Roux, and John R. Hershey, Discriminative methods for noise robust speech recognition: A CHiME Challenge Benchmark, in Proceedings of the 2nd International Workshop on Machine Listening in Multisource Environments (CHiME), pp. 19-24, 2013.

[5] Chao Weng, Dong Yu, Shinji Watanabe, and Biing-Hwang (Fred) Juang, Recurrent Deep Neural Networks for Robust Speech Recognition, in Proceedings of the 39th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) , pp. 5569-5572, 2014.

[6] Karel VeselïK Arnab Ghoshal, Lukas Burget, and Daniel Povey, Sequence-discriminative training of deep neural networks, in Proceedings of INTERSPEECH, pp. 2345-2349, 2013.

[7] Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan "Honza" Černocký and Sanjeev Khudanpur, Recurrent neural network based language model, in Proceedings of INTERSPEECH, pp. 1045-1048, 2010.

Flatstart and recognition mode

The default CHiME4 baseline scripts for 1ch, 2ch, and 6ch tracks work in recognition mode given the acoustic models and the language models provided in the official CHiME4 package (CHiME4/tools/ASR_models). These models are used without modification: no retraining or adaptation is performed :

cd CHiME4/tools/ASR_1ch_track
./run.sh

Also, all baseline scripts have flatstart mode, which retrains acoustic models and language models from scratch. You can modify the training scripts (local/run_init.sh, local/run_dnn.sh, and local/run_lmrescore.sh) to build your own models.

cd CHiME4/tools/ASR_1ch_track
./run.sh --flatstart true

Quick instructions to the ASR baseline (package version)

download and compile Kaldi http://kaldi-asr.org/ (we tested our script with hash 9e8ff73648917836d0870c8f6fdd2ff4bdde384f. If your Kaldi is failed, please try with this version). You first need to install BeamformIt, IRSTLM, SRILM, and Mikolov's RNNLM, in addition to the Kaldi basic tools. For SRILM, you need to get the source (srilm.tgz) at http://www.speech.sri.com/projects/srilm/download.html
git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi
git checkout 9e8ff73648917836d0870c8f6fdd2ff4bdde384f
cd tools
make -j                                      # "-j" option parallelize compile
./extras/install_beamformit.sh               # BeamformIt
./extras/install_irstlm.sh     # IRSTLM
./extras/install_mikolov_rnnlm.sh rnnlm-0.3e # Mikolov's RNNLM version 0.3e
./extras/install_srilm.sh                    # Get source from http://www.speech.sri.com/projects/srilm/download.html first
compile Kaldi
cd ../src
./configure
make depend -j
make -j
move to the CHiME4 ASR baseline directory. The following example is for the 1ch track, but the procedure is the same for the two other tracks (i.e., CHiME4/tools/ASR_2ch_track and CHiME4/tools/ASR_6ch_track).
cd <your CHiME4 directory>/CHiME4/tools/ASR_1ch_track
specify Kaldi in CHiME4/tools/ASR_1ch_track/path.sh e.g.,
export KALDI_ROOT=<your Kaldi>
make sure that the following paths in run.sh are correctly specified (default values must be fine, but please make sure):
chime4_data=`pwd`/../..
modeldir=$chime4_data/tools/ASR_models
execute run.sh.
./run.sh
we suggest using the following command to save the main log file
nohup ./run.sh > run.log
if you want to retrain models from scratch, you can also use the flatstart mode.

./run.sh --flatstart true
if you have your own enhanced speech data for test, you can evaluate the performance of GMM and DNN systems without retraining by
local/run_gmm_recog.sh <enhancement method> <enhanced speech directory>
local/run_dnn_recog.sh <enhancement method>
local/run_lmrescore.sh --stage 3 <your CHiME4 directory>/CHiME4 <enhancement method>

(you don't have to execute local/run_init.sh twice).
you can find the resulting word error rates (WERs) in the following files:
enhan=<enhancement method>
exp/tri3b_tr05_multi_noisy/best_wer_${enhan}.result                                            # GMM
exp/tri4a_dnn_tr05_multi_noisy_smbr_i1lats/best_wer_${enhan}.result                            # DNN sMBR
exp/tri4a_dnn_tr05_multi_noisy_smbr_lmrescore/best_wer_${enhan}_5gkn_5k.result                 # 5-gram rescoring
exp/tri4a_dnn_tr05_multi_noisy_smbr_lmrescore/best_wer_${enhan}_\
rnnlm_5k_h300_w0.5_n100.result
# RNNLM

You can also compare your results with the baseline results in RESULTS file.

Quick instructions to the ASR baseline (Kaldi github repository version)

download and compile Kaldi http://kaldi-asr.org/. You need to install BeamformIt, IRSTLM, SRILM, and Mikolov's RNNLM, in addition to the Kaldi basic tools. For SRILM, you need to get source (srilm.tgz) at http://www.speech.sri.com/projects/srilm/download.html
git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi/tools
make -j                                      # "-j" option parallelize compile
./extras/install_beamformit.sh               # BeamformIt
./extras/install_irstlm.sh     # IRSTLM
./extras/install_mikolov_rnnlm.sh rnnlm-0.3e # Mikolov's RNNLM version 0.3e
./extras/install_srilm.sh                    # Get source from http://www.speech.sri.com/projects/srilm/download.html first
compile Kaldi
cd ../src
./configure
make depend -j
make -j
move to the CHiME4 ASR baseline in the Kaldi egs directory . The following example is for the 1ch track, but the procedure is the same for the two other tracks (i.e., kaldi/egs/chime4/s5_2ch and kaldi/egs/chime4/s5_6ch).
cd ../kaldi/egs/chime4/s5_1ch
specify model and CHiME4 root paths in run.sh:
modeldir=<your CHiME4 directory>/tools/ASR_models
chime4_data=<your CHiME4 directory>
execute run.sh.
./run.sh
we suggest using the following command to save the main log file
nohup ./run.sh > run.log
if you want to train models from scratch, you can also use the flatstart mode.

./run.sh --flatstart true
if you have your own enhanced speech data for test, you can evaluate the performance of GMM and DNN systems without retraining by
local/run_gmm_recog.sh <enhancement method> <enhanced speech directory>
local/run_dnn_recog.sh <enhancement method>
local/run_lmrescore.sh --stage 3 <your CHiME4 directory>/CHiME4 <enhancement method>

(you don't have to execute local/run_init.sh twice).
you can find the resulting word error rates (WERs) in the following files:
enhan=<enhancement method>
exp/tri3b_tr05_multi_noisy/best_wer_${enhan}.result                                            # GMM
exp/tri4a_dnn_tr05_multi_noisy_smbr_i1lats/best_wer_${enhan}.result                            # DNN sMBR
exp/tri4a_dnn_tr05_multi_noisy_smbr_lmrescore/best_wer_${enhan}_5gkn_5k.result                 # 5-gram rescoring
exp/tri4a_dnn_tr05_multi_noisy_smbr_lmrescore/best_wer_${enhan}_\
rnnlm_5k_h300_w0.5_n100.result
# RNNLM

You can also compare your results with the baseline results in RESULTS file.

Additional speech enhancement tool

We provide an additional speech enhancement tool in the following directory of the CHiME4 package:

CHiME4/tools/enhancement

This tools aims to transform the multichannel noisy input signal into a single-channel enhanced output signal suitable for ASR processing by means of MVDR beamforming. MVDR is known to work poorly on real data due to the fact that it does not handle microphone mismatches, microphone failures, early echoes, and reverberation. This code is not intended to be run as such (BeamformIt provides much better results) but to provide a set of Matlab tools from which more advanced beamforming or source separation techniques can be developed.

The signals are represented in the complex-valued STFT domain using half-overlapping sine windows of 1024 samples. In a first step, the spatial position of the target speaker in each time frame is encoded by a nonlinear SRP-PHAT pseudo-spectrum [8], which was found to perform best among a variety of source localization techniques [9]. The peaks of the SRP-PHAT pseudo-spectrum are then tracked over time using the Viterbi algorithm. The transition probabilities between successive speaker positions are inversely related to their distance and to the distance to the center of the microphone array.

The multichannel covariance matrix of noise is estimated from 400 ms to 800 ms of context immediately before the test utterance. The speech signal is then estimated by time-varying minimum variance distortionless response (MVDR) beamforming with diagonal loading [10], taking possible microphone failures into account. The full 5 s of allowed context are not used here, since they often contain some unannotated speech, which would result in cancellation of the target. The relevant noise context is therefore often much shorter.

This baseline does not address the automatic detection of the relevant noise context, modeling of early echoes and reverberation, spatial and spectral post-filtering... You are encouraged to address these limitations in order to get the most out of enhancement.

[8] Benedikt Loesch and Bin Yang, Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions, in Proceedings of the 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pp. 41-48, 2010.

[9] Charles Blandin, Alexey Ozerov, and Emmanuel Vincent, Multi-source TDOA estimation in reverberant audio using angular spectra and clustering, Signal Processing, 92(8):1950-1960, 2012.

[10] Xavier Mestre and Miguel A. Lagunas, On diagonal loading for minimum variance beamformers, in Proceedings of the IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pp. 459-462, 2003.

All software is available on the download page.