Software

We provide three software baselines for array synchronization, speech enhancement, and speech recognition systems. All systems are integrated as a Kaldi CHiME-6 recipe

Overview

The main script (run.sh), which executed array synchronization, data preparation, data augmentation, feature extraction, GMM training, data cleaning, and chain model training. After training, run.sh finally calls the inference script (local/decode.sh), which includes speech enhancement and recognition given the trained model. We can also execute local/decode.sh independently with your own ASR models or pre-trained models downloaded from here.

  1. Array synchronization to generate the new CHiME-6 audio data (stage 0)
    • This stage first downloads the array synchronization tool, and generates the synchronized audio files across arrays and corresponding JSON files. Note that this requires sox v14.4.2, which is installed via miniconda in ./local/check_tools.sh. The details about the array synchronization will be found in Array synchronization.
  2. Data, dictionary, and language model preparation (stages 1 to 3)
    • Prepare Kaldi format data directories, lexicon, and language models
    • Language model: maximum entropy based 3-gram
       data/srilm/best_3gram.gz -> 3gram.me.gz
      
    • Vocabulary size: 127,712
       $wc -l data/lang/words.txt
       127712 data/lang/words.txt
      
  3. Data augmentation (stages 4 to 7)
    • In these stages, we augment and fix the training data. Point source noises are extracted from the CHiME-6 corpus. Here, we use 400k utterances from array microphones, its augmentation and all the worn set utterances during training.
    • We did not include the enhanced speech data for the training data due to the simplicity of the system.
  4. Feature extraction (stage 8)
    • We make 13-dim MFCC features for GMM-HMM systems.
  5. GMM training (stages 9 to 13)
    • Stages 9 to 13 train monophone and triphone models. They will be used for cleaning training data and generating lattices for training the chain model.
  6. Data cleaning (stage 14)
    • This stage performs data cleanup for training data by using the GMM model.
  7. Chain model training (stage 15)
    • We use a factorized time delay neural network (TDNN-F) adapted from SWBD recipe 7q
    • You can also download a pretrained chain ASR model using: wget http://kaldi-asr.org/models/12/0012_asr_v1.tar.gz. Once it is downloaded, extract using: tar -xvzf 0012_asr_v1.tar.gz copy the contents of the exp/ directory to your exp/ directory.
  8. Decoding (stage 16)
    • This stage performs speech enhancement and recognition for the test set. This stage calls local/decode.sh includes speech enhancement and decoding and scoring.
    • [1] Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The Fifth ‘CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines. Proc. Interspeech 2018, 1561-1565.
    • [2] Manohar, V., Chen, S. J., Wang, Z., Fujita, Y., Watanabe, S., & Khudanpur, S. (2019, May). Acoustic Modeling for Overlapping Speech Recognition: JHU CHiME-5 Challenge System. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6665-6669)
    • [3] Povey, D., Cheng, G., Wang, Y., Li, K., Xu, H., Yarmohammadi, M., & Khudanpur, S. (2018, September). Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Interspeech (pp. 3743-3747).

Array synchronization

The new array synchronisation baseline is available on GitHub. The array synchronisation compensates for two separate issues: audio frame-dropping (which affects the Kinect devices only) and clock-drift (which affects all devices). It operates in two stages:

  1. Frame-dropping is compensated by inserting 0 s into the signals where samples have been dropped. These locations have been detected by comparing the Kinect audio with an uncorrupted stereo audio signal recovered from the video avi files that were recorded (but not made publicly available). The frame-drop locations have been precomputed and stored in the file chime6_audio_edits.json that is then used to drive the synchronisation software.

  2. Clock-drift is computed by comparing each device’s signal to the session’s ‘reference’ binaural recordings (the binaural mic of the speaker with the lowest ID number). Specifically, cross-correlation is used to estimate delays between the device and the reference at regular intervals throughout the recording session (performed using estimate_alignment.py from the CHiME-5 synchronization baseline). A relative speed-up or slow-down can then be approximated using a linear fit through these estimates. The signal is then synchronised to the reference using a sox command to adjust the speed of the signal appropriately. This adjustment is typically very subtle, i.e., less than 100 ms over the 2 1/2 hour recording session. Note, the approach failed for devices S01_U02 and S01_U05 which appear to have temporarily changed speeds during the recording session and have required a piece-wise linear fit. The adjustment for clock-drift compensation have been precomputed and the parameters to drive the sox commands are stored in chime6_audio_edits.json.

Note, after frame-drop and clock-drift compensation, the wav files that are generated for each device will have slightly different durations. For each session, device signals can be safely truncated to the duration of the shortest signal across devices, but this step is not performed by the synchronisation tool.

Finally, the CHiME-5 transcript json files are processed to fit the new alignment. In the new version, utterances will have the same start and end time on every device.

Speech enhancement

We prepare two implementations for our speech enhancement.

  1. WPE based dereverberation and Guided Source Separation (GSS) applied to multiple arrays.
    • This speech enhancement front-end consists out of WPE, a spatial mixture model that uses time annotations (GSS), beamforming and masking.
    • multiarray=outer_array_mics means that only the first and last microphones of each array are used.
    • This is a default speech enhancement for the CHiME-6 track 1 recipe.
  2. WPE based dereverberation and Weighted delay-and-sum beamformer, BeamformIt are applied to the reference array.
    • This is an optional speech enhancement processing used in the CHiME-5 recipe.
    • To switch from GSS to BeamformIt, please specify the following options.
       ./run.sh --stage 16 --enhancement beamformit
      

      or

       ./local/decode.sh --enhancement beamformit
      
    • [4] Nakatani, T., Yoshioka, T., Kinoshita, K., Miyoshi, M., & Juang, B. H. (2010). Speech dereverberation based on variance-normalized delayed linear prediction. IEEE Transactions on Audio, Speech, and Language Processing, 18(7), 1717-1731.
    • [5] Anguera, X., Wooters, C., & Hernando, J. (2007). Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2011-2022
    • [6] Drude, L., Heymann, J., Boeddeker, C., & Haeb-Umbach, R. (2018, October). NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing. In Speech Communication; 13th ITG-Symposium (pp. 1-5)
    • [7] Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India.

Decoding and scoring

  1. We perform 2 stage decoding, which refines the i-vector extraction based on the first pass decoding result to achieve robust decoding for noisy speech.

  2. We prepare the scoring script for both development and evaluation sets for the submission local/score_for_submit.sh

  3. The language model weight and insertion penalty are optimized based on the development set.

      Dev. WER (%) Eval. WER (%)
    Kaldi baseline 51.76 51.29
    • [8] Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015, December). Jhu aspire system: Robust LVCSR with TDNNs, ivector adaptation and RNN-LMs. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 539-546).

Usage (especially for Kaldi beginners)

  1. Download Kaldi, compile Kaldi tools, and install BeamformIt for beamforming, Phonetisaurus for constructing a lexicon using grapheme to phoneme conversion, and SRILM for language model construction, miniconda and Nara WPE for dereverberation. For SRILM, you need to download the source (srilm.tgz) first.
     git clone https://github.com/kaldi-asr/kaldi.git
     cd kaldi/tools
     make -j                             # "-j" option parallelize compile
     ./extras/install_beamformit.sh      # BeamformIt
     ./extras/install_srilm.sh           # Get source from http://www.speech.sri.com/projects/srilm/download.html first
     ./extras/install_phonetisaurus.sh   # G2P
     ./extras/install_miniconda.sh       # Miniconda for several Python packages including Nara WPE, audio synchronization, and GSS
     ./extras/install_wpe.sh             # Nara WPE
    
  2. Compile Kaldi (-j 10 means the number of jobs is 10. Please change this number based on your environment accordingly).
     cd ../src
     ./configure
     make depend -j 10
     make -j 10
    
  3. Move to the CHiME-6 track1 ASR baseline in the Kaldi egs/ directory.
     cd ../kaldi/egs/chime6/s5_track1
    
  4. Specify model and CHiME-5 root paths in run.sh. You can also specify the CHiME-6 root directory, which is generated in the array synchronization stage (stage 0)
     chime5_corpus=<your CHiME-5 path>
     chime6_corpus=<desired CHiME-6 path>
    
  5. Execute run.sh.
     ./run.sh
    

    We suggest using the following command to save the main log file:

     nohup ./run.sh > run.log
    

    If your experiments have failed or you want to resume your experiments at some stage, you can use the following command (this example is to rerun GMM experiments from stage 9):

     ./run.sh --stage 9
    
  6. If you have your own enhanced speech data for test, you can perform your own enhancement by using local/decode.sh.

  7. You can find the resulting word error rates (WERs) in the following files:
     exp/chain_train_worn_simu_u400k_cleaned_rvb/tdnn1b_sp/decode_dev_gss_multiarray_2stage/wer_*
     exp/chain_train_worn_simu_u400k_cleaned_rvb/tdnn1b_sp/decode_eval_gss_multiarray_2stage/wer_*
    

Note

  • During scoring, we filter the tags ([noise], [inaudible], [laughs], and [redacted]), and normalize the filler hmm, i.e., sed -e 's/\/hmm/g; s/\/hmm/g; s/\/hmm/g;'. See local/wer_output_filter. The final scoring script will be released when the test data is released.
  • The WER can differ for every run and every machine due to random initialization and to machine-specific issues. The difference can be up to several percent absolute.