The baseline system is provided at Github. Please refer to the README therein for information about how to install and run the system.
The baseline system roughly follows the scheme in (Lin et al, 2023). It comprises of:
- Fixed NLCMV beamformer (Feng et al, 2023), which uses 13 beams into 12 directions uniformly spaced around the wearer + 1 direction for the mouth of the wearer. The beamformer coefficients are derived from acoustic transfer functions (ATF) recorded in anechoic rooms with the Aria glasses. We release both the beamforming coefficients and the original ATFs.
- Extraction of log-mel features from each of the 13 beams
- ASR model processing the multi-channel features and estimating serialized-output-training (SOT) (Kanda et al, 2022) transcriptions
The ASR model is based on a publicly available pre-trained streaming model - FastConformer Hybrid Transducer-CTC model. By default, this model is a single-speaker, single-channel model. We modify this model by prepending the beamformer, extending its input to multiple channels, extending the tokenizer with speaker tokens »0, »1 (for SELF and OTHER, respectively), and fine-tuning it to provide the SOT transcriptions. The fine-tuning is done on the training subset of the MMCSG dataset.
The results achieved by the baseline system on the dev subset of MMCSG with several different latency settings are summarized in the following table: