Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to allow systems to be broadly comparable, there are some guidelines that we expect participants to follow.

Tracks

The challenge features two tracks:

single-array: only one reference array can be used to recognise a given evaluation utterance,
multiple-array: all arrays can be used.

The reference array depends on the session and the location (kitchen, dining, living) and it is fixed for a given location. It was chosen such that it is generally in the same room as the speakers.

For each track, we will produce two separate rankings:

systems based on conventional acoustic modeling and official language modeling: the outputs of the acoustic model must remain frame-level tied phonetic (senone) targets and the lexicon and language model must not be changed compared to the conventional ASR baseline,
all other systems, including systems based on the end-to-end ASR baseline or systems whose lexicon and/or language model have been modified.

In other words, ranking A focuses on acoustic robustness only, while ranking B addresses all aspects of the task.

Which information can I use?

The following annotations can be used for training, development, and evaluation:

the session and location labels,
the start and end times of all utterances,
the corresponding speaker labels.

Note that the start and end times may not be fully accurate, since they have been manually annotated on the binaural microphones corresponding to each speaker and automatically transferred to each Kinect by means of simple cross-correlation between the signals.

For evaluation, you are allowed to use for a given utterance the full-length recording from the reference Kinect (for the single-array track) or from all Kinects (for the multiple-array track) for that session. In other words, you are not limited to the past context or to the immediate context surrounding each utterance.

For training and development, you are also allowed to use:

the full-length recordings of all Kinects (even for the single-array task),
the full-length recordings of all binaural microphones,
the map of the recording environment showing the positions of the Kinects.

Note that the dimensions in the maps are not fully accurate and heights are not provided.

Which information shall I not use?

Manual modification of the data or the annotations (e.g., manual refinement of the utterance start and end times) is forbidden.

All parameters should be tuned on the training set or the development set. Modifications of the development set are allowed, provided that its size remains unchanged and these modifications do not induce the risk of inadvertently biasing the development set toward the particular speakers or acoustic conditions in the evaluation set. For instance, enhancing the signals, applying “unbiased” transformations (e.g., fMLLR) or automatically refining the utterance start and end times is allowed. Augmenting the development set by generating simulated data, applying biased signal transformations (e.g., systematically increasing intensity/pitch), or selecting a subset of the development set is forbidden. In case of doubt, please ask us ahead of the submission deadline.

Can I use different data or annotations?

You are entirely free in the development of your system.

In particular, you can modify the provided training, development, and evaluation data by:

automatically realigning the signals wrt each other,
processing the signals by means of speech enhancement or “unbiased” signal transformations (e.g., fMLLR),
automatically refining the utterance start and end times (e.g., by automatic speech activity detection),

and you can also modify the provided training data by:

processing the signals by means of other transformations,
generating simulated data based on the provided binaural or Kinect signals and on artificially generated impulse responses and noises,

provided that these modifications are fully automatic (no manual reannotation) and they rely on the provided signals only (no external speech, impulse response, or noise data). The results obtained using those modifications will be taken into account in the final WER ranking of all systems.

You may even use external speech, impulse response, or noise data taken from publicly available or in-house datasets. However, you should still report the results of your system using only the official challenge data, so that enough information is available to understand where the performance gains obtained by your system come from. The results obtained using external data will not be taken into account in the final WER ranking of all systems. This rule is motivated by the fact that we found no significant benefit when attempting to use external data in the challenge preparation phase.

Can I use a different recogniser or overall system?

Again, you are entirely free in the development of your system.

In particular, you can:

include a single-channel or multichannel enhancement front-end,
use other acoustic features,
modify the acoustic model architecture or the training criterion,
modify the lexicon and the language model,
use any rescoring technique.

The results obtained using those modifications will be taken into account in the final WER ranking of all systems. Note that, depending on the chosen baseline (conventional or end-to-end) and the modifications made, your system will be ranked within either category A or B. If the outputs of the acoustic model remain frame-level tied phonetic targets, the lexicon and language model are unchanged compared to the conventional ASR baseline, and rescoring techniques (if any) are based on this lexicon and this language model (e.g., MBR or DLM based on acoustic features only), then it will be ranked within category A. Otherwise, e.g., if you used end-to-end ASR, you modified the lexicon or the language model, or you used rescoring techniques that implicitly modify the language model (e.g., DLM based on linguistic features), it will be ranked within category B. In case of doubt, please ask us ahead of the submission deadline.

Which results should I report?

For every tested system, you should report 2 WERs (%), namely:

the WER on the development set
the WER on the evaluation set.

For instance, here are the WERs (%) achieved by the two ASR baselines (the WERs on the evaluation set will be available in June/July). These results were obtained for one run on one machine. If you run the baseline yourself, you will probably obtain different results due to random initialisation and machine-specific issues.

Baseline	Development set	Evaluation set
conventional (GMM)	91.7
conventional (LF-MMI TDNN)	81.3
end-to-end	94.7

Note that the performance of straightforward end-to-end ASR with this amount of training data does not work in general, although we put this result to share the current status of this technique. We've also confirmed that the performance of both end-to-end ASR and Kaldi GMM systems is comparable when we use the worn microphone for testing below. As a reference, we also list the WERs of binaural microphones, which are not considered as challenge results, but good to investigate the problem of the challenge in more details. For example, the performance difference between the array and binaural microphone results in LF-MMI TDNN show that the main difficulty of this challenge comes from the source and microphone distance in addition to very spontaneous and overlapped speech, which exist in both array and binaural microphone conditions.

Baseline	Development set
conventional (GMM)	72.8
conventional (LF-MMI TDNN)	47.9
end-to-end	67.2

The experimental comparison of all systems should provide enough information to understand where the performance gains obtained by your best system come from. For instance, in the case when your system can be split into a front end, an acoustic model and a language model and your front end differs from the baseline, please report the results obtained by replacing your front end with the baseline front end. Similarly, in the case when your acoustic model differs from the baseline, please report the results obtained by replacing your acoustic model with the baseline acoustic model and, in the case when your language model differs from the baseline, please report the results obtained by replacing your language model with the baseline language model. More generally, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance.

Eventually, only the results of the best system on the evaluation set will be taken into account in the final WER ranking of all systems. The best system is taken to be the one that performs best on the development set.

For that system, you should report 12 WERs: one for every development/evaluation session and every location.

Finally, you will be asked to provide the recognised transcriptions for the development and evaluation data and the corresponding lattices in Kaldi format (without changing the utterance IDs) for that system.