Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we expect participants to follow.

Which information can I use?

You are allowed to use the fact that the four classes of acoustic environments (BUS, CAF, PED, STR) are shared across datasets.

You are also allowed to use the environment and speaker labels in the training data, and the speaker labels in the development and test data.

You are encouraged to use the embedded training and development data and the corresponding noise-only recordings in any way that may help, e.g., to learn models of the acoustic environments and use them to recognize the test environment and/or to enhance the signal. The embedded test data may also be used in the limit of the immediate acoustic context of each test utterance, that is the 5 s preceding the utterance. Note that these 5 s may also contain speech, that is not always annotated.

Which information shall I not use?

The systems should not exploit the following information in order to transcribe a given test utterance:

the environment label,
more than 5 seconds of context,
other microphones than provided in the track-specific test data.

Automatic identification of the environment of the test utterance and the immediate acoustic context is allowed, though. The rationale is that a commercial ASR system to be deployed on a tablet should work in any environment just after the tablet has been switched on.

Similarly, manual refinement of the speech start and end times or manual annotation of the unnotated speech data are not allowed, but automatic refinement and automatic detection of the speech data in the 5 s context are allowed.

All parameters should be tuned on the training set or the development set. The system should not use different tuning parameters depending on different noisy environments and different data types (real or simulation). For example the baseline script tunes the system with a single language model weight, which is optimized by the average WER of over all recognition results in the development set including all noisy environments and data types.

Which results should I report?

For every tested system, you should report 4 WERs (%), namely:

the WER on the real development set
the WER on the simulated development set
the WER on the real test set
the WER on the simulated test set

For instance, here are the WERs (%) achieved by the baseline GMM and DNN models (the WERs on test data will be available later). All these results are obtained by training on noisy multicondition data (channel 5) and testing on data enhanced by BeamformIt. They were obtained for one run on one machine. If you run the baseline yourself, you will probably obtain slightly different results due to random initialisation and to machine-specific issues.

Track	Model	Development set		Evaluation set
Track	Model	Real	Simulated	Real	Simulated
1ch	GMM	22.16	24.48	37.54	33.30
	DNN+sMBR	14.67	15.67	27.68	24.13
	DNN+RNNLM	11.57	12.98	23.70	20.84
2ch	GMM	16.22	19.15	29.03	27.57
	DNN+sMBR	10.90	12.36	20.44	19.04
	DNN+RNNLM	8.23	9.50	16.58	15.33
6ch	GMM	13.03	14.30	21.83	21.30
	DNN+sMBR	8.14	9.07	15.00	14.23
	DNN+RNNLM	5.76	6.77	11.51	10.90

Such results will make it possible to assess whether simulated data are a reliable way of predicting ASR performance on real data, for development and/or for test. This currently appears to be approximately true. You are encouraged to improve the simulation baseline, so that it becomes even more true.

Eventually, only the results of the best system on the real test will be taken into account in the final WER ranking of all systems. The best system is taken to be the one that performs best on the real development set.

For that system, you should report 16 WERs (one for every development/test set and for every environment). The participants should also provide the recognized transcriptions for all the sets, when applicable with time alignment information (if the format of the transcriptions is not standard it must be described).

For instance, here are the WERs achieved by the baseline DNN+RNNLM system.

Track	Environment	Development set		Evaluation set
Track	Environment	Real	Simulated	Real	Simulated
1ch	BUS	15.13	11.90	35.93	16.49
	CAF	11.81	15.90	24.60	23.91
	PED	7.42	9.94	19.94	20.25
	STR	11.90	14.19	14.36	22.71
2ch	BUS	10.90	8.19	25.37	10.66
	CAF	7.96	12.15	15.97	18.21
	PED	5.22	7.12	13.53	15.61
	STR	8.82	10.55	11.45	16.85
6ch	BUS	7.39	6.02	16.86	7.68
	CAF	5.77	8.10	10.18	11.54
	PED	3.72	5.49	9.83	10.31
	STR	6.18	7.48	9.19	14.06

Can I use different features, a different recogniser or more data?

You are entirely free in the development of your system, from the front end to the back end and beyond, and you may even use extra data, including clean data, additional noisy data created by running the provided simulation baseline (or an improved version thereof), or any other data.

However, you should provide enough information, results and comparisons, such that one can understand where the performance gains obtained by your system come from. For example, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance.

Specifically:

if you use extra training data, please also report the results of your system using the official training set, which consists of 1,600 real utterances and 7,138 simulated utterances; each utterance of the official training set can be considered in as many versions as needed (clean, noisy, enhanced...); you are even allowed to modify the acoustic simulation baseline provided that you mix each speech signal with the same noise signal as in the original simulated set (i.e., only the impulse responses can change, not the noise instance).
similarly, if you use extra development data, please also report the results of your system using the official development set, which consists of 410 real utterances and 410 simulated utterances for each environment;
if you use a different language model than provided in the baseline, please also report the results of your system with one of the provided baseline language models (3-gram, 5-gram, or RNNLM)
any language model rescoring technique (e.g., MBR, DLM) can be used and reported as an official result as long as the technique is trained using official training data only, i.e. data in CHiME4/data/WSJ0/wsj0/doc/lng_modl/lm_train/.
in the case when your system can be split into a front end and a back end and your front end differs from the baseline, please also report the results obtained by combining the baseline front end (BeamformIt) with your back end;
similarly, in the case when your back end differs from the baseline, please also report the results obtained by combining your front end with one of the baseline back ends (GMM, DNN+sMBR, DNN+5-gram, or DNN+RNNLM).

The interface between front and back end is taken to be either at the signal or feature level, depending whether your front end operates in the signal or feature domain.

Only the results obtained using the official training and development sets (including possible modifications of the acoustic simulation baseline as specified above) and one of the baseline language models will be taken into account in the final WER ranking of all systems.