Twitter Slack

  • The system should be unique among the three scenarios with same hyperparameters: automatic or manual domain identification is not allowed.
  • Inference has to be performed by treating each session independently.
  • Only allowed external datasets along with the core datasets can be used.
    • Automatic data augmentation is permitted.
  • Only pretrained models specified on this page can be used, see Pre-Trained Models.
    • Pre-trained large language models are allowed only on the unconstrained LM track.
    • We do always permit the use of all official and external data for language model training.

You CANNOT use in inference (BOTH evaluation AND development sets):

  • Any form of ground-truth annotation (even if these are already available from previous challenges for some datasets).
    • You cannot re-arrange or mix training and development data.
    • A leaderboard is available to score on development set see Submissions page.
  • Close-talk microphones.
  • Prior information on the exact number of speakers, whatever is the scenario.
    • ✅ however you can assume that the maximum number of total speakers never exceeds 8 in a single session.
  • ⚠️ Normalization techniques that accumulate statistics over multiple evaluation sessions (or entire dev/eval sets !).
    • Kaldi by default (also ESPNet) does this many times BY DEFAULT, be careful !
      • e.g. to avoid this in TS-VAD script e.g. cmvn should be online and not dataset-wide.

Propose Your External Data and Pre-Trained Models !
(Deadline 20th March AoE)

In addition to current external datasets and pre-trained models, we encourage participants to propose new ones.
The proposal period will be open till the 20th March AoE (anywhere on earth).
When your proposal is accepted we will update the lists here and notify participants (via Slack Workspace and [Google Group][google-group]) about the new dataset/model allowed.

Please reach us via the [Google Group][google-group] or in the Slack Workspace if you want to propose new additional external datasets or pre-trained models.

What Models/Datasets Can I Propose ?

When evaluating a proposal for a new external model or external dataset we will take into consideration the following things:

  • How easy can be used by other participants (e.g. massive datasets such as LibriVox will not be accepted).
  • Pre-trained model: Is there a risk this challenge evaluation data has been used in model training or validation ?
  • What is the scientific usefulness/motivation for adding such additional dataset/model ?

❓ Rules FAQs


  • You are allowed to use all information available in the training partitions of core datasets.
    • You can freely use all metadata such as reference device, oracle diarization, speaker ID etc.
      • This does not apply to evaluation of course.
  • Participants can also use data from the external datasets.
    • There is no limitation in how these external datasets are used. You can also combine them with core datasets. to create synthetic datasets.

Data Augmentation

You can use any data-augmentation techniques without restrictions.
This includes automatic methods such as room impulse response generation (e.g. Pyroomacoustic).
But also more sophisticated approaches such as deep-learning based generative methods as long as these are trained only on the allowed data.

What does it mean an unique system across all scenarios ?

The system must be unique and the same across all scenarios and sessions before inference is performed.
E.g. you can’t have a different system for each scenario or for different array configurations.
This also implies that the hyperparameters should be the same across all scenarios.

⚠️ The system may adapt itself (e.g through self-training via pseudo labels) on each session independently thus modifying its parameters in each inference run.
But this has to start again as it is applied to the subsequent session.
The original system (prior adaptation) should be the same, with same hyperparameters, across all sessions and scenarios.

Recording Setup Information

You cannot use prior information about the array type, number of arrays and, more broadly recording setup.

⛔ These include:

  • selecting a subset of microphones for one scenario and ANOTHER ONE for a different scenario.
  • you cannot use any a-priori information about the array topology, including the fact that some microphones belong to the same array.

these practices are equivalent to domain identification and in this challenge we are focused on systems that are as general as possible, with as few assumptions as possible on the recording setup.

✅ Automatic channel selection is encouraged.

What about Language Modeling (LM) ?

In BOTH challenge tracks (unconstrained and constrained LMs, see main page):

  • you can use all core and external data to train whatever LM model you like.

Only in the unconstrained LM track:

Is it required to open-source the final system ?

We do not require participants to open-source their system.
However it is highly encouraged and welcomed.

🤖 Allowed Pre-Trained Models

We allow open-source models for which training and validation data is clearly defined and we are sure it does not overlap with the Task evaluation data. If unsure and have questions please reach us!
This list includes:

Pre-trained ASR Models

Self-supervised Learning Representation (SSLR)

Speaker ID Embeddings/Diarization Models