Please read this page and Data page for detailed rules regarding this task and the associated data. To summarize:
- External data specified on this page and in the Data page may be used.
- Pretrained models specified on this page may be used.
- Other datasets and models can be proposed until March 7th (see Additional Models).
- Close-talk microphones cannot be used during inference.
- Automatic or manual domain identification is not allowed. Note that selecting a subset of microphones on a scenario (e.g. only outer array mics on CHiME-6) and a different one for another scenario accounts for domain identification. Note that you cannot use any a-priori information about the array topology, including the fact that some microphones belong to the same array (as also this will implicitly distinguish the domains).
- Automatic device/microphone selection is allowed.
- Self-supervised adaptation on evaluation data is allowed, but you need to perform it independently for each session.
- We do not permit the use of pre-trained language models such as BERT.
- We do permit the use of all official data and external data for language model training.
- Organizers can partipicate but are not eligible for prizes and will not appear in the official ranking.
Evaluation data comes from previously published datasets (as in past CHiME-6 Challenge). Thus we rely only on your goodwill and correctness to keep this Task fair among all participants.
We do not require participants to open-source their system, however it is highly encouraged.
We require each participant to submit a 2 to 6 pages system description paper (same deadline as the system submission). See Submission Section for more info.
Note that Organizers will use this mandatory system description paper to assess the correctness of each submission.
Organizers can participate to this Challenge but they are not eligible for prizes and will not appear on the official ranking, their submission will be regarded as a baseline system v2.
In addition to allowed external datasets and pre-trained models listed in this page below, we encourage participants to propose new ones.
The proposal period will be open till the 7th March AoE (anywhere on earth).
When your proposal is accepted we will update the lists here and notify participants (via Slack Workspace and Google Group) about the new material/model allowed.
- How easy can be used by other participants (e.g. massive datasets such as LibriVox will not be accepted, same for too large pre-trained models, as not all participants can leverage vast computational resources).
- Pre-trained model: what data for training and validation was used ? We cannot accept models such as Whisper as we don’t know if the train and validation data are this Task evaluation data.
- What is the scientific usefulness/motivation for adding such additional dataset/model ?
You are allowed to use all information as available in the training and development partitions of CHiME-6, DiPCo and Mixer6 Speech datasets. These are the official Task 1 datasets and are described in detail in the Data page. This includes all metadata such as reference device, oracle diarization, speaker ID etc.
Participants are free to re-arrange training and development partitions to suit their needs.
The training and evaluation partitions are different from ones in previous CHiME-6 Challenge ! So please use the new partitions when training in order to be sure to not training on evaluation data.
Also Mixer 6 Speech annotation is different in this Task than the one available from LDC Mixer 6 page.
So please refer to Data page and follow the instructions there on how to obtain and generate this Task data.
In addition to these official datasets, participants can use data from the external datasets listed here.
There is no limitation in how these external datasets are used and can also be combined with the official datasets e.g. to create synthetic datasets.
Participants can use data-augmentation techniques without restrictions. This includes automatic methods e.g room impulse response generation (such as e.g. Pyroomacoustic)
and also more sophisticated approaches such as deep-learning based generative methods.
Obviously these latter should use only the allowed datasets (external and/or official).
We don’t impose any restriction on the size of the dataset used to train the system components but we ask participants to describe carefully the data used and the data-augmentation techniques used in their technical report.
See Submission page for more details.
What information the participants can use in inference depends on the track.
As explained in Task 1 page, in this Task there is a main-track (joint ASR and diarization) and a sub-track (ASR-only with oracle diarization).
In all the tracks using data from the close-talk microphones is forbidden during inference. See in Data page for more information on what are the close-talk microphones and data generation.
Note that our data generation scripts avoid generating such microphones for the evaluation set so you should not worry much about this.
Of course, usage of data for supervised training/tuning purposes from the CHiME-6, DiPCo and Mixer6 evaluation partitions is strictly forbidden.
However, self-supervised adaptation is allowed as long as it is applied on each session in an independent manner.
This means that you need to perform adaptation on each session and each scenario if you choose to do so (as the system should be unique and cannot use domain identification).
The per-session adaptation reflect a real-world application where you need to transcribe each meeting as soon as it is finished and you cannot wait to batch-process all of them.
As explained in Task 1 page, participants have to perform segmentation and transcription of each session. Participants are free to use all or a subset of the far-field microphones available, but note that in the evaluation scenarios the microphones could be placed differently and the setting could also change (e.g. different room).
Automatic methods for device selection are allowed.
Automatic or manual domain identification is not allowed.
The submitted system must be one and the same between each evaluation session and the three different evaluation portions (DiPCo).
Also prior knowledge of total number of speakers in each session can’t be used. This means also that a different subset of microphones cannot be used for each scenario. You should use the same strategy for all scenarios e.g. best 50% microphones as selected via an automatic channel selection method. You cannot rely on array topology prior knowledge e.g. DiPCo has circular arrays while CHiME-6 linear. You cannot use a priori information about number of microphones or the fact that some microphones belong to the same array.
Participant systems must be able to estimate the number of speakers automatically and cannot rely on the fact that data comes from CHiME-6, DipCO or Mixer 6 Speech. It is however possible to assume that the maximum number of speakers in each session never exceeds 4.
Participants are provided with oracle diarization in inference. This sub-track thus resolves to only transcription of each speaker utterance.
Again automatic or manual domain identification and the use of on-person close talk microphones are forbidden also here.
Please note that, contrary to CHiME-6, also for this track we won’t allow/provide the use of a reference device. As in the main track, participants will have to resort to automatic device/microphone selection or methods to fuse information across several devices. Note that participants are free to re-segment the data for inference and submit a JSON with different segmentation than the provided oracle one for this track.
Contrary to the previous CHiME-6 Challenge, here participants are free to use whatever LM techniques is preferred and official plus external data to train such system.
Suitable text-based external open-source datasets can also be proposed, see above in . We do not allow however the use of large-scale pre-trained LM models (e.g. BERT).
This is because our main focus in on acoustic robustness even if integration between ASR and large-scale LMs is an interesting direction.
We allow to use existing open-source datasets such as:
- RWCP Sound Scene Database
- REVERB Challenge RIRs.
- Aachen AIR dataset.
- BUT Reverb database.
- SLR28 RIR and Noise Database (contains Aachen AIR, MUSAN noise,
RWCP sound scene database and REVERB challenge RIRs, plus simulated ones).
- VoxCeleb 1&2
- WSJ0-2mix, WHAM, WHAMR, WSJ
Also datasets that are derived from ones listed here (including artificial reverberation done with non-data driven methods, such as image method) can be used by participants. Some datasets as listed here could have overlapping data. You can use whatever source you prefer (among the one listed here of course) to obtain this data (e.g. SLR28 instead of REVERB one directly).
Importantly this definition includes the synthetic LibriCHiME-5 dataset used in CHiME-7 Task 2.
You are also allowed to use e.g. WHAMR RIRs and noises and combine them with other datasets included here or with the official data e.g. CHiME-6 close-talk microphones.
Regarding the use of methods for creating artificial reverberation see above.
Some of the aforementioned external datasets may have overlapping data (e.g. SLR28 database containing noises from MUSAN).
Again, see Submission page for details on how participants can propose additional external datasets.
Here the main motivation to include LibriSpeech, FSD50k, SINS and WSJ0-2mix (plus WHAM etc.) datasets is to allow participants to use also supervised speech separation and enhancement methods as this was not explored in the past CHiME-6 Challenge.
MUSAN and VoxCeleb 1&2 were already allowed in CHiME-6 Challenge, for respectively, data augmentation and speaker embedding extraction training.
We allow open-source models for which training and validation data is clearly defined and we are sure it does not overlap with the Task evaluation data. If unsure and have questions please reach us!
This list includes:
- Wav2vec 2.0:
- Fairseq :
- wav2vec2-large lv60 + speaker verification
- Other models on Huggingface using the same weights as the Fairseq ones.
- X-vector extractor
- Pyannote Segmentation
- Pyannote Diarization (Pyannote Segmentation+ECAPA-TDNN from SpeechBrain)
- NeMo toolkit ASR pre-trained models:
- NeMo toolkit speaker ID embeddings models:
- NeMo toolkit VAD models:
- NeMo toolkit diarization models:
If you want to use your own e.g. in-house speaker embedding extraction model, then it should only use for training the Task “official” datasets (CHiME-6, DiPCo and Mixer 6) plus the allowed external datasets as listed here above.
If other data is used, you can consider to open-source it and propose it as an additional open-source pretrained model that can be used by every participant.