Rules
Rules
The following rules apply to all systems participating in the CHiME-9 Task 2 (ECHI) challenge:
- Training data: You may use the training subset of the ECHI dataset and any external datasets explicitly listed in the Data and pre-trained models section. If you believe a public dataset is missing, you may propose its inclusion before the deadline specified below.
- Development data: The development subset of ECHI may be used for evaluation purposes only. It must not be used for training or automatic tuning.
- Pre-trained models: Only the models listed in the Data and pre-trained models section may be used. If a useful model is missing, you may propose it for inclusion before the deadline.
- Streaming requirement: Systems must be streaming in nature — that is, they should process inputs sequentially over time with a maximum algorithmic latency of 20 ms. Systems must not access or rely on global information from a recording (e.g., global normalization, non-streaming speaker identification or diarization).
- Latency documentation: The report accompanying system submissions must include a clearly labeled section titled “Latency”, detailing any lookahead, chunk-based processing, and other latency-relevant characteristics. This must also include an explicit estimate of the average algorithmic and emission latency. Only systems meeting the 20 ms algorithmic latency constraint will be ranked.
- Evaluation integrity: Each recording in the evaluation set must be processed independently. The system must not be fine-tuned on the evaluation data or use global information across recordings. Within a session, systems must treat the Aria and the hearing aid devices independently.
Note: We are also interested in systems that might not fully meet the required 20 ms latency constraint and such systems and respective papers are welcome as contributions to the CHiME workshop. Thus, if your system violates any of the above rules (e.g., uses private data), you may still submit it, but it will not be included in the official rankings or considered for the subjective listening test stage.
Evaluation
Each system should output a single-channel rendition of each of the three conversational partners speaking to the device wearer (using either the hearing aid [HA] or Aria glasses). (For file naming conventions and format details see Submission; for how to evaluate your signals locally see Baseline and Evaluation Framework).
These three output streams will be evaluated using:
- Objective measures: To assess the intelligibility and quality of each stream.
- Listening tests: To assess listener preference when hearing the combined (summed) signal of all three streams.
For the development data, participants will be provided with:
- Time-stamped segmentations marking the start and end of each speech segment per speaker.
- Corresponding reference audio for each segment (see Reference Signals).
For the final evaluation dataset, the noisy signals will be provided but the segmentation and reference audio will be withheld. Entrants will submit their enhanced signals and metrics will be computed by the organisers.
Final system rankings will be based solely on listening test results. It is well-known that objective metrics only partly correlate with listening tests and that this is particularly to be expected for the recordings in the CHiME9-ECHI data with the reference signals constructed as described in Reference Signals. Thus, participants should be aware that the provided objective metrics and the resulting scores are to be considered rather informative. Objective metrics will be reported for all submissions to help guide development but will not affect rankings.
Objective Measures
Objective evaluation will include both reference-free (i.e., independent metrics a.k.a. non-intrusive) and reference-based (intrusive) metrics (i.e., dependent metrics). All metrics are being computed using the VERSA toolkit). The metrics will be computed per speech segment, per speaker, and averaged across all sessions in the evaluation set. A segment-length weighted average is provided. The evaluation code will provide overall scores, and also per-session and per-participant statistics for each metric being used.
The metrics will be computed and reported separately using two types of signal:
- individual - using segments from the separate speaker streams.
- summed - the three individual speaker streams are summed (and similarly for the references).
We expect the summed metric to better reflect the quality of the signals being used in the listening test, which will be constructed by summing the individual speaker streams to reconstitute the conversation. For example, systems that separate the conversation from the background but misallocate conversation partners between the participant streams, may do poorly on the ‘individual metrics’ but produce perfectly good signals when summed.
INDEPENDENT METRICS
The following independent metrics will be reported.
- Deep Noise Suppression MOS Score of P.835 (DNSMOS) (1)
- Deep Noise Suppression MOS Score of P.808 (DNSMOS) (2)
- Non-intrusive Speech Quality and Naturalness Assessment (NISQA) (3)
- UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) (4)
- PESQ in TorchAudio-Squim (6)
- STOI in TorchAudio-Squim (7)
- SI-SDR in TorchAudio-Squim (8)
Numbers in parentheses refer to the table row on the VERSA GitHub page here.
DEPENDENT METRICS
The following dependent metrics will be reported.
- Mel Cepstral Distortion (MCD) (1)
- Signal-to-interference Ratio (SIR) (4)
- Signal-to-artifact Ratio (SAR) (5)
- Signal-to-distortion Ratio (SDR) (6)
- Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) (7)
- Scale-invariant signal-to-noise ratio (SI-SNR) (8)
- Perceptual Evaluation of Speech Quality (PESQ) (9)
- Short-Time Objective Intelligibility (STOI) (10)
- Frequency-Weighted SEGmental SNR (FWSEGSNR) (20)
- Weighted Spectral Slope (WSS) (21)
- Cepstrum Distance Objective Speech Quality Measure (CD) (22)
- Composite Objective Speech Quality (23)
- Coherence and speech intelligibility index (CSII) (24)
- Normalized-covariance measure (NCM) (25)
Numbers in parentheses refer to the table row on the VERSA GitHub page here.
NON MATCH METRIC
In addition, the following metrics listed by VERSA as ‘non match metrics’ will be used:
- MOS in TorchAudio-Squim (2)
- Log Likelihood Ratio (LLR) (11)
Numbers in brackets refer to the table row on the VERSA GitHub page here.
Non-match metrics are those where the reference signal is not necessarily a matched counterpart of the signal being assessed. They are useful in situations such as ECHI where the precise noise-free ground truth is hard to obtain.
Note that independent metrics might be erroneous since they might show only weak correlation with assessment by the final listening test, e.g. by being positively impacted by competing or backchannel speech or non-speech audio parts like laughter, and that dependent metrics might be only weakly correlated due to the reference signal constructions. We therefore advise challenge participants to listen to the signals enhanced by their respective systems instead of blindly relying on metric scores alone.
The metrics above are computed using the Versa metric config file shown below:
- name: nisqa
- name: pseudo_mos
predictor_types: ["utmos", "dnsmos"]
- name: pysepm
- name: pesq
- name: signal_metric
- name: stoi
- name: squim_no_ref
- name: squim_ref
⚠️ Latency rule: Participants may apply a fixed delay (up to 20 ms) across all submitted output streams to match reference timing. This single delay must be consistent across all recordings and clearly reported.
Listening Tests
Subjective evaluation will be conducted using MUSHRA-style listening tests, where human listeners assess segments formed by summing the three processed speaker streams into a single channel.
In case that too many submissions are received to conduct a full listening test for all systems, a pre-listening test with reduced number of signals will be conducted to short-list the most promising systems.
Listeners will be asked to assess
- target signal quality
- background noise quality
- overall signal quality
- intelligibility (either subjectively self-reported or semi-formal (since formal intelligibility assessment by e.g. matrix tests will not be possible due to the nature of the CHiME9-ECHI Dataset))
(More information on the full methodology will be provided later.)
External data and pre-trained models
Besides the ECHI dataset published with this challenge, the participants are allowed to use public datasets and pre-trained models listed below. In case you want to propose additional dataset or pre-trained model to be added to these lists, do so by contacting us until August 18th 2025. If you want to use a private dataset or model, you may still submit your system to the challenge, but we will not include it in the final rankings.
Participants may use these publicly available datasets for building the systems:
- AMI
- LibriSpeech
- TEDLIUM
- MUSAN
- RWCP Sound Scene Database
- REVERB Challenge RIRs.
- Aachen AIR dataset.
- BUT Reverb database.
- SLR28 RIR and Noise Database (contains Aachen AIR, MUSAN noise, RWCP sound scene database and REVERB challenge RIRs, plus simulated ones).
- VoxCeleb 1&2
- FSD50k
- WSJ0-2mix, WHAM, WHAMR, WSJ
- SINS
- LibriCSS acoustic transfer functions (ATF)
- NOTSOFAR1 simulated CSS dataset
- Ego4D
- Project Aria Datasets
- DNS challenge noises
In addition, following pre-trained models may be used:
- Wav2vec:
- Wav2vec 2.0:
- Fairseq:
- All models including Wav2Vec 2.0 Large (LV-60 + CV + SWBD + FSH) and the multi-lingual XLSR-53 56k
- Torchaudio:
- Huggingface:
- facebook/wav2vec2-base-960h
- facebook/wav2vec2-large-960h
- facebook/wav2vec2-large-960h-lv60-self
- facebook/wav2vec2-base
- facebook/wav2vec2-large-lv60
- facebook/wav2vec2-large-xlsr-53
- wav2vec2-large lv60 + speaker verification
- Other models on Huggingface using the same weights as the Fairseq ones.
- S3PRL
- Fairseq:
- HuBERT
- WavLM
- Tacotron2
- ECAPA-TDNN
- X-vector extractor
- Pyannote Segmentation
- Pyannote Diarization (Pyannote Segmentation+ECAPA-TDNN from SpeechBrain)
- NeMo toolkit ASR pre-trained models:
- NeMo toolkit speaker ID embeddings models:
- NeMo toolkit VAD models:
- NeMo toolkit diarization models:
- Whisper
- OWSM: Open Whisper-style Speech Model
- Icefall Zipformer
- RWKV Transducer