Rules

The following rules apply to all systems participating in the CHiME-9 Task 2 (ECHI) challenge:

Training data: You may use the training subset of the ECHI dataset and any external datasets explicitly listed in the Data and pre-trained models section. If you believe a public dataset is missing, you may propose its inclusion before the deadline specified below.
Development data: The development subset of ECHI may be used for evaluation purposes only. It must not be used for training or automatic tuning.
Streaming requirement: Systems must be streaming in nature — that is, they should process inputs sequentially over time with a maximum algorithmic latency of 20 ms. Systems must not access or rely on global information from a recording (e.g., global normalization, non-streaming speaker identification or diarization).
Latency documentation: The report accompanying system submissions must include a clearly labeled section titled “Latency”, detailing any lookahead, chunk-based processing, and other latency-relevant characteristics. This must also include an explicit estimate of the average algorithmic and emission latency. Only systems meeting the 20 ms algorithmic latency constraint will be ranked.
Pre-trained models: Only the models listed in the Data and pre-trained models section may be used. If a useful model is missing, you may propose it for inclusion before the deadline. If using pre-trained models be careful to ensure that they are compatible with the 20 ms latency constraint.
Evaluation integrity: Each recording in the evaluation set must be processed independently. The system must not be fine-tuned on the evaluation data or use global information across recordings. Within a session, when performing enhancement, systems must treat the Aria and the hearing aid devices independently. (There are no restrictions on how the Aria and hearing aid signals are used during training.)
Speaker identity: The IDs of the four conversational partners are provided for all sessions (train, dev and eval). Systems can freely use the the ‘Rainbow passage’ recordings of all four speakers in the session, i.e. to guide target speaker extraction, or own voice suppression.

Note: We are also interested in systems that might not fully meet the required 20 ms latency constraint and such systems and respective papers are welcome as contributions to the CHiME workshop. Thus, if your system violates any of the above rules (e.g., uses private data), you may still submit it, but it will not be included in the official rankings or considered for the subjective listening test stage.

Evaluation

Systems being evaluated should take as input, only,

the multichannel audio signal for the device being considered,
(optionally) the ‘Rainbow passage’ recordings of the device wearer and each of the conversational partners, i.e. to distinguish between voices of the device wearer, the conversational partners, and other speakers in the environment. Note, the ID numbers for the device wearers and conversational partners are provided in the metadata.

Systems should output a single-channel rendition of each of the three conversational partners speaking to the device wearer (using either the hearing aid [HA] or Aria glasses). (For file naming conventions and format details see Submission; for how to evaluate your signals locally see Baseline and Evaluation Framework).

These three output streams will be evaluated using:

Objective measures: To assess the intelligibility and quality of each stream.
Listening tests: To assess listener preference when hearing the combined (summed) signal of all three streams.

For the development data, participants will be provided with:

Time-stamped segmentations marking the start and end of each speech segment per speaker.
Corresponding reference audio for each segment (see Reference Signals).

For the final evaluation dataset, the noisy signals will be provided but the segmentation and reference audio will be withheld.

Entrants will submit their enhanced signals and metrics will be computed by the organisers.

Final system rankings will be based solely on listening test results. It is well-known that objective metrics only partly correlate with listening tests and that this is particularly to be expected for the recordings in the CHiME9-ECHI data with the reference signals constructed as described in Reference Signals. Thus, participants should be aware that the provided objective metrics and the resulting scores are to be considered rather informative. Objective metrics will be reported for all submissions to help guide development but will not affect rankings.

Objective Measures

Objective evaluation will include both reference-free (i.e., independent metrics a.k.a. non-intrusive) and reference-based (intrusive) metrics (i.e., dependent metrics). All metrics are being computed using the VERSA toolkit). The metrics will be computed per speech segment, per speaker, and averaged across all sessions in the evaluation set. A segment-length weighted average is provided. The evaluation code will provide overall scores, and also per-session and per-participant statistics for each metric being used.

The metrics will be computed and reported separately using two types of signal:

individual - using segments from the separate speaker streams.
summed - the three individual speaker streams are summed (and similarly for the references).

We expect the summed metric to better reflect the quality of the signals being used in the listening test, which will be constructed by summing the individual speaker streams to reconstitute the conversation. For example, systems that separate the conversation from the background but misallocate conversation partners between the participant streams, may do poorly on the ‘individual metrics’ but produce perfectly good signals when summed.

INDEPENDENT METRICS

The following independent metrics will be reported.

Deep Noise Suppression MOS Score of P.835 (DNSMOS) (1)
Deep Noise Suppression MOS Score of P.808 (DNSMOS) (2)
Non-intrusive Speech Quality and Naturalness Assessment (NISQA) (3)
UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) (4)
PESQ in TorchAudio-Squim (6)
STOI in TorchAudio-Squim (7)
SI-SDR in TorchAudio-Squim (8)

Numbers in parentheses refer to the table row on the VERSA GitHub page here.

DEPENDENT METRICS

The following dependent metrics will be reported.

Mel Cepstral Distortion (MCD) (1)
Signal-to-interference Ratio (SIR) (4)
Signal-to-artifact Ratio (SAR) (5)
Signal-to-distortion Ratio (SDR) (6)
Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) (7)
Scale-invariant signal-to-noise ratio (SI-SNR) (8)
Perceptual Evaluation of Speech Quality (PESQ) (9)
Short-Time Objective Intelligibility (STOI) (10)
Frequency-Weighted SEGmental SNR (FWSEGSNR) (20)
Weighted Spectral Slope (WSS) (21)
Cepstrum Distance Objective Speech Quality Measure (CD) (22)
Composite Objective Speech Quality (23)
Coherence and speech intelligibility index (CSII) (24)
Normalized-covariance measure (NCM) (25)

Numbers in parentheses refer to the table row on the VERSA GitHub page here.

NON MATCH METRIC

In addition, the following metrics listed by VERSA as ‘non match metrics’ will be used:

MOS in TorchAudio-Squim (2)
Log Likelihood Ratio (LLR) (11)

Numbers in brackets refer to the table row on the VERSA GitHub page here.

Non-match metrics are those where the reference signal is not necessarily a matched counterpart of the signal being assessed. They are useful in situations such as ECHI where the precise noise-free ground truth is hard to obtain.

Note that independent metrics might be erroneous since they might show only weak correlation with assessment by the final listening test, e.g. by being positively impacted by competing or backchannel speech or non-speech audio parts like laughter, and that dependent metrics might be only weakly correlated due to the reference signal constructions. We therefore advise challenge participants to listen to the signals enhanced by their respective systems instead of blindly relying on metric scores alone.

The metrics above are computed using the Versa metric config file shown below:

- name: nisqa
- name: pseudo_mos
  predictor_types: ["utmos", "dnsmos"]
- name: pysepm
- name: pesq
- name: signal_metric
- name: stoi
- name: squim_no_ref
- name: squim_ref

⚠️ Latency rule: Participants may apply a fixed delay (up to 20 ms) across all submitted output streams to match reference timing. This single delay must be consistent across all recordings and clearly reported.

Listening Tests

Subjective evaluation will be conducted using MUSHRA-style listening tests, where human listeners assess segments formed by summing the three processed speaker streams into a single channel.

In case that too many submissions are received to conduct a full listening test for all systems, a pre-listening test with reduced number of signals will be conducted to short-list the most promising systems.

Listeners will be asked to assess

target signal quality
background noise quality
overall signal quality
intelligibility (either subjectively self-reported or semi-formal (since formal intelligibility assessment by e.g. matrix tests will not be possible due to the nature of the CHiME9-ECHI Dataset))

(More information on the full methodology will be provided later.)

External data and pre-trained models

Besides the ECHI dataset published with this challenge, the participants are allowed to use public datasets and pre-trained models listed below. In case you want to propose additional dataset or pre-trained model to be added to these lists, do so by contacting us until August 18th 2025. If you want to use a private dataset or model, you may still submit your system to the challenge, but we will not include it in the final rankings.

Participants may use these publicly available datasets for building the systems:

In addition, following pre-trained models may be used:

A note on latency: The models listed below are provided as potential starting points. Their inclusion in this list does not guarantee compliance with the 20 ms algorithmic latency rule and many used ‘as is’ are definitely not compliant. It is the participants responsibility for verifying and documenting that their system meets the challenge requirements.