Results of the first evaluation stage (objective metrics)

Tables of results

  • The tables of results contain the mean performance in terms of
    • SI-SDR on the complete evaluation set of the reverberant LibriCHiME-5 dataset;
    • OVRL, BAK and SIG scores as computed by DNS-MOS on the eval/1 subset of the CHiME-5 dataset.
  • You can click on the name of a system to access the corresponding paper. For the moment, the provided papers are extended abstracts (except for the baseline). The full papers (2 to 6 pages) describing the systems will be published after the CHiME-2023 workshop.

  • OOD teacher, RemixIT, and RemixIT-VAD are part of the baseline.

Ranking based on the mean SI-SDR score

Reverberant LibriCHiME-5 (eval set) CHiME-5 (eval/1 subset)
SI-SDR ranking System name (abbreviation) SI-SDR (dB) OVRL BAK SIG
1 NWPU and ByteAudio (N&B) 13.0 3.07 3.93 3.39
2 Sogang ISDS1 (ISDS1) 12.4 2.90 3.60 3.39
2 Sogang ISDS2 (ISDS2) 12.4 2.88 3.70 3.32
4 RemixIT-VAD 10.1 2.84 3.62 3.28
5 RemixIT 9.4 2.82 3.64 3.26
6 Conformer Metric GAN +/+
7.8 3.40 3.97 3.76
6 OOD teacher 7.8 2.88 3.59 3.33
8 Input 6.6 2.84 2.92 3.48
9 Conformer Metric GAN +/+
Fine Tuned
4.7 3.55 3.93 3.92

Ranking based on the mean OVRL score

CHiME-5 (eval/1 subset) Reverberant LibriCHiME-5 (eval set)
OVRL ranking System name (abbreviation) OVRL BAK SIG SI-SDR (dB)
1 Conformer Metric GAN +/+
Fine Tuned
3.55 3.93 3.92 4.7
2 Conformer Metric GAN +/+
3.40 3.97 3.76 7.8
3 NWPU and ByteAudio (N&B) 3.07 3.93 3.39 13.0
4 Sogang ISDS1 (ISDS1) 2.90 3.60 3.39 12.4
5 Sogang ISDS2 (ISDS2) 2.88 3.70 3.32 12.4
5 OOD teacher 2.88 3.59 3.33 7.8
7 RemixIT-VAD 2.84 3.62 3.28 10.1
7 Input 2.84 2.92 3.48 6.6
9 RemixIT 2.82 3.64 3.26 9.4


The figures below show the performance of the different systems with boxplots and violin plots. In each figure, the systems are ranked based on their mean performance, which is indicated by a black dot.

drawing drawing drawing drawing

Systems selected for the second evaluation stage

The systems selected for the listening test are:

The selection procedure was the following:

  • We selected the top 3 entries in terms of OVRL score on CHiME-5. In case of multiple entries for a same team we only kept the best one. This led to S1 = {CMGAN-FT, N&B, ISDS1}.
  • We selected the top 3 entries in terms of SI-SDR on reverberant LibriCHiME-5. In case of multiple entries for a same team we only kept the best one. This led to S2 = {N&B, ISDS1, RemixIT-VAD}.
  • The union of S1 and S2 gave the above-listed systems selected for the listening test.

Results of the second evaluation stage (listening test)

Brief description of the listening test

The listening test followed the ITU-T Recommendation P.835. It was conducted in-person by Hend ElGhazaly and Jon Barker at the University of Sheffield (UK) between July 17 and August 9, 2023.

Participants were seated in a listening booth and listening over headphones to short speech samples (4–5 seconds). Each trial consisted of three presentations of the same sample, to collect three different subjective reports. In the different presentations participants were instructed to either focus on the speech signal and rate how natural it sounded (SIG rating scale), focus on the background noise and rate how noticeable or intrusive this background was (BAK rating scale), or attend to both the speech and the background noise and rate the overall quality of the sample (OVRL rating scale), quality being defined in the perspective of everyday speech communication. The order of presentations was counterbalanced across participants. The ratings were reported on 5-point Likert scales. For more information you can read the instructions that were provided to the listeners.

32 subjects participated in the listening test. They were separated in 4 panels of 8 listeners. Each panel was associated to a distinct set of 32 audio samples taken from the eval/listening_test subset of the segmented CHiME-5 dataset, resulting in a total of 32 × 4 = 128 audio samples for the entire listening test. For each audio sample we had 5 different experimental conditions (the 4 systems that performed the best in the first evaluation stage and the noisy input condition).

A complete listening experiment for one subject consisted of 160 trials, where one trial corresponds to a couple of audio sample and experimental condition (32 audio samples x 5 experimental conditions = 160 trials). The 160 trials were splitted into 4 listening sessions, separated by short rest periods. For each triplet (audio sample, experimental condition, rating scale), a mean opinion score (MOS) was computed out of 8 votes.

Before the aforementioned listening sessions, the participants performed a practice session to familiarize themselves with the task. This practice session of 48 trials consisted of the reference conditions described in Table 1 of Naderi and Cutler (2021). These reference conditions correspond to synthetic mixtures of speech and noise that were designed to equalize the subjective range of quality ratings of all listeners.

The software for the listening test was based on jsPsych and it was developed by Matthieu Fraticelli.

Naderi, B., & Cutler, R. (2021). Subjective evaluation of noise suppression algorithms in crowdsourcing. INTERSPEECH.

Tables of results

  • Results in the tables below are computed from 128 MOS for each system and rating scale, and each MOS is computed out of 8 votes.
  • The column “95% CI” contains the 95% confidence interval for the mean.
  • The rankings are based on the mean results.

BAK MOS results

Ranking System name (abbreviation) Mean 95% CI Median
1 NWPU and ByteAudio (N&B) 4.30 0.01 4.38
2 Sogang ISDS1 (ISDS1) 3.08 0.01 3.00
3 RemixIT-VAD 2.97 0.01 2.88
4 Conformer Metric GAN +/+
Fine Tuned
2.75 0.01 2.63
5 Input 2.20 0.01 2.19

SIG MOS results

Ranking System name (abbreviation) Mean 95% CI Median
1 Input 3.97 0.01 4.00
2 Sogang ISDS1 (ISDS1) 3.43 0.01 3.56
3 NWPU and ByteAudio (N&B) 3.41 0.01 3.63
4 RemixIT-VAD 3.02 0.02 3.25
5 Conformer Metric GAN +/+
Fine Tuned
2.63 0.01 2.63

OVRL MOS results

Ranking System name (abbreviation) Mean 95% CI Median
1 NWPU and ByteAudio (N&B) 3.11 0.01 3.25
2 Sogang ISDS1 (ISDS1) 2.75 0.01 2.75
3 Input 2.68 0.01 2.75
4 RemixIT-VAD 2.45 0.01 2.50
5 Conformer Metric GAN +/+
Fine Tuned
2.14 0.01 2.13


  • In the figures below, black dots and numbers written above the box/violin plots correspond to the mean results.
  • In each figure, the systems are ranked according to their mean results.

drawing drawing drawing

Audio examples

The audio examples below were taken randomly from the samples used for the listening test.

Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10