Results

Overview

A total of 12 systems were submitted from 7 teams. Not all teams entered both the Aria and HA tracks, and some teams submitted more than one system to the same track.

Metric	Overall	HA	Aria
Number of teams	7	6	6
Number of systems	12	11	9

The results are presented in two sections. The first reports the objective evaluation results, which were used to help guide the selection of systems for the listening tests. The second reports the listening test results, which form the main focus of the CHiME-9 Task 2 evaluation.

Objective evaluation results

The objective evaluation results were computed using the same tools provided to participants, available on GitHub. We report the composite measures CSIG (signal distortion), CBAK (background intrusiveness), and COVL (overall quality), together with PESQ and STOI. Results are presented separately for the Aria and HA systems, since these were evaluated in separate listening tests. The tables are interactive, allowing users to sort by different metrics and view the corresponding charts.

More detailed results and analysis will be made available by the time of the CHiME-9 workshop.

Aria Systems

The results for the objective evaluation for the Aria systems are presented below.

		Team Name	System	CSIG ↑	CBAK ↑	COVL ↑	PESQ ↑	STOI ↑

* The WasedaNTT system is excluded from the official ranking, since it used training data that was not in the permitted whitelist. Results are included here for reference.

Hearing aid systems

The results for the objective evaluation for the Hearing aid systems are presented below.

		Team Name	System	CSIG ↑	CBAK ↑	COVL ↑	PESQ ↑	STOI ↑

Results of the Listening Test Evaluation

A total of eight systems were selected for evaluation in each track (Aria and Hearing Aid). These comprised the challenge baseline, six systems chosen from the submitted entries, and an “oracle” system formed by summing the close-talk microphone signals. The oracle system was included to indicate what might be achievable under ideal speaker extraction and to help contextualise the performance of the submitted systems. The six submitted systems were selected to include one system from each team. Where a team submitted multiple entries, the chosen system reflected the team’s preference, informed by the objective evaluation results.

Brief Description of the Listening Test

Two listening tests were conducted to measure intelligibility and quality separately for the processed audio samples. A brief summary of the essential methodological details is given below; a fuller description will be available by the time of the CHiME-9 workshop.

Intelligibility: A panel of 32 native English speakers was recruited. Each participant completed 16 stimulus blocks, with each block corresponding to a different target talker in the conversation. At the start of each block, participants heard the target talker reading the Rainbow Passage. They then listened to a series of conversation segments containing an utterance from that talker, preceded by 5 seconds of conversation context with transcript. Their task was to transcribe the words spoken by the target talker after the context period. Performance was scored as the percentage of words correctly transcribed, computed using a reference transcript. Within each talker block, systems were randomised, while ensuring that no listener heard the same utterance twice.

Quality: The quality test was based on ITU-T P.835. Participants sat in a listening booth and listened over headphones to short speech samples (4–5 seconds). Each trial contained three presentations of the same sample, each eliciting a different subjective rating. In one presentation, participants focused on the foreground speech and rated its naturalness (SIG). In another, they focused on the background noise and rated how noticeable or intrusive it was (BAK). In the third, they considered both speech and background noise and rated overall quality (OVRL), defined in terms of everyday speech communication. Presentation order was counterbalanced across participants. Ratings were made on a continuous 1–5 slider scale with Likert-style verbal anchors. A total of 32 native English speakers took part, split into two groups of 16: one rated the Aria systems and the other the HA systems.

Ranking score: The final score used for ranking is a combination of the intelligibility score and the OVRL score. The OVRL ratings are mapped onto the range 0 to 100, and then averaged with the intelligibility score.

Aria Systems

The results for the Aria systems are presented below. These include the intelligibility scores, the subjective quality ratings, and the derived overall score used for the final ranking. Systems are ranked by their final score, but the table can be sorted by any metric by clicking on the column headers.

		Team Name	System	SIG ↑	BAK ↑	OVRL ↑	Intell. ↑	Score

* The WasedaNTT system is excluded from the official ranking, since it used training data that was not in the permitted whitelist. Results are included here for reference.

Hearing Aid Systems

The results for the Hearing aid systems are presented below. These include the intelligibility scores, the subjective quality ratings, and the derived overall score used for the final ranking. Systems are ranked by their final score, but the table can be sorted by any metric by clicking on the column headers.

		Team Name	System	SIG ↑	BAK ↑	OVRL ↑	Intell. ↑	Score ↑