Submission

Slack

Registration

Registration for the NOTSOFAR-1 Challenge is currently available through this link. Please register only once per team.

The purpose of registration is to collect basic participant details (name, email, etc.) to enable us to share targeted announcements and updates about the NOTSOFAR challenge, including notifications about new data releases and important milestones. Your information will be deleted after the challenge concludes, unless you give us explicit permission to retain it. While registration is not mandatory and is not limited by any deadline, we highly encourage you to register to stay informed and easily track changes related to the challenge. You are welcome to participate and submit results for the NOTSOFAR challenge even without registering.

Submission Format

We align with CHiME’s SegLST (Segment-wise Long-form Speech Transcription) style.

A submission consists of a json hypothesis file, tcp_wer_hyp.json, containing a list of dictionaries (one for each utterance), each with the following attributes:

{
"session_id": "multichannel/MTG_30860_plaza_0",
"start_time": 28.890
"end_time": 30.719
"words": "we were the only car that can fly",
"speaker": "Spk2"
}

Submissions will be evaluated using the calc_wer function found in scoring.py from the baseline repository. You can use it to compute your own WER on datasets with ground-truth available.
Two metrics will be calculated: the official tcpWER for ranking, and the supplementary tcORC WER (see Challenge Tracks).

Optional: tcORC WER expects a stream ID to be provided as the “speaker” attribute. Participants have the option to provide an additional tc_orc_wer_hyp.json file with their own stream definition (for instance, our baseline treats CSS channels as streams, see write_hypothesis_jsons function). If this file is not provided, the default behavior is to use tcp_wer_hyp.json for both, thereby treating the predicted speaker IDs as the stream IDs.

Leaderboard - Dev Set Evaluation

Navigate to the “Submission” tab on the Hugging Face Leaderboard, select a track (NOTSOFAR-MC or NOTSOFAR-SC), and submit a zip containing either both tcp_wer_hyp.json and tc_orc_wer_hyp.json files or just the tcp_wer_hyp.json. Ensure your hypothesis files cover all sessions within the dev set precisely.

How to submit results for final evaluation

Participants are required to submit the following for each track:

  • Predictions of up to 4 systems through the submission form (not the leaderboard). You will be asked to submit results for both evaluation and development sets (dev-set-2, see Data section). These submissions will be considered for final ranking as well as for the practical and efficient system jury award - everyone has a chance to win the jury award, regardless of their final ranking.
  • A technical description paper of up to 4 pages + 1 for references. If you participate in both tracks you can submit a single paper covering both.
  • We also kindly ask (not mandatory) to include in the final submission for each system this YAML file about training & inference resources (please ignore fields not related to NOTSOFAR - macro, chime6, dipco, mixer6). We recommend to include it as it will be used for jury award consideration.

Results will be released after the technical description papers feedback is released.
All participants are invited to present their work at the CHiME-8 Interspeech Satellite Workshop:

  • The jury award 🥇 for the most practical and efficient system will be announced at the workshop.
  • We strongly encourage in-person participation, but can make exceptions for teams that are unable to participate due to unexpected reasons. Please contact the organizers if you are unable to participate in-person.
  • Details about the presentation format (poster/oral) will be shared later.
  • After the Workshop participants are invited to submit a full-length paper (up to 6 pages, including references) which will appear in the Workshop proceedings and will be indexed.

🥷 If you want to submit the final system but keep the final submission (and required technical description!) anonymous please reach out to the organizers via Slack or email.

Final Submission Form

Please use this Google Form to submit your predictions. Contact the organizers if you encounter any issues.

We expect each submission to each track to have the following structure (including a separate YAML file for each system).
Note that this is an example with only 2 systems, but you can submit up to 4. Additionally, the development set refers to dev-set-2 (dev-set-1 is considered part of training).

.
├── sys1
│   ├── dev
│   │   └── tcp_wer_hyp.json
│   ├── eval
│   │   └── tcp_wer_hyp.json
│   └── info.yaml
└── sys2
    ├── dev    
    │   └── tcp_wer_hyp.json
    ├── eval    
    │   └── tcp_wer_hyp.json
    └── info.yaml

Thus, two sub-folders, one for each system containing JSON files with the predictions and the system information YAML file.

📑 Technical Description Paper

Participants are required to submit a short technical description paper (up to 4 pages, + 1 for references, or shorter) for each track they participated in. A single paper covering both tracks is also permitted.
Submission should be made using:


This technical description paper will be used:

  1. To evaluate for eligibility for the jury award for “Most Practical and Efficient System”.
  2. To assess correctness of submission and compliance with rules.

Teams should provide relevant information about their submitted systems, including:

  • Details to reproduce the systems.
  • External datasets, data augmentation strategies and external models used.
  • Inference and training runtime and hardware specifications.

Please, make sure that the system names used in the technical description match the ones specified in the corresponding YAML file’s system_tag entry.

🏆 Jury Award for "Most Practical and Efficient System"

This year we introduced a special jury award/mention to encourage the development of innovative and practical systems, moving away from performance-squeezing, brute-force approaches such as ensembling or iterative inference-time pseudo-labeling & retraining.
We highly recommend submitting your system, even if you think it is not great in terms of performance, as it still stands a chance to win this jury award if it is practically and scientifically interesting.

Teams are allowed to submit for final evaluation up to four systems for each track, e.g., four systems for single-channel and four for multi-channel.
For the final team ranking: it will be based on the system that performs best on the evaluation set.
For the jury award: all four systems are considered for the jury award, no matter their evaluation set performance.

Since up to four submissions can be made, we encourage teams to explore both “performance-oriented” approaches as well as systems or single components (e.g. separation) that are more efficient, scientifically interesting and still effective. E.g. one team can submit two ensemble-based systems and also two non-ensemble systems that are more practical.

To determine the jury award, a pool of experts will review the submitted materials, including the technical description papers and the systems YAML file. The jury award will be based on the following criteria, in order of importance:

  1. Practicality and efficiency
  2. Novelty
  3. Final results

As an example we list hereafter some examples of challenge submissions that we think would have qualified for such jury award.
These works proposed effective and novel components with lasting impact on the field.

  • Boeddeker, C., Heitkaemper, J., Schmalenstroeer, J., Drude, L., Heymann, J., & Haeb-Umbach, R. (2018, September). Front-end processing for the CHiME-5 dinner party scenario. In CHiME5 Workshop, Hyderabad, India (Vol. 1).
  • Medennikov, I., Korenevsky, M., Prisyach, T., Khokhlov, Y., Korenevskaya, M., Sorokin, I., … & Romanenko, A. (2020). Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario.
  • Erdogan, H., Hayashi, T., Hershey, J. R., Hori, T., Hori, C., Hsu, W. N., … & Watanabe, S. Multi-Channel Speech Recognition: LSTMs All the Way Through.
  • Heymann, J., Drude, L., Chinaev, A., & Haeb-Umbach, R. (2015, December). BLSTM supported GEV beamformer front-end for the 3rd CHiME challenge. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 444-451). IEEE.