Programme

The 2023 CHiME Workshop will take place on August 25, 2023 at the Meta offices on Serpentine Ave in the Ballsbridge neighborhood of Dublin.

The Programme will consist of

8:00	Breakfast
8:30-8:45	Welcome Message from the Steering Committee
8:45-9:30	First Keynote
9:30-10:00	Task overview presentations
10:00-10:15	Break
10:15-11:15	Top systems presentations
11:15-12:00	Second Keynote
12:00-14:00	Posters & Lunch* (Each poster presented for half the time)
14:00-15:00	CHiME-2024 challenge pitching sessions & Discussion of future directions
15:00-15:15	Closing

*Presenters who need to leave early, please let us know.

Keynote 1

Wei-Ning Hsu, Research Scientist, Meta Foundational AI Research
Multimodal and Large-Scale Generative Models for Enhancement

Abstract What is the goal of speech enhancement, and what is the definition of a perfectly enhanced speech sample? Conventional speech enhancement usually concerns additive noise and treats the source speech as the single oracle which an enhancement model should reconstruct exactly. Performance is often measured by signal-level metrics like SDR and PESQ. The paradigm has two main issues. First, why should we consider the source speech as oracle? Those references might also contain some noise and might not be recorded with the best-quality microphone. Should a model be penalized when generating enhanced speech that “sounds better” than the reference speech? Second, even when the reference speech is of superior quality, there could still be multiple samples that sound identical to the reference to humans yet being very different from the reference in the waveform space (e.g., time shift, phase shift). Should a model be penalized if it generates one of those samples that sounds just as good as the reference?
In this talk, I will present two recent studies on generative modeling with applications to the “generalized speech enhancement” problem. The goal of generalized speech enhancement is to ensure the desired factors, such as content and voice, are preserved/enhanced, instead of reconstructing the source speech exactly. The first study is ReVISE, which leverages AV-HuBERT and HiFi-GAN to build a universal model for lip-to-speech synthesis, audio-visual speech inpainting, enhancement, and separation. By using a pre-trained model, ReVISE can operate in very challenging (ego-centric, low resolution, low SNR) and low-resource (2hr) regimes effectively. The second study is Voicebox, a DALL-E and LLM like speech generative model that can perform in-context learning and generalize to monolingual/cross-lingual style transfer, speech editing, and unconditional diverse speech sampling. In particular, we demonstrate one of its applications to transient noise removal through in-context infilling.

Bio Wei-Ning is a research scientist at Meta Foundational AI Research (FAIR). His research focuses on generative modeling and self-supervised learning for speech. He is the core contributor to Voicebox, HuBERT, data2vec, Textless NLP/GLSM, wav2vec-U, AV-HuBERT, ReVISE, and Textless Speech-to-Speech Translation systems. Prior to joining Facebook. Wei-Ning received his Ph.D. and S.M. degrees in Electrical Engineering and Computer Science from Massachusetts Institute of Technology in 2020 and 2018, under the supervision of Dr. James Glass. He received his B.S. degree in Electrical Engineering from National Taiwan University in 2014, under the supervision of Prof. Lin-shan Lee and Prof. Hsuan-Tien Lin.

Keynote 2

Naomi Harte, Professor in Speech Technology, Trinity College Dublin
Understanding Speech in Everyday Environments – Is Multimodality the Answer?

Abstract This talk will consider the multimodal nature of speech and speech technology. Human speech communication is extremely rich. We use many elements to communicate, from words to gestures and eye gaze, and seamlessly interpret these many cues in our conversations. In noisy situations, humans appear to dynamically change their use of different modalities in response to their environment. Is exploiting multimodality hence the solution to developing speech processing algorithms that are robust in everyday environments? In this talk, I’ll look at how visual and linguistic information can be integrated into deep learning frameworks for audio-visual speech recognition and turn taking prediction. I’ll also look at how availability of suitable datasets, with adequate labelling, can help or hinder development in this domain.

Bio Naomi is Professor in Speech Technology in the School of Engineering in Trinity College Dublin. She is Co-PI and a founding member of the ADAPT SFI Centre in Ireland. In ADAPT, she has led a major Research Theme centered on Multimodal Interaction involving researchers from Universities across Ireland and was instrumental in developing the future vision for the Centre for 2021-2026. She is also a lead academic in the Sigmedia Research Group in the School of Engineering. Prior to starting her lectureship in TCD in 2008, Naomi worked in high-tech start-ups in the field of DSP Systems Development, including her own company. She also previously worked in McMaster University in Canada. She was a Visiting Professor at ICSI in 2015, and became a Fellow of TCD in 2017. She earned a Google Faculty Award in 2018 and was shortlisted for the AI Ireland Awards in 2019. She currently serves on the Editorial Board of Computer Speech and Language, and is General Chair of INTERSPEECH 2023.

Task overview presentations

Time	Title	Authors
9:30-9:45	The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios	Samuele Cornell (Università Politecnica delle Marche); Matthew S Wiesner (Johns Hopkins University); Shinji Watanabe (Carnegie Mellon University); Desh Raj (Johns Hopkins University); Xuankai Chang (Carnegie Mellon University); Paola Garcia (Johns Hopkins University); Yoshiki Masuyama (Tokyo Metropolitan University); Zhong-Qiu Wang (Carnegie Mellon University); Stefano Squartini (Università Politecnica delle Marche); Sanjeev Khudanpur (Johns Hopkins University)
9:45-10:00	The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement	Simon Leglaive (CentraleSupélec, IETR); Léonie Borne (Pulse Audition); Efthymios Tzinis (University of Illinois at Urbana-Champaign); Mostafa Sadeghi (INRIA); Matthieu Fraticelli (Ecole Normale Supérieure, PSL University, CNRS); Scott Wisdom (Google); Manuel Pariente (Pulse Audition); Daniel Pressnitzer (Ecole Normale Supérieure, PSL University, CNRS); John Hershey (Google)

Posters

Time	Poster ID	Title	Authors	Track
12:00-13:00	1	The University of Sheffield CHiME-7 UDASE Challenge Speech Enhancement System	George L Close (University of Sheffield); William Ravenscroft (The University of Sheffield); Thomas Hain (University of Sheffield); Stefan Goetze (University of Sheffield)	Challenge Task 2 UDASE
13:00-14:00	2	The SGU Systems for the CHiME-7 UDASE Challenge	JAEHOO JANG (Sogang university); Myoung-Wan Koo (Sogang University)	Challenge Task 2 UDASE
12:00-13:00	3	NTT Multi-Speaker ASR System for the DASR Task of CHiME-7 Challenge	Naoyuki Kamo (NTT); Naohiro Tawara (NTT); Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); Takafumi Moriya (NTT); Atsunori Ogawa (NTT Corporation); Hiroshi Sato (NTT Corporation); Tsubasa Ochiai (NTT); Atsushi Ando (NTT Corporation); Rintaro Ikeshita (NTT); Takatomo Kano (NTT Corporation); Marc Delcroix (NTT); Tomohiro Nakatani (NTT Communication Science Laboratories, NTT Corporation); Taichi Asami (NTT); Shoko Araki (NTT Corporation)	Challenge Task1 DASR
13:00-14:00	4	Multi-stage diarization refinement for the CHiME-7 DASR scenario	Christoph B Boeddeker (Paderborn University); Tobias Cord-Landwehr (Paderborn University); Thilo von Neumann (Paderborn University); Reinhold Haeb-Umbach (University of Paderborn)	Challenge Task1 DASR
12:00-13:00	5	The CHiME-7 Challenge: System Description and Performance of NeMo Team’s DASR System	Tae Jin Park (NVIDIA); He Huang (NVIDIA); Ante Jukić (NVIDIA); Kunal Dhawan (NVIDIA); Krishna C Puvvada (NVIDIA); Nithin Rao Koluguri (NVIDIA); Nikolay Karpov (NVIDIA); Aleksandr Laptev (NVIDIA, ITMO University); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA)	Challenge Task1 DASR
13:00-14:00	6	The NPU System for DASR Task of CHiME-7 Challenge	Bingshen Mu (Northwestern Polytechnical University); Pengcheng Guo (Northwestern Polytechnical University); He Wang (NWPU); Yangze Li (Northwestern Polytechnical University); Yang Li (Space AI, Li Auto); Pan Zhou (Space AI, Li Auto); wei Chen (Space AI, Li Auto); Lei Xie (NWPU)	Challenge Task1 DASR
12:00-13:00	7	BUT CHiME-7 system description	Martin Karafiat (BUT speech@fit); Karel Veselý (Brno University of Technology); Igor Szoke (Brno University of Technology); Ladislav Mosner (Brno University of Technology); Karel Benes (Brno University of Technology); Marcin Witkowski (Akademia Górniczo-Hutnicza im. Stanisława Staszica w Krakowie); Ricardo German Barchi (Departamento de Computación, FCEyN, Universidad de Buenos Aires (UBA)); Leonardo D Pepino (Universidad de Buenos Aires)	Challenge Task1 DASR
13:00-14:00	8	The University of Cambridge System for the CHiME-7 DASR Task	Keqi Deng (University of Cambridge); Xianrui Zheng (University of Cambridge); Phil Woodland (Machine Intelligence Laboratory, Cambridge University Department of Engineering)	Challenge Task1 DASR
12:00-13:00	9	Do We Hyperarticulate on Zoom?	Sam O’Connor Russell (Trinity College Dublin); Ayushi Pandey (Trinity College Dublin); Naomi Harte (Trinity College Dublin)	Non-challenge papers
13:00-14:00	10	Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation	Tae Jin Park (NVIDIA); He Huang (NVIDIA); Coleman Hooper (NVIDIA); Nithin Rao Koluguri (NVIDIA); Kunal Dhawan (NVIDIA); Ante Jukić (NVIDIA); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA)	Non-challenge papers
12:00-13:00	11	STCON System for the CHiME-7 Challenge	Tatiana Prisyach (STCON LLC.); Yuri Khokhlov (STCON LLC.); Maxim Korenevsky (Speech Technology Center); Anton Mitrofanov (STCON LLC.); Tatiana Timofeeva (STCON LLC.); Ilia Odegov (STCON LLC.); Rauf Nasretdinov (STC); Iurii Lezhenin (STCON LLC.); Dmitriy Miroshnichenko (STCON LLC.); Arsenii Karelin (STCON LLC.); Mariya Mitrofanova (STCON LLC.); Roman Svechnikov (STCON LLC.); Sergey Novoselov (ITMO University); Aleksei Romanenko (STCON LLC.)	Challenge Task1 DASR

Time	Title	Authors	Track
10:15-10:30	The USTC-NERCSLIP Systems for CHiME-7 Challenge	Ruoyu Wang (University of Science and Technology of China); Maokui He (University of Science and Technology of China); Jun Du (University of Science and Technology of China); Hengshun Zhou (University of Science and Technology of China); Shutong Niu (University of Science and Technology of China ); Hang Chen (USTC); Yanyan Yue (University of Science and Technology of China); Gaobin Yang (University of Science and Technology of China); Shilong Wu (University of Science and Technology of China); Lei Sun (iFlytek); Yanhui Tu (iFlytek); Haitao Tang (iFlytek); Shuangqing Qian (iFlytek); Tian Gao (iFlytek Research); Mengzhi Wang (iFlytek Research); Genshun Wan (iFlytek); Jia Pan (iFlytek Research); Jianqing Gao (iFlytek Research); Chin-Hui Lee (Georgia Institute of Technology)	Challenge Task1 DASR
10:30-10:45	The NWPU-ByteAudio System for CHiME-7 Task 2 UDASE Challenge	Zihan Zhang (Northwestern Polytechnical University); Runduo Han (Northwestern Polytechnical University); Ziqian Wang (Northwestern Polytechnical University); Xianjun Xia (RTC Lab, ByteDance); Yijian Xiao (Bytedance); Lei Xie (NWPU)	Challenge Task 2 UDASE
10:45-11:00	The IACAS-Thinkit System for CHiME-7 Challenge	lingxuan ye (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics); haitian lu (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics); Gaofeng Cheng (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics); yifan chen (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics); Zengqiang Shang (The Institute of Acoustics of the Chinese Academy of Sciences ); xuyuan li (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics)	Challenge Task1 DASR
11:00-11:15	MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems	Thilo von Neumann (Paderborn University); Christoph B Boeddeker (Paderborn University); Marc Delcroix (NTT); Reinhold Haeb-Umbach (University of Paderborn)	Non-challenge papers

Keynote 1

Keynote 2

Task overview presentations

Top systems presentations

Posters