The workshop will be a full-day event from 9:00 to 18:10 and followed by an on-site evening social. The programme is single-track and consists of a mix of oral and poster sessions, discussion sessions and two invited keynote talks.

Programme Overview

The preliminary program is below. The start and end times are fixed but there may be minor changes to the session timings.

07:45Bus boarding at HICC (prior reservation required, the buses will carry the Microsoft placard and will depart at 08:00 or earlier if they are full)
08:30Arrival at Microsoft, security clearance (bring a photo ID) and badge collection
09:10Overview of the 5th CHiME Challenge
09:40Oral session 1
10:40 Break
11:00Keynote 1 : Florian Metze (Carnegie Mellon University)
12:00 Discussion
12:20 Lunch
13:30 Speech research at Microsoft
13:45 Oral session 2
14:45 Poster session
16:30 Break
16:45 Keynote 2: John HL Hansen (University of Texas, Dallas)
18:10Social event (food and drinks)
19:00First bus boarding (prior reservation required, will depart from Microsoft at 19:15 or earlier if it is full, and arrive at HICC around 19:45)
19:45Second bus boarding (prior reservation required, will depart from Microsoft at 20:00 or earlier if it is full, and arrive at HICC around 20:30)


9:40 Welcome [View Slides]
Jon Barker (University of Sheffield), Shinji Watanabe (Johns Hopkins University), Emmanuel Vincent (Inria)
10:00 Overview of the 5th CHiME Challenge [View Slides]
Jon Barker (University of Sheffield), Shinji Watanabe (Johns Hopkins University), Emmanuel Vincent (Inria), Jan Trmal (Johns Hopkins University)

Oral Session 1

Session chair: Mike Seltzer, Facebook

9:40 The STC System for the CHiME 2018 Challenge [Paper][View Slides]
Ivan Medennikov1,2, Ivan Sorokin1, Aleksei Romanenko1,2, Dmitry Popov1, Yuri Khokhlov1, Tatiana Prisyach3, Nikolay Malkovskiy3, Vladimir Bataev1, Sergei Astapov2, Maxim Korenevsky1 and Alexander Zatvornitskiy1,2,3 (1STC-innovations Ltd, St. Petersburg, Russia; 2ITMO University, St. Petersburg, Russia;3Speech Technology Center Ltd, St.Petersburg, Russia)
10:00 The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays [Paper][View Slides]
Naoyuki Kanda, Rintaro Ikeshita, Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu (Hitachi, Ltd), Xiaofei Wang, Vimal Manohar, Nelson Enrique Yalta Soplin, Matthew Maciejewski, Szu-Jui Chen, Aswin Shanmugam Subramanian, Ruizhi Li, Zhiqi Wang, Jason Naradowsky, L. Paola Garcia-Perera and Gregory Sell (Johns Hopkins University)
10:20 The USTC-iFlytek systems for CHiME-5 Challenge [Paper][View Slides]
Jun Du, Tian Gao, Lei Sun (University of Science and Technology of China, Hefei), Feng Ma, Yi Fang, Di-Yuan Liu, Qiang Zhang, Xiang Zhang, Hai-Kun Wang, Jia Pan, Jian-Qing Gao (iFlytek Research, iFlytek Co., Ltd., Hefei), Chin-Hui Lee (Georgia Institute of Technology, Atlanta) and Jing-Dong Chen (Northwestern Polytechnical University, Shanxi)

Keynote 1: Florian Metze, Carnegie Mellon University

Session chair: Arun Narayanan, Google

Open-domain audiovisual speech recognition and video summarization [View Slides]


Video understanding is one of the hardest challenges in AI. If a machine can look at videos and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites. Making progress towards this goal requires contributions from experts in diverse fields, including computer vision, automatic speech recognition, machine translation, natural language processing, multimodal information processing, and multimedia. I will report the outcomes of the JSALT 2018 Workshop on this topic, including advances in multitask learning for joint audiovisual captioning, summarization, and translation, as well as auxiliary tasks such as text-only translation, language modeling, story segmentation, and classification. I will demonstrate a few results on the “How-to” dataset of instructional videos harvested from the web by my team at Carnegie Mellon University and discuss remaining challenges and possible other datasets for this research.

Oral Session 2

Session chair: Yusuke Fujita, Hitachi

13:45 The NWPU System for CHiME-5 Challenge [Paper][View Slides]
Zhiwei Zhao, Jian Wu and Lei Xie Xie (School of Computer Science, Northwestern Polytechnical University)
14:05 Channel selection from DNN posterior probability for speech recognition with distributed microphone arrays in everyday environments [Paper][View Slides]
Feifei Xiong1, Jisi Zhang1, Bernd Meyer2, Heidi Christensen1 and Jon Barker1 (1Speech and Hearing Research Group, University of Sheffield, UK; 2Medical Physics and Cluster of Excellence Hearing4All,University of Oldenburg, Germany)
14:25 Scaling speech enhancement in unseen environments with noise embeddings [Paper][View Slides]
Gil Keren1, Jing Han1 and Björn Schuller1,2 (1ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany; 2GLAM – Group on Language,Audio & Music, Imperial College London, UK)

Poster Session

Session chair: Jun Du, University of Science and Technology of China

  • The ZTSpeech system for CHiME-5 Challenge: A far-field speech recognition system with front-end and robust back-end [Paper][View Poster]
    Chenxing Li1, Shuang Xu1, Tieqiang Wang1,2 and Bo Xu1(1Institute of Automation, Chinese Academy of Sciences, Beijing, P.R.China; 2University of Chinese Academy of Sciences, Beijing, P.R.China)
  • The SHNU system for the CHiME-5 Challenge [Paper]
    Yanhua Long and Renke He (Laboratory of Natural Human-Computer Interaction, Shanghai Normal University, Shanghai)
  • Front-end processing for the CHiME-5 dinner party scenario [Paper]
    Christoph Boeddecker, Jens Heitkaemper, Joerg Schmalenstroeer, Lukas Drude, Jahn Heymann and Reinhold Haeb-Umbach (Paderborn University, Department of Communications Engineering, Paderborn, Germany)
  • DA-IICT/IIITV system for the 5th CHiME 2018 Challenge [Paper][View Poster]
    Ankur Patil1, Siva Krishna Maddala2, Mehak Piplani2, Aditya Sai Pulikonda2, Hardik Sailor1 and Hemant Patil Patil1 (1Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India; 2Indian Institute of Information Technology, Gujarat, India.)
  • The Toshiba entry to the CHiME 2018 Challenge [Paper][View Poster]
    Rama Doddipatla1, Takehiko Kagoshima2, Cong-Thanh Do1, Petko Petkov1, Catalin-Tudor Zorila1, Uihyun Kim2, Daichi Hayakawa2, Hiroshi Fujimura2 and Yannis Stylianou1 (1Toshiba Cambridge Research Laboratory, Cambridge, United Kingdom; 2Toshiba Corporation Corporate R&D Center, Kawasaki,Japan)
  • The RWTH/UPB system combination for the CHiME 2018 Workshop [Paper]
    Markus Kitza1, Wilfried Michel1, Christoph Boeddeker2, Jens Heitkaemper2, Tobias Menne1, Ralf Schlüter1, Hermann Ney1, Joerg Schmalenstroeer2, Lukas Drude2, Jahn Heymann2 and Reinhold Haeb-Umbach2 (1RWTH Aachen University; 2Paderborn University)
  • The NDSC transcription system for the 2018 CHiME-5 Challenge [Paper]
    Dan Qu, Cheng-Ran Liu and Xu-Kiu Yang Yang (National Digital Switching System Engineering and Technological R&D Center, Zhengzhou)
  • Multiple beamformers with ROVER for the CHiME-5 Challenge [Paper] [View Poster]
    Sining Sun1, Yangyang Shi1, Ching-Feng Yeh2, Suliang Bu3, Mei-Yuh Hwang2 and Lei Xie1 (1School of Computer Science, Northwestern Polytechnical University, Xi’an, China; 2Mobvoi AI Lab, Seattle, USA; 3Dept. of Electrical Engineering and Computer Science, University of Missouri-Columbia, USA)
  • NMF based front-end processing in multi-channel distant speech recognition [Paper][View Poster]
    Nikhil Mohanan, Premanand Nayak, Rajbabu Velmurugan, Preeti Rao Rao (Indian Institute of Technology Bombay), Sonal Joshi, Ashish Panda, Meet Soni, Rupayan Chakraborty and Sunilkumar Kopparapu (Tata Consultancy Services, India)
  • CHiME 2018 Workshop: Enhancing beamformed audio using time delay neural network denoising autoencoder [Paper][View Poster]
    Sonal Joshi, Ashish Panda, Meet Soni, Rupayan Chakraborty, Sunilkumar Kopparapu (TCS Innovation Labs, Mumbai), Nikhil Mohanan, Premanand Nayak, Rajbabu Velmurugan and Preeti Rao (Indian Institute of Technology Bombay, Powai)
  • Situation informed end-to-end ASR for noisy environments [Paper][View Poster]
    Siddharth Dalmia, Suyoun Kim and Florian Metze (Carnegie Mellon University Pittsburgh)
  • Robust network structures for acoustic model on CHiME5 Challenge dataset [Paper][View Poster]
    Alim Misbullah (da Vinci Innovation Lab, ASUSTek Computer Inc., Taiwan)
  • Channel-selection for distant-speech recognition on CHiME-5 dataset [Paper][View Poster]
    Hannes Unterholzner, Lukas Pfeifenberger, Franz Pernkopf (Graz University of Technology, Austria), Marco Matassoni, Alessio Brutti and Daniele Falavigna (Fondazione Bruno Kessler, Center for Information and Communication Technology, Trento, Italy)
  • Acoustic features fusion using attentive multi-channel deep architecture [Paper]
    Gaurav Bhatt, Akshita Gupta, Aditya Arora and Balasubramanian Raman (Indian Institute of Technology Roorkee)
  • The AnTech system for CHiME-5 Challenge [Paper]
    Tao Wang, Xiufeng Li and Lin Wang (AnTech, China)
  • A novel speech enhancement method based on multiple-microphone arrays [Paper]
    Bo Fu, Yijia Wang, Dan Zou and Wenbo Yang (Lenovo Research)
  • LEAP submission to CHiME-5 Challenge [Paper][View Poster]
    Sriram Ganapathy and Purvi Agrawal (LEAP, Dept. of Electrical Eng., Indian Institute of Science, Bengaluru)

Keynote 2: John HL Hansen, University of Texas at Dallas

Session chair: Ralf Schlüter, RWTH Aachen University

Robust speaker diarization and recognition in naturalistic data streams: Challenges for multi-speaker tasks & learning spaces


Speech Technology is advancing beyond general speech recognition for voice command and telephone applications. Today, the emergence of many voice enabled speech systems have required the need for more effective distant based speech voice capture and automatic speech and speaker recognition. The ability to employ speech and language technology to assess human- to-human interactions is opening up new research paradigms which can have a profound impact on assessing human interaction including personal communication traits, and contribute to improving the quality of life and educational experience of individuals. In this talk, we will explore recent research trends on automatic audio diarization and speaker recognition for audio streams which include multi-tracks, speakers, and environments with distant based speech capture. Specifically, we will consider (i) Prof-Life-Log corpus, (ii) Education based child & student based Peer-Lead Team Learning, and (iii) Apollo-11 massive multi-track audio processing (19,000hrs of data). These domains in the context of CHIME workshops will be discussed in terms of algorithmic advancements, as well as directions for continued research..