Programme

The workshop will be a full-day event from 9:00 to 18:10 and followed by an on-site evening social. The programme is single-track and consists of a mix of oral and poster sessions, discussion sessions and two invited keynote talks.

Programme Overview

The preliminary program is below. The start and end times are fixed but there may be minor changes to the session timings.

07:45	Bus boarding at HICC (prior reservation required, the buses will carry the Microsoft placard and will depart at 08:00 or earlier if they are full)
08:30	Arrival at Microsoft, security clearance (bring a photo ID) and badge collection
	-
09:00	Welcome
09:10	Overview of the 5th CHiME Challenge
09:40	Oral session 1
10:40	Break
11:00	Keynote 1 : Florian Metze (Carnegie Mellon University)
12:00	Discussion
12:20	Lunch
13:30	Speech research at Microsoft
13:45	Oral session 2
14:45	Poster session
16:30	Break
16:45	Keynote 2: John HL Hansen (University of Texas, Dallas)
17:45	Discussion
18:00	Closing
18:10	Social event (food and drinks)
	-
19:00	First bus boarding (prior reservation required, will depart from Microsoft at 19:15 or earlier if it is full, and arrive at HICC around 19:45)
19:45	Second bus boarding (prior reservation required, will depart from Microsoft at 20:00 or earlier if it is full, and arrive at HICC around 20:30)

Introduction

9:40	Welcome [View Slides] Jon Barker (University of Sheffield), Shinji Watanabe (Johns Hopkins University), Emmanuel Vincent (Inria)
10:00	Overview of the 5th CHiME Challenge [View Slides] Jon Barker (University of Sheffield), Shinji Watanabe (Johns Hopkins University), Emmanuel Vincent (Inria), Jan Trmal (Johns Hopkins University)

Oral Session 1

Session chair: Mike Seltzer, Facebook

9:40	The STC System for the CHiME 2018 Challenge [Paper][View Slides] Ivan Medennikov^1,2, Ivan Sorokin¹, Aleksei Romanenko^1,2, Dmitry Popov¹, Yuri Khokhlov¹, Tatiana Prisyach³, Nikolay Malkovskiy³, Vladimir Bataev¹, Sergei Astapov², Maxim Korenevsky¹ and Alexander Zatvornitskiy^1,2,3 (¹STC-innovations Ltd, St. Petersburg, Russia; ²ITMO University, St. Petersburg, Russia;³Speech Technology Center Ltd, St.Petersburg, Russia)
10:00	The Hitachi/JHU CHiME-5 system: Advances in speech recognition for everyday home environments using multiple microphone arrays [Paper][View Slides] Naoyuki Kanda, Rintaro Ikeshita, Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu (Hitachi, Ltd), Xiaofei Wang, Vimal Manohar, Nelson Enrique Yalta Soplin, Matthew Maciejewski, Szu-Jui Chen, Aswin Shanmugam Subramanian, Ruizhi Li, Zhiqi Wang, Jason Naradowsky, L. Paola Garcia-Perera and Gregory Sell (Johns Hopkins University)
10:20	The USTC-iFlytek systems for CHiME-5 Challenge [Paper][View Slides] Jun Du, Tian Gao, Lei Sun (University of Science and Technology of China, Hefei), Feng Ma, Yi Fang, Di-Yuan Liu, Qiang Zhang, Xiang Zhang, Hai-Kun Wang, Jia Pan, Jian-Qing Gao (iFlytek Research, iFlytek Co., Ltd., Hefei), Chin-Hui Lee (Georgia Institute of Technology, Atlanta) and Jing-Dong Chen (Northwestern Polytechnical University, Shanxi)

Keynote 1: Florian Metze, Carnegie Mellon University

Session chair: Arun Narayanan, Google

Open-domain audiovisual speech recognition and video summarization [View Slides]

Abstract

Video understanding is one of the hardest challenges in AI. If a machine can look at videos and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites. Making progress towards this goal requires contributions from experts in diverse fields, including computer vision, automatic speech recognition, machine translation, natural language processing, multimodal information processing, and multimedia. I will report the outcomes of the JSALT 2018 Workshop on this topic, including advances in multitask learning for joint audiovisual captioning, summarization, and translation, as well as auxiliary tasks such as text-only translation, language modeling, story segmentation, and classification. I will demonstrate a few results on the “How-to” dataset of instructional videos harvested from the web by my team at Carnegie Mellon University and discuss remaining challenges and possible other datasets for this research.

Oral Session 2

Session chair: Yusuke Fujita, Hitachi

13:45	The NWPU System for CHiME-5 Challenge [Paper][View Slides] Zhiwei Zhao, Jian Wu and Lei Xie Xie (School of Computer Science, Northwestern Polytechnical University)
14:05	Channel selection from DNN posterior probability for speech recognition with distributed microphone arrays in everyday environments [Paper][View Slides] Feifei Xiong¹, Jisi Zhang¹, Bernd Meyer², Heidi Christensen¹ and Jon Barker¹ (¹Speech and Hearing Research Group, University of Sheffield, UK; ²Medical Physics and Cluster of Excellence Hearing4All,University of Oldenburg, Germany)
14:25	Scaling speech enhancement in unseen environments with noise embeddings [Paper][View Slides] Gil Keren¹, Jing Han¹ and Björn Schuller^1,2 (¹ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, University of Augsburg, Germany; ²GLAM – Group on Language,Audio & Music, Imperial College London, UK)

Poster Session

Session chair: Jun Du, University of Science and Technology of China

The ZTSpeech system for CHiME-5 Challenge: A far-field speech recognition system with front-end and robust back-end [Paper][View Poster]
Chenxing Li¹, Shuang Xu¹, Tieqiang Wang^1,2 and Bo Xu¹(¹Institute of Automation, Chinese Academy of Sciences, Beijing, P.R.China; ²University of Chinese Academy of Sciences, Beijing, P.R.China)
The SHNU system for the CHiME-5 Challenge [Paper]
Yanhua Long and Renke He (Laboratory of Natural Human-Computer Interaction, Shanghai Normal University, Shanghai)
Front-end processing for the CHiME-5 dinner party scenario [Paper]
Christoph Boeddecker, Jens Heitkaemper, Joerg Schmalenstroeer, Lukas Drude, Jahn Heymann and Reinhold Haeb-Umbach (Paderborn University, Department of Communications Engineering, Paderborn, Germany)
DA-IICT/IIITV system for the 5th CHiME 2018 Challenge [Paper][View Poster]
Ankur Patil¹, Siva Krishna Maddala², Mehak Piplani², Aditya Sai Pulikonda², Hardik Sailor¹ and Hemant Patil Patil¹ (¹Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India; ²Indian Institute of Information Technology, Gujarat, India.)
The Toshiba entry to the CHiME 2018 Challenge [Paper][View Poster]
Rama Doddipatla¹, Takehiko Kagoshima², Cong-Thanh Do¹, Petko Petkov¹, Catalin-Tudor Zorila¹, Uihyun Kim², Daichi Hayakawa², Hiroshi Fujimura² and Yannis Stylianou¹ (¹Toshiba Cambridge Research Laboratory, Cambridge, United Kingdom; ²Toshiba Corporation Corporate R&D Center, Kawasaki,Japan)
The RWTH/UPB system combination for the CHiME 2018 Workshop [Paper]
Markus Kitza¹, Wilfried Michel¹, Christoph Boeddeker², Jens Heitkaemper², Tobias Menne¹, Ralf Schlüter¹, Hermann Ney¹, Joerg Schmalenstroeer², Lukas Drude², Jahn Heymann² and Reinhold Haeb-Umbach² (¹RWTH Aachen University; ²Paderborn University)
The NDSC transcription system for the 2018 CHiME-5 Challenge [Paper]
Dan Qu, Cheng-Ran Liu and Xu-Kiu Yang Yang (National Digital Switching System Engineering and Technological R&D Center, Zhengzhou)
Multiple beamformers with ROVER for the CHiME-5 Challenge [Paper] [View Poster]
Sining Sun¹, Yangyang Shi¹, Ching-Feng Yeh², Suliang Bu³, Mei-Yuh Hwang² and Lei Xie¹ (¹School of Computer Science, Northwestern Polytechnical University, Xi’an, China; ²Mobvoi AI Lab, Seattle, USA; ³Dept. of Electrical Engineering and Computer Science, University of Missouri-Columbia, USA)
NMF based front-end processing in multi-channel distant speech recognition [Paper][View Poster]
Nikhil Mohanan, Premanand Nayak, Rajbabu Velmurugan, Preeti Rao Rao (Indian Institute of Technology Bombay), Sonal Joshi, Ashish Panda, Meet Soni, Rupayan Chakraborty and Sunilkumar Kopparapu (Tata Consultancy Services, India)
CHiME 2018 Workshop: Enhancing beamformed audio using time delay neural network denoising autoencoder [Paper][View Poster]
Sonal Joshi, Ashish Panda, Meet Soni, Rupayan Chakraborty, Sunilkumar Kopparapu (TCS Innovation Labs, Mumbai), Nikhil Mohanan, Premanand Nayak, Rajbabu Velmurugan and Preeti Rao (Indian Institute of Technology Bombay, Powai)
Situation informed end-to-end ASR for noisy environments [Paper][View Poster]
Siddharth Dalmia, Suyoun Kim and Florian Metze (Carnegie Mellon University Pittsburgh)
Robust network structures for acoustic model on CHiME5 Challenge dataset [Paper][View Poster]
Alim Misbullah (da Vinci Innovation Lab, ASUSTek Computer Inc., Taiwan)
Channel-selection for distant-speech recognition on CHiME-5 dataset [Paper][View Poster]
Hannes Unterholzner, Lukas Pfeifenberger, Franz Pernkopf (Graz University of Technology, Austria), Marco Matassoni, Alessio Brutti and Daniele Falavigna (Fondazione Bruno Kessler, Center for Information and Communication Technology, Trento, Italy)
Acoustic features fusion using attentive multi-channel deep architecture [Paper]
Gaurav Bhatt, Akshita Gupta, Aditya Arora and Balasubramanian Raman (Indian Institute of Technology Roorkee)
The AnTech system for CHiME-5 Challenge [Paper]
Tao Wang, Xiufeng Li and Lin Wang (AnTech, China)
A novel speech enhancement method based on multiple-microphone arrays [Paper]
Bo Fu, Yijia Wang, Dan Zou and Wenbo Yang (Lenovo Research)
LEAP submission to CHiME-5 Challenge [Paper][View Poster]
Sriram Ganapathy and Purvi Agrawal (LEAP, Dept. of Electrical Eng., Indian Institute of Science, Bengaluru)

Keynote 2: John HL Hansen, University of Texas at Dallas

Session chair: Ralf Schlüter, RWTH Aachen University

Robust speaker diarization and recognition in naturalistic data streams: Challenges for multi-speaker tasks & learning spaces

Abstract

Speech Technology is advancing beyond general speech recognition for voice command and telephone applications. Today, the emergence of many voice enabled speech systems have required the need for more effective distant based speech voice capture and automatic speech and speaker recognition. The ability to employ speech and language technology to assess human- to-human interactions is opening up new research paradigms which can have a profound impact on assessing human interaction including personal communication traits, and contribute to improving the quality of life and educational experience of individuals. In this talk, we will explore recent research trends on automatic audio diarization and speaker recognition for audio streams which include multi-tracks, speakers, and environments with distant based speech capture. Specifically, we will consider (i) Prof-Life-Log corpus, (ii) Education based child & student based Peer-Lead Team Learning, and (iii) Apollo-11 massive multi-track audio processing (19,000hrs of data). These domains in the context of CHIME workshops will be discussed in terms of algorithmic advancements, as well as directions for continued research..