2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

My name is Zhao Ren and I am a PhD student working in the ZD.B chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg in Germany. Since 2018, I have been funded by a research fellowship on the TAPAS project. My research is focused on applying machine learning methods to detect pathological conditions through emotion recognition.

With the support from TAPAS, I attended the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in Brighton in the UK in May. ICASSP is the largest technical conference focused on signal processing and its applications. This year, a total of 1725 papers were accepted, with an acceptance rate of 49%.

My first-author paper ‘Attention-based Atrous Convolutional Neural Networks: Visualisation and Understanding Perspectives of Acoustic Scenes’, was accepted as an oral presentation. In the paper, I propose attention-based atrous convolutional neural networks (CNNs) to better visualise and understand the deep neural networks. This paper propose to employ attention mechanism to visualise the contribution of each time-frequency bin to the classification. In the area of speech pathology detection, it is difficult to recognise the pathological state on the time-frequency level, while spectrogram images are extracted as input of CNNs. Hence, in future efforts, it is potential to apply attention-based atrous CNNs to pathological speech detection. You can read the paper in the online proceedings of the meeting here.

In addition, my co-author paper ‘Implicit Fusion by Joint Audiovisual Training for Emotion Recognition in Mono Modality’ was accepted as an oral presentation as well. This paper proposed an approach to implicitly fuse the representations learnt from audiovisual emotional data. Compared to other fusion methods in previous studies, such as early fusion, late fusion, model-level fusion, and multi-task learning, implicit fusion can take advantage of other modalities to train models in mono-modal scenarios. The paper is available to be read in the online proceedings of the meeting here.

There were several sessions related to speech pathology and emotion that I found interesting, such as ‘Analysis of Voice, Speech and Language Disorders’, and ‘Architectures for Emotion and Sentiment Analysis’. A lot of state-of-the-art machine learning architectures were proposed for pathological and emotional speech processing, relevant to the research we are doing in TAPAS. In one paper, ‘Context-aware Deep Learning for Multi-modal Depression Detection’, the authors proposed a novel approach using transformer and 1D CNN models for the text and audio modalities to recognise depression. Their proposed method improves the results of the state-of-the-art methods. Multi-modal pathological speech detection can improve the performance of single-modality. Therefore, multi-modal data processing is one of the trends in the area of pathological speech detection. 

Overall, the meeting was a rewarding experience, and I would like to thank TAPAS providing me with the opportunity to attend.