End-to-end listening agent for audiovisual emotional and naturalistic interactions

Kevin El Haddad; Yara Rizk; Louise Heron; Nadine Hajj; Yong Zhao; Jaebok Kim; Trung Ngô Trọng; Minha Lee; Marwan Doumit; Payton Lin; Yelin Kim; Hüseyin Çakmak

doi:10.7559/citarj.v10i2.424

Kevin El Haddad

Université de Mons, Faculté Polytechnique de Mons

Yara Rizk

American University of Beirut, Department of Electrical and Computer Engineering

Louise Heron

University of Bath, Department of Psychology

Nadine Hajj

American University of Beirut, Department of Electrical and Computer Engineering

Yong Zhao

Vrije Universiteit Brussel, VUB-NPU Joint AVSP Research Lab; Northwestern Polytechnical University

Jaebok Kim

University of Twente, Human Media Interaction group

Trung Ngô Trọng

University of Eastern Finland, Faculty of Science and Forestry, School of Computing

Minha Lee

Technical University of Eindhoven, Human-Technology Interaction group

Marwan Doumit

National Public Radio

Payton Lin

Academia Sinica, Research Center for Information Technology Innovation

Yelin Kim

University at Albany, Department of Electrical and Computer Engineering

Hüseyin Çakmak

Université de Mons, Faculté Polytechnique de Mons

Abstract

In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user’s paralinguistic and nonverbal expressions, prediction of the agent’s reactions, synthesis of the agent’s expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.

Keywords: Listening agent, Smile, Laughter, Head movement, Eyebrow movement, Speech emotion recognition, Nonverbal expression detection, Sequence-to-sequence prediction systems, nonverbal expression synthesis, Emotion database, Dyadic conversa

Downloads

Download data is not yet available.

References

Aubrey, A., Marshall, D., & Rosin, L. (2013). Cardiff conversation database (ccdb): A database of natural dyadic conversations. 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 277–282). Portland, OR, USA: IEEE. https://doi.org/10.1109/CVPRW.2013.48

Baltruvsaitis, T., Robinson, P., & Morency, L.-P. (2016). Openface: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1--10). IEEE.

C. Busso, M. B. ( vol. 42, no. 4, pp. 335-359). IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, December 2008.

C. Busso, S. P. (January-March 2017). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 119-130.

El Haddad, K., Cakmak, H., Gilmartin, E., Dupont, S., & Dutoit, T. (2016). Towards a listening agent: a system generating audiovisual laughs and smiles to show interest. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 248-255). Tokyo, Japan: ACM, New York, NY, USA. https://doi.org/10.1145/2993148.2993182

Haidt, J. (2003). The moral emotions. Handbook of Affective Sciences, 11, 852-870.

Hüseyin Çakmak, K. E. (13-14 June 2016). A real time OSC controlled avatar for human machine interactions. Workshop on Artificial Companion Affect Interaction. Brest, France.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017b). "Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition." In: Proceedings of ACM Multime- dia, pp. 1006–1013.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017a). "Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning." In: Proceedings of INTERSPEECH, pp. 1113–1117. https://doi.org/10.21437/Interspeech.2017-736

Kim, Y., & Provost, E. M. (2016). Emotion spotting: discovering regions of evidence in audio-visual emotion expressions. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 92-99). Tokyo, Japan: ACM, New York, NY, USA. https://doi.org/10.1145/2993148.2993151

Kim, Y., Provost, E. M., & Lee, H. (2013). Deep learning for robust feature generation in audiovisual emotion recognition. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013) (pp. 3687-3691). Vancouver, BC: IEEE. https://doi.org/10.1109/ICASSP.2013.6638346

Tadas Baltrušaitis, P. R.-P. (2016). OpenFace: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: IEEE.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200-5204). Shanghai, China: IEEE. https://doi.org/10.1109/ICASSP.2016.7472669

End-to-end listening agent for audiovisual emotional and naturalistic interactions

Abstract

Downloads

References

Universidade Católica Portuguesa

Universidade Católica Publisher

Main Article Content

Abstract

Downloads

References

Universidade Católica Portuguesa

Universidade Católica Publisher