End-to-end listening agent for audiovisual emotional and naturalistic interactions

Main Article Content

Kevin El Haddad
Yara Rizk
Louise Heron
Nadine Hajj
Yong Zhao
Jaebok Kim
Trung Ngô Trọng
Minha Lee
Marwan Doumit
Payton Lin
Yelin Kim
Hüseyin Çakmak

Abstract

In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user’s paralinguistic and nonverbal expressions, prediction of the agent’s reactions, synthesis of the agent’s expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.

Keywords: Listening agent, Smile, Laughter, Head movement, Eyebrow movement, Speech emotion recognition, Nonverbal expression detection, Sequence-to-sequence prediction systems, nonverbal expression synthesis, Emotion database, Dyadic conversa

Downloads

Download data is not yet available.

References

Aubrey, A., Marshall, D., & Rosin, L. (2013). Cardiff conversation database (ccdb): A database of natural dyadic conversations. 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 277–282). Portland, OR, USA: IEEE. https://doi.org/10.1109/CVPRW.2013.48

Baltruvsaitis, T., Robinson, P., & Morency, L.-P. (2016). Openface: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1--10). IEEE.

C. Busso, M. B. ( vol. 42, no. 4, pp. 335-359). IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, December 2008.

C. Busso, S. P. (January-March 2017). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 119-130.

El Haddad, K., Cakmak, H., Gilmartin, E., Dupont, S., & Dutoit, T. (2016). Towards a listening agent: a system generating audiovisual laughs and smiles to show interest. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 248-255). Tokyo, Japan: ACM, New York, NY, USA. https://doi.org/10.1145/2993148.2993182

Haidt, J. (2003). The moral emotions. Handbook of Affective Sciences, 11, 852-870.

Hüseyin Çakmak, K. E. (13-14 June 2016). A real time OSC controlled avatar for human machine interactions. Workshop on Artificial Companion Affect Interaction. Brest, France.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017b). "Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition." In: Proceedings of ACM Multime- dia, pp. 1006–1013.

Kim, J., Englebienne, G., Truong, K. P., and Evers, V. (2017a). "Towards Speech Emotion Recognition "in the wild" using Aggregated Corpora and Deep Multi-Task Learning." In: Proceedings of INTERSPEECH, pp. 1113–1117. https://doi.org/10.21437/Interspeech.2017-736

Kim, Y., & Provost, E. M. (2016). Emotion spotting: discovering regions of evidence in audio-visual emotion expressions. 18th ACM International Conference on Multimodal Interaction (ICMI 2016) (pp. 92-99). Tokyo, Japan: ACM, New York, NY, USA. https://doi.org/10.1145/2993148.2993151

Kim, Y., Provost, E. M., & Lee, H. (2013). Deep learning for robust feature generation in audiovisual emotion recognition. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013) (pp. 3687-3691). Vancouver, BC: IEEE. https://doi.org/10.1109/ICASSP.2013.6638346

Tadas Baltrušaitis, P. R.-P. (2016). OpenFace: an open source facial behavior analysis toolkit. IEEE Winter Conference on Applications of Computer Vision (WACV). Lake Placid, NY, USA: IEEE.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200-5204). Shanghai, China: IEEE. https://doi.org/10.1109/ICASSP.2016.7472669