End-to-end listening agent for audiovisual emotional and naturalistic interactions

In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user’s paralinguistic and nonverbal expressions, prediction of the agent’s reactions, synthesis of the agent’s expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.


ABSTRACT
In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user's paralinguistic and nonverbal expressions, prediction of the agent's reactions, synthesis of the agent's expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.

| INTRODUCTION
This project is part of the eNTERFACE'17 Workshop. eNTERFACE is a multidisciplinary workshop focusing on multimodal interfaces. It gathers, every year, researchers from around the world to work on different projects for a month. The goal of this project is to build a listening agent that would react to a user using mainly nonverbal expressions. Our ultimate goal is to build a virtual agent which recognizes and takes into account various nonverbal expressions and reacts to the user by generating naturalistic feedbacks. Here, we consider a context of dyadic interactions. And since we focus on nonverbal expressions, the subject of the discussion is not relevant as the goal of the agent is to react to nonverbal expressions with nonverbal expressions. Ideally, the system would run in real time. One of the main challenges of this project is to tease apart the effect of verbal and semantic content in speech. Thus, we rather focus on the speaker's nonverbal and paralinguistic behaviors to predict and generate the agent's nonverbal behavior in real-time. Figure 1 shows the overall workflow of our project which is a basic pipeline of a human-agent interaction system. Our agent will be built on recognition, prediction and synthesis modules. Recognition will detect/recognize relevant expressions, from which the prediction system will take a decision on what should be the agent's reaction. This reaction is then generated by the synthesis module. This latter's output is rendered on a human-like avatar ( Figure 1). In parallel to the development of these modules, we collected a naturalistic emotional dyadic conversation database as explained later in this paper.
The project we propose is inspired by some of the work found in the literature such as (El Haddad, Cakmak, Gilmartin, Dupont, & Dutoit, 2016). In that work, the authors developed some of the modules previously described here. Indeed, in that work, an audiovisual (AV) concatenative synthesis system and a prediction system are presented. Both were built to create a listening agent. The prediction system is a Conditional Random Field (CRF) that takes as input a sequence of labels from a speaker and predicts the most suitable sequence of expressions for the agent. The synthesis system generates AV smiles and laughs predicted by the CRF.
A first step towards a fully functional listening agent is to first build a recognition module that would feed a prediction system such as the one mentioned above. For this we utilize both audio and visual signals for emotion recognition using machine learning techniques, such as temporal (Kim & Provost, 2016) or deep learning models (Kim, Provost, & Lee, 2013), (Kim, 2017a(Kim, , 2017b. We also compare several sequence prediction systems, other than the CRF to predict the agent's expressions of attentiveness when reacting to a speaker/user. In (El Haddad, Cakmak, Gilmartin, Dupont, & Dutoit, 2016), only smiles, laughs and their intensity levels were considered, here we intend to consider more expressions as will be seen in what follows.
In the project we work on improving the synthesis system mentioned above by making it more generic so that it becomes easier to use with the ability to generate more variate set of expressions.
The expressions considered in this project are: laughs and smiles and their intensity dimensions, head movements (nodding, shaking and tilting) and eyebrow movements (raise and frown), for they frequently occur in dyadic interactions. These expressions are a part of all previously mentioned modules. Depending on the module concerned, the audio, video and motion capture signals will be considered.
In what follows, we detail the work done during in this work for each of the previously mentioned modules.

| RECOGNITION
In this section we describe the detection of nonverbal and paralinguistic events occurring during an interaction with a user. This was split in two main tasks: 1. Multimodal Emotion recognition (MER) which predicts arousal and valence values for incoming sentences.
2. Expression detection which detects a list of conversational nonverbal expressions.
Since our agent should work in noisy environments and with "in the wild" data, we chose, for this module, to work with the RECOLA and SEWA databases which meet our requirements.

MULTIMODAL EMOTION RECOGNITION
The state-of-the-art techniques for MER are based on deep learning (Trigeorgis, et al., 2016). A multimodal system was built with a late fusion approach. On one side, the spectrograms of the audio cue were used to train a system similar to the one described in (Kim, et al., 2017b). As shown in Figure 2, a convolutional neural network (CNN) was used to extract descriptors from the data. The extracted features are then fed to a long short-term memory network (LSTM) which is expected to learn dynamic features since the data is time-dependent. On the other side, we used the pre-trained VGG-16 network to extract features from the visual cue which was the subjects face images cropped using the OpenFace tool (Baltruvsaitis, Robinson, & Morency, 2016). Finally, both networks were connected to a Fully Connected (FC) neural network, and both arousal and valence tasks were simultaneously trained in a multi-task learning fashion (Kim, et al. 2017a).
Due to the difference between the data in both datasets, the system was trained and tested on each database separately. Both contain continuous annotations of the valence and arousal. Considering the relatively limited amount of data available in each database, the valence and arousal values were discretized similar to (Tadas Baltrušaitis, 2016) and the problem of regression was turned into a binary classification problem.
By the end of the workshop we evaluated our model using the RECOLA dataset. We conducted 5-fold speaker-independent cross validations and obtained un-weighted accuracy of approximately 60% for arousal and 50% of valence classifications.

EXPRESSION DETECTION
For this task, we chose rule-based approaches instead of deep learning based ones due to both the limited amount of data available and the considerable variance of the nonverbal conversation expressions intra-and inter-speakers. Of the four expressions mentioned in the introduction, only three were considered and extracted from the video recordings of the human subject: head movement, eyebrow movement and smiling. Although some preliminary work on laughter was undertaken, results concerning it will be reported in future work. The workflow to extract these expressions is shown in Figure 3.  The video processing is done using mainly OpenFace to extract action units (AUs) and the coordinates of points of interests (POIs) or landmarks in the face.

Smile Detection
Smile detection is based on the values of AU6 (cheek raiser) and AU12 (lip corner puller). If at least one of the AUs was detected in the frame, then a smile was detected and its intensity is based on the intensity of the detected AU. Figure 4-a) plots the intensity of the smiles detected in a video recording with over 8000 frames. Figure 4-b) shows one of the frames from the video where the smile of the subject is detected.

Eyebrow Movement Detection
Eyebrow movement detection is based on the values of AU1 (inner brow raiser), AU2 (outer brow raiser) and AU4 (brow lowerer). This rule based approach determines whether the subject has the eyebrows raised or lowered by comparing the intensities of these three AUs. Specifically, if AU1 or AU2 has a greater intensity than AU4, then the subject is determined to have raised eyebrows. Otherwise, the subject is considered to have lowered eyebrows. If all three AUs have an intensity of zero, then the subject's eyebrows are in neutral position (neither raised nor lowered).

Head Movement Detection
Head movement detection distinguishes between three possible states: head nodding, head shaking or neither. A rule based approach is adopted that tracks the changes of two POIs located between the eyes to identify the type of head movement. Head shaking is characterized by significant oscillation of the x-coordinate of POIs with minimal changes in the y-coordinates while head nodding is characterized by y-axis oscillations with minimal changes in the x direction. If both coordinates exhibit large changes, we consider that the whole head was displaced due to the subject moving in the video; the subjects were not required to stay still. Table 1 summarizes the adopted algorithm to detect head movements.

Laughter Detection
Systems can be found in the current literature such as (Neuberger & Beke, 2013), but they usually are not efficient in noisy environments if considering audio and video modalities only, especially for low intensity level laughs. Future work will focus more on building a robust laughter detection system by merging several databases and using deep learning methods that can handle the variance of laughs.

FUTURE WORK
Concerning the smiles, eyebrow movements and head movements, although the rule-based and AUbased approaches presented are rather simplistic and are not optimal for all the expressions considered here, they were the most efficient we could implement given the time and resources available during this workshop. In the future we plan on developing more robust machine learning based detection systems for these expressions.  Concerning the MER system, we intend to merge the SEWA and RECOLA databases to increase the amount of data available to train our systems and improve our results. We also intend to leverage other datasets such as IEMOCAP (Busso, et al., 2008) and IMPROV (Busso, et al., 2017) to take advantage of the potential of deep learning.

| PREDICTION
The goal of this module is to foresee the listener's expressions given that of the speaker. The problem is tackled as a one-step-ahead prediction task where the system generates the appropriate responses on a frame-by-frame basis. For that, data is split into two subsets: one corresponding to the listener and the other pertaining to the speaker. While the former is treated as the output, the latter is regarded as the set of predictors for the module.

DYADIC CONVERSATION DATA
Data of nonverbal expressions of a listener to a speaker were needed. We chose the Cardiff Conversation Database (CCDB) because it contains dyadic conversations during which the interlocutors discuss in a naturalistic way. Although it already contained annotations of some nonverbal expressions, these were not well suited for our task. We therefore re-annotated smiles and laughs in three different intensity levels each, as well as head movements (nodding, shaking, tilting) and eyebrow movements (raise and frown for left, right or both eyes). The annotations were made using the ELAN annotation software. For this project, only 10 Compute movement in horizontal direction over 10 frames Compute movement in vertical direction over 10 frames y_movement = max(sum(abs(y(t) -y(t-1)) 3.  Nodding conversations (20 videos) were used to train the systems.

PREDICTION SYSTEMS
To predict the responses of the agent a set of algorithms were tested: • Linear regression • Naïve Bayes classification • Decision tree • Fuzzy inference system • Recurrent neural networks We will ultimately evaluate the generated expressions subjectively, since our goal is to obtain adequate reactions, and not to copy a listener's reactions from a dataset. For this, we will synthesize the predicted sequence of expressions via the synthesis module and evaluate the synthesized expressions in the context of their generation (as reactions to a speaker/user). For the current study, these models were evaluated based on three performance measures: the accuracy, training error and testing error (mean squared error). Table 2 summarizes our results. Below is a detailed description of each of the tested algorithms with the corresponding assumptions and/or preprocessing.

Linear regression
In this model, the dependent variables (listener's expressions) are explained as a linear combination of the regressors (speaker's expression). A multivariate regression framework is employed to predict each of the four expressions as a function of that of the speaker; for example, the agent's head movement is a result of a weighted linear combination of the speaker's head movement, eyebrow movement, smile and laughter. The problem is reduced to learning the optimal weights for each predictor/output combination that minimize the mean squared error on the labeled training data. Due to the categorical nature of the regressor matrix, a conversion to a numerical format (one-toone mapping) is required. The output is converted to a categorical representation using a threshold-based rule.

Naïve Bayes Classification
The Naïve Bayes classifier is a probabilistic technique that constructs conditional probability using Bayes' theorem assuming naïve (i.e. independent) features. The strength of this method lies in its simplicity requiring only a linear number of parameters in terms of predictor/output pairs. In our model, the parameters are learned using a maximum likelihood algorithm. The use of a probabilistic formulation is possible when using categorical features and hence no numerical conversion is needed for this model. Figure 6 shows a sample of predicted expressions over a period of 10,000 frames. As can be seen in Figure 6, the independence assumption is highly erroneous in our case (it is fair to assume that a person's smiling and nodding simultaneously are correlated).

Decision Tree Learning
A decision tree is a prediction technique that models observations of a feature as a branch and outputs as leaves.
While considered to be a non-robust method, it presents the advantage of simple learning by maximizing the information gain of every attribute and subsequently eliminating those with no discriminative ability.

Fuzzy Inference System
A fuzzy inference system models a prediction problem by mapping attributes to fuzzy sets. Each instance is hence characterized by its probability to belong to a particular set rather than a crisp 0 or 1 mapping in classical sets. Outputs or decisions are then made based on a defuzzification (i.e. mapping from a probabilistic representation to a crisp one) of the result of inference rules, which describe the input-output mapping using a set of logic rules. In our case, a speaker's expressions are described as belonging to 4 fuzzy sets (the 4 expressions) with the probability of belonging modeled as Gaussian distributions. A similar mapping is performed for the output.

Recurrent Neural Networks
Recurrent nets are desirable for their ability to robustly model time series prediction problems such as the one tackled in this module. These models include recurrent connections (at the input layer, hidden layers, output layer or any combination of these) that serve as a time varying real-valued activation function. Hence, this allows the network to exhibit time dependency properties. These models suffer from a lengthy and computationally expensive training, rendering their implementation on limited computational resources challenging. In our module, we employ a fully recurrent neural network with two hidden layers each composed of 100 neurons and trained using gradient descent.

FUTURE WORK
A next step would be to evaluate objectively all the models tested here in a similar way as in (El Haddad, Cakmak, Gilmartin, Dupont, & Dutoit, 2016). This means that each system will be used to predict the reactions of the agent to a speaker and these expressions will be synthesized using the synthesis module. The synthesized output will undergo subjective tests to evaluate the relevance of the expressions chosen by the system with respect to the speaker to which the agent is reacting.

| SYNTHESIS
To generate audiovisual nonverbal expressions, we relied on the concatenation system described in (El Haddad, Cakmak, Gilmartin, Dupont, & Dutoit, 2016). It was improved here by implementing a python-based animation of facial expressions, a search algorithm for queried expressions, a facial normalization technique to use different types of data and the same interpolation technique as in our previous work.

SYSTEM OVERVIEW
This system relies on a dataset of AV expressions from which the best suited expressions are picked based on the requirements of a query and concatenated together to form a full sequence. The parameters controlled from the query are currently: • The type of the expression (laughter, smiles, head nodding, head tilting, head shaking, raise or frown eyebrow left or right) • The intensity of the expression

• The duration of the expression
The expressions in the AV dataset were manually annotated according to these parameters.

AUDIOVISUAL CONCATENATION SYSTEM
To concatenate two facial expressions, the starting and ending frames of an expression and the one that succeeds it, respectively, are likely to present discontinuities even for the same expression. To achieve a smooth transition, we used a linear interpolation approach in (El Haddad, Cakmak, Gilmartin, Dupont, & Dutoit, 2016) on the extracted face landmarks. Given two sequences of facial expressions A and B, where B should be concatenated to the end of A. Thus, we apply the interpolation between the last frame a of A and the first frame b of B, yielding the interpolated frame defined as below: = 1 * + 2 * where 1 and 2 denote the interpolation weight, and 1 + 2 = 1. For a smooth transition, we create more frames by interpolating between and , and , until the transition looks natural and smooth.
Concerning the audio cue, only laughter is expressed audibly and is therefore concatenated to silence. No smoothing interpolation is needed, but concatenation and truncation are used to control the length of the silence signal.

RENDERING
The animation is composed of visual and audio cues which are used and should ultimately be rendered on the avatar shown in Figure 7 (Çakmak, El Haddad, & Pulisci, 13-14 June 2016). The visual cue is controlled by facial landmarks extracted from the facial expressions in the dataset. The OpenFace tool was used to automatically extract these facial landmarks. This serves as a first and easy visualization of the generated expressions. Since the landmarks defined to control the avatar are not the same as the ones extracted by OpenFace, ultimately a mapping will be made to control the avatar with the OpenFace landmarks.

EXPRESSION QUERY AND DATASET
The dataset contains segmented expressions of the facial landmarks and the audio (either laughter or silence) separately. The expressions are stored in separated files, the names of which contain the expression's parameters information (type, intensity and duration) The goal is to receive a query of a sequence of expressions from the prediction module along with the duration and intensity required for each expression and use this query to pick the best suited ones from the dataset.
To have expressions coming from different subjects and therefore maximize the amount of expressions contained in the dataset, we use a landmark normalisation approach. Instead of using the raw landmark coordinate values, we use the movement of the landmark coordinates from one frame to the next (by subtraction). These differences are applied to a reference frame containing initial landmarks for a certain face.
The audio cue normalisation is still an ongoing work. Ultimately voice conversion techniques should be used to transform all the different voices in the dataset to one target voice.
The dataset of expressions for the synthesis module currently contains only a few examples recorded for the purpose of testing the system.

FUTURE WORK
In the near future we intend to build an extensive dataset of annotated expressions for the synthesis module.
Several voice conversion techniques will be compared to normalize the audio cue of the dataset. Techniques range from simple signal processing methods to bring the fundamental frequency and spectral values, to certain predefined values, to more recent deep learning based techniques such as autoencoders.
Finally, we will work on controlling more parameters such as the social functionality of the expressions, obtaining intermediate intensity levels through interpolation and the possibility of combining several expressions, such as smiling while nodding for instance.
In future work, parametric and deep learning based synthesis systems will also be considered.

| DATA COLLECTION
The eNTERFACE workshop gave us the opportunity to collect our own dyadic conversation database. For this, interlocutors took turns as speakers and listeners, the latter asking questions to the former about memories of emotional states. Questions were on negative (i.e. -guilt and shame) and positive (i.e.pride and compassion) emotions. These emotions are considered to be moral emotions (Haidt, 2003).

Procedure
Participants were asked to read the informed consent form first, which clearly stated that their participation was voluntary and unpaid. The consent form stated that moral emotions will be discussed by the participants. At the start of the experiment, participants were told that they will randomly be assigned to the role of speaker or listener, then switch roles. The speaker answers questions about moral emotions, whilst the listener listens to the speaker's answers and asks any follow up questions as necessary. The instructions were purposively vague to ensure that the dyadic interaction was as natural as possible.
The order in which participants discussed the emotions was randomised: listeners chose one of the two moral emotion options (positive, negative) from question prompts on a table. Each interaction started with the listener asking a question that varied in the emotion category: "When was the last time you experienced gratitude/compassion/guilt/shame? Can you describe the event and your feelings?" The speaker responded to each question. The interaction lasted until the interlocutors both indicated to experimenters that the conversation was finished.

Experimental Setup: Video/Audio Acquisition
Video and audio were recorded in a soundproof room at the Catholic University of Porto, Portugal. Two Canon Cameras: EOS 550D and EOS 6D were used to record the interactions. Camera A (beside the Speaker 1, recorded Listener 1/Speaker 2) and  Camera B (beside the Speaker 2, recorded Speaker 1/Listener 2). The camera angle and distance were tailored to each participant, ensuring that the head to torso area was captured. The distance between speaker and listener remained constant. Two Rode Podcaster USB microphones on pop shield shock mounts recorded speaker and listener audio. Laptops were attached to microphones for audio recordings and were also used for pre-and postexperimental questionnaires (see Figures 8 and 9 for diagrams of the experimental setup). Two experimenters were in the room and started and stopped video and audio recordings.

CONTENT
This database was designed to mimic, as much as possible, real world conversations -by being unscripted and containing a mixture of nationalities and familiar/unfamiliar individuals.

Database Demographics
As summarized in Figure 10, the database contains 42 participants (21 pairs), 32 males and 10 females. Participants were largely students and professors (age range: 20 -48). There were 14 male-male pairs, 3 female-female pairs and 4 male-female pairs, of which 11 pairs knew each other beforehand. The database comprises of 14 nationalities.

Questionnaire Data
Demographic information (e.g., age, gender) was collected. Further, to obtain a richer dataset, mood and personality measurements were also collected.
Participants completed pre-and post-experiment mood evaluation by filling in the Positive and Negative Affect schedule (Watson, 1988). Then after the experiment, they completed additional stress and personality measures: Stress Response Scale (Suzuki, 1998) and the Big Five Inventory (John, 1999).

File Contents of the Database
Our database contains audio, video, excel and ELAN files.
1. Audio.files (wav) These contain the audio content of speakers and listeners during the interaction.
2. Video files (MPEG-4). These contain both visual and audio content. An example of the footage captured using this setup is in Figure 11.
3. Excel Files of Demographic/mood/ personality data. These files contain all pre-and post-recording data, including, pre-and postmood scores and post-personality data.
4. ELAN Files. These are ongoing and contain annotations of video files for non-verbal cues, including, smiles, nods and laughs.

Ongoing Annotations
Annotations of nonverbal expressions were initiated during the last week of eNTERFACE. These annotations concern: • Smiles: 3 different intensity levels

| CONCLUSION
We were successfully able to build the foundations of an end-to-end expressive listening agent system. Based on these foundations we hope to bring, in the near future, a set of tools for each of the components described above that should be useful to anyone interested in affective agents and willing to contribute to our work. and artificial intelligence. Rizk has attended a technical internship (2013)(2014)