[PDF] Multi-level Attention network using text, audio and video for Depression Prediction

Abstract

Depression has been the leading cause of mental-health illness worldwide. Major depressive disorder (MDD), is a common mental health disorder that affects both psychologically as well as physically which could lead to loss of lives. Due to the lack of diagnostic tests and subjectivity involved in detecting depression, there is a growing interest in using behavioural cues to automate depression diagnosis and stage prediction. The absence of labelled behavioural datasets for such problems and the huge amount of variations possible in behaviour makes the problem more challenging. This paper presents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, video and text modalities while learning the intra and inter modality relevance. The multi-level attention reinforces overall learning by selecting the most influential features within each modality for the decision making. We perform exhaustive experimentation to create different regression models for audio, video and text modalities. Several fusions models with different configurations are constructed to understand the impact of each feature and modality. We outperform the current baseline by 17.52% in terms of root mean squared error.

Full PDF

MMulti-level Attention network using text, audio and video forDepression Prediction

Anupama Ray

IBM Research, [email protected]

Siddharth Kumar

IIIT Sricity, [email protected]

Rutvik Reddy

IIIT Sricity, [email protected]

Prerana Mukherjee

IIIT Sricity, [email protected]

Ritu Garg

Intel [email protected]

ABSTRACT

Depression has been the leading cause of mental-health illnessworldwide. Major depressive disorder (MDD), is a common mentalhealth disorder that affects both psychologically as well as physi-cally which could lead to loss of lives. Due to the lack of diagnostictests and subjectivity involved in detecting depression, there is agrowing interest in using behavioural cues to automate depressiondiagnosis and stage prediction. The absence of labelled behaviouraldatasets for such problems and the huge amount of variations pos-sible in behaviour makes the problem more challenging. This paperpresents a novel multi-level attention based network for multi-modal depression prediction that fuses features from audio, videoand text modalities while learning the intra and inter modalityrelevance. The multi-level attention reinforces overall learning byselecting the most influential features within each modality for thedecision making. We perform exhaustive experimentation to createdifferent regression models for audio, video and text modalities. Sev-eral fusions models with different configurations are constructed tounderstand the impact of each feature and modality. We outperformthe current baseline by 17.52% in terms of root mean squared error.

CCS CONCEPTS • Computing methodologies → Machine Learning ; Neural net-works . KEYWORDS attention networks; long short term memory; depression prediction;multimodal learning

ACM Reference Format:

Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, and RituGarg. 2019. Multi-level Attention network using text, audio and video forDepression Prediction. In

ACM, New York,NY, USA, 8 pages. https://doi.org/10.1145/3347320.3357697

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Depression is a one of the common mental health disorders andaccording to WHO, 300 million people around the world have de-pression . It is a leading cause of mental disability, has tremendouspsychological and pharmacological affects and can in the worstcase lead to suicides. A big barrier to effective treatment of MDDand its care is inaccurate assessment due to the subjectivity in-volved in the assessment procedure. Most assessment proceduresrely on using questionnaires such as Physical Health QuestionnaireDepression Scale (PHQ), the Hamilton Depression Rating Scale(HDRS), or the Beck Depression Inventory (BDI) etc. All of thesequestionnaries used in screening involve patient’s response whichis often not very reliable due to different subjective issue of anindividual. The symptoms of MDD are covert and there could besome individuals who complain a lot in general even without hav-ing mild depression, whereas most severely depressed patients donot speak much in the screening test. Thus, it is very challengingto diagnose early depression and often people are misdiagnosedand prescribed antidepressants. Unlike physical ailments, there areno straightforward diagnostic tests for depression and clinicianshave to routinely screen individuals to determine whether the typeof clinical or chronic depression. Studies have shown that around70% sufferers from MDD have consulted a medical practitioner [6].Most practitioners follow the gold standard Physical Health Ques-tionnaire [24], which has questions to check for symptoms suchas fatigue, sleep struggles, appetite issues etc. Diagnosis is basedon the judgement of the practitioner (which could be biased frompast-education or past experience). Often there are false positives orfalse negatives with the PHQ screening which lead to misjudgementin diagnosis.The huge need for depression detection and the challenges in-volved motivated the affective computing research community touse behavioural cues to learn to predict depression, Post-traumaticstress disorder, and related mental disorders [40]. Behavioral cuessuch as facial expression, prosodic features from speech have provento be excellent features for depression prediction [9, 34].In this paper, we present a novel framework that invokes atten-tion mechanism at several layers to identify and extract importantfeatures from different modalities to predict level of depression. Thenetwork uses several low-level and mid-level features from bothaudio and video modalities and also sentence embeddings on thespeech-to-text output of the participants. We show that attention a r X i v : . [ c s . C V ] S e p t different levels gives us the ratio of importance of each featureand modality, leading to better results. We perform several experi-ments on each feature from different modality and combine severalmodalities. Our best performing network is the all-feature fusionnetwork which outperforms the baseline by 17.52%. The individualfeature-based attention network outperforms the baseline by 20.5%and the attention based text model outperforms the state-of-artby 8.95%, the state-of-art network being an attention based texttranscription network as well [32]. The key contributions of this work is given as follows: • Attention based fusion network: We present a novel featurefusion framework that utilizes several layers of attention tounderstand importance of intra-modality and inter-modalityfeatures for the prediction of depression levels. • The proposed approach outperforms the baseline fusion net-work by - on root mean square error. • An improved attention based network trained on all threemodalities which outperforms the baseline by 17.52%.The remaining paper is organized as follows: Section 2 presentsthe state-of-art methods for depression classification. We present abrief overview about the proposed multi-level attention network insection 3, followed by a brief of the dataset used. In section 4, thedetailed methodology for each model built on individual features orfusion is described. Section 5 explains the results of all the modelsand presents all ablation studies followed by discussions and futurework in section 7.

In this section, we briefly provide a review of the various worksdone in context of distress analysis using multimodal inputs suchas text, speech, facial emotions and multimodal sentiment analysis.

Speech, more specifically non-verbal paralinguistic cues have gainedsignificant popularity in distress prediction and similar tasks dueto two main reasons. First, clinicians use speech traits such asdiminished prosody, less or monotonous verbal activity produc-tion and energy in speech as important markers in the diagnosisof distress. Secondly, speech being an easy signal to record (non-invasive and non-intrusive), makes it the best candidate for allautomation tasks [10]. Cummins et.al. [10] provide an exhaustivereview of depression and suicide risk assessment using speechanalysis. They investigate the usage of vocal bio- markers to as-sociate clinical scores in case of depression signs. In [38], authorsperform distress assessment in speech signals to infer emotionalinformation expressed while speaking. This amounts to quantifyingvarious expressions such as anger, valence, arousal, dominance etc.In [27], authors provide a comparative study on noise and rever-beration on depression prediction using mel-frequency cepstralcoefficients (MFCCs) and damped oscillator cepstral coefficients(DOCCs). Cummins et.al. in [11] investigate change in spectraland energy densities of speech signals for depression prediction.They analyze the acoustic variability in terms of weighted variance,trajectory of speech signals and volume to measure depression. The cross-cultural and cross-linguistic characteristics in depressedspeech using vocal biomarkers is studied in [1]. In [43], authorsstudy the neurocognitive changes influencing the dialogue deliveryand semantics. Semantic features are encoded using sparse lexicalembedding space and context is drawn from subject’s past clinicalhistory.

Although the inherent relationship between verbal content andmental illness level is more prominent, the visual features also playa pivotal role to reinstate the deep association of depression tofacial emotions. It has been observed that patients suffering fromdepression often have distorted facial expressions for e.g. eyebrowtwitching, dull smile, frowning faces, aggressive looks, restrictedlip movements, reduced eye blinks etc. With the quantum of prolif-erating video data and availability of high end built-in cameras inwearables and surveillance sectors, analyzing the facial emotionsand sentiments is the growing trend amongst the vision community.In [31], authors utilize convolutional multiple kernel learning ap-proaches for emotion recognition and sentiment analysis in videos.Dalili et.al. conducted a thorough study for meta-analysis of theassociation between existing facial emotion recognition and depres-sion [12]. In [29], authors rely on temporal LSTM based techniquesfor capturing the contextual information from videos in sentimentanalysis task. Valstar et.al. introduced Facial Expression Recogni-tion and Analysis challenge (FERA 2017) dataset [41] to estimate thehead pose movements and identify the action units against them. Itrequires to quantify the facial expressions in such challenging sce-narios. Ebrahimi et.al. introduced Emotion Recognition in the Wild(EmotiW) Challenge dataset [14] and utilize a hybrid convolutionalneural network-recurrent neural network (CNN-RNN) frameworkfor facial expression analysis. These datasets have been very crucialin advancing state-of-art in research around facial expression recog-nition and distress prediction. In [4], authors present OpenFace anopen source interactive tool to estimate the facial behaviour. Thisis a widely popular tool and in this paper, we have used the fea-tures extracted from OpenFace only as the video low-level features.OpenFace gives us features for face landmark regions, head pose es-timation and eye gaze estimation and converts it into a reliable facialaction unit. In [26], authors investigate the meta-analysis of facialgestures to identify the schizophrenia based event triggers. In [20],authors provide a meta-analysis on attention deficit hyperactivitydisorder (ADHD) and dysregulation of children’s emotions. Theyprovide an attempt to establish a coherent link between ADHD andemotional reactions.

Along with video and audio, the verbal content of what a personspeaks is critically important to be able to diagnose depression andstress. With the surge of social media usage, there is a lot of textualdata inflow from social media which has given researchers the op-portunity to try to analyze distress from text. Such data could helpin sentiment analysis and provide insights to sudden aberrations inthe personality traits of the user as reflected in one’s posts. In [37],authors leverage social media platforms to detect depression by har-nessing the social media data. They categorize the tweets gatheredrom Twitter API into depression and non-depression data. Theyextract various feature groups correlated to six depression levelsand further utilize the multimodal depressive dictionary learningfor online behaviour prediction of the Twitter user base. In [7],authors have inspected the tweet content ranging from commonthemes to trivial mention of depression to classify them into rele-vant category of distress disorder levels. In [22], authors present asentiment analysis framework using social media data and minepatterns based on emotion theory concepts and natural languageprocessing techniques. Ansari et.al. propose a Markovian modelto detect depression utilizing content rating provided by humansubjects [3]. The users are presented with series of contents andthen asked to rate them based on the reactions tapped and tendencyto skip it the depression level is associated to the event. In [23],authors examine the onset of triggered events for mental illnessspecifically stress and depression based on social media data en-compassing different ethnic backgrounds. Tong et.al. [39] utilizea novel classifier inverse boosting pruning trees to mine onlinesocial behaviour. This enables depression detection at early stages.In [21], authors adopt clustering techniques to quantify the anxietyand depression indices on questionnaire textual data. Further, thecorrelation amongst anxiety, depression and social data is investi-gated.For most of social media data, the text analysis is done forshort text and these classifiers dont work well in a conversationsetting which happens during the counselling/screening sessions.

In [28], authors provide a comprehensive review on the fusion tech-niques for depression detection. They also propose a computationallinguistics based fusion approach for multimodal depression de-tection. In [25], authors analyse the depression levels from clinicalinterview corpus DAIC dataset based on context aware featuregeneration techniques and end-to-end trainable deep neural net-works. They further infuse data augmentation techniques based ontopic modelling in the transformer networks. Zhang et.al. releaseda multimodal spontaneous emotion corpus for human behaviouranalysis [44]. Facial emotions are captured by 3D dynamic ranging,high resolution video capture and infrared imaging sensors. Apartfrom facial context, blood pressure, respiration and pulse rates aremonitored to gauge the emotional state of a person. Using the datareleased in AVEC challenge [33], audio, video and physiologicalparameters are investigated to observe key findings on emotionalstate of the subjects. In [30], authors fuse audio, visual and textualcues for harvesting sentiments in multimedia content. They utilizefeature and decision level fusion techniques to perform affectivecomputing. In [2], authors utilize paralinguistic, head pose and eyegaze fixations for multimodal depression detection. With the helpof statistical tests on the selected features the inference engine willclassify the subjects into depressed and healthy categories.When combining multiple modalities it is important to under-stand the contribution of each modality in the task prediction andattention networks can be used to study the relative importance[42]. In this paper we use attention at each modality to understandthe relative importance of the low-level or deep features withinthe modality. We also use attention layers while fusing the three modalities and learn the attention weights to find the ratios of im-portance of each modality. The paper by Querishi et.al. [32] is theonly paper closest to what we are doing in terms of using a subsetof the dataset we are using and applying attention at one layer. Byusing multiple layers of attention at several levels we have beenable to obtain much better results than them and the network iscomputationally less expensive due to attention operations, thusminimizing the test time of the framework.

A block diagram of the proposed multi-layer attention network isshown in Figure 1. The attention layer over each modality teachesthe network to attend to the most important features within themodality to create the context feature for that modality. The contextfeatures of each modality are passed through two layers of feedfor-ward networks, and the outputs of these 3 feedforward networks arefused in another stacked BLSTM. The output of the 3 feedforwardnetworks contain the most important features per modality andare fused to form another concatenation vector with an attentionlayer on top of it. The output of this attention layer is multiplied bythe output of the stacked Bi-LSTM output and passed through theregressor. The loss of the regressor is back-propagated to train theweights learned at each level of the network ensuring end-to-endtraining.

The Extended Distress Analysis Interview Corpus dataset (E-DAIC)[18] is an extension of DAIC-Wizard-of-Oz dataset (DAIC-WOZ)dataset. This dataset containing audio-video recordings of clinicalinterviews for psychological distress conditions such as anxiety,depression and post traumatic stress disorders. The DAIC-WOZ[13] was collected as part of a different effort which aimed to createa bot that interviews people and identifies verbal and nonverbalindicators of mental illnesses [19] which has an animated virtualbot instead of a clinician and the bot is controlled by a humanfrom another room. The Audio/Visual Emotion Challenge (AVEC2019) challenge [17] presents an E-DAIC where all interviews areconducted by the AI based bot rather than a human. This datahas been carefully partitioned to train, development (dev) and testwhile preserving the overall gender diversity. There are total 275subjects in the E-DAIC dataset out of which 163 subjects are usedfor train, 56 for both dev and test, out of which the test labels arenot available as per challenge. Thus the results shown are mostlyon the dev partition only.

In this section we describe the models created on each modalityalong with the various models created for fusion of different featuresfrom different modalities.

We use the speech-to-text output for the participants in the dataprovided by [17]. Since several participants used colloquial Englishwords, we modified the utterances by replacing such words withthe original full word, otherwise they become all out of vocabulary igure 1: Block diagram of proposed multi-layer attention network on multi-modality input features words while training a neural network for language modeling orother predictions. We used pretrained Universal Sentence Encoder[8] to get sentence embeddings. To obtain constant size of thetensors, we zero pad shorter sentences and have a constant numberof timesteps as 400. The length of each sentence embedding vectoris 512 making the final array dimension as (400,512). We used 2layers of stacked Bidirectional Long short term memory networkarchitecture with sentence embeddings as input and PHQ scoresas output to train a regression model on the speech transcriptions.Each BLSTM layer has 200 hidden units, wherein the output of eachhidden unit of the forward layer of first BLSTM layer is connectedto the input of the forward hidden unit of the second layer. Sameconnections are built for each hidden unit in the backward layersas well to create the stacking. The two layers of BLSTM give anoutput of (batchsize,400) at each timestep and this is sent as aninput to a feedforward layer for regression. We kept the number ofnodes in the feedforward layers as (500,100,60,1) and used rectifiedlinear units as the activation function.

For the audio modality we created models using different audiofeatures (low-level features as well as their functionals). As func-tionals, the arithmetic mean and co-efficent of variations is appliedon the low-level features and this is used as a knowledge abstractionon top of the low-level features [36]. The vocal timbre is encodedby low-level descriptor features such as Mel-Frequency Cepstral Coefficients (MFCC)[16] and studies [5, 16] show that the lowerorder MFCCs are more important for affect/emotion prediction andparalinguistic voice analysis tasks. The extended Geneva Minimal-istic Acoustic Parameter Set (eGeMAPS) contains 88 features whichinclude the GeMAPS as well as spectral features and their func-tionals. The GeMAPS feature consists of frequency related features(pitch, jitter, formant), energy related features (shimmer, loudness,harmonic-to-noise ratio), spectral parameters (alpha ratio, hammar-berg ratio, Spectral slope -500 Hz and 500-1500 Hz, formant 1,2,3relative energy), harmonic difference between H1-H2 and H1-A3,and their functionals, and six temporal features related to speechrate [15]. Apart from these low-level features mentioned above, ahigh dimensional deep representation of the audio sample is ex-tracted by passing the audio through a Deep Spectrum and a VGGnetwork. This feature is referred to as deep densenet feature in therest of the paper.For audio features, the span of vectors were the participanthas spoken was only considered in our experimentation. Eachof these features were available as a part of the challenge dataand they have different sampling rates. The functional audio anddeep densenet features are sampled at 1Hz, whereas the Bag-of-AudioWords (BoAW) [35] is sampled at 10Hz and the low-levelaudio descriptors are sampled at 100Hz. The length of the low-levelMFCC features and low-level eGEMAPS is 39 and 23 respectively,and the total time steps is 140500 for both. For the functionals how-ever, the lengths are 78 and 88 respectively with 1300 and 1410imesteps. The BoAW-MFCC and the BoAW-eGEMAPS features areof length 100 and took 14050 timesteps for each. The deep densenetfeatures are of 1920 dimension in length and took 1415 timesteps.For the individual audio modality, we trained another stackedBLSTM network with two layers each having 200 hidden units. Wetake the last layer output and pass it to a multi-layer perceptronwith each layer of (500,100,60,1) nodes in progression and RectifiedLinear Units as activation function.

For the video features available in the challenge dataset, we experi-mented with both the low-level features as well as the functionalsof the low-level video descriptors provided. We observed similarperformance comparing the low-level descriptor with its functional.Since the deep LSTM networks could also learn similar propertiesfrom the data (like functionals and more abstract information), wechose to use the low-level descriptors as it has more informationthan its mean and standard deviation. Each low-level descriptorfeatures for Pose, Gaze and Facial Action Units (FAU) are sampledat 10Hz. The length of these features were 6, 8, 35 respectivelyand all having 15000 timesteps. The Bag-of-VisualWords (BoVW)is also provided in the challenge data and has a length of 100 with15000 timesteps. We use these features to train individual modelper feature, all having a single layer of 200 BLSTM hidden units,followed by a maxpooling and then learn a regressor. We exper-imented with various combinations such as sum of all outputs,mean of outputs and also by maxpooling as three alternatives, butmaxpooling worked best, so we have utilized maxpooling over theLSTM outputs.

Standard procedures of early fusion are computationally expensiveand can lead to overfitting when trained using neural networks.Thus, late fusion and hybrid fusion models became more prevalent.We propose a multi-layer attention based network that learns theimportance of each feature and weighs them accordingly leading tobetter early fusion and prediction. Such an attention network givesus insights of which features in a modality are more influential inlearning. It also gives an understanding of the ratio of contributionof each modality towards the prediction.Towards the fusion, we did several experiments within eachmodality and across. First, we fuse the low-level descriptors of thevideo modality. We take the gaze,pose, and facial action unit featuresand pass them through a single layer of 200 BLSTM cells and applyattention over them. The output of attention layer is passed throughanother BLSTM layer with 200 cells. We take global max pool ofthis LSTM output and pass through feed forward network with 128hidden units. We call this fusion model as videoLLD-fused model intable 2. Second, we combine the low-level video features with theBoVWs and use a similar network of 200 hidden unit BLSTM layerfollowed by attention and another BLSTM, which is then passedthrough a feed-forward layer for regression. We call this fusionmodel as video-BoVW fused in Table 2.The third fusion model is created using the attention vectoroutput from the video modality and the output of the text modality.These two outputs are combined and passed through a stacked BLSTM and an attention layer prior to the video regressor. Thisfused model is referred to as

Video-text fusion in Table 2.The fourth fusion model uses audio and text modalities together.Again we took the output of the attention layer at each modalityand build a hybrid fused network but passing them through tworoutes. In the first route the outputs of attention are concatenatedand passed through attention followed by feed-forward layer andregression loss is propagated. Another route passes the output ofboth attention layers are passed through a stacked BLSTM of 2layers each with 200 cells. An attention layer is applied on top ofthe stacked BLSTM layers and this output is fed to a feed forwardnetwork of 128 hidden units. This network is called

Audio-Text fused in Table 2 and is seen to have better performance than standaloneaudio model naturally due to use of text features which led to bestresults.Our fifth fusion model uses Video and text modalities togetherand here we again use attention layer at each sub-modality of thevideo inputs and then combine it with the text modality usinganother attention network over video and text. Quite surprisingly,the results of this fusion are very similar to audio and text modalityfusion and the learning curve also ended up being quite similar.This network called the

Video-Text fused in Table 2 and, has oneroute that runs through a Bi-LSTM network of 200 units for eachsub-modality and then for each time-step, they are fused togetherusing an attention layer over all the incoming video features(Gaze,Pose, AUs and BOVW), which goes through another Bi-LSTM with200 units to extract the contextual information from within thefused features.Our sixth and final fused model uses all the modalities together.We use the attention based visual modalities to finally obtain a128 unit vector, we use the attention based audio modalities toobtain a 128 unit vector and we extract the the information fromthe transcript modalities and derive a 128 bit vector from that. Weagain use another attention layer over these 3 modalities(Video,Audio and Text), to fuse them together and regress for the PHQ8score. There were several challenges in integrating this fused model.We hypothesise that the error function consists of several localminima, which make it more difficult to reach the global minima.On testing the model with individual modalities, we observed thatboth Video feature model and audio feature model have a muchsteeper descent than the ASR model, on fusion, the model oftengot stuck on the minima of the video and audio features which areboth quite close. To mitigate this and "nudge" the model towardsa minima which takes the path of the minima reached by ASRtranscripts, we multiply the final outputs of the attention layerelement-wise with a variable vector initialised with values in thereciprocal ratios of the rmse loss for each individual modality inorder to prioritize the the text modality initially. This led to a stabledecline of train and validation loss, more stable than the individualmodality loss also, and the final attention scores are indicative ofcontributions of each individual modality. Upon convergence, theattention ratios were [ ] for video,audio and text respectively. able 1: Regression of PHQ score in terms of RMSE and MAE for each feature within each modalityPartition Audio Video Text

Funct MFCC Funct eGeMAPS BoAW-M BoAW-e DS-DNet Pose-LLD Gaze-LLD FAU-LLD BoVW TextDev-proposed 5.11 5.52 5.66 5.50 5.65 5.85 6.13 5.96 5.70

Dev-baseline [17] 7.28 7.78 6.32 6.43 8.09 - - 7.02 5.99 -Qureshi et.al. [32] - - - - - 6.45 6.57 6.53 - 4.80

Table 2: Different Fusion networks in terms of RMSE and MAEPartition Attentive Fusion models videoLLD-fused video-BoVW fused Video-Text fused Audio-Text fused All-feature-fusion

Dev-proposed 5.55 5.38 4.64 4.37 4.28Dev-baseline [17] - - - - 5.03Qureshi et.al. [32] - - 5.11 4.64 4.14

This section presents the results of all regression models and theirablation studies in detail. The results of models trained on individualfeatures from each modality as shown in Table 1. We show resultson 4 different types of fusion networks in Table 2. Since the labelsof the test data are not available as per the challenge, we showmost results on the validation (dev) partition. The only results ontest partition is from the text-based model with which we madea submission and got all scores from the challenge. The paper[32] uses a subset of the E-DAIC data used by us which does notinclude the test partition and the dev partition could also be a littledifferent, so we cannot directly compare the results but thats theclosest comparison on similar dataset.

The attention based BLSTM network trained on text transcriptionsachieved best results in comparison to other modalities on the testset of both the E-DAIC (challenge data) and DAIC-WOZ dataset(current dev partition). This is in coherence with the observation ofthe clinicians that the verbal content is a significant marker and hasexplicit features which could influence the decision of depressionstage classification. We achieved a root mean squared error (RMSE)of 4.37 on the development partition of the challenge dataset (asshown in Table 1). we submitted the output of this model to thechallenge, we have the detailed correlation co-efficients on the testpartition only for the text modality and not for other modalities.On the test set, the text-based model is able to achieve a Meanabsolute error (MAE) of 4.02, a RMSE of 4.73, which confirms tohave a concordance correlation coefficient (CCC) of which ismain metric in the challenge. The Pearson’s Correlation coefficient(PCC) of this model is 0.676, the coefficient of determination (r2)is 0.457, and the Spearman’s Correlation coefficient (SCC) is 0.651as per the results from the challenge on test set. Overall this net-work outperforms the state-of-art model [32] by 8.95%. The codeconverged at 15 epochs with a validation loss of 4.37 and batch sizeof 10 which was kept empirically. The average test time for a singletext transcription to get predicted is 0.09 secs. The loss in terms of RMSE on the audio features of the developmentsplit of the dataset given from the challenge is shown in Table 1 as

Dev-proposed . Dev-baseline are the results provided in the baselinepaper of the challenge [17]. In comparison with the baseline model,each of our individual networks outperformed in terms of RMSE.For the audio MFCC feature based model, we outperform baselineby 29.80%, whereas for eGEMAPS we are better by 29.04%. For theBoAW-MFCC we outperform by 10.44% and in case of BoAW-eGEwe achieve 14.46% improvement over the baseline. Each individualaudio feature code runs for 15 epochs with a batch size of 10. Theaverage time required for one sample using functional MFCC is 0.23secs, using Functional eGEMAPS is 0.14 secs, using BoAW-MFCCis 0.45 secs using BoAW-eGE is 0.45 secs and DS-DNet features is0.13 seconds. For the audio models, we tried a convolution neuralnetwork architecture for fusing the MFCC, eGEMAPS and DS-DNetfeatures, but observed that the performance with Bi-LSTMs wasslightly better than convolution networks due to the sequentiallearning capability which suits such features.

The results on the video features are better than the baseline aswell as the state-of-art, but is still worse than the results obtainedon the text and speech modalities. Among the visual features, theBag of Visual words performed best outperforming the baseline by4.8%. Comparing with [32], we outperformed them by 9.3% usingpose features, by 6.6% using gaze and 8.7% using facial action units.

The details of the RMSE of each fusion model on the developmentpartition is shown in Table 2 as Dev-proposed. The Dev-baselineare the results on the same split from the baseline paper. The thirdrow which shows the results of the state-of-art paper is not onthe same set, but on the test dataset of DAIC-WOZ. Assuming thatthe entire test partition of DAIC-WOZ is now a part of the valida-tion/development partition of the challenge dataset, we present thecomparison with them. The model using all features fused usingmultiple levels of attention led to the best results outperforminghe baseline by 17.52%. In comparison to [32], for the Audio-textand video text fusion networks, our networks are better by 5.8%and 9.19% respectively, but our all-feature fusion network is slightlyworse. This is not conclusive as the dataset being used in the paperis slightly different. The attention mechanism automatically weighseach feature in each modality and allows the network to attendto the most important features for the regression decision. Thenetwork thus learns the relationship between the features with thePHQ-8 scores.

This paper proposes a multi-level attention based early fusion net-work which fuses audio, video and text modalities to predict severityof depression. For this task we observed that the attention networkgave highest weights to the text modality and almost equal weigh-tage to audio and video modalities. Giving higher weights to textmodality is in coherence with clinicians as the content of speech iscritical to diagnose depression levels. Audio and video are equallyimportant sources of information and can be critical for the predic-tion of severity. Our intuition for lower importance to video datais the limited features that we could use from the video modality(eye-gaze, facial action-units and headpose). A clinician in a face-to-face interview can observe a person’s body-posture (self-touches,trembling etc) or record electrophysiological signals, thus helpingdiagnose better.The use of multi-level attention led us to obtain significantlybetter results in all individual and fusion models compared to boththe baseline and state-of-art. Using attention over each feature andeach modality had a two fold advantage overall. Firstly, this givesus deep and better understanding of importance of each featurewithin a modality towards depression prediction. Secondly, atten-tion simplified the network’s overall computational complexityand reduced the training and test time. Experimental results showthat the model with all-feature fusion using multi-level attentionoutperformed the baseline by 17.52%. The model built only on textmodality was also significantly better in comparison to [32] andalso on the test set achieving a CCC score of 0.67 in comparison to0.1 of the baseline on testset.As future work, the authors want to be able to get rid of in-ductive bias from classes of data with more training samples anduse few-shot learning techniques to be able to learn models withless data, as it is highly challenging to get more data or balanceddata across classes in such domains. We also are trying to deepdelve to understand what are the features that have a positive ornegative influence in making these decisions between mild/severedepression. That would lead to more explainability in these models,which is of more importance to a clinician in understanding theoutput of these models.

REFERENCES [1] Sharifa Alghowinem, Roland Goecke, Julien Epps, Michael Wagner, and Jeffrey FCohn. 2016. Cross-Cultural Depression Recognition from Vocal Biomarkers.. In

INTERSPEECH . 1943–1947.[2] Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, MatthewHyett, Gordon Parker, and Michael Breakspear. 2016. Multimodal depressiondetection: fusion analysis of paralinguistic, head pose and eye gaze behaviors.

IEEE Transactions on Affective Computing

9, 4 (2016), 478–490.[3] Haroon Ansari, Aditya Vijayvergia, and Krishan Kumar. 2018. DCR-HMM:Depression detection based on Content Rating using Hidden Markov Model. In . IEEE,1–6.[4] Tadas Baltrušaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface:an open source facial behavior analysis toolkit. In . IEEE, 1–10.[5] Daniel Bone, Chi-Chun Lee, and Shrikanth Narayanan. 2014. Robust unsupervisedarousal rating: A rule-based framework withknowledge-inspired vocal features.

IEEE transactions on affective computing

5, 2 (2014), 201–213.[6] Meadows G Carey M, Jones K. 2014. Accuracy of general practitioner unassisteddetection of depression.

Aust N Z J Psychiatry.

Computers in human behavior

54 (2016), 351–357.[8] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St.John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder.

CoRR abs/1803.11175 (2018). http://arxiv.org/abs/1803.11175[9] J. F. Cohn, T. S. Kruez, I. Matthews, Y. Yang, M. H. Nguyen, M. T. Padilla, F.Zhou, and F. De la Torre. 2009. Detecting depression from facial actions andvocal prosody. In .[10] Nicholas Cummins, Stefan Scherer, Jarek Krajewski, Sebastian Schnieder, JulienEpps, and Thomas F Quatieri. 2015. A review of depression and suicide riskassessment using speech analysis.

Speech Communication

71 (2015), 10–49.[11] Nicholas Cummins, Vidhyasaharan Sethu, Julien Epps, Sebastian Schnieder, andJarek Krajewski. 2015. Analysis of acoustic space variability in speech affectedby depression.

Speech Communication

75 (2015), 27–49.[12] MN Dalili, IS Penton-Voak, CJ Harmer, and MR Munafò. 2015. Meta-analysis ofemotion recognition deficits in major depressive disorder.

Psychological medicine

45, 6 (2015), 1135–1144.[13] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer,Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, Gale Lucas,Stacy Marsella, Fabrizio Morbini, Angela Nazarian, Stefan Scherer, Giota Stratou,Apar Suri, David Traum, Rachel Wood, Yuyu Xu, Albert Rizzo, and Louis-PhilippeMorency. 2014. SimSensei Kiosk: A Virtual Human Interviewer for HealthcareDecision Support. In

Proceedings of the 2014 International Conference on Au-tonomous Agents and Multi-agent Systems (AAMAS ’14) . International Foundationfor Autonomous Agents and Multiagent Systems.[14] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic,and Christopher Pal. 2015. Recurrent neural networks for emotion recognition invideo. In

Proceedings of the 2015 ACM on International Conference on MultimodalInteraction . ACM, 467–474.[15] Florian Eyben, Klaus Scherer, Björn Schuller, Johan Sundberg, Elisabeth André,Carlos Busso, Laurence Devillers, Julien Epps, Petri Laukka, Shrikanth Narayanan,and Khiet Phuong Truong. 2016. The Geneva Minimalistic Acoustic ParameterSet (GeMAPS) for Voice Research and Affective Computing.

IEEE transactions onaffective computing

Proceedingsof the Annual Conference of the International Speech Communication Association,INTERSPEECH (2013), 2044–2048.[17] Fabien Ringeval and Björn Schuller and Michel Valstar and Nicholas Cumminsand Roddy Cowie and Leili Tavabi and Maximilian Schmitt and Sina Alisamirand Shahin Amiriparian and Eva-Maria Messner and Siyang Song and ShuoLui and Ziping Zhao and Adria Mallol-Ragolta and Zhao Ren, and Maja Pantic.2019. AVEC 2019 Workshop and Challenge: State-of-Mind, Depression with AI,and Cross-Cultural Affect Recognition. In

Proceedings of the 9th InternationalWorkshop on Audio/Visual Emotion Challenge, AVEC’19, co-located with the 27thACM International Conference on Multimedia, MM 2019 , Fabien Ringeval, BjörnSchuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, and Maja Pantic (Eds.).ACM, Nice, France.[18] Jonathan Gratch, Ron Arstein, Gale Lucas, Giota Stratou, Stefan Scherer, AngelaNazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum,Albert Rizzo, and L P. Morency. 2014. The Distress Analysis Interview Corpus ofhuman and computer interviews.[19] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer,Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al.2014. The distress analysis interview corpus of human and computer interviews..In

LREC . 3123–3128.[20] Paulo A Graziano and Alexis Garcia. 2016. Attention-deficit hyperactivity disor-der and children’s emotion dysregulation: A meta-analysis.

Clinical psychologyreview

46 (2016), 106–123.[21] Fei Hao, Guangyao Pang, Yulei Wu, Zhongling Pi, Lirong Xia, and Geyong Min.2019. Providing Appropriate Social Support to Prevention of Depression forHighly Anxious Sufferers.

IEEE Transactions on Computational Social Systems (2019).22] Anees Ul Hassan, Jamil Hussain, Musarrat Hussain, Muhammad Sadiq, and Sungy-oung Lee. 2017. Sentiment analysis of social networking sites (SNS) data usingmachine learning approach for the measurement of depression. In . IEEE, 138–140.[23] Jia Jia. 2018. Mental Health Computing via Harvesting Social Media Data.. In

IJCAI . 5677–5681.[24] Kurt Kroenke, Tara Strine, Robert L Spitzer, Janet Williams, Joyce T Berry, andAli Mokdad. 2008. The PHQ-8 as a Measure of Current Depression in the GeneralPopulation.

Journal of affective disorders

114 (09 2008), 163–73.[25] Genevieve Lam, Huang Dongyan, and Weisi Lin. 2019. Context-aware DeepLearning for Multi-modal Depression Detection. In

ICASSP 2019-2019 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE,3946–3950.[26] Amanda McCleery, Junghee Lee, Aditi Joshi, Jonathan K Wynn, Gerhard S Helle-mann, and Michael F Green. 2015. Meta-analysis of face processing event-relatedpotentials in schizophrenia.

Biological psychiatry

77, 2 (2015), 116–126.[27] Vikramjit Mitra, Andreas Tsiartas, and Elizabeth Shriberg. 2016. Noise and rever-beration effects on depression detection from speech. In . IEEE, 5795–5799.[28] Michelle Morales, Stefan Scherer, and Rivka Levitan. 2018. A linguistically-informed fusion approach for multimodal depression detection. In

Proceedings ofthe Fifth Workshop on Computational Linguistics and Clinical Psychology: FromKeyboard to Clinic . 13–24.[29] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, AmirZadeh, and Louis-Philippe Morency. 2017. Context-dependent sentiment anal-ysis in user-generated videos. In

Proceedings of the 55th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) . 873–883.[30] Soujanya Poria, Erik Cambria, Newton Howard, Guang-Bin Huang, and AmirHussain. 2016. Fusing audio, visual and textual clues for sentiment analysis frommultimodal content.

Neurocomputing

174 (2016), 50–59.[31] Soujanya Poria, Iti Chaturvedi, Erik Cambria, and Amir Hussain. 2016. Convolu-tional MKL based multimodal emotion recognition and sentiment analysis. In . IEEE, 439–448.[32] Syed Arbaaz Qureshi, Mohammed Hasanuzzaman, Sriparna Saha, and Gaël Dias.2019. The Verbal and Non Verbal Signals of Depression–Combining Acoustics,Text and Visuals for Estimating Depression Level. arXiv preprint arXiv:1904.07656 (2019).[33] Fabien Ringeval, Björn Schuller, Michel Valstar, Shashank Jaiswal, Erik Marchi,Denis Lalanne, Roddy Cowie, and Maja Pantic. 2015. AV+EC 2015: The FirstAffect Recognition Challenge Bridging Across Audio, Video, and PhysiologicalData. In

Proceedings of the 5th International Workshop on Audio/Visual EmotionChallenge (AVEC ’15) .[34] Stefan Scherer, Giota Stratou, Gale Lucas, Marwa Mahmoud, Jill Boberg, JonathanGratch, Albert (Skip) Rizzo, and Louis-Philippe Morency. 2014. Automatic audio-visual behavior descriptors for psychological disorder analysis.

Image and VisionComputing Journal

32 (Oct. 2014).[35] Maximilian Schmitt and Björn W. Schuller. 2016. openXBOW - Introducing thePassau Open-Source Crossmodal Bag-of-Words Toolkit.

CoRR abs/1605.06778(2016). arXiv:1605.06778 http://arxiv.org/abs/1605.06778[36] Björn W. Schuller, Anton Batliner, Dino Seppi, Stefan Steidl, Thurid Vogt, Jo-hannes Wagner, Laurence Devillers, Laurence Vidrascu, Noam Amir, Loïc Kessous,and Vered Aharonson. 2007. The relevance of feature type for the automaticclassification of emotional user states: low level descriptors and functionals. In

INTERSPEECH .[37] Guangyao Shen, Jia Jia, Liqiang Nie, Fuli Feng, Cunjun Zhang, Tianrui Hu, Tat-Seng Chua, and Wenwu Zhu. 2017. Depression Detection via Harvesting SocialMedia: A Multimodal Dictionary Learning Solution.. In

IJCAI . 3838–3844.[38] Brian Stasak, Julien Epps, Nicholas Cummins, and Roland Goecke. 2016. AnInvestigation of Emotional Speech in Depression Classification.. In

Interspeech .485–489.[39] Lei Tong, Qianni Zhang, Abdul Sadka, Ling Li, Huiyu Zhou, et al. 2019. In-verse boosting pruning trees for depression detection on Twitter. arXiv preprintarXiv:1906.00398 (2019).[40] Michel F. Valstar, Jonathan Gratch, Björn W. Schuller, Fabien Ringeval, DenisLalanne, Mercedes Torres, Stefan Scherer, Giota Stratou, Roddy Cowie, and MajaPantic. 2016. AVEC 2016 - Depression, Mood, and Emotion Recognition Workshopand Challenge.

CoRR abs/1605.01600 (2016). http://arxiv.org/abs/1605.01600[41] Michel F Valstar, Enrique Sánchez-Lozano, Jeffrey F Cohn, László A Jeni, Jeffrey MGirard, Zheng Zhang, Lijun Yin, and Maja Pantic. 2017. Fera 2017-addressinghead pose in the third facial expression recognition and analysis challenge. In . IEEE, 839–847.[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is AllYou Need. In

Proceedings of the 31st International Conference on Neural Informa-tion Processing Systems (NIPS’17) . [43] James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentru-ber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, andThomas F Quatieri. 2016. Detecting depression using vocal, facial and seman-tic communication cues. In

Proceedings of the 6th International Workshop onAudio/Visual Emotion Challenge . ACM, 11–18.[44] Zheng Zhang, Jeff M Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, ShaunCanavan, Michael Reale, Andy Horowitz, Huiyuan Yang, et al. 2016. Multimodalspontaneous emotion corpus for human behavior analysis. In