Topic Modeling Based Multi-modal Depression Detection
aa r X i v : . [ c s . C L ] M a r Topic Modeling Based Multi-modal Depression Detection
Yuan Gong
University of Notre [email protected]
Christian Poellabauer
University of Notre [email protected]
ABSTRACT
Major depressive disorder is a common mental disorder that affectsalmost 7% of the adult U.S. population. The 2017 Audio/Visual Emo-tion Challenge (AVEC) asks participants to build a model to predictdepression levels based on the audio, video, and text of an inter-view ranging between 7-33 minutes. Since averaging features overthe entire interview will lose most temporal information, how todiscover, capture, and preserve useful temporal details for such along interview are significant challenges. Therefore, we proposea novel topic modeling based approach to perform context-awareanalysis of the recording. Our experiments show that the pro-posed approach outperforms context-unaware methods and thechallenge baselines for all metrics.
KEYWORDS
Topic modeling; depression detection; multi-modal; emotion recog-nition; natural language processing
Major depressive disorder (MDD), also usually called depression, isone of the most common mood disorders, which is characterizedby a persistent low mood. The study in [6] showed that men havea risk of 10-20% and women have a risk of 5-12% to develop MDDin their lifetime. Early and accurate detection of MDD will ensurethat appropriate treatment and intervention options can be con-sidered. Therefore, there is a strong need for a simple method todetect depression. In the 2017 Audio/Visual Emotion Challenge(AVEC) [17], the depression sub-challenge task requires partici-pants to predict the depression level (i.e., the PHQ-8 score [10])using audio, video, and text analysis. The database used in this chal-lenge is the distress analysis interview corpus (DAIC-WOZ) [7], [5],which includes data from 189 subjects. For each subject, the data-base includes the audio/video features as well as the transcript ofan interview ranging between 7-33 minutes, which is conducted byan animated virtual interviewer called Ellie, controlled by a humaninterviewer in another room.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].
AVEC’17, Mountain View, CA, USA. © 2017 ACM. 978-1-4503-5502-5/17/10...$15.00DOI: 10.1145/3133944.3133945
Ellie (cid:29)
Do you feel down? (cid:3)
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3)
Ellie (cid:29)
What are you? (cid:3)
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3) (cid:3) (cid:3)
Ellie: Do you travel?
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3)
Audio
Feature
Frame 700 ...
Frame 1000
Frame 1300 … Frame 1700
Frame 2100 … Frame 2700 (cid:3)
Video
Feature
Frame 210 ...
Frame 300
Frame 390 … Frame 510
Frame 630 … Frame 810 (cid:3) (cid:3) (cid:3) (cid:3)
Transcript (cid:3)
Timestamp (cid:3)
Topic1 Topic2 (cid:3)
Topic3 (cid:3)
Subject 1 (cid:3)
Ellie (cid:29)
Do you travel? (cid:3)
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3)
Ellie (cid:29)
Do you feel down? (cid:3)
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3) (cid:3)
Ellie: What are you?
Par!cipant (cid:726) xxxxxxxxxxxxxxxxxxx (cid:3)
Frame 800 …. Frame 1100
Frame 2300 … Frame 2600
Frame 3000 … Frame 3200 (cid:3) (cid:3) (cid:3) (cid:3)
Topic3 (cid:3)
Topic1 (cid:3)
Topic2 (cid:3)
Subject 2 (cid:3)
Lexical
Feature (cid:3)(cid:3) (cid:3)
Topic1 Feature (cid:3)
Topic2 Feature (cid:3)
Topic3 Feature (cid:3) (cid:258)(cid:258) (cid:3)
Subject 1 (cid:3) (cid:3) (cid:3) (cid:3) (cid:3)
Subject 2 (cid:3) (cid:3) (cid:3) (cid:3) (cid:3)
Frame 240 …. Frame 330
Frame 690 … Frame 780
Frame 900 … Frame 960 (cid:3)
Figure 1: Illustration of the proposed topic modeling basedmulti-modal feature vector building scheme.
A big difference between the depression detection task and a tra-ditional emotion detection task is the decision unit. Since humanemotion can change rather quickly, traditional emotion detectiontypically requires second-level prediction. Therefore, popular emo-tion recognition databases usually provide labels for short-termrecordings, e.g., the IEMOCAP database [3] provides labels for eachutterance, while the SEWA database provides labels for each seg-ment of 100ms. In contrast, depression is expressed through a per-sistently low mood, which is very different from short-term sad-ness. The study in [18] shows that the median duration of depres-sion is three months and consequently, prediction of the depres-sion level of an individual should be based on much longer obser-vation periods. This difference between depression and emotiondetection leads to two main challenges:(1)
Large decision unit.
In the DAIZ-WOZ database, eachdata sample is the audio and video recording of an inter-view of a specific subject, where the interview ranges from7 to 33 minutes. Here, only one decision needs to be madefor the entire interview. The length of the decision unitis much longer than for typical emotion recognition tasks,.g., the 2017 AVEC emotion sub-challenge requires mak-ing decisions for each 100ms segment. While a large datavolume is typically beneficial for the accuracy, processinglarge amounts of data can be challenging. When analyz-ing very long audio/video data, applying statistical func-tions (e.g., max, min, mean, quartiles) to short-term fea-tures over the entire interview will lead to loss of detailedtemporal information such as short-term sighs in despair,laughing, or anger. However, these short-term details withinthe interview can be useful when determining the depres-sion level of the subject, especially when analyzed togetherwith contextual information (e.g., sighing in despair whenbeing asked about sleep quality, laughing when talkingabout a journey, and anger when remembering unhappyexperiences). Therefore, it is important to map the wholeinterview to a feature vector such that short-term detailsand context are maintained.(2)
Limited number of samples.
Since each subject has onesample (the entire audio/video recording), the number ofsamples is significantly lower than in the case when eachrecording consists of many small samples (e.g., each utter-ance being a sample). In the 2017 AVEC depression sub-challenge, the number of samples in the training set is only107. In addition, the database is unevenly distributed, i.e.,the number of depression samples in the training set is30. With such a small sample size, the number of featuresshould also be small to avoid the problems of dimension-ality and overfitting. However, the dimensions of audioand video features are very large and therefore, generat-ing and selecting an appropriate number of discriminativefeatures is essential.In order to overcome these two challenges, we propose a topicmodeling based multi-modal feature vector building scheme as shownin Figure 1 to provide the basis for context-aware analysis. The in-terview is first segmented according to topics. Then, audio, video,and semantic features are generated for each topic segment sep-arately and further placed into a separate slot of the topic in thefeature vector. After the features for all topics have been placed, atwo-step feature selection algorithm is executed to shrink the fea-ture vector and only keep the most discriminative features. Theproposed algorithm is inspired by the observation that all inter-views contain not a fixed, but a limited range of topics. Further,we assume that each question by Ellie triggers a response on anew topic, which makes “topic tracking” feasible. We expect thefollowing advantages from the proposed scheme:(1)
Logically organize short-term details based on con-text.
When retaining the short-term details of the inter-view, we need to do it in a fashion that keeps the fea-ture vector space relatively small and also makes it logical.Extracting details according to utterances is not context-organizable and will lead to a dimension explosion sinceeach interview contains hundreds of utterances. For ex-ample, both subject 1 and subject 2 smile at utterance 10,but their smiling might convey different information sincetheir 10th utterance is in different contexts. In contrast,the proposed scheme tracks the topic and place feature of utterances, no matter where it is in the interview, intothe slot of the topic it belongs to in the feature vector. Inaddition, one topic can cover multiple utterances, whichmakes the feature dimension much smaller.(2)
More flexible and precise discovery of useful features.
In traditional feature building schemes, one feature canonly be kept or discarded as a whole. However, it is com-mon that one feature is only useful in some specific con-texts and useless in others. Further, same features in dif-ferent contexts might convey different information andshould be regarded as separate features. For example, smil-ing in the context of discussing family can be more dis-criminative than smiling in the context of greeting some-one, because the latter might only be due to etiquette. There-fore, we would like to only keep the feature when it is ina useful context. The proposed feature building schemeprovides any combination of features and contexts suchas smiling (family) and smiling (greeting). Thus, the fea-ture selection algorithm can perform a more flexible andaccurate filtering. In summary, the proposed scheme al-lows us to perform a finer analysis of the subject’s reac-tion to a specific topic, such as a lower voice when dis-cussing family, irritation when discussing an unhappy sit-uation, and the expressions used when describing recentemotions. We believe that this finer-grained analysis canimprove the performance of depression detection.
In the 2016 AVEC depression classification sub-challenge [19], afew proposed techniques adopted text analysis for their model build-ing. In [14] and [13], the text is analyzed on a subject level and au-dio/video features are separately extracted and then fused with se-mantic features, i.e., topic modeling is not used in these approaches.In [21], the authors conduct a question/answer extraction (whichis similar to topic extraction) before text analysis. However, thequestion/answer extraction is only applied to text analysis. Audioand video analysis is still conducted separately. In [22], the au-thors also conduct topic extraction, but merely use the semanticfeatures of very few topics (3 topics for women and 4 topics formen) to build a simple decision tree. This approach achieved thebest performance in the 2016 AVEC, which demonstrates the ef-fectiveness of simple model and key topic analysis. However, itsperformance on the test set is much worse than that on the devel-opment set. Its limited ability to generalize is probably due to thevery small number of features. Further, audio and video featuresare not used in this work.On the other hand, topic modeling, which is a technique to dis-cover topics from documents, has been widely adopted in applica-tions such as text mining [9] and recommendation systems [20]. Itrecently has also been used for depression and neuroticism assess-ment [16]. In [16], the authors demonstrate that taking automati-cally derived topics into account improves prediction performance.However, audio and video analysis are again not involved in thiswork.In summary, the work in [21], [22], and [16] use topic model-ing, but only for text analysis. We further extend the applicationf topic modeling by using it for context-aware audio and videoanalysis. To the best of our knowledge, the proposed work is thefirst effort to combine topic modeling with multi-modal text, audio,and video analysis.
Topic modeling typically requires a sophisticated algorithm suchas latent Dirichlet allocation (LDA) [2] and network regulariza-tion [12]. However, for the transcription of clinical interviews(such as provided by the DAIC-WOZ database), topic modeling canbe done much simpler for multiple reasons. First, in the interview,only Ellie determines the topic by asking a question and the subjectdoes not initiate a topic proactively. Second, the number of topicsin clinical interviews is limited. And third, Ellie is an animatedinterviewer controlled by human command and, therefore, has arelatively fixed way to start a topic. We observed that when start-ing a specific topic, Ellie chooses one sentence from the library,which typically consists of only 1-3 fixed sentences per topic.Based on the above-mentioned observations, we perform sim-ple topic modeling on the text of interviews. First, we build apreliminary sentence dictionary by traversing all of Ellie’s speechand record all non-redundant sentences. Then, we perform man-ual cleaning of the preliminary dictionary, where sentences thatdo not start new topics (e.g., “that’s good”) are discarded. Afterthat, we perform clustering of the dictionary, where the sentencesthat start the same topic are grouped together. This is done in twosteps. First, very similar sentences with up to 3 characters differ-ence are clustered automatically. Second, further manual cluster-ing and checks are performed. Then, we review each sentence clus-ter, link each cluster to the corresponding topic, and put it intothe topic dictionary. Therefore, the topic dictionary is formattedas [topic name, corresponding Ellie sentences]. The complete listwith 83 extracted topics is shown in Table 1.Note that only a few topics are discussed in most interviews, e.g.,only 14 topics cover over 80% of the interviews. In other words,topics are sparsely distributed in the interviews. The histogram ofthe topic cover rate is shown in Figure 2.
Figure 2: Histogram of topic cover rate.
In this work, we use audio, video, and semantic features to builda multi-modal model. Audio and video features are provided bythe 2017 AVEC organizers while semantic features are extractedby ourselves. All features are computed in a topic-wise fashion.
We use the audio features extracted bythe COVAREP toolkit [4] and formant features. The COVAREPtoolkit generates a 74-dimensional feature vector that includes com-mon features such as fundamental frequency and peak slope. For-mant features contain the first 5 formants, i.e., the vocal tract res-onance frequencies. Both COVAREP and formant features are ex-tracted every 10ms. For each topic, we further apply three statisticfunctions (mean, max, and min) to each feature over time to re-duce the dimension. That is, for each topic, ( + ) × =
237 audiofeatures are used.
We use the action units (AUs) featuresextracted by the OpenFace toolkit [1], which includes the informa-tion of 20 key AUs. For each topic, we further apply three statisticfunctions (mean, max, and min) over time to each feature to reducethe dimension. Thus, for each topic, 20 × =
60 video features areused.
For each of the 83 topics, we use theLinguistic Inquiry and Word Count (LIWC) [15] software to countthe frequency of word presence of the subject’s speech of the topicin 93 categories such as anger, negative emotion, and positive emo-tion. That is, the LIWC software takes the speech of a subjectand generates a 93-dimensional feature vector. Further, inspiredby [22], which demonstrates that some key topics such as sleepquality (topic index: 78) and PTSD diagnose history (topic index:82) have a high correlation with depression level, we further ex-tract additional semantic features for 8 topics (topic index: 76-83,marked with an asterisk in Table 1) that we believe might be mostdiscriminative. We use a dictionary based method to classify eachtopic into 2 or 3 categories according to the content. For example,for the topic easy sleep (topic index: 78), the speech of each sub-ject is classified into three categories: easy (when phrases such as‘no problem’ are present), fair (when words such as ‘it depends’ arepresent), and hard (when words such as ‘difficult’ are present). Thedictionary is built manually for each key topic.
Topic
Presence (cid:3)
Topic
Presence (cid:3)
Key
Topic (cid:3)
Gender (cid:3)
Video
Feature (cid:3)
Topic 1 Slot (cid:3)
Audio
Feature (cid:3)
LIWC
Feature (cid:3)
Topic 83 Slot (cid:3)
Video
Feature (cid:3)
Audio
Feature (cid:3)
LIWC
Feature (cid:3) (cid:17)(cid:17)(cid:17)(cid:3)(cid:17)(cid:17)(cid:17)(cid:3)
Figure 3: Illustration of the structure of feature vector.
In order to conduct context-aware analysis, the feature vector needsto record the features of each topic separately. Therefore, in thefeature vector, each topic has a separate slot.We first find the topics discussed in each interview. For eachinterview, speech sentences of Ellie are traversed and when thesentence is found in the topic dictionary, the corresponding topic able 1: The list of topics extracted from the DAIC-WOZ (potential key topics are marked with an asterisk).
Ind. Topic Abbr. Sample Ellie Question Ind. Topic Abbr. Sample Ellie Question1 more can you tell me about that 43 best parent what’s the best thing about being a parent2 why why 44 are you okay are you okay with this3 last happy time tell me about the last time you felt really happy 45 mad what are some things that make you really mad4 origin where are you from originally 46 they triggered are they triggered by something5 argue when was the last time you argued with some-one and what was it about 47 easy parent do you find it easy to be a parent6 advice ago what advice would you give to yourself ten ortwenty years ago 48 happy did that are you happy you did that7 control temper how are you at controlling your temper 49 therapist affect how has seeing a therapist affected you8 things like la what are some things you really like about l a 50 job what are you9 proud what are you most proud of in your life 51 symptoms what were your symptoms10 positive influence who’s someone that’s been a positive influencein your life 52 ideal weekend tell me how you spend your ideal weekend11 best friend describe how would your best friend describe you 53 avoid could you have done anything to avoid it12 things dont like la what are some things you don’t really like aboutl a 54 do annoyed what do you do when you are annoyed13 major what did you study at school 55 got in trouble has that gotten you in trouble14 regret is there anything you regret 56 your kid tell me about your kids15 dream job what’s your dream job 57 someone made bad tell me about a time when someone made youfeel really badly about yourself16 enjoy travel what do you enjoy about traveling 58 different parent what are some ways that you’re different as aparent than your parents17 how hard how hard is that 59 today kid what do you think of today’s kids18 do sleep not well what are you like when you don’t sleep well 60 down do you feel down19 experiences what’s one of your most memorable experiences 61 how know them how do you know them20 hardest decision tell me about the hardest decision you’ve everhad to make 62 feel often do you feel that way often21 fun relax what are some things you like to do for fun 63 problem before did you think you had a problem before youfound out22 handle differently tell me about a situation that you wish you hadhandled differently 64 living situation how do you like your living situation23 what decide what made you decide to do that 65 why stop why did you stop24 still work are you still doing that 66 how do you do how are you doing today25 erase memory tell me about an event or something that youwish you could erase from your memory 67 roommate do you have roommates26 why move la why did you move to l a 68 hard on yourself do you think that maybe you’re being a littlehard on yourself27 change self what are some things you wish you could changeabout yourself 69 like living with what’s it like for you living with them28 best quality what would you say are some of your best qual-ities 70 disturb thought do you have disturbing thoughts29 often back how often do you go back to your home town 71 where live where do you live30 how long diagnose how long ago were you diagnosed 72 after millitary what did you do after the military31 guilty what’s something you feel guilty about 73 combat did you ever see combat32 when move la when did you move to l a 74 talk later why don’t we talk about that later33 easy used la how easy was it for you to get used to living inl a 75 military change how did serving in the military change you34 seek help what got you to seek help 76* change behavior have you noticed any changes in your behavioror thoughts lately35 when last happy when was the last time you felt really happy 77* depression have you been diagnosed with depression36 cope how do you cope with them 78* easy sleep how easy is it for you to get a good night sleep37 compare la how does it compare to l a 79* family close how close are you to your family38 hard parent what’s the hardest thing about being a parent 80* feeling lately how have you been feeling lately39 still therapy do you still go to therapy now 81* shy outgoing do you consider yourself an introvert40 travel a lot do you travel a lot 82* ptsd have you ever been diagnosed with p t s d41 ever served militaryhave you ever served in the military 83* therapy useful do you feel like therapy is useful42 when last time when was the last time that happened able 2: Dimension of each feature category.
Feature Name DimensionGender 1Topic Presence 83Key Topic 8LIWC 7719Formant 1245COVAREP 18426AUs 4980Sum 32462and the subject’s speech, together with its timestamps are recorded.The subject’s speech is used to generate semantic features whilethe timestamps are used to synchronize audio and video features.Then, all features are placed into separate slots of the correspond-ing topic in the feature vector. As described in Section 2.2, eachtopic contains 237 audio features, 60 video features, and 93 LIWCfeatures, and there are 83 topics in total, which leads to a 83 ∗( + + ) = ,
370 dimensional feature vector. Further, we add thepresence of each topic to the feature vector, because each interviewonly covers a few topics and the topic presence might be correlatedto the subject’s status. Finally, gender is also attached to the fea-ture vector similar to the work in [22] and [14], where the authorsreport that gender information can greatly improve the classifica-tion performance. Figure 3 illustrates the structure of the featurevector and Table 2 shows the dimension of each feature category inthe feature vector. Due to the sparsity of topics, the feature vectoris also sparse, i.e., the features of topics that are not discussed inan interview are missing. However, the slots for all topics are pre-served in our approach, i.e., the slot of a topic that is not discussedin the interview is padded with -1.
In Section 2.3, a 32,462-dimensional feature vector is built, whichmaintains audio, video, and text information of each topic. How-ever, only a small amount of features are actually useful and weexpect the number of features to be small enough to avoid poten-tial overfitting. Therefore, feature selection is an essential step ofthe proposed scheme.We conduct feature selection in two steps. First, we conducta quick model-independent feature selection on all features. Thealgorithm we use in this step is correlation-based feature subsetselection (CFS) [8], which evaluates the value of a subset of fea-tures by considering the individual predictive ability of each fea-ture along with the degree of redundancy between them. Afterthis step, a subset of features is selected. Then, we conduct a finemodel-dependent feature selection to find the optimized featurenumber. In this step, we first rank the features according to theirF-value to the corresponding label. Then we run the regression al-gorithm using a various number of high-rank features and finallyselect the best feature set.This unique two-step feature selection algorithm is designedbased on the following consideration. In our feature generationscheme, we observe that more features are correlated to each other than with the context-unaware feature generation scheme, becausefeatures that belong to the same topic are likely to have high cor-relations. Thus, if we only conduct feature selection according tothe individual feature score, we might get a set of features withhigh scores, but that are also closely correlated to each other. Inother words, many features are redundant and provide little extrainformation in this case. To avoid that, we first conduct a CFS toselect a feature subset, where features have a high correlation withthe label, but low correlation with each other. Since CFS is a model-independent approach, which cannot tell us the overfitting risk forour specific model and dataset, we further conduct a model-basedselection on our dataset to find the appropriate number of featuresfor our task.
It has been widely reported that imbal-anced classes of data will greatly affect the performance of ma-chine learning algorithms [11]. Unfortunately, most healthcarerelated databases, including DAIZ-WOZ, are imbalanced. In thetraining set of the DAIC-WOZ database, only 30 subjects are de-pressed of a total of 107 subjects, which means that there are muchmore subjects with low PHQ-8 scores than those with high PHQ-8scores. Therefore, we perform random-oversampling to make thenumber of samples for each PHQ-8 score is roughly the same bysimply duplicating samples before running the machine learningalgorithm.
In this work, we perform a grid search for thefollowing regression models: random forest regression (number oftrees: 1, 10, 20, 30, 40, 50, 100, and 200), stochastic gradient descent(SGD) regression, and support vector regression (SVR) (kernel: lin-ear, polynomial, and radial basis function (RBF)).
In the 2017 AVEC challenge, only the training and developmentsets of the DAIC-WOZ database are available. However, perform-ing both optimization and testing on the development set will leadto significant overfitting on the development set. Therefore, weadopt the following test strategies for our experiments:(1) : In this test strat-egy, the training set and development set are concatenatedtogether and then divided into 10 folds in a stratified man-ner. Each time, one fold is used for testing and another9 folds are used for training. Note that the random over-sampling and model-dependent feature selection are con-ducted after the data splitting and only on the trainingdata. Since it is not meaningful to conduct CFS featureselection using cross-validation, the model-independentfeature selection is conducted on the entire training anddevelopment set, which will lead to an over-optimistic es-timate on the test result, but will not affect other hyper-parameter selections. Thus, we believe this is the fairestway of testing. All optimizations, including model selec-tion, hyper-parameter tuning, and feature selection, areperformed according to the results of CV.2)
Test on the development set (Dev) : In this strategy,we train the model using the official training set and teston the official development set. In order to avoid report-ing over-optimistic results on the development set, we donot conduct any optimization for the development set. In-stead, we find the best model, hyper-parameters, and fea-ture numbers in the CV test and use them to build themodel on the training set.(3)
Test on the testing set (Test) : In this strategy, we trainthe model using the official training and development setand test on the official test set. This is because we want touse all available data for training to increase the model ro-bustness. Again, all parameters used in building the modelare selected in the CV test.
In this work, we report four metrics for each test strategy men-tioned above: 1)
Root mean square error (RMSE) is the chal-lenge target; therefore, all optimizations, including model selec-tion and feature selection, are performed according to this metric.2)
Mean absolute error (MAE) is another metric reported by theofficial baseline [17] and we use it together with RMSE to analyzethe difference between ground truth and prediction. 3)
Pearsoncorrelation coefficient (CC) is an important metric to evaluatethe regression performance, which can reflect the linear correla-tion between ground truth and prediction. 4)
F1-score measuresthe performance of binary depression classification, i.e., a subjectis depressed when the PHQ-8 score is greater than or equal to 10and non-depressed otherwise.
We compare the proposed method with the following baseline meth-ods: (1)
Basic Baseline , where the model constantly predicts themean PHQ-8 score of the training set. This is a very basicbaseline that any workable regression algorithm shouldoutperform.(2)
Challenge Baseline , which is the official baseline pro-vided in [17]. This baseline uses a random forest regres-sor (number of trees = 10) on the audio and video fea-tures extracted by the COVAREP toolkit [4] and OpenFacetoolkit [1]. Regression is performed on a frame-wise basisand the temporal fusion over the interview is compressedby averaging the outputs over the entire interview. Fusionof audio and video modalities is performed by averagingthe regression outputs of the unimodal result. In [17], theauthors present the results of audio unimodal, video uni-modal, and audio/video multimodal solutions using thisbaseline approach, where video unimodal has the best per-formance. Therefore, we use the results of the video uni-modal solution for comparison.(3)
Context-unaware Baseline
Since the proposed methodand the official challenge baseline method have a lot of dif-ferences in terms of features, regression model, and classbalancing, it is hard to judge which factor cause any per-formance gap. Therefore, we use this baseline to check the effectiveness of context-aware analysis. This baselinemethod is exactly the same as the proposed method (i.e.,the same audio, video, and LIWC features are extracted,the same feature selection algorithms are used, and theregression model is selected from the same grid), excepttopic modeling is not used. The differences are that fea-tures are extracted and averaged over the entire interview(instead of a topic-wise manner) and that topic related fea-tures (topic presence and key topic features) are not in-cluded.(4)
Proposed Method , as described in Section 2.
Through a grid search in the CV test, we selected the best regres-sion model (SGD regressor) and the best number of features (46).We then use these settings on the Dev and Test experiments. Theresults of these experiments are shown is Table 3. We observe thatthe proposed method achieves the best performance for all met-rics and test strategies. Further, we find that the proposed methodperforms significantly better than the context-unaware baseline,which demonstrates the effectiveness of context-aware analysis. Inaddition, we observe that the performance of the proposed methodon the test set is worse than that on the development set and crossvalidation. This is because the model-independent CFS feature se-lection is conducted not in a cross-validation manner, but insteadon the entire training and development set, since it is meaning-less to conduct CFS in a cross-validation manner. However, theperformance of the proposed method is still much higher than thechallenge baseline on the test data.
Figure 4: Distribution of topics corresponding to the se-lected features (count in parentheses). Left: proposed fea-ture selection algorithm. Right: baseline feature selectionalgorithm.
It is very interesting to see which features are actually selectedand useful in depression detection. In our feature building scheme,each feature corresponds to one topic and one feature category.As shown in Figure 4 (left), from the perspective of the topics in-volved, we observe that 31 topics out of the total 83 topics are in-volved, in which the most frequent topics corresponding to the able 3: Result of the depression regression experiment RMSE MAE CC F1-ScoreCV Dev Test CV Dev Test CV Dev Test CV Dev TestBasic baseline 5.84 6.57 / 4.81 5.50 / -0.35 0.00 / 0.00 0.00 /Challenge baseline / 7.13 6.97 / 5.88 6.12 / / / / / /Context-unaware baseline 5.55 5.02 / 4.56 4.42 / 0.45 0.69 / 0.58 0.67 /Proposed method / selected features are topic 30: how long diagnose, 31: guilty, 34:seek help, and 77: depression. Further, we observe that our ap-proach uses a variety of topics that seem not closely related todepression from humans’ perspective such as topic 6: advice agoand 16: enjoy travel. In addition, to check the effectiveness ofthe proposed two-step feature selection algorithm, we compare itwith a baseline feature selection algorithm that only consists ofstep 2 of the proposed method, which only considers the score ofeach feature individually. As shown in Figure 4 (right), the featurevector selected by the baseline feature selection algorithm only in-cludes three topics: 30: how long diagnose, 34: seek help, and 39:still therapy. We conduct an experiment using this feature vectorand find that the result (RMSE: 5.60) is much worse than the resultof the proposed approach (RMSE: 4.99) on the test set. This demon-strates that the proposed two-step feature selection algorithm isable to discover independent features and to improve the result.While it is possible that topics 30, 34, and 39 are the closest re-lated topics of depression, taking more topics into considerationcan lead to a more precise prediction. We believe that it is also anadvantage of the proposed method over a clinician’s analysis, be-cause for a clinician it is very hard to observe and model such alarge volume of factors in the interview. Key Topic(2)LIWC(2)COVAREP(33) AUs(9)
Figure 5: Distribution of feature categories corresponding tothe selected features (count in parentheses).
From the perspective of feature categories involved, we observethat the selected feature set involves LIWC features, key topic se-mantic features, COVAREP audio features, and AUs video features.The two key topics involved are 78: easy sleep and 80: feeling lately. The gender feature, topic presence feature, and the formant fea-tures are not involved. A complete pie chart of the distributionof feature categories corresponding to the selected feature set isshown in Figure 5.
Number of Features R oo t M ean S qua r e E rr o r RF(40)SGDSVR(RBF)
Figure 6: The relationship between RMSE and number of fea-tures and regression models.
Number of Features C o rr e l a t i on C oe ff i c i en t RF(40)SGDSVR(RBF)
Figure 7: The relationship between CC and number of fea-tures and regression models. Due to the limited number of test attempts allowed in the 2017 AVEC, we are notable to provide the results on the test set for the baseline approaches. The challengebaseline paper [17] does not include test results of CC and F1-score and does also notinclude results tested in CV manner. wo important hyperparameters in the proposed method arethe number of features and the regression model. Thus, we per-formed a grid search in cross validation manner for the follow-ing regression models: random forest regression (RF) (number oftrees: 1, 10, 20, 30, 40, 50, 100, 200), SGD regression, and supportvector regression (SVR) (kernel: linear, polynomial, and RBF), andthe following feature numbers: 1-46 (the total number of featuresin the subset selected by the first round CFS feature selection is46). The relationship between regression performance and hyper-parameters is shown in Figures 6 and 7. For clarity, we only plotthe top 3 regression models with the best performance.We observe that when the feature number is small, the randomforest regressor (tree number = 40), SGD regressor, and SVR (RBF)regressor perform similarly. However, with the increase of featurenumbers, SGD and SVR models continually improve their perfor-mance while the random forest model stops improving much ear-lier. The SGD and SVR regressor have close performance, whilethe SGD regressor has a little bit lower RMSE than SVR. Thoughthe lowest RMSE is achieved when the feature number is 41, webelieve it is more likely to be a fluctuation in the CV test and there-fore we choose the feature number of 46, because we prefer to usemore features to build a more discriminative model. The exper-iment shows that using 46 features (RMSE: 4.99) yields a betterperformance than using 41 features (RMSE:5.22) on the test set.
Major depressive disorder is a widespread mental disorder and ac-curate detection will be essential for targeted intervention and treat-ment. In this challenge, participants are asked to build a model pre-dicting the depression levels based on the audio, video, and textof an interview ranging between 7-33 minutes. Since averagingfeatures over the entire interview will lose most temporal details,how to discover, capture, and preserve important temporal detailsfor such long interviews are significant challenges. Therefore, wepropose a novel topic modeling based approach to perform context-aware analysis. Our experiments show that the proposed approachperforms significantly better than context-unaware method andthe challenge baseline for all metrics. In addition, by analyzingthe features selected by the machine learning algorithm, we foundthat our approach has the ability to discover a variety of tempo-ral features that have underlying relationship with depression andfurther to build model on them, which is a task that is difficult toperform by humans.
REFERENCES [1] Tadas Baltruˇsaitis, Peter Robinson, and Louis-Philippe Morency. 2016. Open-face: an open source facial behavior analysis toolkit. In
Applications of ComputerVision (WACV), 2016 IEEE Winter Conference on . IEEE, 1–10.[2] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet alloca-tion.
Journal of machine Learning research
3, Jan (2003), 993–1022.[3] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower,Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan.2008.IEMOCAP: Interactive emotional dyadic motion capture database.
Languageresources and evaluation
42, 4 (2008), 335.[4] Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and StefanScherer. 2014. COVAREP: A collaborative voice analysis repository for speechtechnologies. In
Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE In-ternational Conference on . IEEE, 960–964.[5] David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer,Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al. 2014.SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In
Proceedings of the 2014 international conference on Autonomous agents andmulti-agent systems . International Foundation for Autonomous Agents and Mul-tiagent Systems, 1061–1068.[6] Maurizio Fava and Kenneth S Kendler. 2000. Major depressive disorder.
Neuron
28, 2 (2000), 335–341.[7] Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer,Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al.2014. The Distress Analysis Interview Corpus of human and computer inter-views. In
LREC . 3123–3128.[8] MA Hall. 1998. Correlation-based feature subset selection for machine learning.
Thesis submitted in partial fulfillment of the requirements of the degree of Doctorof Philosophy at the University of Waikato (1998).[9] Liangjie Hong and Brian D Davison. 2010. Empirical study of topic modelingin twitter. In
Proceedings of the first workshop on social media analytics . ACM,80–88.[10] Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce TBerry, and Ali H Mokdad. 2009. The PHQ-8 as a measure of current depressionin the general population.
Journal of affective disorders
IEEE Transactions on Systems, Man, and Cybernet-ics, Part B (Cybernetics)
39, 2 (2009), 539–550.[12] Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. 2008. Topic model-ing with network regularization. In
Proceedings of the 17th international confer-ence on World Wide Web . ACM, 101–110.[13] Md Nasir, Arindam Jati, Prashanth Gurunath Shivakumar, Sandeep Nal-lan Chakravarthula, and Panayiotis Georgiou. 2016. Multimodal and multires-olution depression detection from speech and facial landmark features. In
Pro-ceedings of the 6th International Workshop on Audio/Visual Emotion Challenge .ACM, 43–50.[14] Anastasia Pampouchidou, Olympia Simantiraki, Amir Fazlollahi, Matthew Pedi-aditis, Dimitris Manousos, Alexandros Roniotis, Georgios Giannakakis, FabriceMeriaudeau, Panagiotis Simos, Kostas Marias, et al. 2016. Depression Assess-ment by Fusing High and Low Level Features from Audio, Video, and Text. In
Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge .ACM, 27–34.[15] James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. 2015.
Thedevelopment and psychometric properties of LIWC2015 . Technical Report.[16] Philip Resnik, Anderson Garron, and Rebecca Resnik. 2013. Using topic model-ing to improve prediction of neuroticism and depression. In
Proceedings of the2013 Conference on Empirical Methods in Natural . Association for ComputationalLinguistics, 1348–1353.[17] Fabien Ringeval, Bj¨orn Schuller, Michel Valstar, Jonathan Gratch, Roddy Cowie,Stefan Scherer, Sharon Mozgai, Nicholas Cummins, Maximilian Schmitt, andMaja Pantic. 2017. AVEC 2017: Real-life Depression, and Affect RecognitionWorkshop and Challenge. In
Proceedings of the 7th International Workshop onAudio/Visual Emotion Challenge . ACM, 1–8.[18] JAN Spijker, Ron De Graaf, Rob V Bijl, Aartjan TF Beekman, Johan Ormel, andWillem A Nolen. 2002. Duration of major depressive episodes in the generalpopulation: results from The Netherlands Mental Health Survey and IncidenceStudy (NEMESIS).
The British journal of psychiatry
Proceedings of the 6th International Workshopon Audio/Visual Emotion Challenge . ACM, 3–10.[20] Chong Wang and David M Blei. 2011. Collaborative topic modeling for recom-mending scientific articles. In
Proceedings of the 17th ACM SIGKDD internationalconference on Knowledge discovery and data mining . ACM, 448–456.[21] James R Williamson, Elizabeth Godoy, Miriam Cha, Adrianne Schwarzentru-ber, Pooya Khorrami, Youngjune Gwon, Hsiang-Tsung Kung, Charlie Dagli, andThomas F Quatieri. 2016. Detecting Depression using Vocal, Facial and Seman-tic Communication Cues. In
Proceedings of the 6th International Workshop onAudio/Visual Emotion Challenge . ACM, 11–18.[22] Le Yang, Dongmei Jiang, Lang He, Ercheng Pei, Meshia C´edric Oveneke, andHichem Sahli. 2016. Decision Tree Based Depression Classification from AudioVideo and Language Information. In