A Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
Yiding Wang, Zhenyi Wang, Chenghao Li, Yilin Zhang, Haizhou Wang
AA M
ULTITASK D EEP L EARNING A PPROACH FOR U SER D EPRESSION D ETECTION ON S INA W EIBO
A P
REPRINT
Yiding Wang
College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]
Zhenyi Wang
College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]
Chenghao Li
College of CybersecuritySichuan UniversityChengdu, China 610207
Yilin Zhang
College of CybersecuritySichuan UniversityChengdu, China 610207
Haizhou Wang ∗ College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]
August 31, 2020 A BSTRACT
In recent years, due to the mental burden of depression, the number of people who endanger theirlives has been increasing rapidly. The online social network (OSN) provides researchers with an-other perspective for detecting individuals suffering from depression. However, existing studies ofdepression detection based on machine learning still leave relatively low classification performance,suggesting that there is significant improvement potential for improvement in their feature engi-neering. In this paper, we manually build a large dataset on Sina Weibo (a leading OSN with thelargest number of active users in the Chinese community), namely Weibo User Depression Detec-tion Dataset (WU3D). It includes more than 20,000 normal users and more than 10,000 depressedusers, both of which are manually labeled and rechecked by professionals. By analyzing the user’stext, social behavior, and posted pictures, ten statistical features are concluded and proposed. Inthe meantime, text-based word features are extracted using the popular pretrained model XLNet.Moreover, a novel deep neural network classification model, i.e. FusionNet (FN), is proposed andsimultaneously trained with the above-extracted features, which are seen as multiple classificationtasks. The experimental results show that FusionNet achieves the highest F1-Score of 0.9772 onthe test dataset. Compared to existing studies, our proposed method has better classification perfor-mance and robustness for unbalanced training samples. Our work also provides a new way to detectdepression on other OSN platforms.
Keywords
Depression detection · Online social network · Feature engineering · Deep neural network · Multitasklearning
With the rapid development of the online social network (OSN) such as Twitter and Facebook, people are morefrequently using the OSN to express opinions and emotions. It provides researchers with a novel and effective way todetect the mood, communication, activity, and social behavior pattern of individuals [1]. In the past decade, researchers ∗ Corresponding author: H. Wang ([email protected]) a r X i v : . [ c s . S I] A ug Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT in various fields have conducted quantitative analyses of different illnesses and mental disorders based on the OSNplatform [2–8]. Sina Weibo (hereinafter referred to as “Weibo”) is the most popular OSN in the Chinese community[9]. A statistic shows the number of Weibo’s monthly active users have reached more than 480 million in the secondquarter of 2019 .Major depressive disorder, referred to as depression, is a common mental disease. According to a survey of the WorldHealth Organization (WHO) , more than 300 million people worldwide suffer from depression. Depression can causegreat psychological pain, even suicidal tendencies. Moreover, evidence from a health action plan of WHO showsthat people suffering from depression are much more likely to end their life prematurely than the general population.Despite the current availability of psychotherapy, medical therapy, and other modalities for the treatment of depression,76%-85% of patients in low- and middle-income countries remain untreated. The emergence of this phenomenon is notonly the lack of medical resources but also the inability to make an accurate assessment in the early stage of depression,which leads to a large number of people with depression difficult to get diagnosis and treatment timely [10].Pictures, text, videos, and other information posted on the OSN can reflect feelings of worthlessness, guilt, help-lessness, and self-hatred, which can help researchers to specifically analyze and characterize depressed individu-als [1, 10–12]. However, there are some insurmountable problems in online depression detection using traditionalanalyzing methods. They often focus on analyzing the characteristics of users with depression rather than constructingpredictive models. Therefore, it is difficult to give timely prediction results of new depressed users. Moreover, theyare incapable to deal with a large number of instant interactive user data.With the rapid development of artificial intelligence technologies, machine learning approaches have made great con-tributions to the detection of depression [13–17]. An automated depression detection model based on machine learningusually needs to analyze various information such as tweets, pictures, videos, social activity data of users. Then, itgives the classification results of the predicted objects, most of which are presented as a binary result of normal ordepressive. If an individual is predicted for a potential depressive tendency, further resources and assistance can beprovided, including later medical and psychological diagnoses. Such heuristic learning approaches are quite effec-tive for helping in the early detection of depression [18] since they are capable of handling a large number of instantinteractive user data. However, current approaches to online depression detection still face many unresolved challenges.Firstly, many current studies are not user-oriented modeling [19–21]. Those works usually aim to analyze and modelthe language style of the user. Through sentiment analysis and feature engineering of the tweet text, a classificationmodel is developed to detect whether a specific tweet has a depressive tendency. These works analyzed fine-grainedfeatures and achieved pretty good results. However, such results cannot be directly applied to user-level depressiondetection, or it may lead to an incorrect prediction.Second, in several existing studies [1, 19, 22–24], the size of the dataset used for modeling is insufficient, with only afew hundred to a few thousand data samples being used. Because of the difficulty of accurately obtaining and labelingdepressed samples, researchers usually choose to construct small datasets or directly cited datasets from other works.As a consequence, the trained model fails to reach good generalization performance, thus hard to accurately predictdepressed users on the OSN.Moreover, not enough studies of user depression detection have been proposed on Weibo compare to Twitter andFacebook. To the best of our knowledge, there is no published large Weibo user depression detection dataset availablecurrently.Finally, many of the existing proposed models still do not reach a high level of classification performance, i.e. anF1-Score of 90% and above. Thus, these models need to be further improved to achieve better performance.
Given the above problems and challenges, we hereby summarize the contributions of our work as below: A P
REPRINT • We build and publish a large-scale labeled dataset - Weibo User Depression Detection Dataset(WU3D) . WU3D includes more than 10,000 depressed users and more than 20,000 normal users, eachof which contains enriched information fields, including tweets, the posting time, posted pictures, the usergender, etc. This dataset is labeled and further reviewed by professionals. • We summarize ten features of depressed users, four of which are the first to be proposed.
Different fromsome existing work that directly using the information fields as features, we made statistical analyzes of allthe proposed features. These features show significant distribution differences between depressed and normalusers in our experiments. • We construct a Deep Neural Network (DNN) classification model, i.e. FusionNet.
It implements a multi-task learning strategy to process text-based word vectors and statistical features simultaneously. Experimentalresults show that it achieves both the highest classification performance and the best robustness to unbalancedtraining samples.The subsequent sections of this paper are organized as follows. In Section II, related work and achievements in thefield of depression detection on OSNs are introduced and analyzed. The proposed framework is elaborated in SectionIII. Furthermore, Section IV gives the significance evaluation of statistical features and the performance comparisonexperiments of several classification models (including our proposed FusionNet). At the last of the paper, Section Vsummarizes our work and discusses directions for future work.
The current methods for online depression detection mainly include two directions. (i)
Manually extracting fea-tures and building Traditional Machine Learning (TML) models for classification. (ii)
Using Deep Learning (DL)approaches to automatically extract features and constructing deep neural network models as classifiers.Among them, some of the research that uses DL also introduces TML methods to further improve their model perfor-mance. The research of each approach will be introduced below respectively.
Mining depression users based on TML mostly uses features, i.e. numeric vectors that have been manually analyzedand extracted from users to represent the predicted object (a user, a tweet, a posted picture, etc.) [18].Choudhury et al. [1] presented a pioneering work in this field of research. They explored potential user behavior toperform a user-oriented depression detection. By measuring behavioral attributes on Twitter users relating to socialengagement, emotion, language, and linguistic styles, they discovered useful signals for characterizing depression.Although their trained classifiers did not achieve high classification performance, as a pioneering work in this field,they provided a detailed feature engineering analysis process and a clear modeling approach.Wang et al. [25] undertook further research using data from Twitter and Weibo. Compared with the work of [1] thatmade a more comprehensive feature analysis, this study implemented a sentiment analysis approach and proposedman-made rules by utilizing vocabulary to measure depressive tendencies of tweets. Their work indicated that text-based features play a crucial role in online depression detection.Deshpande et al. [20] proposed a representation learning method based on natural language processing (NLP) to modelthe text information on Twitter. Different from the previously mentioned work [1, 25], they used the Bag of Words(BOW) algorithm to represent the tweet text as a sparse vector, allowing the classifier to automatically learn latentfeatures. Their trained Naive Bayes (NB) classifier reached an F1-Score of 0.8329, while the Support Vector Machine(SVM) classifier only reached an F1-Score of 0.7973.After, Shen et al. [10] proposed an advanced detecting approach that can be used to detect depressed users timely.They constructed a well-labeled depression detection dataset on Twitter, which had been widely used by subsequentresearchers. In the meantime, they extracted six depression-related feature groups covering the text, social behavior,and posted pictures. Their proposed multimodal depressive dictionary learning (MDL) approach can effectively learnthe latent and sparse representation of user features. Experiments showed their proposed MDL model achieved an F1of 0.85, indicating that the dictionary learning strategy and the ensemble of multimodal is quite effective.In recent years, more TML-based work has begun to emerge [21,23,24]. In particular, Mustafa et al. [23] implementedFrequency-Inverse Document Frequency (TF-IDF) algorithm to weight the words in tweets. Their trained classifier https://github.com/aidenwang9867/Weibo-User-Drpession-Detection-Dataset A P
REPRINT based on an one-dimensional convolution neural network (CNN-1D) achieved an F1-Score of 0.89. Their work is thefirst to introduce a neural network model for detecting depressed users on the OSN.
Modeling approaches based on DL are mainly for jointly considering user social behaviors and multimedia informationsuch as the text, pictures, videos, etc. Among them, the modeling of the text information is the main research direction.Researchers have adopted NLP approaches to embed text into a high dimensional continuous vector to automaticallymine word features. Some work has also fused manually extracted features into DNN classifiers as part of the input,or integrated traditional classifiers with DNN classifiers to improve performance. These multimodal and ensembleapproaches have proven to be an effective way to accomplish various tasks on social network analysis includingdepression detection [26].Several DNN classifiers that have achieved significant performance in the NLP classification task were selected andevaluated by Orabi et al. [27]. They used a pretrained Word2Vec [28] model to embed the text of tweets. Their exper-imental results showed that the CNN-1D with a max-pooling structure reported the highest performance. Comparedto other recurrent structures including the recurrent neural network (RNN), the Long Short-Term Memory (LSTM)neural network [29, 30], CNN-based models performed better in the task of depression detection.Then, Sadeque et al. [31] proposed a latency-weighted F1 metric and applied it in a novel sequential classifier basedon the Gated Recurrent Units (GRU). They treated all the text of tweets as documents and input them to the classifierasynchronously, named “post-by-post” strategy. It allows the model to decide the depressive tendency of a user aftereach tweet is scanned. Thus, it somehow avoids the time consumption of scanning too many tweets under a certainand obvious depressed user (e.g, a user with 200 tweets recording its anti-depressant experience). This approach canscan and detect depressive tendencies of tweets more efficient.Later, based on the prior work [10], Shen et al. [11] discovered that the current research on a specific OSN may beunsuitable and not universal for depression detection on other platforms. Thus, they proposed a cross-domain DNNmodel with Feature Adaptive Transformation & Combination (DNN-FATC) strategy that can consider features ofseveral aspects comprehensively and transfer the relevant information across heterogeneous domains.Recently, more studies based on DL have been widely proposed. Gui et al. [12] further discussed the change ofclassification accuracy of the model under the different proportions of depressed users and pointed out that the highestaccuracy can be achieved when the proportion of normal and depressed user samples is close to balance. Moreover,they implemented a reinforcement learning (RL) approach to further improve the performance of the model. Lin etal. [32] used a popular pretrained model, i.e. BERT [33], to embed word vectors. Its hidden layer output was extractedto fuse both text and image features to further accomplish the downstream classification task.
To detect the depressed users on the OSN more effectively, we propose a novel framework, as shown in Fig. 1. Thisframework mainly consists of three parts.i.
User data collection and labeling.
This module contains two independent crawler systems (
UserID-Crawler and
UserInfo-Crawler ), which are used to collect user samples on Weibo. Then, it is responsible for filtering andlabeling the collected data to construct the Weibo User Depression Detection Dataset (WU3D).ii.
Feature extracting.
This module is in charge of extracting the user’s text information including nicknames, pro-files, tweet text, and concatenating them into a long text sequence. Then, the sequence is input to the XLNet [34]pretrained model to obtain embedded word features. In the meantime, this module extracts statistical features ofuser text, social behavior, posted pictures. Finally, these features are jointly input into the classification model.iii.
Model Training and predicting.
The module implements a depression detection model based on DNN, namelyFusionNet, which receive features input from the
Feature Extracting module. The proposed FusionNet can betrained in a multitask learning mode, in which word vectors and statistical features can be used jointly to optimizethe classifier in each training step.The following parts of this section will elaborate on the theoretical construction and implementation methods of thesemodules, respectively. 4 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT
Original Tweets
Late Night Tweets
Negative Emotional TweetsOriginal Tweets
Late Night Tweets
Negative Emotional Tweets
Depressive Words Frequency
Posting Distribution
Posting TimeDepressive Words Frequency
Posting Distribution
Posting Time
Text & Behavior Features
XLNet EmbeddingXLNet Embedding
Picture (RGB) Task-2
SaturationSaturationHueHue
Picture FeaturesFeature Extracting
UserID-Crawler UserInfo-CrawlerUser IDs Candidates Weibo Depression
Detection Dataset (WU3D)Manually Labeling
Data collection and Dataset building
Task-1Picture (HSV)
Model Training and Predicting
FusionNetFusionNet TrainingTraining
Normal UserNormal User Depressed User
Concatenated
Statistical Features
Concatenated
Statistical Features Multitask
Transform
Online Social Network (Sina Weibo)
Word Features
Word VectorText Sequence
Figure 1: The Framework of the Proposed Method
A user ID can be used to uniquely identify a user. With a user ID, the crawler can access the user’s home page andcollect information from it. First of all,
UserID-Crawler is constructed to collect user IDs of depressed candidates.The API provided by Weibo official is used to obtain as accurate information as possible. Our strategies for collectinguser IDs of depressed candidates include:(i) Collecting data from the Weibo Super Topic of “ 抑 郁 症 ” (“Depression” in English). The Super Topic is a socialgroup on Sina Weibo that gathers users with common interests. It has been proved that individuals who share thesame background are more likely to trust each other, thus will gather to form aggregations [35]. According to ourinvestigation and analysis, there are a large number of active depressed users posting under the topic of “Depression”.Collecting data in this way can greatly improve the efficiency of gathering depressed user samples. Therefore, UserID-Crawler collects depressed candidates under this topic and forms a list of their user IDs.(ii) Collecting data through the function of “ 微 博 搜 索 ” (“Weibo Search” in English) provided by Weibo official .We use high-frequency words including “ 抑 郁 症 ” (“Depression” in English), “ 自 杀 ” (“Suicide” in English), “ 痛 苦 ”(“Pain” in English) and the late night time period (from 0:00 a.m. to 6:00 a.m.) as two main search conditions to crawluser IDs for collecting more depressed candidates.Through the above two crawling strategies, we have collected sufficient user IDs of depressed candidates. Then, withthe user ID list, UserInfo-Crawler is implemented to collect detailed user information from its personal homepage.The specific information fields collected by
UserInfo-Crawler are shown in Fig. 2.We divide the information for each user sample into two domains: the user domain and the tweet domain. The userdomain contains the user’s gender, birthday, profile (a short text of the user’s self-description), the number of followers,the number of followings, and the list of tweets. For each tweet in the tweet domain, it contains the tweet text, theposting time, posted pictures, the number of likes, the number of forwards, the number of comments, and an identifierthat identifies whether the tweet is original or not. https://open.weibo.com/wiki/API https://s.weibo.com/ A P
REPRINT
Profile (text)Profile (text)
GenderGender
BirthdayBirthdayFollowersFollowers FollowingsFollowings
User(nickname)
Tweet 1Tweet 2Tweet 3 (Up to 100 tweets per user)
Why is depression so painful … just let me die (Text)(Posting time)(Posted pictures)LikesLikes Original / RepostOriginal / RepostCommentsCommentsForwardsForwardsUser domain Tweet domainTweets Figure 2: The Data Structure of Candidates and WU3D (per user)For normal candidates, we use
UserID-Crawler to collect them under four Super Topics including “ 日 常 ” (“Daily” inEnglish), “ 正 能 量 ” (“Positive Energy” in English), “ 榜 姐 每 日 话 题 ” (“Daily Topic” in English), “ 互 动 ” (“Interac-tion” in English) to form a list of normal candidate IDs. Then, the more detailed user information is collected through UserInfo-Crawler to form the same data fields and structure as the depressed candidates. Based on the previous steps,we have collected 125,479 depressed candidates and 65,913 normal candidates.
Automated scripts are implemented to filter out non-personal accounts by identifying the user’s “account type” field,including marketing accounts, official accounts, and social bots.The automated filtered normal candidates is labeled as normal users directly without further manual labeling. Fordepressed users, we invite professional data labelers to complete the data labeling process of depressed candidates. Toensure that the results are highly reliable, the labeled data has been reviewed twice by psychologists and psychiatrists.The principles of the data labeling can be described as follows:i. Depressed candidates with a self-reported history of depression, confirmed diagnosis, currently taking antidepres-sants, and recording antidepressant experiences in multiple tweets will be labeled as depressed users.ii. If a candidate’s tweets have repeatedly appeared the content of describing psychological suffering, mental anguish,and strong suicide intention, the user will be identified as depressed.iii. If the posted pictures of a candidate repeatedly involve or show bloodshed and self-harming content and the tweettext includes keywords such as “ 抑 郁 ” (“Depression” in English) and “ 自 残 ” (“Self-harming” in English), thecandidate will be identified as a depressed one.iv. Candidates who partially meet the above conditions but have too many unrelated contents such as forwardinglottery prizes, receiving red envelopes, advertising information, will be directly discarded.Therefore, the target dataset, i.e. WU3D, is constructed. It contains both labeled normal and depressed users. Thespecific information of the candidates and WU3D are given in Table 1. We counted the normal sample, the depressedsample, and the total for each of the two types. In particular, we give a detailed number of users and their postedtweets, and posted pictures. Table 1: Dataset statistics Dataset Category User Tweet Picture
Depressed 125,479 5,478,806 2,354,701Candidates Normal 65,913 4,927,904 3,631,537Total 191,392 10,406,710 5,986,238Depressed 10,325 408,797 160,481WU3D Normal 22,245 1,783,113 1,087,556Total 32,570 2,191,910 1,248,037All of the candidates were collected from March 2020 to May 2020. A total of over 200,000 user samples were col-lected, including 125,479 depressed candidates and 65,913 normal candidates. After strict data filtering and labeling,the number of depressed users in WU3D reached 10,325, with the retention rate of 8.23%; the number of normal usersreached 20,338, with the retention rate of 29.34%. The total user data retention rate was 15.50%.6 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT
Several previous studies have defined features that are quite effective for detecting depressed users, such as the propor-tion of late-night tweets, the proportion of original tweets, and the mean value of hue and saturation. Based on theirwork, we firstly perform feature engineering of user features in three aspects: the user text, social behavior, and postedpictures. We then summarize ten user-level features including four brand-new proposed and two modified. Thesefeatures are extracted using statistical approaches, including the scale, the mean value, the standard deviation, etc. InTable 2, symbol definitions that appear in this section and subsequent sections are given.Table 2: Variable and Function Symbol Definitions
Symbol Description P The posted tweet set of a user, including original and repost tweets. t p The posting time of a tweet. l e The emotional label of a tweet. n d The number of depression-related words in a tweet. T A set of all user text information, including the nickname, the profile,and the tweet text. π The posted pictures set of a user. µ = ( h µ , s µ , v µ ) The dominant color of a picture, a ternery contains hue, saturation andbrightness of HSV color space. One picture has one dominant color. X The mean sample value of an attribute. S ξ The concatenated user long text sequence. ∆ The max length of the long text sequence S ξ . C The function that calculates the number of elements in a set. L The loss function of a neural network. Θ The parameter set of a neural network. y The true label of a user data sample. ˆ f The objective function of a neural network. It inputs a user’s featurevector and outputs the predicted label. J The joint optimization function of a neural network.Descriptions of these features are shown in Table 3. The features are divided into three groups, including text-basedfeatures, social behavior-based features, and picture-based features. Here, we give specific descriptions and formulasto calculate each feature. Table 3: Manually extracted user features
Group Feature name Symbol Source
Text: Ψ Proportion of negative emo-tional tweets ψ NP First proposed in our workFrequency of depression-relatedwords ψ F DW [1, 6, 8, 10, 11, 24, 25], modified in our workSocial behavior: Φ Proportion of original tweets φ P OP [6, 11, 25]Proportion of late-night posting φ P LNP [6, 10, 11, 24, 25], modified in our workPosting frequency (per week) φ P F
First proposed in our workStandard deviation of postingtime φ SDP T
First proposed in our workPicture: Γ Frequency of picture posting γ F P P
First proposed in our workProportion of cold color-styledpictures γ P CP [11]Standard deviation of hue γ SDH [6, 10, 11], modified in our workStandard deviation of saturation γ SDS [6, 10, 11], modified in our work . In previous works for Twitter [1, 31], by considering the number of tweetswith negative emotions, they have achieved good results in distinguishing depressed users from normal ones. Ratherthan directly using its “number”, we use the “proportion” calculation method to normalize the feature. Although7 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT depressive tendencies do not fully equate to the expression of negative emotions, when the proportion of tweets withnegative emotions reaches a certain level, it can reflect that the user’s mental state is depressed and painful, thus canreveal a tendency of depression. We use the Text Sentiment Analysis API of the Baidu Smart Cloud Platform to labelall the original tweets. The API returns three emotional labels: 0 for negative, 1 for neutral, and 2 for positive. Weretain the negative emotions of label 0 and summarize label 1 and 2 as a category of non-negative emotions. For allthe original tweets under each user, we give the defination of ψ NP in equation (1), in which C ( P o ) is the total numberof original tweets, C ( l e ) is the total number of original tweets with negative emotions: ψ NP = 1 C ( P o ) × C ( l e ) , ψ NP ∈ [0 , (1) Frequency of depression-related words . Researchers have focused on the lexical and semantic analysis of the tweettext and quantified these features by self-constructing or quoting depression-related semantic lists [1,6,8,10,11,24,25].The results of the existing studies indicate that features based on high-frequency depression keywords can significantlyimprove the classification performance. We use “frequency” to describe how frequently depression-related wordsappeared in a user’s tweets, reflecting its potential depressive tendencies. In our previous investigation and analysison Weibo, we summarized a list of high-frequency words for depression. Here, it is used to calculate the frequency ofdepressive words in users’ original tweets. The number of occurrences of depression-related words of each tweet n d is counted by matching the keyword list. Then, ψ F DW is calculated by: ψ F DW = 1 C ( P o ) × C ( Po ) (cid:88) i =1 n d i , ψ F DW ∈ [0 , ∞ ) (2) . Several related works have proved that depressed users are more likely to post a largenumber of original tweets to express their negative psychological state, with relatively few repost tweets [6, 11, 25].Therefore, we use the proportion of original tweets to distinguish between depressed users and normal users. Here weuse C ( P ) to calculate the total number of tweets, including orignal tweets and repost tweets. Then, φ P OP is definedby: φ P OP = 1 C ( P ) × C ( P o ) , φ P OP ∈ [0 , (3) Proportion of late-night posting . Late night is a time when depressive symptoms more frequently attack, thusdepressed users tend to be more likely to post tweets in this period [6, 10, 11, 24, 25]. Moreover, the late-night periodis the time when normal users sleep and rest. They rarely use social tools during this time and therefore send veryfew tweets. We use the proposed feature “Tweet Time” in Ref. [11] and make minor modifications. The time rangeof 0:00-6:00 is adopted as the late-night period. Moreover, all the tweets of a user are used to calculate, includingoriginal and repost ones. Then, φ P LNP is given by equation (4), in which C ( t p ) is used to calculate the total numberof tweets posted in the late night time period from 0:00 a.m. to 6:00 a.m. φ P LNP = 1 C ( P o ) × C ( t p ) , φ P LNP ∈ [0 , (4) Posting frequency (per week) . The previous study for Twitter [25] has found that there is also a difference in postingfrequency between normal and depressed users. Depressed users tend to post large numbers of tweets when they aresuffering from depression and heavily rely on social media to express their painful feelings. Moreover, “Week” is amoderate time size and has stronger periodicity than “Month”. We take the earliest posting time and latest postingtime as an interval, count the total number of tweets C ( P int ) during this interval, and then divide it by 7 to get theweekly frequency value. Thus, φ P F can be represented by equation (5): φ P F = 17 × C ( P int ) (5) Standard deviation of posting time . The posting time of depressed users tends to be concentrated in the late-night,while the relative distribution of post time of normal users is more discrete within a day [6, 10, 11, 25]. Hence, we usethe standard deviation to describe this phenomenon, in order to reflect the aggregation trend of users’ posting time. http://ai.baidu.com/tech/nlp/sentiment classify A P
REPRINT
The smaller the value of this feature, the more likely that user would post at a specific time period. Here, we considerall the original and repost tweets. The mean value of posting time X SDP T is calculated by: X SDP T = 1 C ( P ) × C ( P ) (cid:88) i =1 t P i (6)Then, φ SDP T can be defined as: φ SDP T = (cid:118)(cid:117)(cid:117)(cid:116) C ( P ) × C ( P ) (cid:88) i =1 ( t P i − X SDP T ) (7) . In existing works for Twitter and Weibo [6, 11], “Tweet with pictures” is categorizedinto “Tweet type” to measure how often users post pictures in their tweets and has achieved good performance. Basedon our prior research on Weibo, we also found that depressed users were more likely to use a lot of text to express theirfeelings and mental states, thus have fewer posted pictures than normal users. Therefore, we propose this feature toreflect users’ habit of posting pictures. C ( π ) represents the total number of posted pictures. Then, we calculate γ F P P by: γ F P P = 1 C ( P o ) × C ( π ) (8) Proportion of cold color-styled pictures . Studies for Twitter [6, 11] and Weibo [6, 10, 11] have shown that comparedto normal users, depressed users tend to post pictures with a relatively colder color. Therefore, we extracted three hueand saturation-related features to distinguish depressed users from normal users.However, the warmth and coolness of a picture is a relative concept, and the human eye will give different conclusionswhen contrasting different colors. Lin et al. [6] proposed a range definition of cold colors by analyzing the hue rings,which is used as our definition of the cold color range as h µ ∈ (30 , .For the three color-related features, we compute them using values from the Hue, Saturation, Value (HSV) colormodel. Similarly to Red-Yellow-Blow (RGB) color space, HSV is a color space that represents the intuitive propertiesof colors, which is composed of hue, saturation, and lightness. Among the three attributes, “hue” refers to the categoryof colored light. Different wavelengths of light give different colors and hues. It is measured by the angle value, witha range of 0-360 degree. From red to counterclockwise, the red hue is defined as 0 degree, the green is 120 degree,and the blue is 240 degree.Saturation indicates the degree of color close to the color of the spectrum, and usually takes a value of 0 to 1. Thelarger the value, the more saturated the color. After converting the RGB value of each pixel to the HSV color space,we calculate the dominant color ternary µ = ( h µ , s µ , v µ ) . The algorithm for extracting the dominant color is given inAlgorithm 1. Algorithm 1:
Dominant Color Extraction
Input: τ , All the pixels of a picture, represented in the HSV color space. Output:
The dominant color pixel of the picture Initialize: threshold ← /* the striking pixel threshold */ Initialize:
Array
SP Arr /* to storage the striking pixels */ τ ← the average of τ for every pixel in τ do h τ = pixel [0] ; // pixel = ( h τ , s τ , v τ ) h µ = τ [0] ; // τ = ( h µ , s µ , v µ ) if | h τ − h µ | > threshold then SP Arr ← SP Arr + pixel end µ ← the average of SP return µ A P
REPRINT
The dominant color is the most attractive and the dominant color in a picture. Thus, we introduce the striking pixel(SP) to represent these colors. The SP plays an important role in the intuitive perception of the entire picture, usuallymeasured by the absolute difference between a specific pixel and the average hue of the entire picture. The algorithmfirst inputs a picture with all the pixels represented by the HSV color space. It initializes a manually assigned thresholdand an array
SP Arr to store striking pixels. Then, the algorithm calculates the average color
Ω = ( h µ , s µ , v µ ) ofthe picture and iterates through each pixel, comparing the absolute value of the difference between its hue value and h µ . If the difference is greater than the threshold, the currently iterated pixel is defined as a SP. Finally, by calculatingthe average value of the SP array, the dominant color ternary µ is calculated. Several rounds of tests have been ran tochoose the best value of the threshold (here set to 30).We count the total number of posted pictures with h µ ∈ (30 , and s µ < . as C ( π cold ) . Then, γ P CP is calculatedby: γ P CP = 1 C ( π ) × C ( π cold ) , γ F P P ∈ [0 , (9) Standard deviation of hue and
Standard deviation of saturation . These two features are used to reflect the fluctua-tion of the user’s picture color. The previous works used the mean values of hue and saturation as the picture featuresand achieved good results on Twitter [6, 11] and Weibo [6, 10, 11]. In our research, we found that the hue of depressedusers’ pictures is more concentrated in colder ranges and the saturation value is relatively low. On the contrary, the hueand saturation distribution of normal users is more dispersed and average. We take the hue value hµ and the saturationvalue s µ to calculate their mean values X SDH and X SDS by: X SDH = 1 C ( π ) × C ( π ) (cid:88) i =1 h µ i (10) X SDS = 1 C ( π ) × C ( π ) (cid:88) i =1 s µ i (11)Then, γ SDH and γ SDS can be defined using the following equations: γ SDH = (cid:118)(cid:117)(cid:117)(cid:116) C ( π ) × C ( π ) (cid:88) i =1 ( h µ i − X SDH ) (12) γ SDS = (cid:118)(cid:117)(cid:117)(cid:116) C ( π ) × C ( π ) (cid:88) i =1 ( s µ i − X SDH ) (13) Algorithm 2 gives the approach to construct the user text sequence. Considering that the user nickname and profilecan also reflect its current emotional state, they are also concatenated to the tweet text.10 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT
Algorithm 2:
User Long Text Sequence Construction
Input: T , a collection of user nicknames, profiles, and all tweets’ text. Output:
The concatenated long text sequence S ξ Initialize:
An empty string S ξ Initialize:
The max length of text sequence ∆ for text in T do if the length of S ξ > ∆ then break if text belongs to an original post then S ξ ← S ξ + text else if text belongs to a repost then if text = “Repost” then /* ignored the default repost reason */ continue else S ξ ← S ξ + text else // User nickname or profile S ξ ← S ξ + text end return S ξ The algorithm firstly inputs all the text information (defined as T ) of the user and constructs the concatenated userlong text sequence S ξ by traversing T . After entering the loop, the algorithm first determines if the current length ofthe concatenated text sequence is greater than the maximum length ∆ ; the algorithm ends if the condition is satisfied.Then, it concatenates the user’s original tweet text in chronological order from the latest to the earliest. Moreover,when a user repost a tweet, Weibo will ask the user to fill in the reason for the retweet. In particular, if the user doesnot fill in the reason, the text “ 转 发 微 博 ” (“Repost” in English) will be automatically added as default. This defaultrepost reason is not retained in the text sequence S ξ since it does not express any opinions and feelings. To effectively vectorize the text sequence constructed above and apply this feature to the classification algorithm, thecharacteristics of this long text sequence are further discussed.First, the sequence is strongly contextually linked. This link exists not only within a single tweet but also among thecontexts of multiple tweets. For example, a user posts multiple tweets at different times about depression diagnoses,depression onset, medication treatment, and inner distress. The integration of these information points is usually thekey to judge whether a user is depressed.Secondly, considering that under real circumstances, not all the tweets would describe depression-related content evenfor true depression users. That is, capturing text semantics such as “the user states that he has been diagnosed withdepression” and “the user expresses a strong inclination of suicide” is critical for detection depression using the longtext sequence S ξ .Considering these aspects, several state-of-the-art word embedding algorithms are discussed here. Transformer [36]is a model that replaces the recurrent neural network with the attention mechanism. It calculates the weights of eachunit in a long sequence to effectively capture the important semantic information. Moreover, BERT [33] is a two-wayencoder that is proposed recently. However, due to the limitations of the “Position Embedding” structure in BERT(including its derivative models ALBERT and ROBERTa), the maximum sequence length Delta for single processingis restricted to 512 units. Furthermore, the existing truncation or batch processing of long text sequences used inRef. [32] will significantly increase the time complexity of processing, which is considered as an unsuitable solutionfor performing a timely depression detection task. Therefore, the ideal word embedding model must have the abilityto process long text efficiently and accurately.A novel language model, namely XLNet [34], is then proposed by Yang et al. Since it combines the features oflanguage models such as auto-regression and auto-encoding, XLNet has resolved the problem that BERT ignores therelationship between the Masked locations and can process longer text sequences. In this paper, XLNet-Chinese-base https://github.com/ymcui/Chinese-XLNet
11 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT is used as the upstream word embedding model, then a multitask-based DNN classifier, FusionNet, is implemented tohandle the downstream tasks.
Dimensionality Reduction
Word Feature
Text Feature - Ψ Behavior Feature - Φ Picture Feature - Γ Statistical Feature Vector P r e - t r a i ned X L N e t E m bedd i ng U s e r T e x t S equen c e Fusion Feature Vector
Concat ⊕ F u ll y C onne c t ed S o ft m a x A u x O u t pu t Task-1Task-2 F u ll y C onne c t ed F u ll y C onne c t ed D r opou t S o ft m a x M a i n O u t pu t Specfically
Optimizing Task-1
BatchNorm
Concat ⊕ LayerNorm
Bi-GRU AttentionBi-GRU Attention
Concat ⊕ Global Loss L Global Loss L L Auxiliary Loss L Auxiliary Loss
Feature Maps
Figure 3: The Structure of our Proposed FusionNet (FN)Multitask learning is an integrated learning strategy that synchronizes model training in a way that multiple tasks sharecollective network structures and weights. Based on multitask learning, we construct a DNN classifier with Bi-GRUwith attention as its main structure. As shown in Fig. 3, the word vector classification task (Task-1) obtained fromthe upstream embedding model XLNet and the manually extracted statistical feature classification task (Task-2) areconsidered as two classification tasks for detecting depressed users. Loss functions L and L with different weights ω and ω are manually defined to simultaneously train and optimize the network.Firstly, the user text sequence S ξ is embedded by XLNet, and the output of the last hidden layer is connected to thelayer normalization (LN) [37] layer. Then, the LN layer is connected to the Bi-GRU layer with attention to capturethe key information and reduce the dimensionality of the word vector.For Task-1 , this one-dimensional word feature connects the Full Connected (FC) layer, the Dropout layer, and theSoftmax layer to directly output classification results. We set a auxiliary loss function L for network optimizationin‘ Task-1 , to help accelerate its network convergence. For
Task-2 , the word feature is concatenated to the manuallyextracted statistical feature input. The statistical feature groups [Ψ , Φ , Γ] are regularized by the Batch Normalization(BN) layer [38].Moreover, the fused feature vector accesses multiple FC layers with activation functions, activating the hidden layers’outputs to further improve the fitting capability of the network. Finally, the network is connected to the Softmax layer,and the classification result is given by the main output. The main loss function L is used to optimize the wholeFusionNet network.We define the weight parameter set of Task-1 network as Θ aux , while the objective function is defined as ˆ f . Theglobal weight parameter set of the whole network is represented by Θ g , while the objective function is defined as ˆ f ,12 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT so that Θ aux ∈ Θ g . Adopting the multitask learning stratergy, the joint optimization function J can be described as: (cid:40) J = (cid:80) C ( U ) i =1 L ( y i , ˆ f ( S ξi ) , Θ aux ) ,J = (cid:80) C ( U ) i =1 L ( y i , ˆ f (Ψ i , Φ i , Γ i ) , Θ g ) . (14) J (Θ aux , Θ g ) = ω × J + ω × J (15)In equation (14), y i represents the true label (normal or depressed) of a specific user sample i . The ˆ f and the ˆ f bothoutput the predicted label of user sample i . In equation (15), ω and ω are the manually assigned weights of lossfunction L and L .Manually extracted features and several uncertain parameters will be evaluated in the following section. Since the original user data obtained by the crawler has irrelevant information, to minimize the experimental bias andimprove the efficiency of model training, we have removed all the non-text contents in the tweets.
In this part, WU3D is divided into four subsets: D , D , D , and D . All of the subsets are sampled using a fixedrandom seed without a crossover.Among them, D is used for DNN model training and the 10-fold cross-validation of TML classifiers. Furthermore, D is used as a fixed dataset for validation in each round of neural network training. Finally, we test the models’performance on D and give the evaluation metrics. As a supplementary dataset, D contains only 325 depressed usersand 12,245 normal users, which will be only used in the last experiment of unbalanced training samples. Statistics ofthe sliced datasets are given in Table 4: Table 4: Dataset Slicing Statistics Dataset Depressed NormalUser
Tweet Picture
User
Tweet PictureWU3D 10325 408797 160481 22245 1783113 1087556 D D D D
325 13426 5183 12245 993566 641077
The experimental metrics used in this section mainly includes supervised machine learning metrics.
True Positive(TP), True Negative (TN), False Positive (FP), False Negative (FN) are commonly used to describe the number ofclasses predicted by models in classification tasks. Among them, TP represents the number of depressed users correctlypredicted, TN represents the number of normal users correctly predicted, FP represents the number of depressed usersincorrectly predicted, and FN represents the number of normal users incorrectly predicted. Based on the above fourmetrics, we can further define the advanced metrics by:
Accuracy = | T P + T N || T P + T N + F P + F N | (16) P recision = | T P || T P + F P | (17) Recall = | T P + T N || T P + F N | (18)13 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT F − Score = 2 × P recision × RecallP recision + Recall (19)Moreover, Receiver Operating Characteristic (ROC) curve is a curve with False Positive Rate (FPR) as the horizontalaxis and True Positive Rate (TPR) as the vertical axis, which can be used to visually reflect the classifier performance.Furthermore, in the experiment of the statistical features, we introduce the Man-Whitney U test and the cumulativedistribution function (CDF) curve as the evaluation metrics.
We complete all the experiments in this section on a workstation with IntelXeon Silver 4212 CPU, NVIDIA RTX TITAN GPU with 24GB GRAM, and 32GB RAM. The programming-relatedsettings used in the experiments are Python v3.7.5, Anaconda v4.8.3, TensorFlow v2.1.0, and Scikit-learn v0.23.1. (ii) Baseline statistical feature classifiers.
For the statistical feature classification task, we select several popularTML model from existing studies to demonstrate the effectiveness of the ten concluded features. • LR:
Logistic Regression is a commonly used linear model [19] and has good classification performance. • NB:
A Naive Bayesian classifier is a simple probabilistic classifier using Bayes’ theorem as a basis. Itsimplementation is relatively simple and is used more often in related works [6, 7, 10, 20, 21, 25]. • SVM:
Support Vector Machine classifiers apply the principle of structural risk minimization to the field ofclassification and are the most used classifiers in previous studies [1, 6, 8, 20, 21, 23, 24, 27, 31]. In our work,we discussed different kernel modes including the linear kernel, polynomial kernel and radial basis kernel ofthe SVM, respectively. • RF:
Random Forest is an algorithm that integrates multiple trees through ensemble learning, which is alsoused widely in related works [6, 7, 23, 24]. The basic unit of RF is the decision tree. • AB:
Adaptive Boosting is an ensemble learning algorithm that combines multiple simple classifiers [19]. • GBDT:
Gradient Boosting Decesion Tree is a classification model that uses an integrated additive model tocontinuously reduce the training residuals. GBDT is one of the algorithms with an excellent generalizationability in TML, however, to the best of our knowledge, there is no existing work using GBDT as a classifica-tion model. • BP:
The Back Propagation (BP) network is extracted from the main output part of our proposed FusionNetwith the same parameter settings. To be specific, BP is composed of “FC+Dropout+FC+Softmax”. (iii) Baseline word vector classification networks.
For word vector classifiers, we use several popular neural networkstructures as the main structures, appended by the FC layers and the Softmax layer to output the classification label. • CNN-1D:
One-dimensional convolutional neural networks are more widely used in natural language pro-cessing, and have achieved good performance in the task of depression detection [6, 19, 22, 23, 27]. • Bi-LSTM:
The bi-directional LSTM network splices two-way LSTMs together, which are more capable ofhandling time series data [27, 31, 32]. • Bi-GRU:
Similarly, the bi-directional GRU splices the two-directional GRU network together and is similarto Bi-LSTM in its ability to handle time series data [12, 31]. • TCN:
The temporal convolutional network is a new algorithm for processing time series that reduces theserial processing complexity of RNNs [39]. • Attention:
The attention mechanism is proposed by Vaswani et al. [36], which can quickly filter out high-value information from large amounts of information. Attention is popular in many fields such as machinetranslation and speech recognition. • Bi-GRU with attention:
It is extracted from our proposed FusionNet with the same parameter settings. • Bi-GRU with GAP:
Global Average Pooling (GAP) [40] is used to replace the attention layer to reduce thedimensionality of the output of Bi-GRU, so as to compare the performance differences of these two similarstructures.For the baseline statistical feature classifiers and word vector classification networks, we have run a series of pre-experiments on each classifier and selected the structures and parameters with the best classification performance.Each classifier will be represented by its main structure’s symbols (e.g., Bi-GRU-based classifiers are referred to asBi-GRU for short). API default parameters are selected for both TML and DL classifiers that are not specifically14 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT described here. Moreover, separate experiments for the neural network structures of BP and Bi-GRU with attentionare set to further demonstrate the superiority of FusionNet, which uses multitask learning to merge the two DNNstructures.The loss functions and callback settings for neural network training are given in Table 5:Table 5: Neural Network Training Setup
Item Setup
Batch size 32Epoch 80Early Stopping monitor=’val acc’, patience=10Check Point monitor=’val acc’, mode=’max’FN Loss Function L , L Categorical CrossentropyFN Optimizer (for L ) NAdam (Init lr=3e-4)FN Optimizer (for L ) NAdam (Init lr=1e-3)FN [ ω , ω ] [0 . , . In this part, we perform a non-parametric Mann-Whitney U test on each manually extracted feature. The result isshown in Table 6. Since the statistical variables Mann-Whitney U-value and Wilcoxon W-value can be transformedinto each other, only the U-values are reserved in the table.Table 6: Mann-Whitney U Test Results
Symbol Mann-Whitney U Significance Decision ψ NP p < . Reject H ψ F DW p < . Reject H φ P OP p < . Reject H φ P LNP p < . Reject H φ P F p < . Reject H φ SDP T p < . Reject H φ F P P p < . Reject H φ P CP p < . Reject H φ SDH p < . Reject H φ SDS p < . Reject H In the experiment, the default null hypothesis H is set to “the distribution of this feature is the same on normal usersand depressed ones”. At the 95% confidence interval, p < . will reject the null hypothesis, which is to admit thesignificant difference between normal and depressed users. The result shows that the p -value of each feature is lessthan 0.001 in the significance of Mann Whitney’s bilateral test. Thus, it is concluded that all the features pass the testand have significant differences in the distribution of two types of users. We also evaluate the significance of the features by comparing the feature distribution curve of normal and depressedusers. The CDF curve for each feature are plotted in Fig. 4. Since there are no quantitative parameters for the CDFcurve to describe the results, we only evaluate the degree of coincidence between the two types of user curves.Among the features, ψ NP has the highest distinction between the two types of users and the most significant differencein distribution. This phenomenon further demonstrates that text-based features play an important role in identifyingdepressed users on social networks. For our proposed features shown in Fig. 4(a), (e), (f), and (g), the curve oftwo types of users shows obvious separation and different trends, indicating there are significant differences in thesefeatures between the two types of users. 15 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT
Proportion of negative emotional tweets C u m u l a t i v e p r opo r t i on NormalDepressed (a) ψ NP Frequency of depression-related words C u m u l a t i v e p r opo r t i on NormalDepressed (b) ψ FDW
Proportion of original tweets C u m u l a t i v e p r opo r t i on NormalDepressed (c) φ POP
Proportion of late night posting C u m u l a t i v e p r opo r t i on NormalDepressed (d) φ PLNP
Posting frequency (per week) C u m u l a t i v e p r opo r t i on NormalDepressed (e) φ PF Standard deviation of posting time C u m u l a t i v e p r opo r t i on NormalDepressed (f) φ SDPT
Frequency of picture posting C u m u l a t i v e p r opo r t i on NormalDepressed (g) γ FPP
Proportion of cold color-styled pictures C u m u l a t i v e p r opo r t i on NormalDepressed (h) γ PCP
Standard deviation of hue C u m u l a t i v e p r opo r t i on NormalDepressed (i) γ SDH
Standard deviation of saturation C u m u l a t i v e p r opo r t i on NormalDepressed (j) γ SDS
Figure 4: The CDF Curves of Ten Statistical Features
In this part, baseline statistical feature classifiers are used to evaluate the contribution of different feature groups. Weperform experiments on different combinations of features to determine the contribution to the classification task. Theresult is shown in Fig. 5.The experimental result demonstrates that the classification performance of feature groups in each classifier is con-sistent with the results of our statistical tests, in which the text-based feature group Ψ contributed the most. The BPclassifier has already achieved a high F1-Score of 0.9431 when only text-based features are used. The contributionof the picture-based feature group Γ is relatively poor, only with the highest F1-Score of 0.7514 using the GBDTclassifier. However, under the different combinations of features, the performance is improved at different levels foreach group and for each classifier. Especially, with the combination of all the feature groups ( Ψ + Φ + Γ ), the GBDTclassifier reports the highest F1-Score of 0.9465. GBDT and BP both achieved the highest performance metric forseveral rounds benefiting from gradient descent related optimization method.Therefore, it is concluded that all three types of feature groups can positively improve the performance of classificationtasks, with the text-based features contributing the most.
The user text information consists of the user nickname, profile, and tweet text, which are concatenated to a longtext sequence using Algorithm 2. However, due to the uncertainty of the total number of tweets and the total numberof words in a tweet, it is necessary to explore the most effective embedding length of the text sequence. In themeantime, since the pretrained Chinese XLNet has 12 layers of network structure, 768 hidden layers, and a total of117M parameters, it will also take considerable time to access the network and to extract word vectors. Therefore, inthis section, we run multiple experiments by setting several different values of the text sequence length. We record the16 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (a) LR
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (b) NB
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (c) SVM-linear
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (d) SVM-poly
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (e) SVM-rbf
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (f) RF
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (g) AB
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (h) GBDT
Combination M e t r i c Accuracy F1-Score Precision Recall AUC (i) BP
Figure 5: The Contribution of Different Feature Group Combinationstime consumption of extracting word vectors using XLNet and the F1-Score of each classifier to explore the appropriatetext sequence length. It should have reasonable embedding time consumption and relatively high F1-Score.According to our statistics, a user’s tweet text may generally be longer than 32 Chinese characters. A short textlength may result in premature truncation of the tweet text. Thus, our experiment starts with sequence length ∆ = 64 and increases its value gradually. Finally, we select six groups of text sequence lengths for experiments, with ∆ ∈ [64 , , , , , . Then, we record the time consumption of embedding word vectors using XLNet andthe classification F1-Score under different text sequence lengths.We select the embedding time consumption of ∆ = 64 as a base, normalizing the remaining groups to get the scaledvalues. Figure 6(a) shows the relative time consumption of XLNet at each value of the text sequence length ∆ .The result demonstrates that, as the text sequence length increases, the curve shows a nonlinear trend and the growthrate of the slope continues to increase. Specifically, the relative time of ∆ = 512 is 11.81, while the correspondingtime of ∆ = 1024 is 25.31, and the corresponding time of ∆ = 2048 is 77.52. The time consumption increasesby 114.31% and 206.28%, respectively. Therefore, when ∆ increases by a binary exponential power, the time ofembedding will increase at a significantly faster rate. 17 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT
Text Sequence Length R e l a t i v e T i m e C on s u m i ng (a) ∆ - Time (Relative) Text Sequence Length F - S c o r e CNN_1D TCN Bi-LSTMBi-GRU Bi-GRU (GAP) Bi-GRU (Attention) (b) ∆ - F1-Score Figure 6: The Selection of Text Sequence LengthMoreover, the baseline word vector classification networks are used to reflect the classification performance underdifferent values of ∆ . Figure ?? (b) shows the F1-Score of each classifier under different ∆ values.When ∆ increases from 64 to 1024, the F1-Score of each classifier increases significantly. When ∆ increases from1024 to 2048, the F1-Score almost no longer increases. Although the performance of a single attention structure isrelatively poor, by adding it to other structures, the classification performance can be further improved. Thus, Bi-GRUwith attention can better capture key information in long text sequences among the tested word vector classifiers. Itachieves the highest classification performance and thus is used in our proposed FusionNet.Combined with the previous experiment of ∆ and the word vector extraction time consumption, it is concluded thatthe length of text sequence at ∆ = 1024 has reached a performance bottleneck in this classification task, while thetime computation of word embedding is relatively acceptable. Therefore, in subsequent experiments involving wordvectors, we implement ∆ = 1024 as the default text sequence embedding length. To compare other classifiers with FusionNet, in this section, the output of Bi-GRU with attention is extracted as theword feature, which is concatenated to the statistical feature. Then, this integrated feature vector is input into thebaseline statistical feature classifiers to accomplish the classification task of depressed and normal users.Thus, multimodel classifiers are constructed by both the word vector classification network and the baseline statisticalfeature classifier. Table 7 gives detailed metrics of all the target classifiers, while Fig. 7 visualizes these metrics.Table 7: Classification Performance of Target Classifiers
Classifier Accuracy F1-Score Precision Recall
LR 0.9660 0.9655 0.9813 0.9502NB 0.9555 0.9544 0.9779 0.9321SVM-linear 0.9623 0.9616 0.9796 0.9442SVM-poly 0.9562 0.9560 0.9604 0.9517SVM-rbf 0.9600 0.9593 0.9773 0.9419RF 0.9604 0.9597 0.9766 0.9434AB 0.9649 0.9643 0.9820 0.9472GBDT 0.9653 0.9647 0.9805 0.9494
FN (Proposed) 0.9775 0.9772 0.9908 0.9639
In the performance experiment of the target classifiers, each classifier reaches an F1-Score above 0.95. Particularly,our proposed FusionNet achieves the highest F1-Score value of 0.9772. It also obtains the highest value under allthe other metrics. Compared to the second-highest LR, FusionNet improves its F1-Score by 1.21%. Using transferlearning, the word feature is extracted as an input to the baseline statistical feature classifiers. However, transferring18 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT
LR NB SVM-L SVM-P SVM-R RF AB GBDT FN
Classifier A cc u r a cy (a) Accuracy LR NB SVM-L SVM-P SVM-R RF AB GBDT FN
Classifier F - S c o r e (b) F1-Score LR NB SVM-L SVM-P SVM-R RF AB GBDT FN
Classifier P r e c i s i on (c) Precision LR NB SVM-L SVM-P SVM-R RF AB GBDT FN
Classifier R e c a ll (d) Recall Figure 7: Performance of Target Classifiersfeatures through different classifiers may lead to a loss of information. Since multitask learning enables different tasksto share the same network structure and weights, it significantly reduces the loss of information caused by transferlearning, thus has better performance.Furthermore, the ROC curves are given in Fig. 8. We obtain the classification probability values of the model outputand plot these ROC curves by sampling the multipoint FPR and TPR.The result of the plot shows that all the curves are close to the upper left corner, which proves that all of the classifiershave excellent performance. The curve of FusionNet is closest to the upper left corner, achieving the best classificationperformance.
In previous experiments, we used both depressed users and normal users with a proportion of 50% in dataset D , D ,and D . However, in the real OSN environment, depressed users exist in a minority of the whole user community.Due to the difficulty of collecting depressed user data, it is hard to ensure that the training and optimization process ofthe classifier can fully guarantee a balanced data proportion.Therefore, by changing the proportion of depressed user samples (denoted as ρ ), we analyze the F1-Score fluctuationson the target classifiers to evaluate its robustness of training an unbalanced number of samples. Each classifier istreated as a group. For each group, we will test nine values of ρ from 0.1 to 0.9, with an interval of 0.1.19 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT
False positive rate T r ue po s i t i v e r a t e FN (AUC = 0.9775)LR (AUC = 0.9660)GBDT (AUC = 0.9653)AB (AUC = 0.9649)SVM-linear (AUC = 0.9623)RF (AUC = 0.9604)SVM-rbf (AUC = 0.9600)SVM-poly (AUC = 0.9562)NB (AUC = 0.9555)
Figure 8: ROC Curves of the Target ClassifiersHere, we implement a new metric, namely the Intra-group F1-Score Variance (IFV), to calculate the variance of theF1-Score in each group. First, for each group, the mean value of the F1-Score is calculated and represented by X IF .The number of ρ values taken in each group is noted as T . Therefore, the IFV metric is defined as: IF V = (cid:118)(cid:117)(cid:117)(cid:116) T × T (cid:88) i =1 ( F i − X IF ) (20)Table 8: IFV of Target Classifiers Classifier Intra-group F1-Score Variance (IFV)
LR 3.30e-5NB 1.82e-5SVM-linear 4.60e-5SVM-poly 5.89e-5SVM-rbf 2.59e-5RF 2.80e-5AB 1.80e-5GBDT 2.92e-5
FN (Proposed) 1.01e-5
Fig. 9(a) shows the F1-Score of each classifier under different ρ values. The experimental results show that when theproportion of data samples of depressed users and normal users are close to balance, the classifiers tend to achievehigher F1-Score.Table 8 and Fig. 9(b) shows the IFV metric of the target classifiers. Although LR once achieved a high F1-Score inthe experiment of target classifiers, it has relatively low robustness for the unbalanced data due to the relatively poordecision ability of the single classifier. With the kernel learning strategy, the two types of SVM classifiers have a betterability to fit the data, but the values of F1-Score also fluctuates obviously when the training samples are unbalanced.Ensemble classifiers including RF and GBDT obtains better robustness in the experiment. The IFV metric of ourproposed FusionNet reaches the minimum value among the target classifiers, indicating that FusionNet has the bestrobustness, i.e. the most stable classification performance.In addition to the advantages of the multitask learning strategy mentioned in the previous part, we believe that theadaptive learning rate of Nadam can also help the FusionNet find the global optimal solution more quickly even if twoclasses of training data are not balanced. 20 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P
REPRINT F - sc o r e LR NB SVM-LSVM-P SVM-R RFAB GBDT FN (a) F1-Score under Different ρ Values
LR NB SVM-L SVM-P SVM-R RF AB GBDT FN
Classifier I F V -5 (b) IFV of the Target Classifiers Figure 9: Unbalanced Training Samples
In this work, we proposed a multitask learning-based approach to predict depressed users on Sina Weibo.First, based on data collection and script filtering and manual labeling, we built and publish a large Weibo Userdepression detection dataset - WU3D. The total number of user samples reaches over 30,000 and each user has enrichedinformation fields. This dataset will be quite sufficient to be used by subsequent researchers to complete furtherresearch.Secondly, we summarized and manually extracted ten statistical features including text, social behavior, and picture-based features. The experimental results showed that all of them have varying degrees of distribution differencesbetween normal users and depressed users, which can contribute positively to classification tasks. Our experimentalresults also proved that the feature engineering process of text information is the most vital part of depression detectionon OSN.Furthermore, we evaluated the performance of the pretrained model XLNet as the embedding model to solve down-stream classification tasks. It showed that when the appropriate embedding length is selected, XLNet has excellentperformance and efficiency in handling long text sequences.Finally, we implemented a multitask learning DNN classifier, FusionNet, to simultaneously handle the word vectorclassification task and the statistical feature classification task. Benefit from the strategic advantages of multitasklearning, FusionNet reduced the loss of feature information caused by transfer learning. Compared with the commonlyused models in existing work, FusionNet has achieved a very significant performance improvement with an F1-Scoreof 0.9772 and showed the best classification robustness when the training samples are unbalanced. Thus, it has provento be an ideal classification model when dealing with multiple classification tasks at the same time.For future work, two directions will be further explored. (i)
The size of the dataset will be further expanded. Largerdatasets will be constructed for training and evaluating classifiers to achieve better generalization performance. (ii)
The characteristics and behavior patterns of depressed users will be further analyzed. We will propose more effectivefeature solutions for user-level depression detection on the OSN.
References [1] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, “Predicting depression via social media,” in
Proceed-ings of the 7th International AAAI Conference on Weblogs and Social Media , Cambridge, MA, USA, Jul 2013,pp. 128–137.[2] T. Wang, M. Brede, A. Ianni, and E. Mentzakis, “Detecting and characterizing eating-disorder communitieson social media,” in
Proceedings of the 10th ACM International Conference on Web Search and Data Mining ,Cambridge, UK, Feb 2017, pp. 91–100. 21 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT [3] C. Karmen, R. C. Hsiung, and T. Wetter, “Screening internet forum participants for depression symptoms byassembling and enhancing multiple nlp methods,”
Computer Methods and Programs in Biomedicine , vol. 120,no. 1, pp. 27–36, Jun 2015.[4] E. Saravia, C.-H. Chang, R. J. De Lorenzo, and Y.-S. Chen, “Midas: Mental illness detection and analysis viasocial media,” in
Proceedings of 2016 IEEE/ACM International Conference on Advances in Social NetworksAnalysis and Mining , San Francisco, CA, USA, Aug 2016, pp. 1418–1421.[5] Q. Zhang, L. Zhong, S. Gao, and X. Li, “Optimizing hiv interventions for multiplex social networks via partition-based random search,”
IEEE Transactions on Cybernetics , vol. 48, no. 12, pp. 3411–3419, Dec 2018.[6] H. Lin, J. Jia, Q. Guo, Y. Xue, Q. Li, J. Huang, L. Cai, and L. Feng, “User-level psychological stress detectionfrom social media using deep neural network,” in
Proceedings of the 22nd ACM International Conference onMultimedia , Orlando, FL, USA, Nov 2014, pp. 507–516.[7] S. Balani and M. De Choudhury, “Detecting and characterizing mental health related self-disclosure in socialmedia,” in
Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in ComputingSystems , Seoul, Republic of Korea, Apr 2015, pp. 1373–1378.[8] Q. Cheng, T. M. Li, C.-L. Kwok, T. Zhu, and P. S. Yip, “Assessing suicide risk and emotional distress in chinesesocial media: A text mining and machine learning study,”
Journal of Medical Internet Research , vol. 19, no. 7,p. e243, Jul 2017.[9] Q. Gao, F. Abel, G.-J. Houben, and Y. Yu, “A comparative study of users’ microblogging behavior on sinaweibo and twitter,” in
Proceedings of the 20th ACM International Conference on User Modeling, Adaptation,and Personalization . Berlin, Heidelberg: Springer, July 2012, pp. 88–101.[10] G. Shen, J. Jia, L. Nie, F. Feng, C. Zhang, T. Hu, T.-S. Chua, and W. Zhu, “Depression detection via harvestingsocial media: A multimodal dictionary learning solution,” in
Proceedings of the 26th ACM International JointConference on Artificial Intelligence , Melbourne, Australia, Aug 2017, pp. 3838–3844.[11] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T. S. Chua, and W. Hall, “Cross-domain depression detection via harvesting social media,” in
Proceedings of the 27th ACM International JointConference on Artificial Intelligence , Stockholm, Sweden, Jul 2018, pp. 1611–1617.[12] T. Gui, L. Zhu, Q. Zhang, M. Peng, X. Zhou, K. Ding, and Z. Chen, “Cooperative multimodal approach todepression detection in twitter,” in
Proceedings of the 33rd AAAI Conference on Artificial Intelligence , Honolulu,HI, USA, Jan 2019, pp. 110–117.[13] Y. Suhara, Y. Xu, and A. Pentland, “Deepmood: Forecasting depressed mood based on self-reported historiesvia recurrent neural networks,” in
Proceedings of the 26th ACM International Conference on World Wide Web ,Perth, Australia, Apr 2017, pp. 715–724.[14] X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang, “Depaudionet: An efficient deep model for audio baseddepression classification,” in
Proceedings of the 6th ACM International Workshop on Audio/Visual EmotionChallenge , Amsterdam, The Netherlands, Oct 2016, pp. 35–42.[15] L. Yang, D. Jiang, X. Xia, E. Pei, M. C. Oveneke, and H. Sahli, “Multimodal measurement of depression usingdeep learning models,” in
Proceedings of the 7th Annual ACM Workshop on Audio/Visual Emotion Challenge ,Mountain View, CA, USA, Oct 2017, pp. 53–59.[16] H. Dibeklio˘glu, Z. Hammal, and J. F. Cohn, “Dynamic multimodal measurement of depression severity usingdeep autoencoding,”
IEEE Journal of Biomedical and Health Informatics , vol. 22, no. 2, pp. 525–536, Mar 2017.[17] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression diagnosis based on deep networks to encodefacial appearance and dynamics,”
IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 578–584, Oct2018.[18] S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt, “Detecting depression and mentalillness on social media: An integrative review,”
Current Opinion in Behavioral Sciences , vol. 18, no. SI, pp.43–49, Dec 2017.[19] A. Cohan, B. Desmet, A. Yates, L. Soldaini, S. MacAvaney, and N. Goharian, “Smhd: A large-scale resource forexploring online language usage for multiple mental health conditions,” arXiv preprint arXiv:1806.05258 , 2018.[20] M. Deshpande and V. Rao, “Depression detection using emotion artificial intelligence,” in
Proceedings of the19th IEEE International Conference on Intelligent Sustainable Systems , Palladam, Tirupur, India, Dec 2017, pp.858–862.[21] N. Al Asad, M. A. M. Pranto, S. Afreen, and M. M. Islam, “Depression detection by analyzing social mediaposts of user,” in
Proceedings of 2019 IEEE International Conference on Signal Processing, Information, Com-munication & Systems , Dhaka, Bangladesh, Apr 2019, pp. 13–17.22 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo
A P
REPRINT [22] M. Trotzek, S. Koitka, and C. M. Friedrich, “Utilizing neural networks and linguistic metadata for early detectionof depression indications in text sequences,”
IEEE Transactions on Knowledge and Data Engineering , vol. 32,no. 3, pp. 588–601, Mar 2020.[23] R. U. Mustafa, N. Ashraf, F. S. Ahmed, J. Ferzund, B. Shahzad, and A. Gelbukh, “A multiclass depressiondetection in social media based on sentiment analysis,” in
Proceedings of the 17th IEEE International Conferenceon Information Technology—New Generations . Las Vegas, NV, USA: Springer, Apr 2020, pp. 659–662.[24] M. Stankevich, V. Isakov, D. Devyatkin, and I. Smirnov, “Feature engineering for depression detection in so-cial media,” in
Proceedings of the 7th IEEE International Conference on Pattern Recognition Applications andMethods , Funchal, Madeira, Portugal, Jan 2018, pp. 426–431.[25] X. Wang, C. Zhang, Y. Ji, L. Sun, L. Wu, and Z. Bao, “A depression detection model based on sentiment anal-ysis in micro-blog social network,” in , Gold Coast, QLD, Australia, Apr 2013, pp. 201–213.[26] F. Huang, X. Zhang, J. Xu, Z. Zhao, and Z. Li, “Multimodal learning of social image representation by exploitingsocial relations,”
IEEE Transactions on Cybernetics , pp. 1–13, Mar 2019, early access.[27] A. H. Orabi, P. Buddhitha, M. H. Orabi, and D. Inkpen, “Deep learning for depression detection of twitter users,”in
Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology: From Keyboard toClinic , New Orleans, LA, USA, Jun 2018, pp. 88–97.[28] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrasesand their compositionality,” in
Proceedings of the 26th ACM International Conference on Neural InformationProcessing Systems , Lake Tahoe, NV, USA, Dec 2013, pp. 3111–3119.[29] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602 , 2014.[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation , vol. 9, no. 8, pp. 1735–1780,Nov 1997.[31] F. Sadeque, D. Xu, and S. Bethard, “Measuring the latency of depression detection in social media,” in
Proceed-ings of the 11th ACM International Conference on Web Search and Data Mining , Marina Del Rey, CA, USA,Feb 2018, pp. 495–503.[32] C. Lin, P. Hu, H. Su, S. Li, J. Mei, J. Zhou, and H. Leung, “Sensemood: Depression detection on social media,”in
Proceedings of the 28th ACM International Conference on Multimedia Retrieval , Dublin, Ireland, Jun 2020,pp. 407–411.[33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers forlanguage understanding,” arXiv preprint arXiv:1810.04805 , 2018.[34] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregres-sive pretraining for language understanding,” in
Advances in neural information processing systems , Vancouver,Canada, Dec 2019, pp. 5753–5763.[35] P. De Meo, E. Ferrara, D. Rosaci, and G. M. Sarn´e, “Trust and compactness in social network groups,”
IEEETransactions on Cybernetics , vol. 45, no. 2, pp. 205–216, Feb 2015.[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attentionis all you need,” in
Proceedings of the 31st ACM International Conference on Neural Information ProcessingSystems , Long Beach, CA, USA, Dec 2017, pp. 5998–6008.[37] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 , 2016.[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covari-ate shift,” in
Proceedings of the 32nd ACM International Conference on International Conference on MachineLearning , Lille, France, Jul 2015, pp. 448–456.[39] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement inthe time domain,” in
Proceedings of the 44th IEEE International Conference on Acoustics, Speech and SignalProcessing , Brighton, UK, May 2019, pp. 6875–6879.[40] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400arXiv preprint arXiv:1312.4400