[PDF] A Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

Abstract

In recent years, due to the mental burden of depression, the number of people who endanger their lives has been increasing rapidly. The online social network (OSN) provides researchers with another perspective for detecting individuals suffering from depression. However, existing studies of depression detection based on machine learning still leave relatively low classification performance, suggesting that there is significant improvement potential for improvement in their feature engineering. In this paper, we manually build a large dataset on Sina Weibo (a leading OSN with the largest number of active users in the Chinese community), namely Weibo User Depression Detection Dataset (WU3D). It includes more than 20,000 normal users and more than 10,000 depressed users, both of which are manually labeled and rechecked by professionals. By analyzing the user's text, social behavior, and posted pictures, ten statistical features are concluded and proposed. In the meantime, text-based word features are extracted using the popular pretrained model XLNet. Moreover, a novel deep neural network classification model, i.e. FusionNet (FN), is proposed and simultaneously trained with the above-extracted features, which are seen as multiple classification tasks. The experimental results show that FusionNet achieves the highest F1-Score of 0.9772 on the test dataset. Compared to existing studies, our proposed method has better classification performance and robustness for unbalanced training samples. Our work also provides a new way to detect depression on other OSN platforms.

Full PDF

AA M

ULTITASK D EEP L EARNING A PPROACH FOR U SER D EPRESSION D ETECTION ON S INA W EIBO

A P

REPRINT

Yiding Wang

College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]

Zhenyi Wang

College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]

Chenghao Li

College of CybersecuritySichuan UniversityChengdu, China 610207

Yilin Zhang

College of CybersecuritySichuan UniversityChengdu, China 610207

Haizhou Wang ∗ College of CybersecuritySichuan UniversityChengdu, China 610207 [email protected]

August 31, 2020 A BSTRACT

In recent years, due to the mental burden of depression, the number of people who endanger theirlives has been increasing rapidly. The online social network (OSN) provides researchers with an-other perspective for detecting individuals suffering from depression. However, existing studies ofdepression detection based on machine learning still leave relatively low classiﬁcation performance,suggesting that there is signiﬁcant improvement potential for improvement in their feature engi-neering. In this paper, we manually build a large dataset on Sina Weibo (a leading OSN with thelargest number of active users in the Chinese community), namely Weibo User Depression Detec-tion Dataset (WU3D). It includes more than 20,000 normal users and more than 10,000 depressedusers, both of which are manually labeled and rechecked by professionals. By analyzing the user’stext, social behavior, and posted pictures, ten statistical features are concluded and proposed. Inthe meantime, text-based word features are extracted using the popular pretrained model XLNet.Moreover, a novel deep neural network classiﬁcation model, i.e. FusionNet (FN), is proposed andsimultaneously trained with the above-extracted features, which are seen as multiple classiﬁcationtasks. The experimental results show that FusionNet achieves the highest F1-Score of 0.9772 onthe test dataset. Compared to existing studies, our proposed method has better classiﬁcation perfor-mance and robustness for unbalanced training samples. Our work also provides a new way to detectdepression on other OSN platforms.

Keywords

Depression detection · Online social network · Feature engineering · Deep neural network · Multitasklearning

With the rapid development of the online social network (OSN) such as Twitter and Facebook, people are morefrequently using the OSN to express opinions and emotions. It provides researchers with a novel and effective way todetect the mood, communication, activity, and social behavior pattern of individuals [1]. In the past decade, researchers ∗ Corresponding author: H. Wang ([email protected]) a r X i v : . [ c s . S I] A ug Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT in various ﬁelds have conducted quantitative analyses of different illnesses and mental disorders based on the OSNplatform [2–8]. Sina Weibo (hereinafter referred to as “Weibo”) is the most popular OSN in the Chinese community[9]. A statistic shows the number of Weibo’s monthly active users have reached more than 480 million in the secondquarter of 2019 .Major depressive disorder, referred to as depression, is a common mental disease. According to a survey of the WorldHealth Organization (WHO) , more than 300 million people worldwide suffer from depression. Depression can causegreat psychological pain, even suicidal tendencies. Moreover, evidence from a health action plan of WHO showsthat people suffering from depression are much more likely to end their life prematurely than the general population.Despite the current availability of psychotherapy, medical therapy, and other modalities for the treatment of depression,76%-85% of patients in low- and middle-income countries remain untreated. The emergence of this phenomenon is notonly the lack of medical resources but also the inability to make an accurate assessment in the early stage of depression,which leads to a large number of people with depression difﬁcult to get diagnosis and treatment timely [10].Pictures, text, videos, and other information posted on the OSN can reﬂect feelings of worthlessness, guilt, help-lessness, and self-hatred, which can help researchers to speciﬁcally analyze and characterize depressed individu-als [1, 10–12]. However, there are some insurmountable problems in online depression detection using traditionalanalyzing methods. They often focus on analyzing the characteristics of users with depression rather than constructingpredictive models. Therefore, it is difﬁcult to give timely prediction results of new depressed users. Moreover, theyare incapable to deal with a large number of instant interactive user data.With the rapid development of artiﬁcial intelligence technologies, machine learning approaches have made great con-tributions to the detection of depression [13–17]. An automated depression detection model based on machine learningusually needs to analyze various information such as tweets, pictures, videos, social activity data of users. Then, itgives the classiﬁcation results of the predicted objects, most of which are presented as a binary result of normal ordepressive. If an individual is predicted for a potential depressive tendency, further resources and assistance can beprovided, including later medical and psychological diagnoses. Such heuristic learning approaches are quite effec-tive for helping in the early detection of depression [18] since they are capable of handling a large number of instantinteractive user data. However, current approaches to online depression detection still face many unresolved challenges.Firstly, many current studies are not user-oriented modeling [19–21]. Those works usually aim to analyze and modelthe language style of the user. Through sentiment analysis and feature engineering of the tweet text, a classiﬁcationmodel is developed to detect whether a speciﬁc tweet has a depressive tendency. These works analyzed ﬁne-grainedfeatures and achieved pretty good results. However, such results cannot be directly applied to user-level depressiondetection, or it may lead to an incorrect prediction.Second, in several existing studies [1, 19, 22–24], the size of the dataset used for modeling is insufﬁcient, with only afew hundred to a few thousand data samples being used. Because of the difﬁculty of accurately obtaining and labelingdepressed samples, researchers usually choose to construct small datasets or directly cited datasets from other works.As a consequence, the trained model fails to reach good generalization performance, thus hard to accurately predictdepressed users on the OSN.Moreover, not enough studies of user depression detection have been proposed on Weibo compare to Twitter andFacebook. To the best of our knowledge, there is no published large Weibo user depression detection dataset availablecurrently.Finally, many of the existing proposed models still do not reach a high level of classiﬁcation performance, i.e. anF1-Score of 90% and above. Thus, these models need to be further improved to achieve better performance.

Given the above problems and challenges, we hereby summarize the contributions of our work as below: A P

REPRINT • We build and publish a large-scale labeled dataset - Weibo User Depression Detection Dataset(WU3D) . WU3D includes more than 10,000 depressed users and more than 20,000 normal users, eachof which contains enriched information ﬁelds, including tweets, the posting time, posted pictures, the usergender, etc. This dataset is labeled and further reviewed by professionals. • We summarize ten features of depressed users, four of which are the ﬁrst to be proposed.

Different fromsome existing work that directly using the information ﬁelds as features, we made statistical analyzes of allthe proposed features. These features show signiﬁcant distribution differences between depressed and normalusers in our experiments. • We construct a Deep Neural Network (DNN) classiﬁcation model, i.e. FusionNet.

It implements a multi-task learning strategy to process text-based word vectors and statistical features simultaneously. Experimentalresults show that it achieves both the highest classiﬁcation performance and the best robustness to unbalancedtraining samples.The subsequent sections of this paper are organized as follows. In Section II, related work and achievements in theﬁeld of depression detection on OSNs are introduced and analyzed. The proposed framework is elaborated in SectionIII. Furthermore, Section IV gives the signiﬁcance evaluation of statistical features and the performance comparisonexperiments of several classiﬁcation models (including our proposed FusionNet). At the last of the paper, Section Vsummarizes our work and discusses directions for future work.

The current methods for online depression detection mainly include two directions. (i)

Manually extracting fea-tures and building Traditional Machine Learning (TML) models for classiﬁcation. (ii)

Using Deep Learning (DL)approaches to automatically extract features and constructing deep neural network models as classiﬁers.Among them, some of the research that uses DL also introduces TML methods to further improve their model perfor-mance. The research of each approach will be introduced below respectively.

Mining depression users based on TML mostly uses features, i.e. numeric vectors that have been manually analyzedand extracted from users to represent the predicted object (a user, a tweet, a posted picture, etc.) [18].Choudhury et al. [1] presented a pioneering work in this ﬁeld of research. They explored potential user behavior toperform a user-oriented depression detection. By measuring behavioral attributes on Twitter users relating to socialengagement, emotion, language, and linguistic styles, they discovered useful signals for characterizing depression.Although their trained classiﬁers did not achieve high classiﬁcation performance, as a pioneering work in this ﬁeld,they provided a detailed feature engineering analysis process and a clear modeling approach.Wang et al. [25] undertook further research using data from Twitter and Weibo. Compared with the work of [1] thatmade a more comprehensive feature analysis, this study implemented a sentiment analysis approach and proposedman-made rules by utilizing vocabulary to measure depressive tendencies of tweets. Their work indicated that text-based features play a crucial role in online depression detection.Deshpande et al. [20] proposed a representation learning method based on natural language processing (NLP) to modelthe text information on Twitter. Different from the previously mentioned work [1, 25], they used the Bag of Words(BOW) algorithm to represent the tweet text as a sparse vector, allowing the classiﬁer to automatically learn latentfeatures. Their trained Naive Bayes (NB) classiﬁer reached an F1-Score of 0.8329, while the Support Vector Machine(SVM) classiﬁer only reached an F1-Score of 0.7973.After, Shen et al. [10] proposed an advanced detecting approach that can be used to detect depressed users timely.They constructed a well-labeled depression detection dataset on Twitter, which had been widely used by subsequentresearchers. In the meantime, they extracted six depression-related feature groups covering the text, social behavior,and posted pictures. Their proposed multimodal depressive dictionary learning (MDL) approach can effectively learnthe latent and sparse representation of user features. Experiments showed their proposed MDL model achieved an F1of 0.85, indicating that the dictionary learning strategy and the ensemble of multimodal is quite effective.In recent years, more TML-based work has begun to emerge [21,23,24]. In particular, Mustafa et al. [23] implementedFrequency-Inverse Document Frequency (TF-IDF) algorithm to weight the words in tweets. Their trained classiﬁer https://github.com/aidenwang9867/Weibo-User-Drpession-Detection-Dataset A P

REPRINT based on an one-dimensional convolution neural network (CNN-1D) achieved an F1-Score of 0.89. Their work is theﬁrst to introduce a neural network model for detecting depressed users on the OSN.

Modeling approaches based on DL are mainly for jointly considering user social behaviors and multimedia informationsuch as the text, pictures, videos, etc. Among them, the modeling of the text information is the main research direction.Researchers have adopted NLP approaches to embed text into a high dimensional continuous vector to automaticallymine word features. Some work has also fused manually extracted features into DNN classiﬁers as part of the input,or integrated traditional classiﬁers with DNN classiﬁers to improve performance. These multimodal and ensembleapproaches have proven to be an effective way to accomplish various tasks on social network analysis includingdepression detection [26].Several DNN classiﬁers that have achieved signiﬁcant performance in the NLP classiﬁcation task were selected andevaluated by Orabi et al. [27]. They used a pretrained Word2Vec [28] model to embed the text of tweets. Their exper-imental results showed that the CNN-1D with a max-pooling structure reported the highest performance. Comparedto other recurrent structures including the recurrent neural network (RNN), the Long Short-Term Memory (LSTM)neural network [29, 30], CNN-based models performed better in the task of depression detection.Then, Sadeque et al. [31] proposed a latency-weighted F1 metric and applied it in a novel sequential classiﬁer basedon the Gated Recurrent Units (GRU). They treated all the text of tweets as documents and input them to the classiﬁerasynchronously, named “post-by-post” strategy. It allows the model to decide the depressive tendency of a user aftereach tweet is scanned. Thus, it somehow avoids the time consumption of scanning too many tweets under a certainand obvious depressed user (e.g, a user with 200 tweets recording its anti-depressant experience). This approach canscan and detect depressive tendencies of tweets more efﬁcient.Later, based on the prior work [10], Shen et al. [11] discovered that the current research on a speciﬁc OSN may beunsuitable and not universal for depression detection on other platforms. Thus, they proposed a cross-domain DNNmodel with Feature Adaptive Transformation & Combination (DNN-FATC) strategy that can consider features ofseveral aspects comprehensively and transfer the relevant information across heterogeneous domains.Recently, more studies based on DL have been widely proposed. Gui et al. [12] further discussed the change ofclassiﬁcation accuracy of the model under the different proportions of depressed users and pointed out that the highestaccuracy can be achieved when the proportion of normal and depressed user samples is close to balance. Moreover,they implemented a reinforcement learning (RL) approach to further improve the performance of the model. Lin etal. [32] used a popular pretrained model, i.e. BERT [33], to embed word vectors. Its hidden layer output was extractedto fuse both text and image features to further accomplish the downstream classiﬁcation task.

To detect the depressed users on the OSN more effectively, we propose a novel framework, as shown in Fig. 1. Thisframework mainly consists of three parts.i.

User data collection and labeling.

This module contains two independent crawler systems (

UserID-Crawler and

UserInfo-Crawler ), which are used to collect user samples on Weibo. Then, it is responsible for ﬁltering andlabeling the collected data to construct the Weibo User Depression Detection Dataset (WU3D).ii.

Feature extracting.

This module is in charge of extracting the user’s text information including nicknames, pro-ﬁles, tweet text, and concatenating them into a long text sequence. Then, the sequence is input to the XLNet [34]pretrained model to obtain embedded word features. In the meantime, this module extracts statistical features ofuser text, social behavior, posted pictures. Finally, these features are jointly input into the classiﬁcation model.iii.

Model Training and predicting.

The module implements a depression detection model based on DNN, namelyFusionNet, which receive features input from the

Feature Extracting module. The proposed FusionNet can betrained in a multitask learning mode, in which word vectors and statistical features can be used jointly to optimizethe classiﬁer in each training step.The following parts of this section will elaborate on the theoretical construction and implementation methods of thesemodules, respectively. 4 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT

Original Tweets

Late Night Tweets

Negative Emotional TweetsOriginal Tweets

Late Night Tweets

Negative Emotional Tweets

Depressive Words Frequency

Posting Distribution

Posting TimeDepressive Words Frequency

Posting Distribution

Posting Time

Text & Behavior Features

XLNet EmbeddingXLNet Embedding

Picture (RGB) Task-2

SaturationSaturationHueHue

Picture FeaturesFeature Extracting

UserID-Crawler UserInfo-CrawlerUser IDs Candidates Weibo Depression

Detection Dataset (WU3D)Manually Labeling

Data collection and Dataset building

Task-1Picture (HSV)

Model Training and Predicting

FusionNetFusionNet TrainingTraining

Normal UserNormal User Depressed User

Concatenated

Statistical Features

Concatenated

Statistical Features Multitask

Transform

Online Social Network (Sina Weibo)

Word Features

Word VectorText Sequence

Figure 1: The Framework of the Proposed Method

A user ID can be used to uniquely identify a user. With a user ID, the crawler can access the user’s home page andcollect information from it. First of all,

UserID-Crawler is constructed to collect user IDs of depressed candidates.The API provided by Weibo ofﬁcial is used to obtain as accurate information as possible. Our strategies for collectinguser IDs of depressed candidates include:(i) Collecting data from the Weibo Super Topic of “ 抑郁症 ” (“Depression” in English). The Super Topic is a socialgroup on Sina Weibo that gathers users with common interests. It has been proved that individuals who share thesame background are more likely to trust each other, thus will gather to form aggregations [35]. According to ourinvestigation and analysis, there are a large number of active depressed users posting under the topic of “Depression”.Collecting data in this way can greatly improve the efﬁciency of gathering depressed user samples. Therefore, UserID-Crawler collects depressed candidates under this topic and forms a list of their user IDs.(ii) Collecting data through the function of “ 微博搜索 ” (“Weibo Search” in English) provided by Weibo ofﬁcial .We use high-frequency words including “ 抑郁症 ” (“Depression” in English), “ 自杀 ” (“Suicide” in English), “ 痛苦 ”(“Pain” in English) and the late night time period (from 0:00 a.m. to 6:00 a.m.) as two main search conditions to crawluser IDs for collecting more depressed candidates.Through the above two crawling strategies, we have collected sufﬁcient user IDs of depressed candidates. Then, withthe user ID list, UserInfo-Crawler is implemented to collect detailed user information from its personal homepage.The speciﬁc information ﬁelds collected by

UserInfo-Crawler are shown in Fig. 2.We divide the information for each user sample into two domains: the user domain and the tweet domain. The userdomain contains the user’s gender, birthday, proﬁle (a short text of the user’s self-description), the number of followers,the number of followings, and the list of tweets. For each tweet in the tweet domain, it contains the tweet text, theposting time, posted pictures, the number of likes, the number of forwards, the number of comments, and an identiﬁerthat identiﬁes whether the tweet is original or not. https://open.weibo.com/wiki/API https://s.weibo.com/ A P

REPRINT

Profile (text)Profile (text)

GenderGender

BirthdayBirthdayFollowersFollowers FollowingsFollowings

User(nickname)

Tweet 1Tweet 2Tweet 3 (Up to 100 tweets per user)

Why is depression so painful … just let me die (Text)(Posting time)(Posted pictures)LikesLikes Original / RepostOriginal / RepostCommentsCommentsForwardsForwardsUser domain Tweet domainTweets Figure 2: The Data Structure of Candidates and WU3D (per user)For normal candidates, we use

UserID-Crawler to collect them under four Super Topics including “ 日常 ” (“Daily” inEnglish), “ 正能量 ” (“Positive Energy” in English), “ 榜姐每日话题 ” (“Daily Topic” in English), “ 互动 ” (“Interac-tion” in English) to form a list of normal candidate IDs. Then, the more detailed user information is collected through UserInfo-Crawler to form the same data ﬁelds and structure as the depressed candidates. Based on the previous steps,we have collected 125,479 depressed candidates and 65,913 normal candidates.

Automated scripts are implemented to ﬁlter out non-personal accounts by identifying the user’s “account type” ﬁeld,including marketing accounts, ofﬁcial accounts, and social bots.The automated ﬁltered normal candidates is labeled as normal users directly without further manual labeling. Fordepressed users, we invite professional data labelers to complete the data labeling process of depressed candidates. Toensure that the results are highly reliable, the labeled data has been reviewed twice by psychologists and psychiatrists.The principles of the data labeling can be described as follows:i. Depressed candidates with a self-reported history of depression, conﬁrmed diagnosis, currently taking antidepres-sants, and recording antidepressant experiences in multiple tweets will be labeled as depressed users.ii. If a candidate’s tweets have repeatedly appeared the content of describing psychological suffering, mental anguish,and strong suicide intention, the user will be identiﬁed as depressed.iii. If the posted pictures of a candidate repeatedly involve or show bloodshed and self-harming content and the tweettext includes keywords such as “ 抑郁 ” (“Depression” in English) and “ 自残 ” (“Self-harming” in English), thecandidate will be identiﬁed as a depressed one.iv. Candidates who partially meet the above conditions but have too many unrelated contents such as forwardinglottery prizes, receiving red envelopes, advertising information, will be directly discarded.Therefore, the target dataset, i.e. WU3D, is constructed. It contains both labeled normal and depressed users. Thespeciﬁc information of the candidates and WU3D are given in Table 1. We counted the normal sample, the depressedsample, and the total for each of the two types. In particular, we give a detailed number of users and their postedtweets, and posted pictures. Table 1: Dataset statistics Dataset Category User Tweet Picture

Depressed 125,479 5,478,806 2,354,701Candidates Normal 65,913 4,927,904 3,631,537Total 191,392 10,406,710 5,986,238Depressed 10,325 408,797 160,481WU3D Normal 22,245 1,783,113 1,087,556Total 32,570 2,191,910 1,248,037All of the candidates were collected from March 2020 to May 2020. A total of over 200,000 user samples were col-lected, including 125,479 depressed candidates and 65,913 normal candidates. After strict data ﬁltering and labeling,the number of depressed users in WU3D reached 10,325, with the retention rate of 8.23%; the number of normal usersreached 20,338, with the retention rate of 29.34%. The total user data retention rate was 15.50%.6 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT

Several previous studies have deﬁned features that are quite effective for detecting depressed users, such as the propor-tion of late-night tweets, the proportion of original tweets, and the mean value of hue and saturation. Based on theirwork, we ﬁrstly perform feature engineering of user features in three aspects: the user text, social behavior, and postedpictures. We then summarize ten user-level features including four brand-new proposed and two modiﬁed. Thesefeatures are extracted using statistical approaches, including the scale, the mean value, the standard deviation, etc. InTable 2, symbol deﬁnitions that appear in this section and subsequent sections are given.Table 2: Variable and Function Symbol Deﬁnitions

Symbol Description P The posted tweet set of a user, including original and repost tweets. t p The posting time of a tweet. l e The emotional label of a tweet. n d The number of depression-related words in a tweet. T A set of all user text information, including the nickname, the proﬁle,and the tweet text. π The posted pictures set of a user. µ = ( h µ , s µ , v µ ) The dominant color of a picture, a ternery contains hue, saturation andbrightness of HSV color space. One picture has one dominant color. X The mean sample value of an attribute. S ξ The concatenated user long text sequence. ∆ The max length of the long text sequence S ξ . C The function that calculates the number of elements in a set. L The loss function of a neural network. Θ The parameter set of a neural network. y The true label of a user data sample. ˆ f The objective function of a neural network. It inputs a user’s featurevector and outputs the predicted label. J The joint optimization function of a neural network.Descriptions of these features are shown in Table 3. The features are divided into three groups, including text-basedfeatures, social behavior-based features, and picture-based features. Here, we give speciﬁc descriptions and formulasto calculate each feature. Table 3: Manually extracted user features

Group Feature name Symbol Source

Text: Ψ Proportion of negative emo-tional tweets ψ NP First proposed in our workFrequency of depression-relatedwords ψ F DW [1, 6, 8, 10, 11, 24, 25], modiﬁed in our workSocial behavior: Φ Proportion of original tweets φ P OP [6, 11, 25]Proportion of late-night posting φ P LNP [6, 10, 11, 24, 25], modiﬁed in our workPosting frequency (per week) φ P F

First proposed in our workStandard deviation of postingtime φ SDP T

First proposed in our workPicture: Γ Frequency of picture posting γ F P P

First proposed in our workProportion of cold color-styledpictures γ P CP [11]Standard deviation of hue γ SDH [6, 10, 11], modiﬁed in our workStandard deviation of saturation γ SDS [6, 10, 11], modiﬁed in our work . In previous works for Twitter [1, 31], by considering the number of tweetswith negative emotions, they have achieved good results in distinguishing depressed users from normal ones. Ratherthan directly using its “number”, we use the “proportion” calculation method to normalize the feature. Although7 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT depressive tendencies do not fully equate to the expression of negative emotions, when the proportion of tweets withnegative emotions reaches a certain level, it can reﬂect that the user’s mental state is depressed and painful, thus canreveal a tendency of depression. We use the Text Sentiment Analysis API of the Baidu Smart Cloud Platform to labelall the original tweets. The API returns three emotional labels: 0 for negative, 1 for neutral, and 2 for positive. Weretain the negative emotions of label 0 and summarize label 1 and 2 as a category of non-negative emotions. For allthe original tweets under each user, we give the deﬁnation of ψ NP in equation (1), in which C ( P o ) is the total numberof original tweets, C ( l e ) is the total number of original tweets with negative emotions: ψ NP = 1 C ( P o ) × C ( l e ) , ψ NP ∈ [0 , (1) Frequency of depression-related words . Researchers have focused on the lexical and semantic analysis of the tweettext and quantiﬁed these features by self-constructing or quoting depression-related semantic lists [1,6,8,10,11,24,25].The results of the existing studies indicate that features based on high-frequency depression keywords can signiﬁcantlyimprove the classiﬁcation performance. We use “frequency” to describe how frequently depression-related wordsappeared in a user’s tweets, reﬂecting its potential depressive tendencies. In our previous investigation and analysison Weibo, we summarized a list of high-frequency words for depression. Here, it is used to calculate the frequency ofdepressive words in users’ original tweets. The number of occurrences of depression-related words of each tweet n d is counted by matching the keyword list. Then, ψ F DW is calculated by: ψ F DW = 1 C ( P o ) × C ( Po ) (cid:88) i =1 n d i , ψ F DW ∈ [0 , ∞ ) (2) . Several related works have proved that depressed users are more likely to post a largenumber of original tweets to express their negative psychological state, with relatively few repost tweets [6, 11, 25].Therefore, we use the proportion of original tweets to distinguish between depressed users and normal users. Here weuse C ( P ) to calculate the total number of tweets, including orignal tweets and repost tweets. Then, φ P OP is deﬁnedby: φ P OP = 1 C ( P ) × C ( P o ) , φ P OP ∈ [0 , (3) Proportion of late-night posting . Late night is a time when depressive symptoms more frequently attack, thusdepressed users tend to be more likely to post tweets in this period [6, 10, 11, 24, 25]. Moreover, the late-night periodis the time when normal users sleep and rest. They rarely use social tools during this time and therefore send veryfew tweets. We use the proposed feature “Tweet Time” in Ref. [11] and make minor modiﬁcations. The time rangeof 0:00-6:00 is adopted as the late-night period. Moreover, all the tweets of a user are used to calculate, includingoriginal and repost ones. Then, φ P LNP is given by equation (4), in which C ( t p ) is used to calculate the total numberof tweets posted in the late night time period from 0:00 a.m. to 6:00 a.m. φ P LNP = 1 C ( P o ) × C ( t p ) , φ P LNP ∈ [0 , (4) Posting frequency (per week) . The previous study for Twitter [25] has found that there is also a difference in postingfrequency between normal and depressed users. Depressed users tend to post large numbers of tweets when they aresuffering from depression and heavily rely on social media to express their painful feelings. Moreover, “Week” is amoderate time size and has stronger periodicity than “Month”. We take the earliest posting time and latest postingtime as an interval, count the total number of tweets C ( P int ) during this interval, and then divide it by 7 to get theweekly frequency value. Thus, φ P F can be represented by equation (5): φ P F = 17 × C ( P int ) (5) Standard deviation of posting time . The posting time of depressed users tends to be concentrated in the late-night,while the relative distribution of post time of normal users is more discrete within a day [6, 10, 11, 25]. Hence, we usethe standard deviation to describe this phenomenon, in order to reﬂect the aggregation trend of users’ posting time. http://ai.baidu.com/tech/nlp/sentiment classify A P

REPRINT

The smaller the value of this feature, the more likely that user would post at a speciﬁc time period. Here, we considerall the original and repost tweets. The mean value of posting time X SDP T is calculated by: X SDP T = 1 C ( P ) × C ( P ) (cid:88) i =1 t P i (6)Then, φ SDP T can be deﬁned as: φ SDP T = (cid:118)(cid:117)(cid:117)(cid:116) C ( P ) × C ( P ) (cid:88) i =1 ( t P i − X SDP T ) (7) . In existing works for Twitter and Weibo [6, 11], “Tweet with pictures” is categorizedinto “Tweet type” to measure how often users post pictures in their tweets and has achieved good performance. Basedon our prior research on Weibo, we also found that depressed users were more likely to use a lot of text to express theirfeelings and mental states, thus have fewer posted pictures than normal users. Therefore, we propose this feature toreﬂect users’ habit of posting pictures. C ( π ) represents the total number of posted pictures. Then, we calculate γ F P P by: γ F P P = 1 C ( P o ) × C ( π ) (8) Proportion of cold color-styled pictures . Studies for Twitter [6, 11] and Weibo [6, 10, 11] have shown that comparedto normal users, depressed users tend to post pictures with a relatively colder color. Therefore, we extracted three hueand saturation-related features to distinguish depressed users from normal users.However, the warmth and coolness of a picture is a relative concept, and the human eye will give different conclusionswhen contrasting different colors. Lin et al. [6] proposed a range deﬁnition of cold colors by analyzing the hue rings,which is used as our deﬁnition of the cold color range as h µ ∈ (30 , .For the three color-related features, we compute them using values from the Hue, Saturation, Value (HSV) colormodel. Similarly to Red-Yellow-Blow (RGB) color space, HSV is a color space that represents the intuitive propertiesof colors, which is composed of hue, saturation, and lightness. Among the three attributes, “hue” refers to the categoryof colored light. Different wavelengths of light give different colors and hues. It is measured by the angle value, witha range of 0-360 degree. From red to counterclockwise, the red hue is deﬁned as 0 degree, the green is 120 degree,and the blue is 240 degree.Saturation indicates the degree of color close to the color of the spectrum, and usually takes a value of 0 to 1. Thelarger the value, the more saturated the color. After converting the RGB value of each pixel to the HSV color space,we calculate the dominant color ternary µ = ( h µ , s µ , v µ ) . The algorithm for extracting the dominant color is given inAlgorithm 1. Algorithm 1:

Dominant Color Extraction

Input: τ , All the pixels of a picture, represented in the HSV color space. Output:

The dominant color pixel of the picture Initialize: threshold ← /* the striking pixel threshold */ Initialize:

Array

SP Arr /* to storage the striking pixels */ τ ← the average of τ for every pixel in τ do h τ = pixel [0] ; // pixel = ( h τ , s τ , v τ ) h µ = τ [0] ; // τ = ( h µ , s µ , v µ ) if | h τ − h µ | > threshold then SP Arr ← SP Arr + pixel end µ ← the average of SP return µ A P

REPRINT

The dominant color is the most attractive and the dominant color in a picture. Thus, we introduce the striking pixel(SP) to represent these colors. The SP plays an important role in the intuitive perception of the entire picture, usuallymeasured by the absolute difference between a speciﬁc pixel and the average hue of the entire picture. The algorithmﬁrst inputs a picture with all the pixels represented by the HSV color space. It initializes a manually assigned thresholdand an array

SP Arr to store striking pixels. Then, the algorithm calculates the average color

Ω = ( h µ , s µ , v µ ) ofthe picture and iterates through each pixel, comparing the absolute value of the difference between its hue value and h µ . If the difference is greater than the threshold, the currently iterated pixel is deﬁned as a SP. Finally, by calculatingthe average value of the SP array, the dominant color ternary µ is calculated. Several rounds of tests have been ran tochoose the best value of the threshold (here set to 30).We count the total number of posted pictures with h µ ∈ (30 , and s µ < . as C ( π cold ) . Then, γ P CP is calculatedby: γ P CP = 1 C ( π ) × C ( π cold ) , γ F P P ∈ [0 , (9) Standard deviation of hue and

Standard deviation of saturation . These two features are used to reﬂect the ﬂuctua-tion of the user’s picture color. The previous works used the mean values of hue and saturation as the picture featuresand achieved good results on Twitter [6, 11] and Weibo [6, 10, 11]. In our research, we found that the hue of depressedusers’ pictures is more concentrated in colder ranges and the saturation value is relatively low. On the contrary, the hueand saturation distribution of normal users is more dispersed and average. We take the hue value hµ and the saturationvalue s µ to calculate their mean values X SDH and X SDS by: X SDH = 1 C ( π ) × C ( π ) (cid:88) i =1 h µ i (10) X SDS = 1 C ( π ) × C ( π ) (cid:88) i =1 s µ i (11)Then, γ SDH and γ SDS can be deﬁned using the following equations: γ SDH = (cid:118)(cid:117)(cid:117)(cid:116) C ( π ) × C ( π ) (cid:88) i =1 ( h µ i − X SDH ) (12) γ SDS = (cid:118)(cid:117)(cid:117)(cid:116) C ( π ) × C ( π ) (cid:88) i =1 ( s µ i − X SDH ) (13) Algorithm 2 gives the approach to construct the user text sequence. Considering that the user nickname and proﬁlecan also reﬂect its current emotional state, they are also concatenated to the tweet text.10 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT

Algorithm 2:

User Long Text Sequence Construction

Input: T , a collection of user nicknames, proﬁles, and all tweets’ text. Output:

The concatenated long text sequence S ξ Initialize:

An empty string S ξ Initialize:

The max length of text sequence ∆ for text in T do if the length of S ξ > ∆ then break if text belongs to an original post then S ξ ← S ξ + text else if text belongs to a repost then if text = “Repost” then /* ignored the default repost reason */ continue else S ξ ← S ξ + text else // User nickname or profile S ξ ← S ξ + text end return S ξ The algorithm ﬁrstly inputs all the text information (deﬁned as T ) of the user and constructs the concatenated userlong text sequence S ξ by traversing T . After entering the loop, the algorithm ﬁrst determines if the current length ofthe concatenated text sequence is greater than the maximum length ∆ ; the algorithm ends if the condition is satisﬁed.Then, it concatenates the user’s original tweet text in chronological order from the latest to the earliest. Moreover,when a user repost a tweet, Weibo will ask the user to ﬁll in the reason for the retweet. In particular, if the user doesnot ﬁll in the reason, the text “ 转发微博 ” (“Repost” in English) will be automatically added as default. This defaultrepost reason is not retained in the text sequence S ξ since it does not express any opinions and feelings. To effectively vectorize the text sequence constructed above and apply this feature to the classiﬁcation algorithm, thecharacteristics of this long text sequence are further discussed.First, the sequence is strongly contextually linked. This link exists not only within a single tweet but also among thecontexts of multiple tweets. For example, a user posts multiple tweets at different times about depression diagnoses,depression onset, medication treatment, and inner distress. The integration of these information points is usually thekey to judge whether a user is depressed.Secondly, considering that under real circumstances, not all the tweets would describe depression-related content evenfor true depression users. That is, capturing text semantics such as “the user states that he has been diagnosed withdepression” and “the user expresses a strong inclination of suicide” is critical for detection depression using the longtext sequence S ξ .Considering these aspects, several state-of-the-art word embedding algorithms are discussed here. Transformer [36]is a model that replaces the recurrent neural network with the attention mechanism. It calculates the weights of eachunit in a long sequence to effectively capture the important semantic information. Moreover, BERT [33] is a two-wayencoder that is proposed recently. However, due to the limitations of the “Position Embedding” structure in BERT(including its derivative models ALBERT and ROBERTa), the maximum sequence length Delta for single processingis restricted to 512 units. Furthermore, the existing truncation or batch processing of long text sequences used inRef. [32] will signiﬁcantly increase the time complexity of processing, which is considered as an unsuitable solutionfor performing a timely depression detection task. Therefore, the ideal word embedding model must have the abilityto process long text efﬁciently and accurately.A novel language model, namely XLNet [34], is then proposed by Yang et al. Since it combines the features oflanguage models such as auto-regression and auto-encoding, XLNet has resolved the problem that BERT ignores therelationship between the Masked locations and can process longer text sequences. In this paper, XLNet-Chinese-base https://github.com/ymcui/Chinese-XLNet

11 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT is used as the upstream word embedding model, then a multitask-based DNN classiﬁer, FusionNet, is implemented tohandle the downstream tasks.

Dimensionality Reduction

Word Feature

Text Feature - Ψ Behavior Feature - Φ Picture Feature - Γ Statistical Feature Vector P r e - t r a i ned X L N e t E m bedd i ng U s e r T e x t S equen c e Fusion Feature Vector

Concat ⊕ F u ll y C onne c t ed S o ft m a x A u x O u t pu t Task-1Task-2 F u ll y C onne c t ed F u ll y C onne c t ed D r opou t S o ft m a x M a i n O u t pu t Specfically

Optimizing Task-1

BatchNorm

Concat ⊕ LayerNorm

Bi-GRU AttentionBi-GRU Attention

Concat ⊕ Global Loss L Global Loss L L Auxiliary Loss L Auxiliary Loss

Feature Maps

Figure 3: The Structure of our Proposed FusionNet (FN)Multitask learning is an integrated learning strategy that synchronizes model training in a way that multiple tasks sharecollective network structures and weights. Based on multitask learning, we construct a DNN classiﬁer with Bi-GRUwith attention as its main structure. As shown in Fig. 3, the word vector classiﬁcation task (Task-1) obtained fromthe upstream embedding model XLNet and the manually extracted statistical feature classiﬁcation task (Task-2) areconsidered as two classiﬁcation tasks for detecting depressed users. Loss functions L and L with different weights ω and ω are manually deﬁned to simultaneously train and optimize the network.Firstly, the user text sequence S ξ is embedded by XLNet, and the output of the last hidden layer is connected to thelayer normalization (LN) [37] layer. Then, the LN layer is connected to the Bi-GRU layer with attention to capturethe key information and reduce the dimensionality of the word vector.For Task-1 , this one-dimensional word feature connects the Full Connected (FC) layer, the Dropout layer, and theSoftmax layer to directly output classiﬁcation results. We set a auxiliary loss function L for network optimizationin‘ Task-1 , to help accelerate its network convergence. For

Task-2 , the word feature is concatenated to the manuallyextracted statistical feature input. The statistical feature groups [Ψ , Φ , Γ] are regularized by the Batch Normalization(BN) layer [38].Moreover, the fused feature vector accesses multiple FC layers with activation functions, activating the hidden layers’outputs to further improve the ﬁtting capability of the network. Finally, the network is connected to the Softmax layer,and the classiﬁcation result is given by the main output. The main loss function L is used to optimize the wholeFusionNet network.We deﬁne the weight parameter set of Task-1 network as Θ aux , while the objective function is deﬁned as ˆ f . Theglobal weight parameter set of the whole network is represented by Θ g , while the objective function is deﬁned as ˆ f ,12 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT so that Θ aux ∈ Θ g . Adopting the multitask learning stratergy, the joint optimization function J can be described as: (cid:40) J = (cid:80) C ( U ) i =1 L ( y i , ˆ f ( S ξi ) , Θ aux ) ,J = (cid:80) C ( U ) i =1 L ( y i , ˆ f (Ψ i , Φ i , Γ i ) , Θ g ) . (14) J (Θ aux , Θ g ) = ω × J + ω × J (15)In equation (14), y i represents the true label (normal or depressed) of a speciﬁc user sample i . The ˆ f and the ˆ f bothoutput the predicted label of user sample i . In equation (15), ω and ω are the manually assigned weights of lossfunction L and L .Manually extracted features and several uncertain parameters will be evaluated in the following section. Since the original user data obtained by the crawler has irrelevant information, to minimize the experimental bias andimprove the efﬁciency of model training, we have removed all the non-text contents in the tweets.

In this part, WU3D is divided into four subsets: D , D , D , and D . All of the subsets are sampled using a ﬁxedrandom seed without a crossover.Among them, D is used for DNN model training and the 10-fold cross-validation of TML classiﬁers. Furthermore, D is used as a ﬁxed dataset for validation in each round of neural network training. Finally, we test the models’performance on D and give the evaluation metrics. As a supplementary dataset, D contains only 325 depressed usersand 12,245 normal users, which will be only used in the last experiment of unbalanced training samples. Statistics ofthe sliced datasets are given in Table 4: Table 4: Dataset Slicing Statistics Dataset Depressed NormalUser

Tweet Picture

User

Tweet PictureWU3D 10325 408797 160481 22245 1783113 1087556 D D D D

325 13426 5183 12245 993566 641077

The experimental metrics used in this section mainly includes supervised machine learning metrics.

True Positive(TP), True Negative (TN), False Positive (FP), False Negative (FN) are commonly used to describe the number ofclasses predicted by models in classiﬁcation tasks. Among them, TP represents the number of depressed users correctlypredicted, TN represents the number of normal users correctly predicted, FP represents the number of depressed usersincorrectly predicted, and FN represents the number of normal users incorrectly predicted. Based on the above fourmetrics, we can further deﬁne the advanced metrics by:

Accuracy = | T P + T N || T P + T N + F P + F N | (16) P recision = | T P || T P + F P | (17) Recall = | T P + T N || T P + F N | (18)13 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT F − Score = 2 × P recision × RecallP recision + Recall (19)Moreover, Receiver Operating Characteristic (ROC) curve is a curve with False Positive Rate (FPR) as the horizontalaxis and True Positive Rate (TPR) as the vertical axis, which can be used to visually reﬂect the classiﬁer performance.Furthermore, in the experiment of the statistical features, we introduce the Man-Whitney U test and the cumulativedistribution function (CDF) curve as the evaluation metrics.

We complete all the experiments in this section on a workstation with IntelXeon Silver 4212 CPU, NVIDIA RTX TITAN GPU with 24GB GRAM, and 32GB RAM. The programming-relatedsettings used in the experiments are Python v3.7.5, Anaconda v4.8.3, TensorFlow v2.1.0, and Scikit-learn v0.23.1. (ii) Baseline statistical feature classiﬁers.

For the statistical feature classiﬁcation task, we select several popularTML model from existing studies to demonstrate the effectiveness of the ten concluded features. • LR:

Logistic Regression is a commonly used linear model [19] and has good classiﬁcation performance. • NB:

A Naive Bayesian classiﬁer is a simple probabilistic classiﬁer using Bayes’ theorem as a basis. Itsimplementation is relatively simple and is used more often in related works [6, 7, 10, 20, 21, 25]. • SVM:

Support Vector Machine classiﬁers apply the principle of structural risk minimization to the ﬁeld ofclassiﬁcation and are the most used classiﬁers in previous studies [1, 6, 8, 20, 21, 23, 24, 27, 31]. In our work,we discussed different kernel modes including the linear kernel, polynomial kernel and radial basis kernel ofthe SVM, respectively. • RF:

Random Forest is an algorithm that integrates multiple trees through ensemble learning, which is alsoused widely in related works [6, 7, 23, 24]. The basic unit of RF is the decision tree. • AB:

Adaptive Boosting is an ensemble learning algorithm that combines multiple simple classiﬁers [19]. • GBDT:

Gradient Boosting Decesion Tree is a classiﬁcation model that uses an integrated additive model tocontinuously reduce the training residuals. GBDT is one of the algorithms with an excellent generalizationability in TML, however, to the best of our knowledge, there is no existing work using GBDT as a classiﬁca-tion model. • BP:

The Back Propagation (BP) network is extracted from the main output part of our proposed FusionNetwith the same parameter settings. To be speciﬁc, BP is composed of “FC+Dropout+FC+Softmax”. (iii) Baseline word vector classiﬁcation networks.

For word vector classiﬁers, we use several popular neural networkstructures as the main structures, appended by the FC layers and the Softmax layer to output the classiﬁcation label. • CNN-1D:

One-dimensional convolutional neural networks are more widely used in natural language pro-cessing, and have achieved good performance in the task of depression detection [6, 19, 22, 23, 27]. • Bi-LSTM:

The bi-directional LSTM network splices two-way LSTMs together, which are more capable ofhandling time series data [27, 31, 32]. • Bi-GRU:

Similarly, the bi-directional GRU splices the two-directional GRU network together and is similarto Bi-LSTM in its ability to handle time series data [12, 31]. • TCN:

The temporal convolutional network is a new algorithm for processing time series that reduces theserial processing complexity of RNNs [39]. • Attention:

The attention mechanism is proposed by Vaswani et al. [36], which can quickly ﬁlter out high-value information from large amounts of information. Attention is popular in many ﬁelds such as machinetranslation and speech recognition. • Bi-GRU with attention:

It is extracted from our proposed FusionNet with the same parameter settings. • Bi-GRU with GAP:

Global Average Pooling (GAP) [40] is used to replace the attention layer to reduce thedimensionality of the output of Bi-GRU, so as to compare the performance differences of these two similarstructures.For the baseline statistical feature classiﬁers and word vector classiﬁcation networks, we have run a series of pre-experiments on each classiﬁer and selected the structures and parameters with the best classiﬁcation performance.Each classiﬁer will be represented by its main structure’s symbols (e.g., Bi-GRU-based classiﬁers are referred to asBi-GRU for short). API default parameters are selected for both TML and DL classiﬁers that are not speciﬁcally14 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT described here. Moreover, separate experiments for the neural network structures of BP and Bi-GRU with attentionare set to further demonstrate the superiority of FusionNet, which uses multitask learning to merge the two DNNstructures.The loss functions and callback settings for neural network training are given in Table 5:Table 5: Neural Network Training Setup

Item Setup

Batch size 32Epoch 80Early Stopping monitor=’val acc’, patience=10Check Point monitor=’val acc’, mode=’max’FN Loss Function L , L Categorical CrossentropyFN Optimizer (for L ) NAdam (Init lr=3e-4)FN Optimizer (for L ) NAdam (Init lr=1e-3)FN [ ω , ω ] [0 . , . In this part, we perform a non-parametric Mann-Whitney U test on each manually extracted feature. The result isshown in Table 6. Since the statistical variables Mann-Whitney U-value and Wilcoxon W-value can be transformedinto each other, only the U-values are reserved in the table.Table 6: Mann-Whitney U Test Results

Symbol Mann-Whitney U Signiﬁcance Decision ψ NP p < . Reject H ψ F DW p < . Reject H φ P OP p < . Reject H φ P LNP p < . Reject H φ P F p < . Reject H φ SDP T p < . Reject H φ F P P p < . Reject H φ P CP p < . Reject H φ SDH p < . Reject H φ SDS p < . Reject H In the experiment, the default null hypothesis H is set to “the distribution of this feature is the same on normal usersand depressed ones”. At the 95% conﬁdence interval, p < . will reject the null hypothesis, which is to admit thesigniﬁcant difference between normal and depressed users. The result shows that the p -value of each feature is lessthan 0.001 in the signiﬁcance of Mann Whitney’s bilateral test. Thus, it is concluded that all the features pass the testand have signiﬁcant differences in the distribution of two types of users. We also evaluate the signiﬁcance of the features by comparing the feature distribution curve of normal and depressedusers. The CDF curve for each feature are plotted in Fig. 4. Since there are no quantitative parameters for the CDFcurve to describe the results, we only evaluate the degree of coincidence between the two types of user curves.Among the features, ψ NP has the highest distinction between the two types of users and the most signiﬁcant differencein distribution. This phenomenon further demonstrates that text-based features play an important role in identifyingdepressed users on social networks. For our proposed features shown in Fig. 4(a), (e), (f), and (g), the curve oftwo types of users shows obvious separation and different trends, indicating there are signiﬁcant differences in thesefeatures between the two types of users. 15 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT

Proportion of negative emotional tweets C u m u l a t i v e p r opo r t i on NormalDepressed (a) ψ NP Frequency of depression-related words C u m u l a t i v e p r opo r t i on NormalDepressed (b) ψ FDW

Proportion of original tweets C u m u l a t i v e p r opo r t i on NormalDepressed (c) φ POP

Proportion of late night posting C u m u l a t i v e p r opo r t i on NormalDepressed (d) φ PLNP

Posting frequency (per week) C u m u l a t i v e p r opo r t i on NormalDepressed (e) φ PF Standard deviation of posting time C u m u l a t i v e p r opo r t i on NormalDepressed (f) φ SDPT

Frequency of picture posting C u m u l a t i v e p r opo r t i on NormalDepressed (g) γ FPP

Proportion of cold color-styled pictures C u m u l a t i v e p r opo r t i on NormalDepressed (h) γ PCP

Standard deviation of hue C u m u l a t i v e p r opo r t i on NormalDepressed (i) γ SDH

Standard deviation of saturation C u m u l a t i v e p r opo r t i on NormalDepressed (j) γ SDS

Figure 4: The CDF Curves of Ten Statistical Features

In this part, baseline statistical feature classiﬁers are used to evaluate the contribution of different feature groups. Weperform experiments on different combinations of features to determine the contribution to the classiﬁcation task. Theresult is shown in Fig. 5.The experimental result demonstrates that the classiﬁcation performance of feature groups in each classiﬁer is con-sistent with the results of our statistical tests, in which the text-based feature group Ψ contributed the most. The BPclassiﬁer has already achieved a high F1-Score of 0.9431 when only text-based features are used. The contributionof the picture-based feature group Γ is relatively poor, only with the highest F1-Score of 0.7514 using the GBDTclassiﬁer. However, under the different combinations of features, the performance is improved at different levels foreach group and for each classiﬁer. Especially, with the combination of all the feature groups ( Ψ + Φ + Γ ), the GBDTclassiﬁer reports the highest F1-Score of 0.9465. GBDT and BP both achieved the highest performance metric forseveral rounds beneﬁting from gradient descent related optimization method.Therefore, it is concluded that all three types of feature groups can positively improve the performance of classiﬁcationtasks, with the text-based features contributing the most.

The user text information consists of the user nickname, proﬁle, and tweet text, which are concatenated to a longtext sequence using Algorithm 2. However, due to the uncertainty of the total number of tweets and the total numberof words in a tweet, it is necessary to explore the most effective embedding length of the text sequence. In themeantime, since the pretrained Chinese XLNet has 12 layers of network structure, 768 hidden layers, and a total of117M parameters, it will also take considerable time to access the network and to extract word vectors. Therefore, inthis section, we run multiple experiments by setting several different values of the text sequence length. We record the16 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (a) LR

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (b) NB

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (c) SVM-linear

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (d) SVM-poly

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (e) SVM-rbf

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (f) RF

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (g) AB

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (h) GBDT

Combination M e t r i c Accuracy F1-Score Precision Recall AUC (i) BP

Figure 5: The Contribution of Different Feature Group Combinationstime consumption of extracting word vectors using XLNet and the F1-Score of each classiﬁer to explore the appropriatetext sequence length. It should have reasonable embedding time consumption and relatively high F1-Score.According to our statistics, a user’s tweet text may generally be longer than 32 Chinese characters. A short textlength may result in premature truncation of the tweet text. Thus, our experiment starts with sequence length ∆ = 64 and increases its value gradually. Finally, we select six groups of text sequence lengths for experiments, with ∆ ∈ [64 , , , , , . Then, we record the time consumption of embedding word vectors using XLNet andthe classiﬁcation F1-Score under different text sequence lengths.We select the embedding time consumption of ∆ = 64 as a base, normalizing the remaining groups to get the scaledvalues. Figure 6(a) shows the relative time consumption of XLNet at each value of the text sequence length ∆ .The result demonstrates that, as the text sequence length increases, the curve shows a nonlinear trend and the growthrate of the slope continues to increase. Speciﬁcally, the relative time of ∆ = 512 is 11.81, while the correspondingtime of ∆ = 1024 is 25.31, and the corresponding time of ∆ = 2048 is 77.52. The time consumption increasesby 114.31% and 206.28%, respectively. Therefore, when ∆ increases by a binary exponential power, the time ofembedding will increase at a signiﬁcantly faster rate. 17 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT

Text Sequence Length R e l a t i v e T i m e C on s u m i ng (a) ∆ - Time (Relative) Text Sequence Length F - S c o r e CNN_1D TCN Bi-LSTMBi-GRU Bi-GRU (GAP) Bi-GRU (Attention) (b) ∆ - F1-Score Figure 6: The Selection of Text Sequence LengthMoreover, the baseline word vector classiﬁcation networks are used to reﬂect the classiﬁcation performance underdifferent values of ∆ . Figure ?? (b) shows the F1-Score of each classiﬁer under different ∆ values.When ∆ increases from 64 to 1024, the F1-Score of each classiﬁer increases signiﬁcantly. When ∆ increases from1024 to 2048, the F1-Score almost no longer increases. Although the performance of a single attention structure isrelatively poor, by adding it to other structures, the classiﬁcation performance can be further improved. Thus, Bi-GRUwith attention can better capture key information in long text sequences among the tested word vector classiﬁers. Itachieves the highest classiﬁcation performance and thus is used in our proposed FusionNet.Combined with the previous experiment of ∆ and the word vector extraction time consumption, it is concluded thatthe length of text sequence at ∆ = 1024 has reached a performance bottleneck in this classiﬁcation task, while thetime computation of word embedding is relatively acceptable. Therefore, in subsequent experiments involving wordvectors, we implement ∆ = 1024 as the default text sequence embedding length. To compare other classiﬁers with FusionNet, in this section, the output of Bi-GRU with attention is extracted as theword feature, which is concatenated to the statistical feature. Then, this integrated feature vector is input into thebaseline statistical feature classiﬁers to accomplish the classiﬁcation task of depressed and normal users.Thus, multimodel classiﬁers are constructed by both the word vector classiﬁcation network and the baseline statisticalfeature classiﬁer. Table 7 gives detailed metrics of all the target classiﬁers, while Fig. 7 visualizes these metrics.Table 7: Classiﬁcation Performance of Target Classiﬁers

Classiﬁer Accuracy F1-Score Precision Recall

LR 0.9660 0.9655 0.9813 0.9502NB 0.9555 0.9544 0.9779 0.9321SVM-linear 0.9623 0.9616 0.9796 0.9442SVM-poly 0.9562 0.9560 0.9604 0.9517SVM-rbf 0.9600 0.9593 0.9773 0.9419RF 0.9604 0.9597 0.9766 0.9434AB 0.9649 0.9643 0.9820 0.9472GBDT 0.9653 0.9647 0.9805 0.9494

FN (Proposed) 0.9775 0.9772 0.9908 0.9639

In the performance experiment of the target classiﬁers, each classiﬁer reaches an F1-Score above 0.95. Particularly,our proposed FusionNet achieves the highest F1-Score value of 0.9772. It also obtains the highest value under allthe other metrics. Compared to the second-highest LR, FusionNet improves its F1-Score by 1.21%. Using transferlearning, the word feature is extracted as an input to the baseline statistical feature classiﬁers. However, transferring18 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT

LR NB SVM-L SVM-P SVM-R RF AB GBDT FN

Classifier A cc u r a cy (a) Accuracy LR NB SVM-L SVM-P SVM-R RF AB GBDT FN

Classifier F - S c o r e (b) F1-Score LR NB SVM-L SVM-P SVM-R RF AB GBDT FN

Classifier P r e c i s i on (c) Precision LR NB SVM-L SVM-P SVM-R RF AB GBDT FN

Classifier R e c a ll (d) Recall Figure 7: Performance of Target Classiﬁersfeatures through different classiﬁers may lead to a loss of information. Since multitask learning enables different tasksto share the same network structure and weights, it signiﬁcantly reduces the loss of information caused by transferlearning, thus has better performance.Furthermore, the ROC curves are given in Fig. 8. We obtain the classiﬁcation probability values of the model outputand plot these ROC curves by sampling the multipoint FPR and TPR.The result of the plot shows that all the curves are close to the upper left corner, which proves that all of the classiﬁershave excellent performance. The curve of FusionNet is closest to the upper left corner, achieving the best classiﬁcationperformance.

In previous experiments, we used both depressed users and normal users with a proportion of 50% in dataset D , D ,and D . However, in the real OSN environment, depressed users exist in a minority of the whole user community.Due to the difﬁculty of collecting depressed user data, it is hard to ensure that the training and optimization process ofthe classiﬁer can fully guarantee a balanced data proportion.Therefore, by changing the proportion of depressed user samples (denoted as ρ ), we analyze the F1-Score ﬂuctuationson the target classiﬁers to evaluate its robustness of training an unbalanced number of samples. Each classiﬁer istreated as a group. For each group, we will test nine values of ρ from 0.1 to 0.9, with an interval of 0.1.19 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT

False positive rate T r ue po s i t i v e r a t e FN (AUC = 0.9775)LR (AUC = 0.9660)GBDT (AUC = 0.9653)AB (AUC = 0.9649)SVM-linear (AUC = 0.9623)RF (AUC = 0.9604)SVM-rbf (AUC = 0.9600)SVM-poly (AUC = 0.9562)NB (AUC = 0.9555)

Figure 8: ROC Curves of the Target ClassiﬁersHere, we implement a new metric, namely the Intra-group F1-Score Variance (IFV), to calculate the variance of theF1-Score in each group. First, for each group, the mean value of the F1-Score is calculated and represented by X IF .The number of ρ values taken in each group is noted as T . Therefore, the IFV metric is deﬁned as: IF V = (cid:118)(cid:117)(cid:117)(cid:116) T × T (cid:88) i =1 ( F i − X IF ) (20)Table 8: IFV of Target Classiﬁers Classiﬁer Intra-group F1-Score Variance (IFV)

LR 3.30e-5NB 1.82e-5SVM-linear 4.60e-5SVM-poly 5.89e-5SVM-rbf 2.59e-5RF 2.80e-5AB 1.80e-5GBDT 2.92e-5

FN (Proposed) 1.01e-5

Fig. 9(a) shows the F1-Score of each classiﬁer under different ρ values. The experimental results show that when theproportion of data samples of depressed users and normal users are close to balance, the classiﬁers tend to achievehigher F1-Score.Table 8 and Fig. 9(b) shows the IFV metric of the target classiﬁers. Although LR once achieved a high F1-Score inthe experiment of target classiﬁers, it has relatively low robustness for the unbalanced data due to the relatively poordecision ability of the single classiﬁer. With the kernel learning strategy, the two types of SVM classiﬁers have a betterability to ﬁt the data, but the values of F1-Score also ﬂuctuates obviously when the training samples are unbalanced.Ensemble classiﬁers including RF and GBDT obtains better robustness in the experiment. The IFV metric of ourproposed FusionNet reaches the minimum value among the target classiﬁers, indicating that FusionNet has the bestrobustness, i.e. the most stable classiﬁcation performance.In addition to the advantages of the multitask learning strategy mentioned in the previous part, we believe that theadaptive learning rate of Nadam can also help the FusionNet ﬁnd the global optimal solution more quickly even if twoclasses of training data are not balanced. 20 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo A P

REPRINT F - sc o r e LR NB SVM-LSVM-P SVM-R RFAB GBDT FN (a) F1-Score under Different ρ Values

LR NB SVM-L SVM-P SVM-R RF AB GBDT FN

Classifier I F V -5 (b) IFV of the Target Classiﬁers Figure 9: Unbalanced Training Samples

In this work, we proposed a multitask learning-based approach to predict depressed users on Sina Weibo.First, based on data collection and script ﬁltering and manual labeling, we built and publish a large Weibo Userdepression detection dataset - WU3D. The total number of user samples reaches over 30,000 and each user has enrichedinformation ﬁelds. This dataset will be quite sufﬁcient to be used by subsequent researchers to complete furtherresearch.Secondly, we summarized and manually extracted ten statistical features including text, social behavior, and picture-based features. The experimental results showed that all of them have varying degrees of distribution differencesbetween normal users and depressed users, which can contribute positively to classiﬁcation tasks. Our experimentalresults also proved that the feature engineering process of text information is the most vital part of depression detectionon OSN.Furthermore, we evaluated the performance of the pretrained model XLNet as the embedding model to solve down-stream classiﬁcation tasks. It showed that when the appropriate embedding length is selected, XLNet has excellentperformance and efﬁciency in handling long text sequences.Finally, we implemented a multitask learning DNN classiﬁer, FusionNet, to simultaneously handle the word vectorclassiﬁcation task and the statistical feature classiﬁcation task. Beneﬁt from the strategic advantages of multitasklearning, FusionNet reduced the loss of feature information caused by transfer learning. Compared with the commonlyused models in existing work, FusionNet has achieved a very signiﬁcant performance improvement with an F1-Scoreof 0.9772 and showed the best classiﬁcation robustness when the training samples are unbalanced. Thus, it has provento be an ideal classiﬁcation model when dealing with multiple classiﬁcation tasks at the same time.For future work, two directions will be further explored. (i)

The size of the dataset will be further expanded. Largerdatasets will be constructed for training and evaluating classiﬁers to achieve better generalization performance. (ii)

The characteristics and behavior patterns of depressed users will be further analyzed. We will propose more effectivefeature solutions for user-level depression detection on the OSN.

References [1] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, “Predicting depression via social media,” in

Proceed-ings of the 7th International AAAI Conference on Weblogs and Social Media , Cambridge, MA, USA, Jul 2013,pp. 128–137.[2] T. Wang, M. Brede, A. Ianni, and E. Mentzakis, “Detecting and characterizing eating-disorder communitieson social media,” in

Proceedings of the 10th ACM International Conference on Web Search and Data Mining ,Cambridge, UK, Feb 2017, pp. 91–100. 21 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT [3] C. Karmen, R. C. Hsiung, and T. Wetter, “Screening internet forum participants for depression symptoms byassembling and enhancing multiple nlp methods,”

Computer Methods and Programs in Biomedicine , vol. 120,no. 1, pp. 27–36, Jun 2015.[4] E. Saravia, C.-H. Chang, R. J. De Lorenzo, and Y.-S. Chen, “Midas: Mental illness detection and analysis viasocial media,” in

Proceedings of 2016 IEEE/ACM International Conference on Advances in Social NetworksAnalysis and Mining , San Francisco, CA, USA, Aug 2016, pp. 1418–1421.[5] Q. Zhang, L. Zhong, S. Gao, and X. Li, “Optimizing hiv interventions for multiplex social networks via partition-based random search,”

IEEE Transactions on Cybernetics , vol. 48, no. 12, pp. 3411–3419, Dec 2018.[6] H. Lin, J. Jia, Q. Guo, Y. Xue, Q. Li, J. Huang, L. Cai, and L. Feng, “User-level psychological stress detectionfrom social media using deep neural network,” in

Proceedings of the 22nd ACM International Conference onMultimedia , Orlando, FL, USA, Nov 2014, pp. 507–516.[7] S. Balani and M. De Choudhury, “Detecting and characterizing mental health related self-disclosure in socialmedia,” in

Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in ComputingSystems , Seoul, Republic of Korea, Apr 2015, pp. 1373–1378.[8] Q. Cheng, T. M. Li, C.-L. Kwok, T. Zhu, and P. S. Yip, “Assessing suicide risk and emotional distress in chinesesocial media: A text mining and machine learning study,”

Journal of Medical Internet Research , vol. 19, no. 7,p. e243, Jul 2017.[9] Q. Gao, F. Abel, G.-J. Houben, and Y. Yu, “A comparative study of users’ microblogging behavior on sinaweibo and twitter,” in

Proceedings of the 20th ACM International Conference on User Modeling, Adaptation,and Personalization . Berlin, Heidelberg: Springer, July 2012, pp. 88–101.[10] G. Shen, J. Jia, L. Nie, F. Feng, C. Zhang, T. Hu, T.-S. Chua, and W. Zhu, “Depression detection via harvestingsocial media: A multimodal dictionary learning solution,” in

Proceedings of the 26th ACM International JointConference on Artiﬁcial Intelligence , Melbourne, Australia, Aug 2017, pp. 3838–3844.[11] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T. S. Chua, and W. Hall, “Cross-domain depression detection via harvesting social media,” in

Proceedings of the 27th ACM International JointConference on Artiﬁcial Intelligence , Stockholm, Sweden, Jul 2018, pp. 1611–1617.[12] T. Gui, L. Zhu, Q. Zhang, M. Peng, X. Zhou, K. Ding, and Z. Chen, “Cooperative multimodal approach todepression detection in twitter,” in

Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence , Honolulu,HI, USA, Jan 2019, pp. 110–117.[13] Y. Suhara, Y. Xu, and A. Pentland, “Deepmood: Forecasting depressed mood based on self-reported historiesvia recurrent neural networks,” in

Proceedings of the 26th ACM International Conference on World Wide Web ,Perth, Australia, Apr 2017, pp. 715–724.[14] X. Ma, H. Yang, Q. Chen, D. Huang, and Y. Wang, “Depaudionet: An efﬁcient deep model for audio baseddepression classiﬁcation,” in

Proceedings of the 6th ACM International Workshop on Audio/Visual EmotionChallenge , Amsterdam, The Netherlands, Oct 2016, pp. 35–42.[15] L. Yang, D. Jiang, X. Xia, E. Pei, M. C. Oveneke, and H. Sahli, “Multimodal measurement of depression usingdeep learning models,” in

Proceedings of the 7th Annual ACM Workshop on Audio/Visual Emotion Challenge ,Mountain View, CA, USA, Oct 2017, pp. 53–59.[16] H. Dibeklio˘glu, Z. Hammal, and J. F. Cohn, “Dynamic multimodal measurement of depression severity usingdeep autoencoding,”

IEEE Journal of Biomedical and Health Informatics , vol. 22, no. 2, pp. 525–536, Mar 2017.[17] Y. Zhu, Y. Shang, Z. Shao, and G. Guo, “Automated depression diagnosis based on deep networks to encodefacial appearance and dynamics,”

IEEE Transactions on Affective Computing , vol. 9, no. 4, pp. 578–584, Oct2018.[18] S. C. Guntuku, D. B. Yaden, M. L. Kern, L. H. Ungar, and J. C. Eichstaedt, “Detecting depression and mentalillness on social media: An integrative review,”

Current Opinion in Behavioral Sciences , vol. 18, no. SI, pp.43–49, Dec 2017.[19] A. Cohan, B. Desmet, A. Yates, L. Soldaini, S. MacAvaney, and N. Goharian, “Smhd: A large-scale resource forexploring online language usage for multiple mental health conditions,” arXiv preprint arXiv:1806.05258 , 2018.[20] M. Deshpande and V. Rao, “Depression detection using emotion artiﬁcial intelligence,” in

Proceedings of the19th IEEE International Conference on Intelligent Sustainable Systems , Palladam, Tirupur, India, Dec 2017, pp.858–862.[21] N. Al Asad, M. A. M. Pranto, S. Afreen, and M. M. Islam, “Depression detection by analyzing social mediaposts of user,” in

Proceedings of 2019 IEEE International Conference on Signal Processing, Information, Com-munication & Systems , Dhaka, Bangladesh, Apr 2019, pp. 13–17.22 Multitask Deep Learning Approach for User Depression Detection on Sina Weibo

A P

REPRINT [22] M. Trotzek, S. Koitka, and C. M. Friedrich, “Utilizing neural networks and linguistic metadata for early detectionof depression indications in text sequences,”

IEEE Transactions on Knowledge and Data Engineering , vol. 32,no. 3, pp. 588–601, Mar 2020.[23] R. U. Mustafa, N. Ashraf, F. S. Ahmed, J. Ferzund, B. Shahzad, and A. Gelbukh, “A multiclass depressiondetection in social media based on sentiment analysis,” in

Proceedings of the 17th IEEE International Conferenceon Information Technology—New Generations . Las Vegas, NV, USA: Springer, Apr 2020, pp. 659–662.[24] M. Stankevich, V. Isakov, D. Devyatkin, and I. Smirnov, “Feature engineering for depression detection in so-cial media,” in

Proceedings of the 7th IEEE International Conference on Pattern Recognition Applications andMethods , Funchal, Madeira, Portugal, Jan 2018, pp. 426–431.[25] X. Wang, C. Zhang, Y. Ji, L. Sun, L. Wu, and Z. Bao, “A depression detection model based on sentiment anal-ysis in micro-blog social network,” in , Gold Coast, QLD, Australia, Apr 2013, pp. 201–213.[26] F. Huang, X. Zhang, J. Xu, Z. Zhao, and Z. Li, “Multimodal learning of social image representation by exploitingsocial relations,”

IEEE Transactions on Cybernetics , pp. 1–13, Mar 2019, early access.[27] A. H. Orabi, P. Buddhitha, M. H. Orabi, and D. Inkpen, “Deep learning for depression detection of twitter users,”in

Proceedings of the 5th Workshop on Computational Linguistics and Clinical Psychology: From Keyboard toClinic , New Orleans, LA, USA, Jun 2018, pp. 88–97.[28] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrasesand their compositionality,” in

Proceedings of the 26th ACM International Conference on Neural InformationProcessing Systems , Lake Tahoe, NV, USA, Dec 2013, pp. 3111–3119.[29] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602 , 2014.[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Computation , vol. 9, no. 8, pp. 1735–1780,Nov 1997.[31] F. Sadeque, D. Xu, and S. Bethard, “Measuring the latency of depression detection in social media,” in

Proceed-ings of the 11th ACM International Conference on Web Search and Data Mining , Marina Del Rey, CA, USA,Feb 2018, pp. 495–503.[32] C. Lin, P. Hu, H. Su, S. Li, J. Mei, J. Zhou, and H. Leung, “Sensemood: Depression detection on social media,”in

Proceedings of the 28th ACM International Conference on Multimedia Retrieval , Dublin, Ireland, Jun 2020,pp. 407–411.[33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers forlanguage understanding,” arXiv preprint arXiv:1810.04805 , 2018.[34] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregres-sive pretraining for language understanding,” in

Advances in neural information processing systems , Vancouver,Canada, Dec 2019, pp. 5753–5763.[35] P. De Meo, E. Ferrara, D. Rosaci, and G. M. Sarn´e, “Trust and compactness in social network groups,”

IEEETransactions on Cybernetics , vol. 45, no. 2, pp. 205–216, Feb 2015.[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attentionis all you need,” in

Proceedings of the 31st ACM International Conference on Neural Information ProcessingSystems , Long Beach, CA, USA, Dec 2017, pp. 5998–6008.[37] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 , 2016.[38] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covari-ate shift,” in

Proceedings of the 32nd ACM International Conference on International Conference on MachineLearning , Lille, France, Jul 2015, pp. 448–456.[39] A. Pandey and D. Wang, “Tcnn: Temporal convolutional neural network for real-time speech enhancement inthe time domain,” in

Proceedings of the 44th IEEE International Conference on Acoustics, Speech and SignalProcessing , Brighton, UK, May 2019, pp. 6875–6879.[40] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400arXiv preprint arXiv:1312.4400