Domain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media
DDomain-Guided Task Decomposition with Self-Training forDetecting Personal Events in Social Media
Payam Karisani
Emory [email protected]
Joyce C. Ho
Emory [email protected]
Eugene Agichtein
Emory [email protected]
Abstract
Mining social media content for tasks such as detecting personalexperiences or events, suffer from lexical sparsity, insufficient train-ing data, and inventive lexicons. To reduce the burden of creatingextensive labeled data and improve classification performance, wepropose to perform these tasks in two steps: 1. Decomposing thetask into domain-specific sub-tasks by identifying key concepts, thusutilizing human domain understanding; and 2. Combining the re-sults of learners for each key concept using co-training to reducethe requirements for labeled training data. We empirically show theeffectiveness and generality of our approach, Co-Decomp, usingthree representative social media mining tasks, namely PersonalHealth Mention detection, Crisis Report detection, and AdverseDrug Reaction monitoring. The experiments show that our modelis able to outperform the state-of-the-art text classification models–including those using the recently introduced BERT model–whensmall amounts of training data are available.
CCS Concepts • Information systems → Search results deduplication ; Socialnetworks ; Document filtering ; Information extraction ; Cluster-ing and classification ; Nearest-neighbor search . Keywords classification, semi-supervised learning, social media analysis, eventdetection
Social networks, such as Twitter and Facebook, have become insep-arable parts of societies. A broad spectrum of topics are shared anddiscussed in the networks every day, and this has turned them into asuitable means for the online public monitoring. The applicationsinclude, but not limited to, consumer opinion mining [18], stockmarket prediction [7], sarcasm detection [12], and user reputationmanagement [3]. These cases signify that social networks, e.g., Twit-ter, went beyond their initial purpose years ago–which was beingsimple personal messaging tools . Personal Event Detection is anexample of the online public monitoring. For instance, in the case ofPersonal Health Mention detection [30], the aim is to mine and trackany individual health event. Scalability, real-time surveillance, andrapid response to potential outbreaks are the main advantages of thistask when it is used inside a public health monitoring system. An-other example is Crisis Report detection [16] through social media,which aims to mine user postings and alert humanitarian institutionsand agencies during natural disasters. Even though social networks are a valuable source of information,mining user postings comes with several challenges. For instance, thetasks usually suffer from the lack of enough training data [21]. Evenin the cases that there is enough resources to construct a trainingset, the class distributions might be highly imbalanced [1, 33]. Thus,having machine learning models to perform well in this data scarceenvironment is of great value.In classification tasks a common practice is to first extract a setof features, either manually or through representation learning, andthen train a classifier over the resulting feature vectors. While train-ing a single classifier over the entire content is a standard practice,an end-to-end classifier may require substantial amount of annotateddata. Instead, for a subset of tasks, we can use domain knowledge todecompose the problem into a set of sub-tasks, and use a separatelearner to tackle each one individually. This can lead to the devel-opment of models which are equipped with domain understandingand require less training data. For instance, if the task is cancersurveillance on the Twitter website, in the tweet “I Just went tomy Oncology appointment at the Hospital!!! Praying that it’s notcancer” , we might be able to infer the class label from the contex-tual information of either the word “I” or “cancer”. Therefore, wecan solve each classification problem individually and aggregate theresults.We propose Co-Decomp, a semi-supervised model that can clas-sify short text for problems with a set of sub-tasks. While our modelcan be potentially applied to any problem that is centered arounda group of concepts or entities, we focus on three personal eventdetection tasks; because they usually suffer from the lack of train-ing data and imbalanced class distributions, as mentioned earlier.Namely, we focus on Personal Health Mention detection [21], CrisisReport detection [16], and Adverse Drug Reaction monitoring [33],and show that Co-Decomp can outperform state-of-the-art classifiersin semi-supervised settings. In summary, our contributions are: • We propose Key Concept Sets to decompose a particularcategory of text classification problems, referred to as decom-posable problems, into a set of sub-tasks. • We introduce a co-training model to effectively utilize theproblem decomposition, and reduce the need for training data. • We show that a category of personal event detection tasks fallinto the class of decomposable problems. We carry out com-prehensive experiments on four datasets, and show that ourmodel reduces the need for training data, and can outperformstate-of-the-art classifiers in the low data regime.Together, these contributions significantly advance the state of theart in the personal event detection and related tasks. Next, we reviewthe related work to place our contributions in context. a r X i v : . [ c s . C L ] A p r WW ’20, April 20–24, 2020, Taipei, Taiwan Payam Karisani, Joyce C. Ho, and Eugene Agichtein
Our model falls into the category of divide-and-conquer algorithms,and this family of algorithms have been employed in text classifi-cation before. For example, a pipeline of filtering steps have beenapplied to documents in order to filter out the confidently negativeones [1]. The main difference between our model and the pipeliningapproach is that we initially decompose the task into a set of sub-tasks that can be complementary, whereas in the case of pipelining,the final classifier still needs to tackle the same initial task. Addi-tionally, our decomposition reduces the need for training data suchthat the task can be solved in semi-supervised settings. Our modelis also deeply connected to the information extraction [26], relationclassification [41], and semantic role labeling [35] tasks in naturallanguage processing. In addition to be agnostic towards the numberof entities and their relation type, which are pivotal in the mentionedtasks, our proposal is mainly a new perspective on tackling textclassification problems in semi-supervised settings. Thus, in con-trast to these tasks, we are not concerned about entity extractionor relation classification, but our focus is on how to decomposethe classification problem such that the resulting pieces are goodrepresentations.Another related topic, which has inspired our work, is AnnotatorRationale technique introduced in [40]. The authors use manualannotations within documents to derive new training examples. Totake into account the possible biases in the synthesized examples,they also adjust the classification model accordingly. Similar to theirapproach, our model also relies on the annotations within each docu-ment. The manual annotation of the sentences within each documentraises efficiency concerns about the cost of preparing the trainingdata. However, they carry out a set of extensive experiments andshow that the effort of labeling the sentences within each documentis not significant. Specifically, they show that when the classificationtask is predetermined but the set of candidate sentences and wordsis open and unknown, human annotators can rapidly scan the textand highlight the important sections. In our model, this issue is evenless concerning, because once the set of Key Concept Sets is defined,they can be automatically discovered and highlighted ; and readyto annotate . The main difference between Co-Decomp and Anno-tator Rationale is that our model relies on domain-guided problemdecompositions to derive new training examples. Consequently, Co-Decomp is able to divide the initial problem into potentially smallertasks, and tackle each one individually.In the context of the personal health event detection, the closestwork to ours is the WESPAD model introduced in [21]–We haveincluded the model as a baseline. The underlying assumption ofWESPAD is that there is enough data to extract good lexical features.Even though this model works well in supervised settings, in Section6 we will show that it performs poorly in semi-supervised settings.Finally, in contrast to general semi-supervised learning models suchas transductive [19], graph-based [42], generative [29], or hybridmodels [5], our model is a novel method to incorporate domainknowledge into the learning process. Therefore, our solution canbe still implemented in any of the machine learning frameworkswhich can regulate the interaction between multiple learners, e.g.,[6, 14, 32]. In summary, our work advances the state of the art byidentifying the problem decomposition in text classification tasks, i heard my cousin is diagnosed with cancer you lessen your chances of getting cancer when you quit ( f (“i”), -) id: 1 id: 2 ( f (“cousin”), +)( f (“you”), -) ( f (“cancer”), +)( f (“cancer”), -) friend of mine has cancer id: 3 ( f (“you”), ?) ( f (“cancer”), ?) Label L1 Label L2
Train Classifier C1 (Task 1)
Train
Classifier C2 (Task 2)
Classifier C2Classifier C1 co-training
Final Label
Training
Test
Figure 1: Illustration of Co-Decomp method for detecting per-sonal health mentions (cancer), where the task is decomposedinto detecting positive human mentions (Class C1) and actualhealth event (cancer) mentions (Class C2). In the training phase,classifiers for C1 and C2 are trained over the labeled instancesof C1 and C2. To label the unseen examples in the test phase,the predictions of classifiers for C1 and C2 are aggregated. proposing an effective co-training model to utilize the technique,and showing the superiority of the model in semi-supervised settingsacross multiple tasks.
We begin this section by presenting an example, and explaining theintuition behind Co-Decomp. Consider the task of cancer surveil-lance in Twitter. The common practice is to extract a set of featurevectors from user postings–manually or automatically–and train aclassifier over the extracted vectors. However, this approach hassome drawbacks. First, the classifier needs to learn a mapping func-tion from the linguistic patterns that appear in tweets to the classlabels. Even if the patterns are not semantically and directly relatedto the task, the classifier still needs to learn to discard them. Sec-ond, no domain understanding is used to tackle the problem. Withsufficient training data, classifiers can ultimately discover the rightfeature set, and detect the correct mapping function. But this is notthe case in semi-supervised settings with insufficient labels. To ad-dress these issues, our proposal is to decompose the task into a setof complementary sub-tasks, and tackle each one individually.For instance, in the case of cancer surveillance, as shown inFigure 1, the original task can be decomposed into (1) detectingpositive mentions of humans (marked by “Task 1” in Figure 1)and (2) detecting positive mentions of the word cancer (markedby “Task 2” in Figure 1). A tweet may contain multiple humanmentions and cancer mentions, as shown in the case of the tweet“id: 1” in Figure 1. The mentions that refer to the human with thereported cancer are labeled positive, while the remaining mentionsare labeled as negative. Two separate classifiers are trained over the omain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media WWW ’20, April 20–24, 2020, Taipei, Taiwan mentions of humans and the mentions of cancer, respectively. Thetwo classifiers are then aggregated in a co-training framework toresult a robust model. In the following subsections, we define KeyConcept Sets and decomposable problems. Then, we describe ourmodel Co-Decomp, which utilizes the problem decomposition in aco-training framework.
In this section, we introduce Key Concept Sets, which allow us todecompose a problem into a set of sub-tasks. Let π be the distributionover document and class pairs π : ( d , c ) ∈ D × { , , · · · } , and V be the vocabulary set. Also let f : ( w , d , i ) (cid:55)→ R ⋉ be a vector-valued function which captures the contextual information of the i-th occurrence of term w in document d , and maps it into an n -dimensionspace of real values. Given threshold γ , we define K to be a KeyConcept Set if: K ⊆ V ∀ w , v ∈ K : ∥ f ( w , : , : ) − f ( v , : , : )∥ ≤ γ There exists distribution φ over the value of f and class pairs φ : ( f , c ) ∈ f × { , , · · · } such that ∀ d ∈ D , ∃ w ∈ K , ∃ ( w , d , i ) : ( d , c k )∼ π ⇔ ( f ( w , d , i ) , c k )∼ φ .Thus, a Key Concept Set is a subset of the vocabulary set–attribute(1)–in which its members are contextually similar–governed by γ in attribute (2)–and if we train a classifier on the context vectorsof its members, there is at least one term in every document whereits label is the same as the document label–attribute (3). We call aclassification problem decomposable, if there exists at least one KeyConcept Set in the vocabulary set.Key Concept Sets simplify the classification inference, since theclassification over the documents can be replaced with the clas-sification over the key-concept-set terms in the documents. Morespecifically the advantages are: First, the dimension of the contextfunction f is usually much smaller than the size of the vocabularyset V , thus feature selection becomes easier. Second, since intuitivelythere are limited ways of using a word in context, there is less vari-ance in distribution φ in comparison to distribution π , which canvirtually model the entire language. Third, as we will discuss in thenext section, we can rely on our domain understanding to identifyKey Concept Sets, and therefore, equip the model with a knowledgethat otherwise it would need to learn through more training data.This will help the model to generalize better with smaller number oftraining examples. To identify Key Concept Sets we rely on human knowledge. Ourmodel is proposed for the tasks which are tailored for specific enti-ties or concepts. Therefore, we assume once the problem statementis defined, the identification of the subject entities will be straightfor-ward. To demonstrate that this assumption holds in some real-worldscenarios, in Section 4 we present three tasks that follow this motif.Namely, we discuss Personal Health Mention detection [21], Cri-sis Report detection [16], and Adverse Drug Reaction monitoring[33] tasks. We show that, even though there is a large body of workbehind each one, they can be viewed as decomposable problemsand addressed similarly. This is striking, since to the best of ourknowledge so far no connection has been made between these threetasks. We conjecture that there may be an even larger set of tasks that have the same attributes and can be potentially decomposable–oneparticularly interesting case which we may explore in the future isthe product review task in social media.
A short note on the role of human knowledge in our model.
Our model is not a human-in-the-loop algorithm. Once the trainingstage begins, no human supervision is required. In the regular learn-ing, the learner mines the entire feature space to detect the conclusivesubset of features. To do so, the model requires enough training data.We are in fact eliminating this step, and reducing document levelclassification to word level classification. In other words, we rely onhuman knowledge to relocate one of the data exploration steps fromthe learning stage to the design stage. Thus, the learning procedurestill occurs, however, in a smaller feature space with less variation.The idea of reliance on human knowledge is not novel. For instance,the distant supervision model [26], assumes the user has enoughdomain expertise to introduce a large noisy dataset. Co-trainingmodel [6], assumes the user has enough information about the taskto introduce two subsets of features. And the data programmingmodel [32], assumes the user has enough knowledge to provide thelearner with a set of heuristics. Interestingly, all of these models areproposed for the low data regime.
The contextual similarity between the members of a Key ConceptSet, that was introduced in the previous section, insures that thesets that can potentially capture different aspects of documents arenot combined . Being able to capture multiple views of the sameproblem–even loosely–is shown to be effective in models such asco-training [6, 28]. Thus, we propose to use co-training to utilizethe problem decomposition . Algorithm 1 illustrates the trainingprocedure of Co-Decomp. Since there could be multiple occurrencesof the members of a Key Concept Set in a document, the problemis viewed as a multiple instance learning problem [9], where eachdocument is called an example, and each set member occurrence inthe document is called an instance. The procedure is iterative, and inevery iteration the set of labeled instances of every example are usedto train a classifier. Then the classifiers are used to label the instancesof the unlabeled data, and according to the multiple instance learningselection metric the examples are labeled–e.g, based on their mostconfident positive instance. Finally, the most confident positive andnegative examples of each Key Concept Set are added to the pool ofthe labeled training data.Algorithm 2 illustrates the test procedure. The array of classifierstrained in Algorithm 1 are used to label the unseen examples. To labelevery example, each classifier is used to calculate the probabilityof the example being positive, and then a simple criterion similarto the one proposed in [6] is used to label the example. In a morecomplicated scenario, each classifier could have a prior reliabilityscore, however, for simplicity we opted for the model proposed in[6]. The similarity condition–introduced by γ –does not by itself guarantee orthogonalityof the features. However, if two subsets of vocabularies are contextually different, andtheir context vectors are indicators of the document class, then, we assume they cancapture different aspects of the document. We consider the binary classification problems, however, our model can also generalizeto multi-label classification problems.
WW ’20, April 20–24, 2020, Taipei, Taiwan Payam Karisani, Joyce C. Ho, and Eugene Agichtein
Algorithm 1
Training Procedure of Co-Decomp procedure T RAIN Given: L : Set of labeled examples U : Set of unlabeled examples J : Number of key concept sets K : Number of iterations Return: C [ . . . J ] : array of classifiers trained on instances ofeach key concept set in L and U Execute: for i ← to K do for j ← to J do Train C j on instances of key concept set j in L Use C j and multiple instance learning metric tolabel the examples in U Store the most confident positive and negative ex-amples in EP j and EN j for j ← to J do Delete EP j and EN j in U and add them to L Return C [ . . . J ] Algorithm 2
Test Procedure of Co-Decomp procedure T EST Given: J : Number of key concept sets C [ . . . J ] : array of classifiers Test : Test set Return: Labeled test set Execute: for exmpl in Test do for j ← to J do Use C j and multiple instance learning metric to find theprobability of exmpl being positive Store the corresponding probability in P j if (cid:206) Ji = P i ≥ (cid:206) Ji = ( − P i ) then exmpl is positive else exmpl is negative Return
Test
A short note on the orthogonality of Key Concept Sets.
Multi-view learning techniques [39] are effective even in the presence ofcorrelated views. Particularly in the case of co-training algorithm,numerous studies have shown that the initial assumption of orthog-onality between the views was over-strong. For instance, Balcan,Blum, and Yang [4] propose a theoretical framework and argue that ifthe classifiers in each view are sufficiently strong PAC-learners, thenthe initial constraint on the views can be substantially relaxed. In theapplication domain, Nigam and Ghani [28] show that by randomlysplitting lexical features, one can construct two separate views forco-training algorithm. Jones et al., [20], propose Co-EM algorithm for information extraction. Their two feature sets are noun phrasesand their surrounding contexts. They show that even though thesetwo feature sets are highly correlated, they can be still effective in aco-training model.In the next section, we use Co-Decomp to propose a solution to aset of personal event detection tasks in social media.
In this section, we show that Co-Decomp is applicable to threeimportant real-world scenarios: Personal Health Mention detection(PHM), Crisis Report detection (CR), and Adverse Drug Reactionmonitoring (ADR). We show that these three tasks are decomposableproblems and have a unified solution.
Personal Health Mention detection (PHM) is described in [21], andconcerns “identifying postings in social data, which not only containa specific disease , but also mention a person who is affected” . Toemploy Co-Decomp, we regard the two entities that are present inthe problem statement as the Key Concept Sets: 1) The set of allhuman mentions. 2) The disease keyword mentioned in the task.We argue that both of the sets loosely follow the conditions whichare described in Section 3.1. Intuitively, all the human mentionshave similar contextual vectors (condition (2)); and by construction,there is at least one human mention that determines the label ofthe user posting (condition (3)). The same reasoning applies to thesecond Key Concept Set; there must be at least one occurrence ofthe disease keyword which determines the label of the user posting(condition (3)).After identifying the Key Concept Sets, the next step is to preparethe training set. We implemented a tool to automatically extract thehuman mentions and highlight the mentions for manual annotation–similar to Annotators Rationale method [40]. Since user postingsare short, we assumed all the disease mentions in the positive userpostings were positive instances of the second Key Concept Set.All the human mentions and disease mentions of the negative userpostings were assumed to be negative instances. Thus, the extractionand annotation of the disease mentions, the extraction of the humanmentions, and also the annotation of the negative human mentionsare all fully automatic. Only the annotation of the positive humanmentions is manual–after a tweet is labeled positive, the user is askedto highlight the affected human mention.We followed Algorithm 1 for training the classifiers, and aug-mented the labeled data with unlabeled data. To add positive in-stances of Key Concept Sets to the labeled data, we selected themost confidently labeled instance and its most probable counterpartin the other Key Concept Set–we effectively stored the set of in-stances as labeled data. For example, assume the classifier trainedover disease mentions confidently labeled the word “cancer” positivein the tweet “a friend of me is diagnosed with cancer” . Then, weadded this instance to the set of labeled data, and also used the classi-fier trained over the human mentions to label the mentions of humanin the tweet, i.e., “friend” and “me”, and selected the most confidentone and added to the labeled data. To add negative instances of KeyConcept Sets to the labeled data, we selected the example which all omain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media WWW ’20, April 20–24, 2020, Taipei, Taiwan of its instances were confidently labeled negative, and added to thelabeled data. To test our model, we followed Algorithm 2.
Crisis Report detection (CR) as defined in [17] concerns “detect-ing reports of casualties and/or injured people due to the crisis . Orreports and/or questions about missing or found people ” . We regardthe two entities mentioned in the problem statement as the Key Con-cept Sets: 1) The set of all human mentions. 2) The crisis keywordmentioned in the task. In this study, we focus on the reports whichwere posted during an earthquake. To prepare the training set andevaluate our model, we followed the same procedure that we usedfor the PHM problem. Adverse Drug Reaction monitoring (ADR) is defined in [11], and ismeant for “detecting personal injuries resulting from medical drug use” . We regard the two entities mentioned in the problem statementas the Key Concept Sets: 1) The set of all human mentions. 2) Theset of all drug mentions. To prepare the training set and evaluate ourmodel, we re-implemented all the decisions that we made for thePHM problem.
In this section we provide a detailed explanation of the modulesand components used in Co-Decomp to address the tasks mentionedearlier. Specifically, we discuss the context function described inSection 3.1, the classifiers described in Section 3.3, the extraction ofthe Key Concept Sets mentioned in Sections 4.1, 4.2, and 4.3; andfinally the learning representation of the Key Concept Sets.
Context Function . We used contextual embeddings as the contextfunction described in Section 3.1. We used the BERT model [8],even though other models such as ELMO could be also used [31].We used the base variant, and pre-trained it on Twitter data–seebelow for the details about pre-training.
Used Classifiers . We used logistic regression classifier as the learn-ers mentioned in Section 3.3. Thus, after fine-tuning the embeddingsusing the training data, we used the contextual features to train the lo-gistic regression classifiers . The Mallet implementation of logisticregression [24] was used in this step. Key Concept Set Extraction . To detect human mentions we used aweak rule-based classifier. The accurate detection of human mentionsis out of our research scope; here, we aim to show that even a weakhuman mention detector can contribute to the performance. The rulesfor human mention detection were as follows: Using the StanfordNamed Entity Recognition (NER) tagger [10] we labeled all of the“PERSON” tags. Using the Stanford Parts of Speech (POS) tagger[36] we labeled all of the personal pronoun tags except for the word“it”. We also labeled all of the Twitter mentions–indicated by thesign “@”. Finally, we used a dictionary of 240 words manuallycollected from the Web to cover the remaining cases. Since not allof the human mentions are explicitly referred in user postings, wealso used a simple noisy rule based human mention synthesizer: If a There are also other variations of this task, e.g., displacing or evacuating people, duringdifferent incidents [2]. We made this decision based on implementation considerations. sentence started with a past tense verb we inserted the word “i” atthe beginning. If a sentence started with an adjective we inserted“i am” at the beginning. If a sentence started with a past participleverb we inserted “i have” at the beginning. If a sentence startedwith a present continuous verb we inserted “i am” at the beginning.And finally, if a sentence started with “is”, we replaced it with “iam”. We empirically developed these rules, and as mentioned earlier,to achieve a better performance they can be replaced with moresophisticated models.The model relies on the positive mentions of the humans in thepositive tweets–described in Section 4.1. One of the authors of thearticle supplied the annotations. The rules for the annotation were asfollows: The explicit mentions of the humans which are associatedwith the event (either disease, or disaster, or drug injury) shouldbe annotated. If the explicit mention does not exist, the implicitmentions which are associated with the event should be annotated.To extract the disease Key Concept Set mentioned in Section 4.1,we conducted a keyword search for the disease name in the taskdescription. For instance if the task is about Parkinson’s diseasesurveillance, the disease Key Concept Set contains the word {Parkin-son’s}. To extract the crisis Key Concept Set mentioned in Section4.2, we also performed a keyword search for the incident in the taskdescription. As mentioned earlier, in this study we focused on anearthquake incident. Thus, the crisis Key Concept Set contains thekeywords {earthquake, quake}. To extract the drug Key Concept Setdescribed in Section 4.3, we used the list of drug names published in[33], and conducted a keyword search for the drug names in the list.
Learning Key Concept Set Representations . Since the humanmentions are lexically different–although we expect them to becontextually similar–we replaced all of them with a mask tokenHUM_TOK and learned the representation. To do so, we collecteda set of 7,598,545 random tweets by Twitter API in October 2018,replaced all the human mentions with this token, and pre-trained thebase variant of the BERT model for 10 epochs–with default hyperpa-rameters as mentioned in [8]. The word vectors used in the personalhealth mention detection and crisis report detection tasks are the out-put of this model. To unify the representations of the drug mentions,we used the list of drug names published in [33] to collect a set of28,710 tweets containing the drug names , replaced the names withDRUG_TOK and further pre-trained the above mentioned modelfor 10 epochs. The word vectors used in the adverse drug reactionmonitoring task are the output of this model. In this section we first describe the datasets that we used in theexperiments, and then, we review the baselines that we implemented,and finally discuss the training procedure.
For personal health mention detection task we used two datasets.First, the dataset introduced in [23], which we call FLU dataset .At the time of downloading this dataset, there were still 2,837 tweets We used the Twitter streaming API for four weeks, and collected about 300K tweets,however, found that the majority of them were duplicates. We used the infection vs awareness version of FLU dataset, for detailed informationabout the datasets please refer to the cited articles.
WW ’20, April 20–24, 2020, Taipei, Taiwan Payam Karisani, Joyce C. Ho, and Eugene Agichtein
Name Target
FLU [23] Positive flu cases 2837 51PHM [21] Alzheimer 1256 18PHM [21] Heart attack 1219 13PHM [21] Parkinson’s 1040 11PHM [21] Cancer 1242 21PHM [21] Depression 1213 40PHM [21] Stroke 1222 14CRISIS [17] Injured or missing 2013 11ADR [33] Drug injuries 4355 10
Table 1: Summary of FLU [23], PHM [21], CRISIS [17], andADR [33] datasets and their associated prediction tasks. Thethird and fourth columns report the size of the dataset and per-centage of the positive tweets respectively. available to crawl, in which 49% of them are negative–awarenesstweets–and 51% of them are positive–report actual cases of flu.Second, the dataset introduced in [21], which we call PHM dataset.At the time of downloading this dataset, there were 7,192 tweetsavailable to crawl. This dataset consists of 6 diseases: Alzheimer’s,heart attack, Parkinson’s, cancer, depression, and stroke. All of thesesub-datasets are highly imbalanced, positive examples span between11% to 40% of the cases. For crisis report detection task, we usedthe earthquake related dataset introduced in [17], which we callCRISIS dataset. This dataset contains a set of 2,013 tweets whichwere posted during the California earthquake in 2014 . Only 11%of the tweets in this dataset are positive cases of injured or missingpeople. For adverse drug reaction monitoring task, we used thedataset introduced in [33], which we call ADR dataset. At the time ofcrawling the dataset, there were 4,355 tweets available. This datasetis also highly imbalanced, only 10% of the tweets are positive casesof drug injures. Table 1 summarizes the 4 datasets and their targetprediction tasks. To compare the performance of our method, we implemented thefollowing methods and classifiers. Model hyperparameters weretuned based on the training folds and datasets, and in most casestheir optimal values were dependent on the training data. NB . A Naive Bayes classifier is trained over unigrams and bigrams,as it has been shown to perform well with small training sets [27]. EM . We implemented the Expectation Maximization algorithm pro-posed by [29], which is known to work well in semi-supervisedsettings. We experimented with the set of {10,20,50,100} for thenumber of unlabeled documents. FastText . We trained the shallow neural network classifier introducedin [13], which can update word embeddings during the training. Weexperimented with {0.05,0.1,0.25,0.5} for the learning rate, and{2,4} for the window size.
WESPAD . We trained the PHM model introduced in [21], which isspecifically designed for Personal Health Mention detection. We ex-perimented with {3,4,5} for the number of clusters, and {0.05,0.15,0.3}for threshold values. Reference [17] also introduces a few more datasets. We used the California earthquakeversion, and split by the injured and missing vs other categories.
BERT-BASE . We included the model introduced in [8], which isnamed BERT and uses a multi-layer transformer encoder followed byone layer of a fully connected neural network for binary classificationproblems. In the experiments we observed that the large variantshows poor performance when the training data is small, thus wereport the results of the base variant
BERT-BASE –which has fewerlayers. We followed the parameter settings suggested in [8]; butempirically observed that if we set the number of epochs for fine-tuning to 15, the model is more stable and performs better.
BERT-TW . Since we experimented with Twitter data, we also pre-trained BERT in order to adjust the language model. Thus, we usedthe set of 7 million tweets described in Section 4.4 to further pre-train
BERT-BASE for 10 epochs–without replacing human mentions. Thehyperparameters were set to what is suggested in [8], and by the timethe pre-training was done, the performance of the internal languagemodelling tasks for sample tweets was similar to the performance of
BERT-BASE for sample Wikipedia pages.
BERT-DR . We also used the set of drug related tweets mentioned inSection 4.4–without replacing the drug mentions–to further pre-train
BERT-TW to be used in ADR task. We used the same parametersetting as
BERT-TW . Co-BE-LE . In order to boost the BERT model with Bootstrapping,we also included a co-training model with two learners: One NaiveBayes classifier trained over unigrams and bigrams, and one logisticregression classifier trained over the
BERT-TW or BERT-DR rep-resentation of the tweets–depending on the task. We experimentedwith {13,25,50} as the number of iterations in co-training model.
Co-Decomp . Our method described in Section 4. We empirically setthe number of iterations in the co-training model to 25–based on thetraining and development folds in the FLU dataset–and did not doany further tuning beyond what we did for
BERT-TW . We report allthe results with this setting unless stated otherwise.
We used standard 10-fold cross validation to train, validate, andtest all of the models. To evaluate the models in semi-supervisedsettings, we did not use the entire training and validation data, butrandomly sampled a few examples and used the rest of the examplesas unlabeled data. In the next section, we report the results whenwe have 100 training examples, however, we also show that ourmodel still performs well when the number of available trainingexamples increases. To split the datasets into the folds, we usedstratified sampling to preserve the original class distribution in thedatasets. We also preserved the folds and samples identical acrossthe experiments to ensure that all of the models use exactly the sametraining and test data. Since there is a natural randomness in neuralnetwork initialization and regularization techniques, we carried outall of the experiments 5 times, and averaged the performance results.Because the datasets are highly imbalanced, following the argu-ment in [25], we used the F1 measure in the positive class to tunethe models. In the next section we report F1, Precision, and Recallin the positive class–averaged over the test folds. omain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media WWW ’20, April 20–24, 2020, Taipei, Taiwan
In this section, we first report the performance results in FLU, PHM,CRISIS, and ADR datasets, and then analyze our model through aseries of experiments.
Table 2 summarizes the F1, precision, and recall of the models inFLU and PHM datasets–the results in PHM dataset are averaged overthe topics. Table 3 summarizes the results in CRISIS dataset, andTable 4 reports the results in ADR dataset. We also report the perfor-mance of the models in PHM dataset across all the topics in Table 5.The experiments show that Co-Decomp outperforms state-of-the-artclassifiers across the majority of the tasks. We can see that the im-provements in the imbalanced datasets (PHM and ADR) are morenoticeable than the improvements in the balanced dataset (FLU ).We can also see that the semi-supervised learning model
Co-BE-LE performs relatively well, although it has a low precision. In contrast,our model maintains a high precision. We attribute this advantage tothe easier tasks that Co-Decomp is tackling–i.e., selecting the mostconfident unlabeled instances via the context representations versusvia the document representations. Finally, the results suggest thatcrisis report detection is an easier problem than adverse drug reac-tion monitoring, because even though both CRISIS and ADR haveabout 10% positive examples, the performance of the models in theADR dataset is much lower. We will discuss this dataset in moredetail in the next section.
To better understand the impact of each component in our model, wereport the results of the ablation study in Table 6. Since PHM datasetwas the most diverse dataset (it constitutes 6 sub-topics), we carriedout the experiment in this dataset. The results show that the weakhuman mention classifier is clearly contributing to the performancewhen it is combined with the disease mention classifier. Then afurther improvement is achieved when co-training iterations areperformed. However, the improvement after 50 iterations comes atthe cost of dramatic deterioration in precision, which might not bedesirable.In Section 6.1, we observed that the performance of the modelsin ADR dataset was very low. To investigate the performance of themodels as the function of the training set size, in Figure 2 we reportthe performance of Co-Decomp in comparison to the state-of-the-art
BERT-DR classifier at different training set size cut-offs in thisdataset. The results show that even in supervised settings our modelis on par with strong classifiers–for this dataset and with manualfeature engineering the F1 of 0.538 is reported in [33] .Finally, often in the real world situations, practitioners who tryto tackle a classification problem, may have a small training set forthe task and a larger diverse training set in the similar domains. Wetried to evaluate our model in such a scenario. Thus, we assumedFLU dataset was the small training set which was available to doinfluenza surveillance in social media, and PHM dataset was thebigger diverse dataset which was available for similar domains. InTable 7, we report the results of domain adaptation in FLU dataset, The ADR task has been extensively explored in supervised settings [34, 37, 38].However, the studies on semi-supervised ADR are limited [15]
Figure 2: F1 at different training set size cut-offs for
BERT-DR and
Co-Decomp models in ADR dataset. There are 3,919 ex-amples in the training folds of ADR dataset–excluding the testfolds in 10 fold cross validation. when we use PHM dataset as the out-of-domain training data. Werandomly sampled 500 positive and 500 negative examples fromPHM dataset and fine-tuned the models; then further fine-tunedthem using the training folds of FLU dataset, and finally used forlabeling the FLU test folds–we used this approach to prevent fromthe catastrophic forgetting phenomenon in neural networks [22]. Theresults signify that even with a moderately large balanced trainingset, a supervised model cannot outperform Co-Decomp.In this study we defined problem decomposition, and showed thatit has at least three important real-world applications in social media.Our model is defined for the tasks that are centered around a set ofentities or concepts. Co-Decomp can be also regarded as an approachto incorporate domain knowledge into the machine learning models.In Section 3.1, we presented three arguments that explain why ourmodel is effective: 1) The vector representation of words is smallerthan the vector representation of documents. Thus, classification iseasier over the words. 2) There are limited ways of using a word ina context. 3) Equipping the model with domain knowledge. The lastargument, is based on the fact that we use domain understanding toimpose a new inductive bias on the learner, through removing lessimportant word features and targeting the pivotal entities in the task.
We proposed a novel semi-supervised model for classification tasksthat are centered around specific entities or concepts. Our modelis based on: (1) decomposing the problem into a set of sub-tasks,and (2) combining the results in a co-training framework. By lever-aging domain knowledge to decompose problems, and employingco-training framework to reinforce the underlying classifiers, ourmodel Co-Decomp is able to generalize well and outperform state-of-the-art classifiers in semi-supervised settings. We showed that ourmodel is applicable to at least three important personal event detec-tion problems, namely, Personal Health Mention detection, CrisisReport detection, and Adverse Drug Reaction monitoring. We alsocarried out extensive experiments and reported the performance ofthe model in various settings. The results indicate that Co-Decomp is
WW ’20, April 20–24, 2020, Taipei, Taiwan Payam Karisani, Joyce C. Ho, and Eugene Agichtein
FLU dataset PHM dataset
Model
F1 Precision Recall F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-TW
Co-BE-LE
Co-Decomp
CRISIS dataset
Model
F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-TW
Co-BE-LE
Co-Decomp
ADR dataset
Model
F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-DR
Co-BE-LE
Co-Decomp able to consistently and significantly outperform state-of-the-art clas-sifiers in the three mentioned tasks.Our current research introduces three potential future work direc-tions. First, investigating other tasks which may be decomposable.As we discussed in Section 3.2, the tasks that are centered around en-tities and concepts can be potential targets. For instance, our modelcan be applied to the customer satisfaction task–where the mentionsof human and the product can serve as candidate Key Concept Sets.The next two future directions are on the theory aspect of our method.One direction is to investigate the extent in which the choice of KeyConcept Sets can impact the model performance. This will help usto understand whether our model can be applied to the tasks that thedomain understanding is incomplete. Even though our experimentswith a weak human mention detector showed promising results, webelieve further investigation is required to understand if noisy Key Concept Sets can still be beneficial. And finally, the last future di-rection is to investigate the ways of automatically discovering KeyConcept Sets.
Acknowledgments
This work was funded by Emory University; also partially by NIHgrant LM013014-02, NSF award IIS-
References [1] Mohammad Akbari, Xia Hu, Liqiang Nie, and Tat-Seng Chua. 2016. From Tweetsto Wellness: Wellness Event Detection from Twitter Streams. In
Proceedings ofthe Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016,Phoenix, Arizona, USA.
Twelfth International AAAI Conference on Web and Social Media .[3] Abolfazl AleAhmad, Payam Karisani, Maseud Rahgozar, and Farhad Oroumchian.2016. OLFinder: Finding opinion leaders in online social networks.
J. InformationScience
42, 5 (2016), 659–674.[4] Maria-Florina Balcan, Avrim Blum, and Ke Yang. 2004. Co-training and Ex-pansion: Towards Bridging Theory and Practice. In
Proceedings of the 17thInternational Conference on Neural Information Processing Systems (NIPS’04) .MIT Press, Cambridge, MA, USA, 89–96.[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver,and Colin Raffel. 2019. MixMatch: A Holistic Approach to Semi-SupervisedLearning. arXiv preprint arXiv:1905.02249 (2019).[6] Avrim Blum and Tom M. Mitchell. 1998. Combining Labeled and UnlabeledData with Co-Training. In
Proceedings of the Eleventh Annual Conference onComputational Learning Theory, COLT 1998, Madison, Wisconsin, USA, July24-26, 1998.
J. Comput. Science
2, 1 (2011), 1–8.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding.
CoRR abs/1810.04805 (2018). arXiv:1810.04805[9] Thomas G. Dietterich, Richard H. Lathrop, and Tomás Lozano-Pérez. 1997. Solv-ing the Multiple Instance Problem with Axis-Parallel Rectangles.
Artificial Intelli-gence
89, 1-2 (1997), 31–71.[10] Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incor-porating Non-local Information into Information Extraction Systems by GibbsSampling. In
ACL 2005, 43rd Annual Meeting of the Association for Computa-tional Linguistics, Proceedings of the Conference, 25-30 June 2005, University ofMichigan, USA . 363–370.[11] Rachel Ginn, Pranoti Pimpalkhute, Azadeh Nikfarjam, Apurv Patki, Karen Oâ ˘A´Z-Connor, Abeed Sarker, Karen Smith, and Graciela Gonzalez. 2014. Mining Twitterfor adverse drug reaction mentions: a corpus and classification benchmark. In
Proceedings of the fourth workshop on building and evaluating resources forhealth and biomedical text processing . 1–8.[12] Roberto I. González-Ibáñez, Smaranda Muresan, and Nina Wacholder. 2011.Identifying Sarcasm in Twitter: A Closer Look. In
The 49th Annual Meeting ofthe Association for Computational Linguistics: Human Language Technologies,Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA - ShortPapers . 581–586.[13] Edouard Grave, Tomas Mikolov, Armand Joulin, and Piotr Bojanowski. 2017. Bagof Tricks for Efficient Text Classification. In
Proceedings of the 15th Conference omain-Guided Task Decomposition with Self-Training for Detecting Personal Events in Social Media WWW ’20, April 20–24, 2020, Taipei, Taiwan
Alzheimer’s Heart attack
Model
F1 Precision Recall F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-TW
Co-BE-LE
Co-Decomp
Parkinson’s Cancer
Model
F1 Precision Recall F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-TW
Co-BE-LE
Co-Decomp
Depression Stroke
Model
F1 Precision Recall F1 Precision Recall NB EM FastText
WESPAD
BERT-BASE
BERT-TW
Co-BE-LE
Co-Decomp
F1 Precision Recall
Human-cl
Disease-cl
Combined +13-itr co-train +25-itr co-train +50-itr co-train +75-itr co-train
Table 6: Improvement analysis in PHM dataset. The perfor-mance of human mention classifier (
Human-cl ), disease men-tion classifier (
Disease-cl ), their combination in co-trainingframework without adding unlabeled data (
Combined ), andwhen unlabeled data is added per co-training iteration (4 un-labeled documents are added in every iteration).Model
F1 Precision Recall
BERT-TW
Co-Decomp
Table 7: Domain adaptation results in FLU dataset. 1000 train-ing examples from PHM dataset were randomly sampled–500positives and 500 negatives–as the out-of-domain data. of the European Chapter of the Association for Computational Linguistics, EACL2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers . 427–431.[14] Melody Y. Guan, Varun Gulshan, Andrew M. Dai, and Geoffrey E. Hinton. 2018.Who Said What: Modeling Individual Labelers Improves Classification. In
Pro-ceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18),New Orleans, Louisiana, USA, February 2-7, 2018 . 3109–3118.[15] Shashank Gupta, Manish Gupta, Vasudeva Varma, Sachin Pawar, Nitin Ram-rakhiyani, and Girish Keshav Palshikar. 2018. Co-training for extraction of ad-verse drug reaction mentions from tweets. In
European Conference on InformationRetrieval . Springer, 556–562.[16] Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015.Processing Social Media Messages in Mass Emergency: A Survey.
ACM Comput.Surv.
47, 4, Article 67 (June 2015), 38 pages.[17] Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as aLifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages.In
Proceedings of the Tenth International Conference on Language Resourcesand Evaluation (LREC 2016) (23-28). European Language Resources Association(ELRA), Paris, France.[18] Bernard J. Jansen, Mimi Zhang, Kate Sobel, and Abdur Chowdury. 2009. Twitterpower: Tweets as electronic word of mouth.
JASIST
60, 11 (2009), 2169–2188.[19] Thorsten Joachims. 1999. Transductive Inference for Text Classification usingSupport Vector Machines. In
Proceedings of the Sixteenth International Confer-ence on Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999 .200–209.[20] Rosie Jones, Rayid Ghani, Tom Mitchell, and Ellen Riloff. 2003. Active learningfor information extraction with multiple view feature sets.
Proc. of AdaptiveText Extraction and Mining, EMCL/PKDD-03, Cavtat-Dubrovnik, Croatia (2003),26–34.[21] Payam Karisani and Eugene Agichtein. 2018. Did You Really Just Have a HeartAttack?: Towards Robust Detection of Personal Health Mentions in Social Media.
WW ’20, April 20–24, 2020, Taipei, Taiwan Payam Karisani, Joyce C. Ho, and Eugene Agichtein In Proceedings of the 2018 World Wide Web Conference on World Wide Web,WWW 2018, Lyon, France, April 23-27, 2018 . 137–146.[22] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, GuillaumeDesjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, AgnieszkaGrabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, andRaia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks.
Proceedings of the National Academy of Sciences
Human Language Technologies: Conferenceof the North American Chapter of the Association of Computational Linguistics,Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia,USA . 789–795.[24] Andrew Kachites McCallum. 2002. Mallet: A machine learning for languagetoolkit. (2002).[25] Richard Mccreadie, Cody Buntain, and Ian Soboroff. 2019. TREC IncidentStreams: Finding Actionable Information on Social Media. In
Proceedings of the16th International Conference on Information Systems for Crisis Response andManagement (ISCRAM), 2019 .[26] Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervi-sion for relation extraction without labeled data. In
ACL 2009, Proceedings of the47th Annual Meeting of the Association for Computational Linguistics and the 4thInternational Joint Conference on Natural Language Processing of the AFNLP,2-7 August 2009, Singapore . 1003–1011.[27] Andrew Y. Ng and Michael I. Jordan. 2001. On Discriminative vs. GenerativeClassifiers: A comparison of logistic regression and naive Bayes. In
Advancesin Neural Information Processing Systems 14 [Neural Information ProcessingSystems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver,British Columbia, Canada] . 841–848.[28] Kamal Nigam and Rayid Ghani. 2000. Analyzing the Effectiveness and Appli-cability of Co-training. In
Proceedings of the 2000 ACM CIKM InternationalConference on Information and Knowledge Management, McLean, VA, USA,November 6-11, 2000 . 86–93.[29] Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell. 2000.Text Classification from Labeled and Unlabeled Documents using EM.
MachineLearning
39, 2/3 (2000), 103–134.[30] Michael J. Paul and Mark Dredze. 2017.
Social Monitoring for Public Health .Morgan & Claypool Publishers.[31] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, ChristopherClark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized WordRepresentations. In
Proceedings of the 2018 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018,Volume 1 (Long Papers) . 2227–2237.[32] Alexander J. Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and ChristopherRé. 2016. Data Programming: Creating Large Training Sets, Quickly. In
Advancesin Neural Information Processing Systems 29: Annual Conference on NeuralInformation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain .3567–3575.[33] Abeed Sarker and Graciela Gonzalez. 2015. Portable automatic text classifica-tion for adverse drug reaction detection via multi-corpus training.
Journal ofBiomedical Informatics
53 (2015), 196–207.[34] Gabriel Stanovsky, Daniel Gruhl, and Pablo Mendes. 2017. Recognizing mentionsof adverse drug reaction in social media using knowledge-infused recurrent models.In
Proceedings of the 15th Conference of the European Chapter of the Associationfor Computational Linguistics: Volume 1, Long Papers . 142–151.[35] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum.2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In
Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing, Brussels, Belgium, October 31 - November 4, 2018 . 5027–5038.[36] Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003.Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In
Human Language Technology Conference of the North American Chapter of theAssociation for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada,May 27 - June 1, 2003 .[37] Elena Tutubalina and Sergey Nikolenko. 2017. Combination of deep recurrent neu-ral networks and conditional random fields for extracting adverse drug reactionsfrom user reviews.
Journal of healthcare engineering
Proceedingsof the Fourth Social Media Mining for Health Applications ( arXiv preprint arXiv:1304.5634 (2013).[40] Omar Zaidan, Jason Eisner, and Christine D. Piatko. 2007. Using "AnnotatorRationales" to Improve Machine Learning for Text Categorization. In
Human Lan-guage Technology Conference of the North American Chapter of the Associationof Computational Linguistics, Proceedings, April 22-27, 2007, Rochester, NewYork, USA
Proceedings of the 54th Annual Meeting of theAssociation for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin,Germany, Volume 2: Short Papers .[42] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. 2003. Semi-SupervisedLearning Using Gaussian Fields and Harmonic Functions. In