[PDF] Multi-task transfer learning for finding actionable information from crisis-related messages on social media

Abstract

The Incident streams (IS) track is a research challenge aimed at finding important information from social media during crises for emergency response purposes. More specifically, given a stream of crisis-related tweets, the IS challenge asks a participating system to 1) classify what the types of users' concerns or needs are expressed in each tweet, known as the information type (IT) classification task and 2) estimate how critical each tweet is with regard to emergency response, known as the priority level prediction task. In this paper, we describe our multi-task transfer learning approach for this challenge. Our approach leverages state-of-the-art transformer models including both encoder-based models such as BERT and a sequence-to-sequence based T5 for joint transfer learning on the two tasks. Based on this approach, we submitted several runs to the track. The returned evaluation results show that our runs substantially outperform other participating runs in both IT classification and priority level prediction.

Full PDF

MMulti-task transfer learning for ﬁnding actionable information fromcrisis-related messages on social media

Congcong Wang

School of Computer ScienceUniversity College DublinDublin, Ireland [email protected]

David Lillis

School of Computer ScienceUniversity College DublinDublin, Ireland [email protected]

Abstract

The Incident streams (IS) track is a researchchallenge aimed at ﬁnding important infor-mation from social media during crises foremergency response purposes. More speciﬁ-cally, given a stream of crisis-related tweets,the IS challenge asks a participating system to1) classify what the types of users’ concernsor needs are expressed in each tweet, knownas the information type (IT) classiﬁcation taskand 2) estimate how critical each tweet is withregard to emergency response, known as thepriority level prediction task. In this paper,we describe our multi-task transfer learningapproach for this challenge. Our approachleverages state-of-the-art transformer modelsincluding both encoder-based models such asBERT and a sequence-to-sequence based T5for joint transfer learning on the two tasks.Based on this approach, we submitted severalruns to the track. The returned evaluation re-sults show that our runs substantially outper-form other participating runs in both IT classi-ﬁcation and priority level prediction.

Social media platforms such as Twitter have madeit possible for users to report on an ongoing eventin their vicinity in a timely manner (Fraustinoet al., 2012). This has motivated researchers toexplore the potential of social media platformsfor ﬁnding actionable information from this user-generated content during a crisis event (Carageaet al., 2011; Imran et al., 2015; McCreadie et al.,2019). Finding this type of information is espe-cially important for emergency response agenciesto enable them to take immediate actions to helpthose who are posting for help, which is knownas situational awareness (Vieweg, 2012; Vieweget al., 2010). This naturally raises the question:how can the process of ﬁnding the actionable in-formation effectively be automated, given the fact that the messages posted during a crisis on socialmedia are usually noisy and numerous?The Incident streams (IS) track (McCreadieet al., 2019, 2020) is proposed by the T ext RE trieval C onference (TREC) as a research chal-lenge for this purpose. Since it was introduced in2018, the IS track has conducted two major tasksregarding crisis short message processing. Givena stream of tweets from crisis events, the foremosttask is that it asks a participating system to clas-sify the information types (ITs) for each tweet.The ITs are simply a pre-deﬁned set of classes inrelation to something that a user is likely to postduring a crisis. The ITs can be something impor-tant such as requesting research and rescue, callfor moving people, reporting goods available , etc.,as well as something less important such as re-porting weather or location, expressing sentiment ,etc. In addition to the ITs classiﬁcation task, theIS track also asks the participating systems to esti-mate the priority level for each tweet, indicatinghow important the tweet is in taking immediateemergency response actions. The IS track pre-deﬁnes four priority levels: critical , high , medium and low , which are ordered from the highest tolowest priority.The IS track was run once in 2018 and twice ineach subsequent year, so it has accumulated ﬁveeditions as of 2020. For each edition, an annotatedcollection of tweets from previous editions is usedas the training data for the community, and un-seen tweets (non-annotated) are released as the testtweets for ofﬁcial evaluation. The two most recenteditions, conducted in 2020, are named 2020A and2020B respectively. Slightly different from previ-ous editions, the two editions introduce a reducedset of ITs as well as a set of test tweets related to There are 6 important ITs known as “actionable” ITs pre-deﬁned by the IS track and 19 are considered to be “non-actionable”. For details, see (McCreadie et al., 2019). a r X i v : . [ c s . C L ] F e b he COVID-19 pandemic, resulting in three tasksdescribed as follows.• Task 1 : This task remains the same as theeditions before 2020, it uses all 25 ITs forclassiﬁcation and four priority levels for es-timation.•

Task 2 : Different from Task 1, this task onlyasks the participating systems to classify oneor more of 12 IT classes. The 12 ITs include11 that are closely related to emergency re-sponse and the remaining as “Other-Any” .• Task 3 : Unlike Task 1 and 2 that relate togeneral crises such as earthquakes, explo-sions or hurricanes, this task focuses on theCOVID-19 domain. It provides a stream ofCOVID-related tweets from different loca-tions for IT classiﬁcation using only a subsetof 9 ITs suitable for COVID-19 and priorityestimation using the same four priority levelsas used in Task 1 and 2.In this paper, we describe our system’s approachin the three tasks of the IS track from our participa-tion in both 2020A and 2020B. For different tasks,we submitted different runs but all were basedon the multi-task transfer learning approach thatwe utilised in our system. Given the recent suc-cess of transformers (Wolf et al., 2020) in trans-fer learning for various language tasks such assentence classiﬁcation, question answering, etc.,we leverage them in the IS challenge. We ex-plored transformer encoder based models such asBERT (Devlin et al., 2019) and a sequence-to-sequence model - T5 (Raffel et al., 2020) for theirpotential in this challenge. By doing so, we ﬁne-tune them in a multi-task learning fashion (i.e,joint ﬁne-tuning of the IT classiﬁcation and prior-ity estimation). With this approach, we submittedﬁve runs to the IS track. The evaluation resultsshow that our runs substantially outperform otherparticipating runs in both IT classiﬁcation and pri-ority level prediction.

To improve emergency response, the communityhas seen many works on exploring computationaltechniques for knowledge acquisition from crisis For full details, refer to http://dcs.gla.ac.uk/˜richardm/TREC_IS/2020/participate.html messages on social media. Caragea et al. (2011)applied traditional machine learning algorithmsincluding LDA and SVM to ﬁnd important in-formation such as people trapped or food short-age from the 2010 Haiti Earthquake. As neuralnetwork (NN) approaches have gained popularityin recent years, many deep learning approacheshave been applied to this domain. For example,Nguyen et al. (2017) applied a convolution neuralnetwork (CNN) for classifying informative tweetsfrom general disasters such as the , Typhoon Hagupit , etc., whereas Alamet al. (2018) leveraged a CNN with adversarialtraining for identifying whether a tweet is relevantto a certain crisis event.In recent years, since the attention-based trans-former model was introduced (Vaswani et al.),several variations have been proposed such asBERT (Devlin et al., 2019), ELECTRA (Clarket al., 2020) and T5 (Raffel et al., 2020), col-lectively known as the transformers (Wolf et al.,2020), achieving state-of-the-art performance inmany language tasks with transfer learning. It iscommon that the transformers are ﬁrst pre-trainedon a large general text corpus and then are ﬁne-tuned on speciﬁc downstream language tasks suchas text classiﬁcation. Given the strong transfer ca-pability of transformers, they have been widelystudied for crisis messages processing also. Liuet al. (2020) ﬁne-tuned BERT for crisis identiﬁ-cation and detection tasks and Wang and Lillis(2020b) applied T5 for extracting useful informa-tion such as who tested positive/negative or can-not get test from COVID-related tweets by treat-ing it as a question-answering task. Our approachin the IS track is similar to this line of work, whichapplies the transformers with transfer learning forﬁnding actionable information in the tasks as pro-posed by the IS track. However, our approach isdifferent in the way it ﬁne-tunes the transform-ers by multi-task learning, aiming to make use ofshared model weights between different tasks.Since the IS track has been run for several years,the participating systems have proposed varioustechniques speciﬁcally for this track. Such ap-proaches can broadly be summarised in three cat-egories. First, traditional machine learning algo-rithms have been used with careful pre-processingsteps and handcrafted input features. For exam-ple, Wang et al. applied models including Na¨ıveBayes, SVM, Random Forest, and the ensemblef these models. To train these models, theyused hand-crafted features such as the length, sen-timent polarity of a tweet, number of followersof the user, combining with context-free GloVeand FastText embeddings as well as context-awareBERT embeddings as the input features. The sec-ond category uses deep learning approaches thatpre-date the widespread adoption of transformers.For instance, Miyazaki et al. (2019) proposed themethod using label embedding with a BiLSTMmodel in this track while Wang and Lillis (2020a)applied a BiLSTM network along with pre-trainedELMo embeddings and trainable embeddings asthe input features for crisis tweet categorisation.The last category encompasses transformer-basedﬁne-tuning approaches. One example is that Za-hera et al. (2019) ﬁne-tuned BERT for the multi-label ITs classiﬁcation task using the trainingtweets after preprocessing.

Our approach is based on multi-task transfer learn-ing through ﬁne-tuning both transformer encoder-based models such as BERT and sequence-to-sequence transformers such as T5. The follow-ing details the process of the two types of mod-els used, which we name the encoders scenario and sequence-to-sequence scenario respectively.Each type of model was used for both the IT clas-siﬁcation task and priority prediction task.

Encoders scenario : This scenario simply addstwo linear projection layers on top of transformerencoders such as BERT. Our architecture is ag-nostic as to the speciﬁc transfer encoder used.One projection layer transforms the encoder’spooled output (namely, the [CLS] output vectorof BERT) to a vector representing the IT classes.The IT representation is then passed to the sigmoid function that calculates the probability distributionfor every IT class. The other projection layer isused to transform the encoder’s output to a vec-tor representing the four priority levels. Similarly,it is then passed to the sigmoid function, whichcalculates a score indicating the priority levels asfollows. (0 . , −→ Critical (1) (0 . , . −→ High (0 . , . −→ Medium [0 . , . −→ Low

In order to achieve the joint learning of bothtasks, the encoder model is ﬁne-tuned with theloss function linearly combining the binary crossentropy loss between the IT probability distribu-tion and ground truths (a multi-label classiﬁcationproblem) as well as the mean squared error be-tween the importance scores and priority groundtruths (a regression problem).

Sequence-to-sequence scenario (seq2seq):This scenario is mostly motivated by the work thatapplies T5 for COVID-related event extraction bytreating it as a multi-choice question answeringtask (Wang and Lillis, 2020a). We adapt it tothe IS track for multi-task transfer learning usingseq2seq transformers such as T5. Basically, theseq2seq model takes a sequence of text as theinput, known as the source sequence, and outputsthe target sequence conditional on the sourcesequence. Under this mechanism, the templateused to construct the source and target sequencesin both tasks of the IS track is presented asfollows.

Source : context: T question: IQ/PQ choices: IC/PC

Target : I/P • T refers to the raw tweet text without any re-processing except for being lower-cased.• IQ/PQ refers to the IT classiﬁcation and pri-ority estimation task-speciﬁc ad-hoc questiontexts, which are “ what type of informationdoes the tweet convey relating to a cri-sis? ” and “ what level of urgency is likelyexpressed in this tweet relating to a crisis? ”respectively.•

IC/PC implies the ﬂatted texts concatenatingall IT and priority levels respectively. For ex-ample, IC is something like “ call for dona-tions, call to move people, ... ” which variesin different IT classiﬁcation tasks. The PC issimply “ critical, high, medium, low ”.•

I/P indicates the generated predictions forITs and priority level, which are direct tex-tual predictions from

IC/PC respectively.Using this template, each tweet in the train-ing set is converted to an IT-speciﬁc source-targetpair and a priority-speciﬁc source-target pair. Inorder to achieve the joint learning of both tasks,the sequence-to-sequence model is ﬁne-tuned onbatches of training sequences that contain both theIT pairs and priority pairs.untag scenario task target submission type training datarun1 Encoders Task 1 & 2 one-hot prior to 2020B excluding COVIDrun2 Encoders Task 1 & 2 probability prior to 2020B excluding COVIDrun3 seq2seq Task 1 & 2 one-hot prior to 2020B excluding COVIDrun4 seq2seq Task 3 one-hot prior to 2020B including COVIDrun5 seq2seq Task 1 & 2 one-hot prior to 2020B including COVID

Table 1: The summary of our submitted runs for TREC-IS 2020-B. Run1, 2, 3, 5 submitted to task 1 are alsosubmitted to task 2 for evaluation.

This section describes the details of our system’sruns submitted to the latest 2020B edition of the IStrack. Since our system was developed based onour previous experience in this track, the methodwe described in Section 3 also covers our approachto the 2020A edition (actually the encoders sce-nario ). Our baseline run (run1) for 2020B, whichis an ensemble run under the encoders scenario from 2020A that we consider as a strong baseline.In 2020B, we submitted a total of ﬁve runs to Task1, 2 and 3 as mentioned in Section 1 and they aresummarised in Table 1 and described as follows.• run1 : This is a baseline with techniquesinitially developed in 2019A. In 2020A, weproposed the encoders scenario , achievingstrong performance as compared to other par-ticipating techniques. To further make it astrong baseline, we used a simple ensembleapproach combining the predictions made bythe ﬁne-tuned individual models under theencoders scenario. The ensemble run sim-ply predicts the ﬁnal IT predictions for eachtweet to be the union of individual IT predic-tions and the ﬁnal priority level to the highestof the individual priority predictions. Per theguideline of 2020B, both the IT and prioritylevels are expected to be numeric instead ofbeing categorical as required prior to 2020B.Hence, we transform the ﬁnal IT predictionsto one-hot encodings and map the prioritylevel prediction to its importance score by: Critical: 1.0, High: 0.75, Medium: 0.5, Low:0.25 .• run2 : Similar to run1, the difference is that The individual models that were used in thisrun included ﬁne-tuned bert-base-uncased , electra-base-discriminator , albert-base-v2 and distilbert-base-uncased , which are allavailable in the transformers library (Wolf et al., 2020). for run2, the ﬁnal ITs predictions are thehighest probability values among the predic-tions by individual models. The ﬁnal prioritypredictions are simply the highest of the indi-vidual models’ outputs without applying theconversion as deﬁned in Equation 1.• run3 : For this run, the seq2seq scenario is conducted for multi-task transfer learning.We follow the T5 base architecture initialisedwith t5-base weights and ﬁne-tune it onthe training tweets prior to 2020B (excludingthe COVID-related tweets from the 2020Aedition). Since the seq2seq model outputs thegenerated texts as the predictions for both pri-ority and ITs, we convert the IT predictionsto one-hot encodings and priority level to theimportance score before they are submitted.• run4 : With a similar setup to run3, run4 issubmitted for Task 3 and thus it includes thetraining tweets prior to 2020B including theCOVID-related tweets from 2020A.• run5 : With a similar setup to run3, run5 issubmitted for Task 1 & 2 and it uses all pre-vious training tweets including the COVIDtweets for ﬁne-tuning the T5 model. As described, our runs mainly focus on ﬁne-tuning several transformer encoder models and a t5-base sequence-to-sequence model in a multi-task learning way. For the ﬁne-tuning of t5-base ,we follow the same hyper-parameter conﬁgurationas used in Wang and Lillis (2020b). For ﬁne-tuning each of the transformer encoder models,we use the same set of the hyper-parameters thatare conﬁgured with reference to a similar workin this domain (Liu et al., 2020). For training,we sample around 10% of the training data as un nDCG@100 Info-Type F1[Actionable] Info-TypeF1 [All] Info-TypeAccuracy Priority F1[Actionable] Priority F1[All]BJUT-run 0.4346 0.0266 0.0581 0.8321 0.1744 0.0905njit.s1.aug 0.4480 0.2634 0.3103

Table 2: Evaluation results of participating runs at TREC-IS 2020-B Task 1. Highest in columns are bold.

Run nDCG@100 Info-Type F1 [All] Info-Type Accuracy Priority F1 [All]Task-1 SystemsBJUT-run 0.4350 0.0472 0.7977 0.1337njit.s1.aug 0.4487 0.3480 0.8846 0.1838njit.s2.cmmd.t1 0.4467 0.2494 0.8612 0.1838njit.s3.img.t1 0.4215 0.2494 0.8612 0.1708njit.s4.cml.t1 0.4176 0.1278 0.8360 0.1162ufmg-sars-test 0.3630 0.0127 0.8419 0.1480ucd-run1 (ours) 0.5020

Task-2 Systemsnjit.s1.aug.t2 0.4478 0.2548 0.8656 0.1838njit.s2.cmmd.t2 0.4478 0.2548 0.8656 0.1838njit.s3.img.t2 0.4213 0.2548 0.8656 0.1708njit.s4.cml.t2 0.4189 0.1713 0.8327 0.1162ufmg-sars-test-t2 0.3637 0.0127 0.8419 0.1480

Table 3: Evaluation results of participating runs at TREC-IS 2020-B Task 2. The Task-1 systems refer to the runsfrom Task 1 re-evaluated under Task 2 while Task-2 systems are the submitted runs speciﬁc to Task 2. the validation set ﬁrst. Then, we ﬁne-tune eachmodel with a batch size of 32, learning rate of5e-5, linear warm-up ratio of 0.1 with Adam op-timizer (Kingma and Ba, 2015). For the inputlength, we set the maximum input length to be 256since we found few examples has length beyondthis number. All training examples in our experi-ments are not pre-processed but used in raw texts.

Having submitted the ﬁve runs as described inTable 1 to the track, they were ofﬁcially evalu-ated and the results are reported in Tables 2, 3and 4. The tables show the performance of par-ticipating runs in Task 1, 2 and 3 respectively. The columns are the ofﬁcial metrics used to evaluatedifferent aspects of a run’s performance, which aredescribed brieﬂy as follows.•

Information type classiﬁcation : There aretwo types of information type (IT) F1. The“Actionable IT” F1 reﬂects a run’s perfor-mance in classifying actionable ITs . The“All IT” F1 measures a run’s performanceacross all information types (25 in Task 1, 12in Task 2 and 9 in Task 3). The IT accuracyis the overall accuracy in IT classiﬁcation. They are Request-GoodsService, Request-SearchAndRescue, Report-NewSubEvent, Report-ServiceAvailable, CallToAction-MovePeople, and Report-EmergingThreats. a ll T o A c t i o n - D o n a t i o n s C a ll T o A c t i o n - M o v e P e o p l e C a ll T o A c t i o n - V o l un t ee r O t h e r - A d v i c e O t h e r - C o n t e x t u a l I n f o r m a t i o n O t h e r - D i s c u ss i o n O t h e r - I rr e l e v a n t O t h e r - S e n t i m e n t R e p o r t - C l e a n U p R e p o r t - E m e r g i n g T h r e a t s R e p o r t - F a c t o i d R e p o r t - F i r s t P a r t y O b s e r v a t i o n R e p o r t - H a s h t a g s R e p o r t - L o c a t i o n R e p o r t - M u l t i m e d i a S h a r e R e p o r t - N e w s R e p o r t - N e w S u b E v e n t R e p o r t - O ff i c i a l R e p o r t - O r i g i n a l E v e n t R e p o r t - S e r v i c e A v a il a b l e R e p o r t - T h i r d P a r t y O b s e r v a t i o n R e p o r t - W e a t h e r R e q u e s t - G oo d s S e r v i c e s R e q u e s t - I n f o r m a t i o n W a n t e d R e q u e s t - S e a r c h A n d R e s c u e F S c o r e s F1 Scores by Information Type (a) F1 Scores by Information Type C a ll T o A c t i o n - D o n a t i o n s C a ll T o A c t i o n - M o v e P e o p l e C a ll T o A c t i o n - V o l un t ee r O t h e r - A d v i c e O t h e r - C o n t e x t u a l I n f o r m a t i o n O t h e r - D i s c u ss i o n O t h e r - I rr e l e v a n t O t h e r - S e n t i m e n t R e p o r t - C l e a n U p R e p o r t - E m e r g i n g T h r e a t s R e p o r t - F a c t o i d R e p o r t - F i r s t P a r t y O b s e r v a t i o n R e p o r t - H a s h t a g s R e p o r t - L o c a t i o n R e p o r t - M u l t i m e d i a S h a r e R e p o r t - N e w s R e p o r t - N e w S u b E v e n t R e p o r t - O ff i c i a l R e p o r t - O r i g i n a l E v e n t R e p o r t - S e r v i c e A v a il a b l e R e p o r t - T h i r d P a r t y O b s e r v a t i o n R e p o r t - W e a t h e r R e q u e s t - G oo d s S e r v i c e s R e q u e s t - I n f o r m a t i o n W a n t e d R e q u e s t - S e a r c h A n d R e s c u e P r i o r t y L a b e l P r e d i c t i o n F Priorty Label Prediction F1 Per Information Type (b) Priority level prediction F1 per information type

Figure 1: Performance visualisation by information types of ucd-run1 in Task 1.

Run nDCG@100 Info-Type F1[Actionable] Info-TypeF1 [All] Info-TypeAccuracy Priority F1[Actionable] Priority F1[All]njit.s1.aug.t3 0.4322

Table 4: Evaluation results of participating runs at TREC-IS 2020-B Task 3. • Prioritisation : Similarly, the Actionable pri-ority F1 measures a run’s performance in pri-ority level prediction for only the tweets thatare labeled as actionable ITs while the All F1measures the performance for all test tweets.Moreover, the nDCG@100 is used to mea-sure a run’s average performance in rankingtop 100 test tweets per event by priority.As seen from Table 2, in Task 1, our runs sub-stantially outperform other participating runs inboth IT classiﬁcation and prioritisation . In par-ticular, our runs are effective in classifying action-able ITs. For example, our run1 and run3 achievethe top actionable IT F1 score of 0.3215 and thebest actionable priority F1 of 0.2803 respectively.This is further evidenced by the runs’ performancein Task 2, as in Table 3. All the runs overall per-form well in IT classiﬁcation and prioritisation inTask 2 (the condensed more emergency responserelated 12 ITs).In Task 1 and 2, run1 and run2 perform simi-larly across the metrics since both are based on theencoder scenario and only differ in the ﬁnal sub- The exception if accuracy, where only a small differenceis observed accuracy across the participating runs: our resultsare substantially higher than other participating runs in theremaining metrics. mission type. It is interesting that run5 performssimilarly to run3 across the metrics except for be-ing better in nDCG@100: 0.5252 versus 0.5038.The two runs are both based on the seq2seq sce-nario and only difference is in their training data.This indicates that adding the COVID data (sim-ilar domain) to the general crisis data for train-ing can be helpful in the priority-centric rankingperformance. To compare between the four runs,it is found that no one run dominates the otherruns across all the metrics. This indicates that themulti-task transfer learning approach using eitherthe transformer encoder or the seq2seq as the basemodel is likely to bring similar performance.To further examine our runs’ performance atevery IT level, we report the IT F1s and prior-ity F1s per IT of the run1 in Task 1, as pre-sented in Figure 1. Figure 1a shows that therun performs well in categorising some action-able ITs, such as “CallToAction-MovePeople” and“Report-EmergingThreats” while not the best inactionable ITs such as “Request-GoodsService”,as compared to the non-actionable ITs. How-ever, taking a look at the priority F1s perIT in Figure 1b, we found that the run per-forms relatively better in priority level predic-tion for actionable ITs than non-actionable ITs,here “CallToAction-MovePeople”, “Request-GoodsService” and “Report-ServiceAvailable”are the top 3 the runs achieve in priority F1.Apart from the four runs to Task 1 and Task 2,we submitted run4 to Task 3 and the results arereported in Table 4. We see that the run is compet-itive with other participating runs, particularly inprioritisation. Unlike our other four runs in Task 1and 2, this run achieves 0.1425 in actionable IT F1,next to the best 0.1629. Since Task 3 is COVID-related and newly introduced, we expect our runto be improved in future iterations of this track asmore data accumulates.

This paper introduces University College Dublin’s(UCD) participation in the 2020 TREC-IS track.The IS track was run twice in 2020: namely2020A and 2020B. Based on our experiencefrom previous editions, we describe our multi-task transfer learning approach using pre-trainedencoder-based and sequence-to-sequence trans-formers. With these approaches, we submitted ﬁveruns to the track’s 2020-B edition - four for Task 1and Task 2, and one for Task 3. The results showthat our runs to Task 1 and Task 2 substantiallyoutperform other participating runs in both infor-mation type classiﬁcation and priority level pre-diction. In addition, our runs are effective in ﬁnd-ing some actionable information types in Task 1and Task 2 and the run to Task 3 performs com-petitively with other participating runs. Regardingfuture work, we expect to explore the incorpora-tion of knowledge graphs to enhance the model’sidentiﬁcation of the crisis-related tweets.

References

Firoj Alam, Shaﬁq Joty, and Muhammad Imran. 2018.Domain adaptation with adversarial training andgraph embeddings. In

Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers) , pages 1077–1087.Cornelia Caragea, Nathan J McNeese, Anuj R Jaiswal,Greg Traylor, Hyun-Woo Kim, Prasenjit Mitra,Dinghao Wu, Andrea H Tapia, C Lee Giles,Bernard J Jansen, et al. 2011. Classifying textmessages for the haiti earthquake. In

Proceedingsof the 8th International Conference on InformationSystems for Crisis Response and Management (IS-CRAM 2011) . Citeseer. Kevin Clark, Minh-Thang Luong, Quoc V. Le, andChristopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather thangenerators. In

International Conference on Learn-ing Representations .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) ,pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.Julia Daisy Fraustino, Brooke Liu, and Yan Jin. 2012.Social media use during disasters: a review of theknowledge base and gaps.

National Consortium forthe Study of Terrorism and Responses to Terrorism .Muhammad Imran, Carlos Castillo, Fernando Diaz,and Sarah Vieweg. 2015. Processing social mediamessages in mass emergency: A survey.

ACM Com-puting Surveys (CSUR) , 47(4):67.Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In

Proceedingsof the 3rd International Conference for LearningRepresentations , San Diego.Junhua Liu, Trisha Singhal Lucienne Blessing,Kristin L Wood, and Kwan Hui Lim. 2020. Crisis-BERT: Robust Transformer for Crisis Classiﬁcationand Contextual Crisis Embedding. arXiv preprintarXiv:2005.06627 .Richard McCreadie, Cody Buntain, and Ian Sobo-roff. 2019. TREC incident streams: Finding ac-tionable information on social media.

Proceed-ings of the International ISCRAM Conference , 2019-May(May):691–705.Richard McCreadie, Cody Buntain, and Ian Sobo-roff. 2020. Incident Streams 2019: Actionable In-sights and How to Find Them. In

Proceedings ofthe 17th International Conference on InformationSystems for Crisis Response and Management (IS-CRAM 2020) .Taro Miyazaki, Kiminobu Makino, Yuka Takei, HirokiOkamoto, and Jun Goto. 2019. Label embeddingusing hierarchical structure of labels for twitter clas-siﬁcation. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) , pages6318–6323.Dat Tien Nguyen, Kamela Ali Al Mannai, Shaﬁq Joty,Hassan Sajjad, Muhammad Imran, and Prasenjit Mi-tra. 2017. Robust classiﬁcation of crisis-related dataon social networks using convolutional neural net-works. In

Eleventh International AAAI Conferenceon Web and Social Media .olin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the limitsof transfer learning with a uniﬁed text-to-text trans-former.

Journal of Machine Learning Research ,21(140):1–67.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. Attention is all youneed. In

Advances in Neural Information Process-ing Systems .Sarah Vieweg, Amanda L Hughes, Kate Starbird, andLeysia Palen. 2010. Microblogging during two nat-ural hazards events: what twitter may contributeto situational awareness. In

Proceedings of theSIGCHI conference on human factors in computingsystems , pages 1079–1088. ACM.Sarah Elizabeth Vieweg. 2012.

Situational awarenessin mass emergency: A behavioral and linguisticanalysis of microblogged communications . Ph.D.thesis, University of Colorado at Boulder.Congcong Wang and David Lillis. 2020a. Classiﬁ-cation for Crisis-Related Tweets Leveraging WordEmbeddings and Data Augmentation. In

Proceed-ings of the Twenty-Eighth Text REtrieval Conference(TREC 2019) , Gaithersburg, MD.Congcong Wang and David Lillis. 2020b. UCD-CS at W-NUT 2020 Shared Task-3: A Text to Text Ap-proach for COVID-19 Event Extraction on SocialMedia. In

Proceedings of the Sixth Workshop onNoisy User-Generated Text (W-NUT 2020) , pages514–521. Association for Computational Linguis-tics.Junpei Zhou Xinyu Wang, Po-yao Huang, and Alexan-der Hauptmann. CMU-Informedia at TREC 2019Incident Streams Track. In

Proceedings of theTwenty-Eighth Text REtrieval Conference (TREC2019) , Gaithersburg, MD.Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In

Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations , pages 38–45, Online. As-sociation for Computational Linguistics.Hamada M Zahera, Ibrahim A Elgendy, Rricha Jalota,and Mohamed Ahmed Sherif. 2019. Fine-tunedBERT Model for Multi-Label Tweets Classiﬁcation.In