[PDF] CrisisBench: Benchmarking Crisis-related Social Media Datasets for Humanitarian Information Processing

Abstract

Time-critical analysis of social media streams is important for humanitarian organizations for planing rapid response during disasters. The \textit{crisis informatics} research community has developed several techniques and systems for processing and classifying big crisis-related data posted on social media. However, due to the dispersed nature of the datasets used in the literature (e.g., for training models), it is not possible to compare the results and measure the progress made towards building better models for crisis informatics tasks. In this work, we attempt to bridge this gap by combining various existing crisis-related datasets. We consolidate eight human-annotated datasets and provide 166.1k and 141.5k tweets for \textit{informativeness} and \textit{humanitarian} classification tasks, respectively. We believe that the consolidated dataset will help train more sophisticated models. Moreover, we provide benchmarks for both binary and multiclass classification tasks using several deep learning architecrures including, CNN, fastText, and transformers. We make the dataset and scripts available at: this https URL

Full PDF

SStandardizing and Benchmarking Crisis-related Social Media Datasets forHumanitarian Information Processing

Firoj Alam Hassan Sajjad Muhammad Imran Ferda Oﬂi

Qatar Computing Research InstituteHamad Bin Khalifa UniversityDoha, Qatar { fialam,hsajjad,mimran,fofli } @hbku.edu.qa Abstract

Time-critical analysis of social media streams is important for humanitarian organizations to planrapid response during disasters. The crisis informatics research community has developed severaltechniques and systems to process and classify big crisis related data posted on social media.However, due to the dispersed nature of the datasets used in the literature, it is not possible tocompare the results and measure the progress made towards better models for crisis informatics.In this work, we attempt to bridge this gap by standardizing various existing crisis-related datasets.We consolidate labels of eight annotated data sources and provide 166.1k and 141.5k tweets forinformativeness and humanitarian classiﬁcation tasks, respectively. The consolidation resultsin a larger dataset that affords the ability to train more sophisticated models. To that end,we provide baseline results using CNN and BERT models. We make the dataset available at https://crisisnlp.qcri.org/crisis_datasets_benchmarks.html . At the onset of a disaster event, information pertinent to situational awareness such as reports of injured,trapped, or deceased people, urgent needs of victims, and infrastructure damage reports is most needed byformal humanitarian organizations to plan and launch relief operations. Acquiring such information inreal-time is ideal to understand the situation as it unfolds. However, it is challenging as traditional methodssuch as ﬁeld assessments and surveys are time-consuming. Microblogging platforms such as Twitterhave been widely used to disseminate situational and actionable information by the affected population.Although social media sources are useful in this time-critical setting, it is, however, challenging to parseand extract actionable information from big crisis data available on social media (Castillo, 2016).The past couple of years have witnessed a surge in the research works that focus on analyzing theusefulness of social media data and developing computational models to extract actionable information.Among others, proposed computational techniques include information classiﬁcation, information ex-traction, and summarization (Imran et al., 2015; Rudra et al., 2018). Most of these studies use one ofthe publicly available datasets, reported in (Olteanu et al., 2014; Imran et al., 2016; Alam et al., 2018),either proposing a new model or reporting higher performance of an existing model. Typical classiﬁcationtasks in the community include (i) informativeness (i.e., informative reports vs. not-informative reports),(ii) humanitarian information type classiﬁcation (e.g., affected individual reports, infrastructure damagereports), and (iii) event type classiﬁcation (e.g., ﬂood, earthquake, ﬁre).Despite the recent focus of the crisis informatics research community to develop novel and morerobust computational algorithms and techniques to process social media data, very limited efforts havebeen invested to develop standard datasets and benchmarks for others to compare their results, models,and techniques. In this paper, we develop a standard social media dataset for disaster response to facilitatecomparison between different modeling approaches and to encourage the community to streamline theirefforts towards a common goal. We can create such a standard benchmark dataset thanks to the publiclyavailable datasets. The consolidated data is also larger in size and has better class distribution comparedto the individual datasets, which are two important data features for building better models. https://en.wikipedia.org/wiki/Disaster_informatics a r X i v : . [ c s . S I] A p r e consolidate eight annotated datasets, namely, CrisisLex (Olteanu et al., 2014; Olteanu et al., 2015),CrisisNLP (Imran et al., 2016), SWDM2013 (Imran et al., 2013a), ISCRAM13 (Imran et al., 2013b),Disaster Response Data (DRD) , Disasters on Social Media (DSM) , CrisisMMD (Alam et al., 2018), anddata collected by AIDR system (Imran et al., 2014). One of the challenges while consolidating the datasetsis the inconsistent class labels across the datasets. One of the earlier efforts of deﬁning the class labels andterminologies is the work of Temnikova et al. (2015). The CrisisLex, CrisisNLP and CrisisMMD datasetsused similar deﬁnitions discussed in (Temnikova et al., 2015). Across several studies, a commonalityexists at the semantic level of class labels used in different datasets. In this study, we map the class labelsacross datasets using their semantic meaning—a step performed by domain experts manually.Another challenge while consolidating different social media datasets is to tackle the duplicate contentthat is present within or across datasets. There are three types of duplicates: (i) tweet-id based duplicates(i.e., same tweet appears in different datasets), (ii) content-based duplicates (i.e., tweets with different idshave same content), which usually happens when users copy-paste tweets, and (iii) near-duplicate content(i.e., tweets with similar content), which happens due to retweets or partial copy of tweets from otherusers. We use cosine similarity between tweets to ﬁlter out various types of duplicates. The contributionsof this work are as follows. • We consolidate all publicly available disaster-related datasets by manually mapping semanticallysimilar class labels. We ﬁlter exact and near-duplicate tweets to clean the data and avoid anyexperimental biases. • We provide benchmark results using state-of-the-art learning algorithms such as Convolutional NeuralNetworks (CNN) and pre-trained BERT models (Devlin et al., 2018) for two classiﬁcations tasks,i.e.,

Informativeness (binary) and

Humanitarian type (multi-class) classiﬁcation. The benchmarkingencourages the community towards comparable and reproducible research. • For the research community, we aim to release the dataset in multiple forms as, (i) a consolidatedclass label mapped version, (ii) exact- and near-duplicate ﬁltered version obtained from previousversion, (iii) a subset of the ﬁltered data used for the classiﬁcation experiments in this study.Our released dataset also includes a language tag, which enables the use of multilingual informationin classiﬁcation and is a promising future research direction.The rest of the paper is organized as follows. Section 2 provides a brief overview of the existingwork. Section 3 describes our data consolidation procedures, and Section 4 describes the experiments andSection 5 presents and discusses the results. Finally, Section 6 concludes the paper.

Dataset Consolidation: In crisis informatics research on social media, there has been an effort todevelop datasets for the research community. An extensive literature review can be found in (Imran etal., 2015). Although there are several publicly available datasets that are used by the researchers, theirresults are not exactly comparable due to the differences in class labels and train/dev/test splits. Alamet al. (2019) and Kersten et al. (2019) have previously worked in this direction to consolidate socialmedia disaster response data. However, both of these studies have limitations because Alam et al. (2019)did not consider the issue of duplicate and near-duplicate content when combining different datasetswhile Kersten et al. (2019) focused only on informativeness classiﬁcation . A fair comparison of theclassiﬁcation experiment is also difﬁcult with these two studies as their train/dev/test splits are not public.We address such limitations in this study, i.e., we consolidate the datasets, eliminate duplicates, and releasetrain/dev/test splits publicly with benchmark results.In terms of deﬁning class labels (i.e., ontologies) for crisis informatics, most of the earlier efforts arediscussed in Imran et al. (2015) and Temnikova et al. (2015). Various recent studies (Olteanu et al., 2014;Imran et al., 2016; Alam et al., 2018) use a similar deﬁnitions. https://data.world/crowdflower/disasters-on-social-media Note that in this study informativeness classiﬁcation is also referred to as related vs. not-related. ource Total Mapping FilteringInformativeness Humanitarian Informativeness Humanitarian

CrisisLex 88,015 84,407 84,407 69,699 69,699CrisisNLP 52,656 51,271 50,824 40,401 40,074SWDM13 1,543 1,344 802 857 699ISCRAM13 3,617 3,196 1,702 2,521 1,506DRD 26,235 21,519 7,505 20,896 7,419DSM 10,876 10,800 0 8,835 0CrisisMMD 16,058 16,058 16,058 16,020 16,020AIDR 7,411 7,396 6,580 6,869 6,116

Total 206,411 195,991 167,878 166,098 141,533

Table 1: Different datasets and their sizes before and after label mapping and ﬁltering steps.Different from them, Strassel et al. (2017) deﬁnes categories based on need types (e.g., evacuation, foodsupply) and issue type (e.g., civil unrest). In this study, we use the class labels that are highly importantfor humanitarian aid for disaster response task, which also has a commonality across the publicly availableresources.

Classiﬁcation Algorithms:

Despite a majority of studies in crisis informatics literature employ tradi-tional machine learning algorithms for automatic event detection, event type classiﬁcation, and ﬁne-grainedhumanitarian information type classiﬁcation, several recent works explore deep learning algorithms indisaster-related tweet classiﬁcation tasks. The study of Nguyen et al. (2017) and Neppalli et al. (2018)perform comparative experiments between different classical and deep learning algorithms includingSupport Vector Machines (SVM), Logistic Regression (LR), Random Forests (RF), Recurrent NeuralNetworks (RNN), and Convolutional Neural Networks (CNN). Their experimental results suggest thatCNN outperforms other algorithms. Though in another study, Burel and Alani (2018) reports that SVMand CNN can provide very competitive results in some cases. CNNs have also been explored in event type-speciﬁc ﬁltering model (Kersten et al., 2019) and few-shot learning (Kruspe et al., 2019). Very recentlydifferent types of embedding representations have been proposed in literature such as Embeddings fromLanguage Models (ELMo) (Peters et al., 2018), Bidirectional Encoder Representations from Transformers(BERT) (Devlin et al., 2018), and XLNet (Yang et al., 2019) for different NLP tasks. For disaster-relatedclassiﬁcation, Jain et al. (2019) investigates these embedding representations and achieves similar results.

We consolidate some of the most prominent, publicly-available social media datasets that were labeledfor different disaster response classiﬁcation tasks. In doing so, we deal with two major challenges: (i)discrepancies in the class labels used across different datasets, and (ii) exact- and near-duplicate contentthat exists within as well as across different datasets.In this study, we focus on eight datasets that have annotations and can be mapped consistently fortwo tasks: informativeness classiﬁcation and humanitarian information type categorization. Thesedatasets include CrisisLex (CrisisLexT6 (Olteanu et al., 2014), CrisisLexT26 (Olteanu et al., 2015)),CrisisNLP (Imran et al., 2016), SWDM2013 (Imran et al., 2013a), ISCRAM13 (Imran et al., 2013b),Disaster Response Data (DRD) , Disasters on Social Media (DSM) , CrisisMMD (Alam et al., 2018), anddata collected by AIDR system (Imran et al., 2014) . Second column of Table 1 summarizes original sizesof the datasets. From the table, we observe that CrisisLex and CrisisNLP are the largest and second-largestdatasets, respectively, which are currently widely used in the literature. The SWDM2013 is the smallestset, which is one of the earliest datasets for the crisis informatics community. Below we elaborate on thedetails of the data consolidation process of these datasets. https://data.world/crowdflower/disasters-on-social-media Note that the AIDR system data has been annotated by domain experts and is available upon request. .1 Class Label Mapping

To combine these datasets, we create a set of common class labels by manually mapping class labels thatcome from different datasets but have the same or similar semantic meanings. For example, the label“building damaged,” originally used in the AIDR system, is mapped to “infrastructure and utilities damage”in our ﬁnal dataset. Some of the class labels in these datasets are not annotated for humanitarian aid purposes, therefore, we have not included them in the consolidated dataset. For example, we do not selecttweets labeled as “animal management” or “not labeled” that appear in CrisisNLP and CrisisLex26. Thiscauses a drop in the number of tweets for both informativeness and humanitarian tasks as can be seen inTable 1 (Mapping column). The large drop in the CrisisLex dataset for the informativeness task is due tothe 3,103 unlabeled tweets (i.e., labeled as “not labeled”). The other signiﬁcant drop in the number oftraining examples for the informativeness task is in the DRD dataset. This is because many tweets wereannotated with multiple labels, which we have not included in our consolidated dataset.Many tweets in these datasets were labeled for informativeness only. For example, the DSM dataset isonly labeled for informativeness, and a large portion of the DRD dataset is labeled for informativenessonly. Therefore, we were not able to map them for the humanitarian task. More details of this mappingfor different datasets are reported in the supplementary material. To develop a machine learning model, it is important to design non-overlapping train/dev/test splits. Acommon practice is to randomly split the dataset into train/dev/test sets.This approach does not work with social media data as it generally contains duplicates and nearduplicates. Such duplicate content, if present in both train and test sets, often leads to overestimated testresults during classiﬁcation. Filtering the near-and-exact duplicate content is one of the major steps wehave taken into consideration while consolidating the datasets.We ﬁrst tokenize the text before applying any ﬁltering. For tokenization, we used a modiﬁed version ofthe Tweet NLP tokenizer (O’Connor et al., 2010). Our modiﬁcation includes lowercasing the text andremoving URL, punctuation, and user id mentioned in the text. We then ﬁlter tweets having only onetoken. Next, we apply exact string matching to remove exact duplicates. An example of an exact duplicatetweet is: “ RT Reuters: BREAKING NEWS: 6.3 magnitude earthquake strikes northwest of Bologna, Italy:USGS ”, which appear three times with exact match in CrisisLex26 (Olteanu et al., 2014) dataset that hasbeen collected during Northern Italy Earthquakes, 2012 .Then, we use a similarity-based approach to remove the near-duplicates. To do this, we ﬁrst convert thetweets into vectors of uni- and bi-grams with their frequency-based representations. We then use cosinesimilarity to compute a similarity score between two tweets and ﬂag them as duplicate if their similarityscore is greater than the threshold value of 0.75. In the similarity-based approach, threshold selection isan important aspect. Choosing a lower value would remove many distant tweets while choosing a highervalue would leave several near-duplicate tweets in the dataset. To determine a plausible threshold value,we manually looked into the tweets in different threshold bins (i.e., 0.70 to 1.0 with 0.05 interval) asshown in Figure 1, which we selected from consolidated informativeness dataset. By investigating thedistribution, we concluded that a threshold value of 0.75 is a reasonable choice. From the ﬁgure we canclearly see that choosing a lower threshold (e.g., < . ) removes larger number of tweets. Note that restof the tweets have similarity lower than what we have reported in the ﬁgure. In Table 2, we provide a fewexamples for the sake of clarity.To understand, which events and in which dataset has more exact- and near-duplicate we attemptedto analyze them. In Figure 2, we provide such duplicates counts for both exact- and near-duplicatesfor informativeness tweets. In the ﬁgure, we report total number (in parenthesis the number representpercentage of reduction) of duplicates (i.e., exact and near) for each dataset. The ﬁgure shows thatCrisisLex and CrisisNLP have higher duplicates comparatively, however, it is because those two are larger https://en.wikipedia.org/wiki/Humanitarian_aid https://github.com/brendano/ark-tweet-nlp http://en.wikipedia.org/wiki/2012_Northern_Italy_earthquakes . - . . - . . - . . - . . - . . - . Similarity value0500100015002000250030003500 o f t w ee t Figure 1: Number of near-duplicates in different bins obtained from consolidated informativeness tweetsafter label mapping. Tweets will lower similarity ( < . ) bins are not reported here.datasets comparatively. For each of these datasets, we wanted to see which event’s duplicates appear most.In CrisisLex, the majority of the exact duplicates appear in “Queensland ﬂoods (2013)” consisting of2270 exact duplicates. The second majority is “West Texas explosion 92013)” event, which consists of1301 duplicates. Compared to CrisisLex, the exact duplicates are low in CrisisNLP, and the majority ofsuch duplicates appear in the “Philippines Typhoon Hagupit (2014)” event with 1084 tweets. For thehumanitarian tweets, we observed similar characteristics of Figure 2.Figure 2: Exact- and near-duplicates in informativeness tweets. Number on top of each bar represent totalnumber and the number in the parenthesis represent percentage of consolidated exact and near duplicatesfrom the respective dataset.As indicated in Table 1, there is a drop after ﬁltering, e.g., ∼

25% for informativeness and ∼

20% forhumanitarian tasks. It is important to note that failing to eradicate duplicates from the consolidated datasetwould potentially lead to misleading performance results in the classiﬁcation experiments. Note that the event name that we are referring here is the events during which data has been collected by the respective dataauthors. We provided such information as a part of supplementary material.

Tweet Tokenized Sim. Dup. (cid:55)

As ﬂood waters recede in Qld, (cid:55)

Queensland counts ﬂood cost as New SouthWales braces for river peaks - The Guardianhttp://t.co/njADhrdc (cid:55)

AUSTRALIA: RT @abcnews: LIVE: QueenslandPremier Campbell Newman is giving an update onQueensland ﬂood crisis http://t.co/Jj9S057T australia rt live queensland premier campbell new-man is giving an update on queensland ﬂood crisisurl4 Australia lurches from ﬁre to ﬂoodhttp://t.co/C6x8Uxnk australia lurches from ﬁre to ﬂood url 0.807 (cid:55)

Australia lurches from ﬁre to ﬂood (cid:55)

Live coverage: Queensland ﬂood crisis - Yahoo!7http://t.co/U2hw0LWW via @Y7News live coverage queensland ﬂood crisis yahoo url via6 Halo tetangga. Sabar ya. RT @AJEnglish: Floodworsens in eastern Australia http://t.co/YfokqBmG halo tetangga sabar ya rt ﬂood worsens in easternaustralia url 0.787 (cid:55)

RT @AJEnglish: Flood worsens in eastern Australiahttp://t.co/kuGSMCiH rt ﬂood worsens in eastern australia url7 ”@guardian: Queensland counts ﬂood costas New South Wales braces for river peakshttp://t.co/MpQskYt1”. Brisbane friends moved torefuge. queensland counts ﬂood cost as new south walesbraces for river peaks url brisbane friends moved torefuge 0.778 (cid:55)

Queensland counts ﬂood cost as New South Walesbraces for river peaks http://t.co/qb5UuYf9 queensland counts ﬂood cost as new south walesbraces for river peaks url8 RT @FoxNews: (cid:51)

Numerous injuries reported in large explosion atTexas fertilizer plant: DEVELOPING: Emergencycrews in Texas ... http://t.co/Th5Yzvdg5m numerous injuries reported in large explosion at texasfertilizer plant developing emergency crews in texasurl9 Obama to attend memorial service for victims ofTexas explosion: The president will meet with vic-tims of the d... http://t.co/VgGdVATn1b obama to attend memorial service for victims of texasexplosion the president will meet with victims of thed url 0.732 (cid:51)

Obama to attend memorial service for victims ofTexas explosion http://t.co/f6JXfzd7QZ obama to attend memorial service for victims of texasexplosion url10 RT @RobertTaylors: Shooting Reported at Los An-geles International Airport: There are reports of ashooting incident Friday mornin... http:/. . . rt shooting reported at los angeles international air-port there are reports of a shooting incident fridaymornin http . . . 0.705 (cid:51)

RT @BuzzFeed: There Are Reports Of AShooting At Los Angeles International Airporthttp://t.co/9TgunRXajQ rt there are reports of a shooting at los angeles inter-national airport url11 “@BuzzFeed: Watch Hurricane Sandy rollin from the top of the @nytimes buildinghttp://t.co/dl2g3sAH” watch hurricane sandy roll in from the top of thebuilding url 0.709 (cid:51)

Hurricane Sandy view from the top of the NYTimesbuilding http://t.co/pLiXlaHI hurricane sandy view from the top of the nytimesbuilding url

Table 2: Examples of near-duplicates with similarity scores selected from informativeness tweets. Dupli-cates are highlighted.

Sim. refers to similarity value.

Dup. refers to whether we consider them as duplicateand ﬁltered. The symbol ( (cid:55) ) indicates a duplicate, which we dropped and the symbol ( (cid:51) ) indicates a notduplicate, which we have included in our dataset. .3 Adding Language Tags

While combining the datasets, we realized that some of them contain tweets in different languages (i.e.,Spanish and Italian) other than English. In addition, many tweets have code-switched (i.e., multilingual)content. For example, the following tweet has both English and Spanish: “It’s . Note that Twitter taggedthis tweet as English whereas the Google language detector service tagged it as Spanish with a conﬁdencescore of . . After we realized this multilingual issue in the datasets, we decided to provide a languagetag for each tweet if the language tag is not available with the respective dataset. For example, the tweetsannotated by volunteers in the CrisisNLP dataset have language tags provided by Twitter whereas nolanguage tag is provided with the CrisisLex dataset. For these tweets, we used the language detectionAPI of Google Cloud Services . We provided a language tag and conﬁdence score obtained from thelanguage detection API. Hence, with the consolidated dataset, we include a language tag for all tweets.In Figure 3, we report the distribution of languages with more than 20 tweets in the datasets. Amongdifferent languages of informativeness tweets, English tweets appear to be highest in the distributioncompared to any other language, which is 94.46% of 156,899, as shown in Figure 3. Note that most of thenon-English tweets appear in the CrisisLex dataset.Figure 3: Distribution of top nineteen languages ( > = 20 tweets) in the consolidated informativenesstweets. Distribution of class labels is an important factor for developing the classiﬁcation model. In Table 3 and 4,we report individual datasets along with the class label distribution for informativeness and humanitariantasks, respectively. It is clear that there is an imbalance in class distributions in different datasets andsome class labels are not present. For example, the distribution of “not informative” class is very lowin SWDM13 and ISCRAM13 datasets. For the humanitarian task, some class labels are not present indifferent datasets. Only 17 tweets with the label “terrrorism related” are present in CrisisNLP. Similarly,the class “disease related” only appears in CrisisNLP. The scarcity of the class labels poses a greatchallenge to design the classiﬁcation model using individual datasets. Even after combining the datasets, https://cloud.google.com/translate/docs/advanced/detecting-language-v3 Note, it is a paid service, therefore, we have not we have not used this service for the tweets for which language tags areavailable. he imbalance in class distribution seems to persist (last column in Table 4). For example, the distributionof “Not humanitarian” is relatively higher (37.40%) than other class labels, which might have to beunder-sampled for training the classiﬁcation model. In Table 4, we highlighted some class labels, whichwe dropped in the rest of the classiﬁcation experiments conducted in this study, however, tweets withthose class labels will be available in the released datasets. The reason for not including them in theexperiments is that we aim to develop classiﬁers for the disaster response tasks only.

Class CrisisLex CrisisNLP SWDM13 ISCRAM13 DRD DSM CrisisMMD AIDR Total

Informative 42,140 23,694 716 2,443 14,849 3,461 11,488 2,968 101,759Not informative 27,559 16,707 141 78 6,047 5,374 4,532 3,901 64,339

Total 69,699 40,401 857 2,521 20,896 8,835 16,020 6,869 166,098

Table 3: Data distribution of informativeness across different sources.

Class CrisisLex CrisisNLP SWDM13 ISCRAM13 DRD CrisisMMD AIDR Total

Affected individual 3,740 - - - - 471 - 4,211Caution and advice 1,774 1,137 117 412 - - 161 3,601Disease related - 1,478 - - - - - 1,478Displaced and evacuations - 495 - - - - 50 545Donation and volunteering 1,932 2,882 27 189 10 3,286 24 8,350Infrastructure and utilities damage 1,353 1,721 - - 877 1,262 283 5,496Injured or dead people - 2,151 139 125 - 486 267 3,168Missing and found people - 443 - 43 - 40 46 572Not humanitarian 27,559 16,708 142 81 - 4,538 3,911 52,939Other relevant information 29,562 8,188 - - - 5,937 939 44,626Personal update - 116 274 656 - - - 1,046Physical landslide - 538 - - - - - 538Requests or needs - 215 - - 6,532 - 257 7,004Response efforts - 1,114 - - - - - 1,114Sympathy and support 3,779 2,872 - - - - 178 6,829Terrorism related - 16 - - - - - 16

Total 69,699 40,074 699 1,506 7,419 16,020 6,116 141,533

Table 4: Data distribution of humanitarian categories across different datasets.

Although our consolidated dataset contains multilingual tweets, we only use tweets in English languagein our experiments. We split data into train, dev, and test sets with a proportion of 70%, 10%, and20%, respectively, also reported in Table 5. As mentioned earlier we have not selected the tweets withhighlighted class labels in Table 4 for the classiﬁcation experiments. Therefore, in the rest of the paper,we report the class label distribution and results on the selected class labels with English tweets only.

The motivation of these experiments is to investigate whetherconsolidated dataset helps in improving the classiﬁcation performance. For the individual dataset classiﬁ-cation experiments, we selected CrisisLex and CrisisNLP as they are reasonably large in size and havea reasonable number of class labels, i.e., six and eleven class labels, respectively. Note that these aresubsets of the consolidated dataset reported in Table 5. We selected them from train, dev and test splits ofthe consolidated dataset to be consistent across different classiﬁcation experiments. To understand theeffectiveness of the smaller datasets, we run experiments by training the model using smaller datasets andevaluating using the consolidated test set.

CNN vs. BERT using Consolidated Dataset:

The recent development of the pre-trained BERT modelhas shown success in different downstream NLP tasks. In this study, we wanted to compare the perfor-mance of the widely used CNN model with BERT model. We chose to use the consolidated dataset forthe experiments. nformativeness Train Dev Test Total

Informative 65826 9594 18626 94046Not informative 43970 6414 12469 62853

Total 109796 16008 31095 156899Humanitarian

Affected individual 2454 367 693 3514Caution and advice 2101 309 583 2993Displaced and evacuations 359 53 99

Donation and volunteering 5184 763 1453 7400Infrastructure and utilities damage 3541 511 1004 5056Injured or dead people 1945 271 561 2777Missing and found people 373 55 103

Not humanitarian 36109 5270 10256 51635Requests or needs 4840 705 1372 6917Response efforts 780 113 221

Sympathy and support 3549 540 1020 5109

Total 61235 8957 17365 87557

Table 5: Data split and their distributions with the consolidated

English tweets dataset.

Event-aware Training

The availability of annotated data for a disaster event is usually scarce. Oneof the advantages of our compiled data is to have identical classes across several disaster events. Thisenables us to combine the annotated data from all previous disasters for the classiﬁcation. Though thisincreases the size of the training data substantially, the classiﬁer may result in sub-optimal performancedue to the inclusion of heterogeneous data (i.e., a variety of disaster types and occurs in a different part ofthe world).Sennrich et al. (2016) proposed a tag-based strategy where they add a tag to machine translation trainingdata to force a speciﬁc type of translation. The method has later been adopted to do domain adaptationand multilingual machine translation (Chu et al., 2017). Motivated by it, we propose an event-awaretraining mechanism. Given a set of m disaster event types D = { d , d , ..., d m } where disaster eventtype d i includes earthquake, ﬂood, ﬁre, hurricane. For a disaster event type d i , T i = { t , t , ..., t n } are theannotated tweets. We append a disaster event type as a token to each annotated tweet t i . More concretely,say tweet t i consists of k words { w , w , ..., w k } . We append a disaster event type tag d i to each tweet sothat t i would become { d i , w , w , ..., w k } . We repeat this step for all disaster event types present in ourdataset. We concatenate the modiﬁed data of all disasters and use it for the classiﬁcation. Different fromconcatenating the original data, we are essentially preserving the domain information present in the datawhile making use of all of the data for the classiﬁcation.The event-aware training requires the knowledge of the disaster event type at the time of the test. If wedo not provide a disaster event type, the classiﬁcation performance will be suboptimal due to a mismatchbetween train and test. In order to apply the model to an unknown disaster event type, we modify thetraining procedure. Instead of appending the disaster event type to all tweets of a disaster, we randomlyappend disaster event type UNK to 5% of the tweets of every disaster. Note that

UNK is now distributedacross all disaster event types and is a good representation of an unknown event.

In this section, we describe the details of our classiﬁcation models. For the experiments, we use CNN andpre-trained BERT model for experimentation.

Classiﬁcation using CNN

The current state-of-the-art disaster classiﬁcation model is based on theCNN architecture. We used similar architecture as proposed by Nguyen et al. (2017).

Classiﬁcation using BERT

Pre-trained models have achieved state-of-the-art performance on naturallanguage processing tasks and have been adopted as feature extractors for solving down-stream taskssuch as question answering, and sentiment analysis. Though the pre-trained models are mainly trainedon non-Twitter text, we hypothesize that their rich contextualized embeddings would be beneﬁcial for rain Test Acc P R F1Informativeness

CrisisLex (2C) Consolidated 0.801 0.807 0.800 0.803CrisisNLP (2C) Consolidated 0.725 0.768 0.730 0.727Consolidated (2C) Consolidated

CrisisLex (6C) Consolidated 0.694 0.601 0.690 0.633CrisisNLP (10C) Consolidated 0.666 0.582 0.670 0.613Consolidated (11C) Consolidated

Table 6: Classiﬁcation results for the individual and consolidated train sets using the CNN model. 2C, 6C,10C, and 11C refer to two, six, ten and eleven class labels, respectively.the disaster domain. In this work, we choose the pre-trained model BERT (Devlin et al., 2018) for theclassiﬁcation task. We follow the standard ﬁne-tuning procedure with a task-speciﬁc layer added on topof the BERT architecture.

Model Settings

We train the CNN models using the Adam optimizer (Kingma and Ba, 2014). Themaximum number of epochs is set to 1000. We set early stopping criterion based on the accuracy ofthe development set with a patience of 200. We use a ﬁlter size of 300 ﬁlters with both window sizeand pooling length of 2, 3, and 4. We use BERT-base model (Devlin et al., 2018) using the TransformerToolkit (Wolf et al., 2019). The model consists of 12 layers plus an additional task-speciﬁc layer. Weﬁne-tune the model using the default settings for three epochs as prescribed by Devlin et al. (2018).

Prior to the classiﬁcation experiment, we preprocess tweets to remove symbols,emoticons, invisible and non-ASCII characters, punctuations (replaced with whitespace), numbers, URLs,and hashtag signs. We also remove stop words.

Evaluation Settings

To measure the performance of each classiﬁer, we use weighted average precision(P), recall (R), and F1-measure (F1). The rationale behind choosing the weighted metric is that it takesinto account the class imbalance problem.

In Table 6, we report the classiﬁcation results for individual vs. consolidated datasets for both informa-tiveness and humanitarian tasks using the CNN model. As mentioned earlier, we selected CrisisLexand CrisisNLP to conduct experiments for the individual datasets. Between CrisisLex and CrisisNLP,the performance is higher with CrisisLex dataset for both informativeness and humanitarian tasks. Thismight be due to the CrisisLex dataset being larger than the CrisisNLP dataset. The model trained usingthe consolidated dataset achieves 0.866 (F1) for informativeness and 0.829 for humanitarian, which isbetter than the models trained using individual datasets. In the humanitarian task, for different datasets inTable 6, we have different number of class labels. We report the results of those classes only for which themodel is able to classify. For example, the model trained using the CrisisLex data can classify tweets usingone of the six class labels (see Table 4 for excluded labels with highlights). The experiments with smallerdatasets for both informativeness and humanitarian tasks show the importance to design a classiﬁer usinga larger dataset. Note that humanitarian task is a multi-class classiﬁcation problem which makes it a muchmore difﬁcult task than the binary informativeness classiﬁcation.

Table 7 compares the performance of CNN and BERT on the consolidated datasets. For the informativenesstask, there is no difference in performance between CNN and BERT. However, on the humanitarian task,the BERT model outperforms CNN model by an absolute margin of 3.5 points in F1. The BERT model nformativeness HumanitarianModel Acc P R F1 Acc P R F1

CNN 0.867 0.866 0.870

BERT 0.866 0.866 0.866

Table 7: Classiﬁcation results of the consolidated dataset using CNN and BERT.

CNN BERTClass P R F1 P R F1

Affected individual 0.760 0.720 0.740 0.771 0.835 0.802Caution and advice 0.630 0.630 0.630 0.684 0.720 0.702Displaced and evacuations 0.490 0.180 0.260 0.523 0.586 0.552Donation and volunteering 0.700 0.790 0.740 0.746 0.825 0.783Infrastructure and utilities damage 0.650 0.660 0.660 0.727 0.704 0.716Injured or dead people 0.760 0.780 0.770 0.822 0.866 0.844Missing and found people 0.470 0.170 0.240 0.512 0.427 0.466Not humanitarian 0.900 0.930 0.920 0.933 0.926 0.929Requests or needs 0.850 0.840 0.850 0.916 0.907 0.912Response efforts 0.330 0.070 0.120 0.424 0.226 0.295Sympathy and support 0.760 0.640 0.690 0.770 0.732 0.751

Table 8: Class-wise classiﬁcation results of the consolidated dataset using CNN and BERT.perform also consistently better in both precision and recall. In Table 8, we report class-wise performanceof both CNN and BERT models for the humanitarian task. BERT performs better than or on par withCNN across all classes. More importantly, BERT performs substantially better than CNN in the case ofminority classes as highlighted in the table.We further investigate the classiﬁcation results of the CNN models for the minority class labels.We observe that the class “response efforts” is mostly confused with “donation and volunteering” and“not humanitarian”. For example, the following tweet with “response efforts” label, “I am supportingRebuild Sankhu @crowdfunderuk , is classiﬁed as “donationand volunteering”. We also observe similar phenomena in minority class labels. The class “displaced andevacuations” is confused with “donation and volunteering” and “caution and advice”. It is interesting thatthe class “missing and found people” is confused with “donation and volunteering” and “not humanitarian”.The following “missing and found people” tweet, “RT @Fahdhusain: 11 kids recovered alive from underearthquake rubble in Awaran. Shukar Allah!!” , is classiﬁed as “donation and volunteering”.

In Table 9, we report the results of the event-aware training using both CNN and BERT. The event-awaretraining improves the classiﬁcation performance by 1.3 points (F1) using CNN for the humanitarian taskcompared to the results without using event information (see Table 7). However, no improvement hasbeen observed for the informativeness task. The training using event information enables the system touse data of all disasters while preserving the disaster-speciﬁc distribution.Event-aware training is also effective in the advent of a new disaster event. Based on the type of anew disaster, one may use appropriate tags to optimize the classiﬁcation performance. The event-awaretraining can be extended to use more than one tag. For example, in addition to preserving the eventinformation, one can also append a tag for the disaster region. In this way, one can optimize the model formore ﬁne-grained domain information.The event-aware training with BERT does not provide better results in any of the tasks, which requiresfurther investigation and we leave it as a future study.

Social media data is noisy and it often poses a challenge for labeling and training classiﬁers. Whileinvestigating the publicly available datasets, we realized that it is important to follow a number of stepsbefore preparing and labeling any social media dataset, not just the dataset for crisis computing. Such nformativeness HumanitarianModel Acc P R F1 Acc P R F1

CNN 0.868 0.868 0.870

BERT 0.860 0.861 0.860

Table 9: Classiﬁcation results of the event-aware experiments using the consolidated dataset.steps include (i) tokenization to help in the subsequent phase, (ii) remove exact- and near-duplicates, (iii)check for existing data where the same tweet might be annotated for the same task, and then (iv) labeling.For designing the classiﬁer, we postulate checking the overlap between training and test splits to avoidany misleading performance results.The classiﬁcation performance that we report is considered as benchmark results, which can be used tocompare in any future study. The current state-of-art for informativeness and humanitarian tasks can befound in (Burel et al., 2017; Alam et al., 2019). The F-measure for informativeness and humanitariantasks are reported as 0.838 and 0.613, respectively, on the CrisisLex26 dataset in (Burel et al., 2017).Whereas in (Alam et al., 2019), the reported F-measure for informativeness and humanitarian tasks are0.93 and 0.78, respectively. It is important to emphasize the fact that the results reported in this study arereliable as they are obtained on a dataset that has been cleansed from duplicate content, which might haveled to misleading performance results otherwise.The competitive performance of BERT encourages us to try deeper models such as BERT-large (Devlinet al., 2018) and Google T5 (Raffel et al., 2019) models. Another interesting angle is to use pre-trainedmultilingual models to classify tweets in different languages. A future research direction is to usemultilingual models for the zero-shot classiﬁcation of tweets. For the BERT-based model, it is importantto invest the effort to try different regularization methods to obtain better results, which we foresee asa future study. From the event-aware experiments, we see that it helps to improve the classiﬁcationperformance, which could also be a future research avenue.As we aim to release different versions of the dataset with benchmark results for the research community,we believe that it will help the community to develop better models and compare results. Our consolidateddataset also includes a language tag, which will help to conduct multilingual experiments in futureresearch. The resulting consolidated dataset covers a time-span starting from 2010 to 2017, which can beused to study temporal aspects in crisis scenarios.

The information available on social media has been widely used by humanitarian organizations at times ofa disaster. Many techniques and systems have been developed to process social media data. However,the research community lacks a standard dataset and benchmarks to compare the performance of theirsystems. We tried to bridge this gap by consolidating existing datasets and providing benchmarks basedon state-of-the-art CNN and BERT models.

References

Firoj Alam, Oﬂi Ferda, and Imran Muhammad. 2018. Crisismmd: Multimodal twitter datasets from naturaldisasters. In

Proc. of the 12th ICWSM, 2018 , pages 465–473. AAAI press, 1.Firoj Alam, Imran Muhammad, and Oﬂi Ferda. 2019. Crisisdps: Crisis data processing services. In

Proc. of 16thISCRAM .Gregoire Burel and Harith Alani. 2018. Crisis event extraction service (crees)-automatic detection and classiﬁca-tion of crisis-related content on social media. In

Proc. of the 15th ISCRAM, 2018 .Gr´egoire Burel, Hassan Saif, Miriam Fernandez, and Harith Alani. 2017. On semantics and deep learning forevent detection in crisis situations. In

Workshop on Semantic Deep Learning (SemDeep), at ESWC 2017 , 5.Carlos Castillo. 2016.

Big Crisis Data . Cambridge University Press.henhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of domain adaptation methodsfor neural machine translation. In

Proceedings of the 55th Annual Meeting of the Association for ComputationalLinguistics .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirec-tional transformers for language understanding. arXiv preprint arXiv:1810.04805 .Muhammad Imran, Shady Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. 2013a. Practical extrac-tion of disaster-relevant information from social media. In

Proc. of the 22nd WWW , pages 1021–1024. ACM.Muhammad Imran, Shady Mamoon Elbassuoni, Carlos Castillo, Fernando Diaz, and Patrick Meier. 2013b. Ex-tracting information nuggets from disaster-related messages in social media. In

Proc. of the 12th ISCRAM .Muhammad Imran, Carlos Castillo, Ji Lucas, Patrick Meier, and Sarah Vieweg. 2014. AIDR: Artiﬁcial intelligencefor disaster response. In

Proc. of the ACM Conference on WWW , pages 159–162. ACM.Muhammad Imran, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2015. Processing social media messagesin mass emergency: A survey.

ACM Computing Surveys , 47(4):67.Muhammad Imran, Prasenjit Mitra, and Carlos Castillo. 2016. Twitter as a lifeline: Human-annotated twittercorpora for nlp of crisis-related messages. In

Proc. of the LREC, 2016 , Paris, France, 5. ELRA.Pallavi Jain, Robert Ross, and Bianca Schoen-Phelan. 2019. Estimating distributed representation performance indisaster-related social media classiﬁcation. In . IEEE.Jens Kersten, Anna Kruspe, Matti Wiegmann, and Friederike Klan. 2019. Robust ﬁltering of crisis-related tweets.In

ISCRAM .Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .Anna Kruspe, Jens Kersten, and Friederike Klan. 2019. Detecting event-related tweets by example using few-shotmodels. In

Proc. of the 16th ISCRAM .Venkata Kishore Neppalli, Cornelia Caragea, and Doina Caragea. 2018. Deep neural networks versus na¨ıve bayesclassiﬁers for identifying informative tweets during disasters. In

Proc. of the 15th ISCRAM, 2018 .Dat Tien Nguyen, Kamla Al-Mannai, Shaﬁq R Joty, Hassan Sajjad, Muhammad Imran, and Prasenjit Mitra. 2017.Robust classiﬁcation of crisis-related data on social networks using convolutional neural networks. In

Proc. ofthe 11th ICWSM, 2017 , pages 632–635. AAAI press.Brendan O’Connor, Michel Krieger, and David Ahn. 2010. Tweetmotif: Exploratory search and topic summariza-tion for twitter. In

Fourth International AAAI Conference on Weblogs and Social Media .Alexandra Olteanu, Carlos Castillo, Fernando Diaz, and Sarah Vieweg. 2014. Crisislex: A lexicon for collectingand ﬁltering microblogged communications in crises. In

Proc. of the 8th ICWSM, 2014 . AAAI press.Alexandra Olteanu, Sarah Vieweg, and Carlos Castillo. 2015. What to expect when the unexpected happens:Social media communications across crises. In

Proc. of the 18th ACM Conference on Computer SupportedCooperative Work & Social Computing , pages 994–1009. ACM.Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-moyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 .Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,and Peter J. Liu. 2019. Exploring the limits of transfer learning with a uniﬁed text-to-text transformer.Koustav Rudra, Pawan Goyal, Niloy Ganguly, Prasenjit Mitra, and Muhammad Imran. 2018. Identifying sub-events and summarizing disaster-related information from microblogs. In

The 41st International ACM SIGIRConference on Research & Development in Information Retrieval , pages 265–274. ACM.Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling Politeness in Neural Machine Translationvia Side Constraints. In

Proceedings of the 2016 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies , pages 35–40, San Diego, California, June.Association for Computational Linguistics.tephanie M Strassel, Ann Bies, and Jennifer Tracey. 2017. Situational awareness for low resource languages: thelorelei situation frame annotation task. In

SMERP@ ECIR , pages 32–41.Irina P Temnikova, Carlos Castillo, and Sarah Vieweg. 2015. Emterms 1.0: A terminological resource for crisistweets. In

ISCRAM .Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing.

ArXiv , abs/1910.03771.Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. Xlnet:Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237 . is one of the largest publicly-available datasets, CrisisLex , which consists of two subsets, i.e.,CrisisLexT26 and CrisisLexT6 (Olteanu et al., 2014). CrisisLexT26 comprises data from 26 differentcrisis events that took place in 2012 and 2013 with annotations for informative vs. not-informative as wellas humanitarian categories (six classes) classiﬁcation tasks among others. CrisisLexT6, on the other hand,contains data from six crisis events that occurred between October 2012 and July 2013 with annotationsfor related vs. not-related classiﬁcation task.

CrisisNLP is another large-scale dataset collected during 19 different disaster events that happenedbetween 2013 and 2015 and annotated according to different schemes including classes from humanitariandisaster response and some classes related to health emergencies (Imran et al., 2016).

SWDM2013 dataset consists of data from two events. The Joplin collection contains tweets from thetornado that struck Joplin, Missouri on May 22, 2011. The Sandy collection contains tweets collectedfrom Hurricane Sandy that hit Northeastern US on Oct 29, 2012 (Imran et al., 2013a).

ISCRAM2013 dataset consists of tweets from two different events occurred in 2011 (Joplin 2011) and2012 (Sandy 2012). The Joplin 2011 data consists of 4,400 labeled tweets collected during the tornadothat struck Joplin, Missouri (USA) on May 22, 2011, whereas Sandy 2012 data consists of 2,000 labeledtweets collected during the Hurricane Sandy, that hit Northeastern US on Oct 29, 2012.

DRD consists of tweets collected during various crisis events that took place in 2010 and 2012. Thisdataset is annotated using 36 classes that include informativeness as well as humanitarian categories.

DSM dataset comprises 10K tweets collected and annotated with labels related vs. not-related to thedisasters . CrisisMMD is a multimodal dataset consisting of tweets and associated images collected during sevendisaster events that happened in 2017 (Alam et al., 2018). The annotations available and relevant to thisstudy include two classiﬁcation tasks: informative vs. not-informative and humanitarian categories (eightclasses).

AIDR is the labeled dataset obtained from

AIDR system (Imran et al., 2014) that has been annotated bydomain experts for different events and made available upon requests. We only retained labeled data thatare relevant to this study.

In Table 10, we report the events associated with the respective datasets such as ISCRAM2013,SWDM2013 CrisisLex and CrisisNLP. The time-period is from 2011 to 2015, which is a good represen-tative of temporal aspects. In Table 11, we report class label mapping for ISCRAM2013, SWDM2013CrisisLex and CrisisNLP datasets. The ﬁrst column in the table shows the mapped class for both infor-mative and humanitarian tasks. Note that all humanitarian class labels also mapped to informative and https://data.world/crowdflower/disasters-on-social-media ot humanitarian labels mapped to not-informative in the data preparation step. In Table 12, we reportthe class label mapping for informativeness and humanitarian tasks for DRD dataset. The DSM datasetonly contains tweets labeled as relevant vs not-relevant which we mapped for informativeness task asshown in Table 13. The CrisisMMD dataset has been annotated for informativeness and humanitariantask, therefore, very minor label mapping was needed as shown in Table in 14. The AIDR data has beenlabeled by domain experts using AIDR system and has been labeled during different events. The labelnames we mapped for informativeness and humanitarian tasks are shown in Table 15. ataset Year Event nameISCRAM2013 ISCRAM2013 2011 Joplin

SWDM2013

SWDM2013 2012 Sandy

CrisisLex

CrisisLexT6 2012 US Sandy HurricaneCrisisLexT6 2013 Alberta FloodsCrisisLexT6 2013 Boston BombingsCrisisLexT6 2013 Oklahoma TornadoCrisisLexT6 2013 Queensland FloodsCrisisLexT6 2013 West Texas ExplosionCrisisLexT26 2012 Costa-Rica EarthquakeCrisisLexT26 2012 Italy EarthquakesCrisisLexT26 2012 Philipinnes FloodsCrisisLexT26 2012 Philippines Typhoon PabloCrisisLexT26 2012 Venezuela Reﬁnery ExplosionCrisisLexT26 2013 Alberta FloodsCrisisLexT26 2013 Australia BushﬁreCrisisLexT26 2013 Bangladesh Savar building collapseCrisisLexT26 2013 Bohol EarthquakeCrisisLexT26 2013 Boston BombingsCrisisLexT26 2013 Brazil Nightclub FireCrisisLexT26 2013 Canada Lac Megantic Train CrashCrisisLexT26 2013 Colorado FloodsCrisisLexT26 2013 Glasgow Helicopter CrashCrisisLexT26 2013 Italy Sardinia FloodsCrisisLexT26 2013 LA Airport ShootingsCrisisLexT26 2013 Manila FloodsCrisisLexT26 2013 NY Train CrashCrisisLexT26 2013 Phillipines Typhoon YolandaCrisisLexT26 2013 Queensland FloodsCrisisLexT26 2013 Singapore hazeCrisisLexT26 2013 West-Texas explosionCrisisLexT26 2012 Guatemala EarthquakeCrisisLexT26 2012 Colorado Wildﬁres

CrisisNLP

CrisisNLP-CF 2013 Pakistan EarthquakeCrisisNLP-CF 2014 California EarthquakeCrisisNLP-CF 2014 Chile EarthquakeCrisisNLP-CF 2014 India FloodsCrisisNLP-CF 2014 Mexico Hurricane OdileCrisisNLP-CF 2014 Middle-East Respiratory SyndromeCrisisNLP-CF 2014 Pakistan FloodsCrisisNLP-CF 2014 Philippines Typhoon HagupitCrisisNLP-CF 2014 Worldwide EbolaCrisisNLP-CF 2015 Nepal EarthquakeCrisisNLP-CF 2015 Vanuatu Cyclone PamCrisisNLP-volunteers 2014-2015 Worldwide LandslidesCrisisNLP-volunteers 2014 California EarthquakeCrisisNLP-volunteers 2014 Chile EarthquakeCrisisNLP-volunteers 2014 Iceland VolcanoCrisisNLP-volunteers 2014 Malaysia Airline MH370CrisisNLP-volunteers 2014 Mexico Hurricane OdileCrisisNLP-volunteers 2014 Middle-East Respiratory SyndromeCrisisNLP-volunteers 2014 Philippines Typhoon HagupitCrisisNLP-volunteers 2015 Nepal EarthquakeCrisisNLP-volunteers 2015 Vanuatu Cyclone Pam

Table 10: Events in CrisisLex, CrisisNLP, ISCRAM2013 and SWDM2013 datasets. apped class Original class Source Annotation Description

Affected individual Affected individuals CrisisLexT26 Deaths, injuries, missing, found, or displaced people, and/or personal updates. (cid:55)

Animal management CrisisNLP-volunteers Pets and animals, living, missing, displaced, or injured/deadCaution and advice Caution and advice CrisisLexT26 If a message conveys/reports information about some warning or a piece of advice about a possiblehazard of an incident.Disease related Disease signs or symptoms CrisisNLP-CF Reports of symptoms such as fever, cough, diarrhea, and shortness of breath or questions related tothese symptoms.Disease related Disease transmission CrisisNLP-CF Reports of disease transmission or questions related to disease transmissionDisease related Disease Treatment CrisisNLP-CF Questions or suggestions regarding the treatments of the disease.Disease related Disease Prevention CrisisNLP-CF Questions or suggestions related to the prevention of disease or mention of a new preventionstrategy.Disease related Disease Affected people CrisisNLP-CF Reports of affected people due to the diseaseDisplaced and evacuations Displaced people CrisisNLP-volunteers People who have relocated due to the crisis, even for a short time (includes evacuations)Displaced and evacuations Displaced people and evacuations CrisisNLP-CF People who have relocated due to the crisis, even for a short time (includes evacuations)Donation and volunteering Donation needs or offers or vol-unteering services CrisisNLP-CF Reports of urgent needs or donations of shelter and/or supplies such as food, water, clothing, money,medical supplies or blood; and volunteering servicesDonation and volunteering Donations and volunteering CrisisLexT26 Needs, requests, or offers of money, blood, shelter, supplies, and/or services by volunteers orprofessionals.Donation and volunteering Donations of money CrisisNLP-volunteers Donations of moneyDonation and volunteering Donations of money goods or ser-vices SWDM2013/ISCRAM2013 If a message speaks about money raised, donation offers, goods/services offered or asked by thevictims of an incident.Donation and volunteering Donations of supplies and or vol-unteer work CrisisNLP-volunteers Donations of supplies and/or volunteer workDonation and volunteering Money CrisisNLP-volunteers Money requested, donated or spentDonation and volunteering Shelter and supplies CrisisNLP-volunteers Needs or donations of shelter and/or supplies such as food, water, clothing, medical supplies orbloodDonation and volunteering Volunteer or professional services CrisisNLP-volunteers Services needed or offered by volunteers or professionalsInformative Informative CrisisNLP-CF 2014 Iceland Volcano en, 2014 Malaysia Airline MH370 enInformative Informative direct SWDM2013/ISCRAM2013 If the message is of interest to other people beyond the author’s immediate circle, and seems to bewritten by a person who is a direct eyewitness of what is taking place.Informative Informative direct or indirect SWDM2013/ISCRAM2013 If the message is of interest to other people beyond the author’s immediate circle, but there is notenough information to tell if it is a direct report or a repetition of something from another source.Informative Informative indirect SWDM2013/ISCRAM2013 If the message is of interest to other people beyond the author’s immediate circle, and seems to beseen/heard by the person on the radio, TV, newspaper, or other source. The message must specifythe source.Informative related and informative CrisisLexT26 Related to the crisis and informative: if it contains useful information that helps understand thecrisis situation.Infrastructure and utilities damage Infrastructure damage CrisisNLP-volunteers Houses, buildings, roads damaged or utilities such as water, electricity, interruptedInfrastructure and utilities damage Infrastructure and utilities CrisisNLP-volunteers Buildings or roads damaged or operational; utilities/services interrupted or restoredInfrastructure and utilities damage Infrastructure CrisisNLP-volunteers InfrastructureInfrastructure and utilities damage Infrastructure and utilities damage CrisisNLP-CF Reports of damaged buildings, roads, bridges, or utilities/services interrupted or restored.Injured or dead people Injured or dead people CrisisNLP-CF Reports of casualties and/or injured people due to the crisis.Injured or dead people Injured and dead CrisisNLP-volunteers Injured and deadInjured or dead people Deaths reports CrisisNLP-CF Injured and deadInjured or dead people Casualties and damage SWDM2013/ISCRAM2013 If a message reports the information about casualties or damage done by an incident.Missing and found people Missing trapped or found people CrisisNLP-volunteers Missing, trapped, or found people—Questions and/or reports about missing or found people.Missing and found people People Missing or found CrisisNLP-volunteers People missing or found.Missing and found people People Missing found or seen CrisisNLP-volunteers If a message reports about the missing or found person effected by an incident or seen a celebrityvisit on ground zero.Not humanitarian Not applicable CrisisLexT26 Not applicableNot humanitarian Not related to crisis CrisisNLP-volunteers Not related to this crisisNot humanitarian Not informative CrisisNLP-volunteers, Cri-sisLexT26 1. Refers to the crisis, but does not contain useful information that helps you understand thesituation; 2. Not related to the Typhoon, or not relevant for emergency/humanitarian response;3. Related to the crisis, but not informative: if it refers to the crisis, but does not contain usefulinformation that helps understand the situation. (cid:55)

Not labeled CrisisLexT26 Not labeledNot humanitarian Not related or irrelevant CrisisNLP-CF,CrisisNLP-volunteers 1. Not related or irrelevant; 2. Unrelated to the situation or irrelevantNot humanitarian Not related to the crisis CrisisNLP-volunteers Not related to crisisNot humanitarian Not relevant CrisisLexT26 Not relevantNot humanitarian Off-topic CrisisLexT6; Off-topicNot humanitarian Other CrisisNLP-volunteers if the message is not in English, or if it cannot be classiﬁed.Not humanitarian Not related CrisisLexT26 Not relatedNot humanitarian Not physical landslide CrisisNLP-volunteers The item does not refer to a physical landslideNot humanitarian Terrorism not related CrisisNLP-volunteers If the tweet is not about terrorism related to the ﬂight MH370Other relevant information Other relevant information CrisisNLP-volunteers 1. Other useful information that helps understand the situation; 2. Informative for emergency/hu-manitarian response, but in none of the above categories, including weather/evacuations/etc.Other relevant information Other relevant CrisisNLP-volunteers 1. Other useful information that helps understand the situation; 2. Informative for emergency/hu-manitarian response, but in none of the above categories, including weather/evacuations/etc.Other relevant information Other useful information CrisisLexT26 1. Other useful information not covered by any of the following categories: affected individuals,infrastructure and utilities, donations and volunteering, caution and advice, sympathy and emotionalsupport.Other relevant information Related but not informative CrisisLexT26 Related to the crisis, but not informative: if it refers to the crisis, but does not contain usefulinformation that helps understand the situation.Other relevant information Relevant CrisisLexT26; Cri-sisNLP RelevantPersonal update Personal CrisisNLP-volunteers If the tweet conveys some sort of personal opinion, which is not of interest of a general audience.Personal update Personal only CrisisNLP-volunteers 1. Personal and only useful to a small circle of family/friends of the author.; 2. If a message is onlyof interest to its author and her immediate circle of family/friends and does not convey any usefulinformation to other people who do not know the author.Personal update Personal updates CrisisNLP-volunteers 1. Status updates about individuals or loved ones.Physical landslide Physical landslide CrisisNLP-volunteers The item is related to a physical landslideRequests or needs Needs of those affected CrisisNLP-volunteers Needs of those affectedRequests or needs Requests for help needs CrisisNLP-volunteers Something (e.g. food, water, shelter) or someone (e.g. volunteers, doctors) is needed.Requests or needs Urgent needs CrisisNLP-volunteers Something (e.g. food, water, shelter) or someone (e.g. volunteers, doctors) is needed.Response efforts Humanitarian aid provided CrisisNLP-volunteers Affected populations receiving food, water, shelter, medication, etc. from humanitarian/emergencyresponse organizations.Response efforts Response efforts CrisisNLP-volunteers All info about responders. Affected populations receiving food, water, shelter, medication, etc.from humanitarian/emergency response organizations.Sympathy and support Sympathy and emotional support CrisisNLP-volunteers Sympathy and emotional supportSympathy and support Sympathy and support CrisisLexT26 1. Thoughts, prayers, gratitude, sadness, etc.Sympathy and support Personal updates sympathy support CrisisNLP-volunteers Personal updates, sympathy, support.Sympathy and support Praying CrisisNLP-volunteers If author of the tweet prays for ﬂight MH370 passengers.Terrorism related information Terrorism related CrisisNLP-volunteers If the tweet reports possible terrorism act involved.

Table 11: Class label mapping and grouping for CrisisLex, CrisisNLP, ISCRAM2013, and SWDM2013datasets. riginal class Class label mappingInformative Humanitarian

Related Informative (cid:55)

Aid related Informative Requests or needsRequest Informative Requests or needsOffer Informative Donation and volunteeringMedical help Informative Requests or needsMedical products Informative requests or needsSearch and rescue Informative displaced and evacuationsSecurity (cid:55) (cid:55)

Military (cid:55) (cid:55)

Water Informative Requests or needsFood Informative Requests or needsShelter Informative Requests or needsClothing Informative Requests or needsMoney Informative Requests or needsMissing people Informative Missing and found peopleRefugees Informative Requests or needsDeath Informative Injured or dead peopleOther aid Informative Requests or needsInfrastructure related Informative Infrastructure and Utilities damageTransport Informative Infrastructure and utilities damageBuildings Informative Infrastructure and utilities damageElectricity Informative Infrastructure and utilities damageHospitals Informative Infrastructure and utilities damageShops Informative Infrastructure and utilities damageAid centers Informative Infrastructure and utilities damageOther infrastructure Informative Infrastructure and Utilities damage

Table 12: Class label mapping for Disaster Response Data (DRD).

Original class Mapped class

Relevant InformativeNot Relevant Not informative

Table 13: Class label mapping for Disasters on Social Media (DSM) dataset.

Original class Class label mappingInformative Humanitarian

Affected individuals Informative Affected individualInfrastructure and utility damage Informative Infrastructure and utilities damageInjured or dead people Informative Injured or dead peopleMissing or found people Informative Missing and found peopleNot relevant or cant judge Not informative Not humanitarianOther relevant information Informative Other relevant informationRescue volunteering or donation effort Informative Donation and volunteeringVehicle damage Informative Infrastructure and utilities damage

Table 14: Class label mapping for CrisisMMD. riginal class Class label mappingInformative Humanitarian

Blocked roads Informative Infrastructure and utilities damageBlood or other medical supplies needed Informative Requests or needsBuilding damaged Informative Infrastructure and utilities damageCamp shelter Informative Requests or needsCasualties and damage Informative Infrastructure and utilities damageCaution and advice Informative Caution and adviceClothing needed Informative Requests or needsDamage Informative Infrastructure and utilities damageDisplaced people Informative Displaced and evacuationsDonations Informative Donation and volunteeringFood and or water needed Informative Requests or needsFood water Informative Requests or needsHumanitarian aid provided Informative Response effortsInformative Informative InformativeInfrastructure and utilities Informative Infrastructure and utilities damageInfrastructure damage Informative Infrastructure and utilities damageInjured dead Informative Injured or dead peopleInjured or dead people Informative Injured or dead peopleLoss of electricity Informative Infrastructure and utilities damageLoss of internet Informative Infrastructure and utilities damageMissing trapped or found people Informative Missing and found peopleMoney Informative Requests or needsMoney needed Informative Requests or needsNeeds and requests for help Informative Requests or needsNon emergency but relevant Informative (cid:55)