[PDF] A multi-modal approach towards mining social media data during natural disasters -- a case study of Hurricane Irma

Abstract

Streaming social media provides a real-time glimpse of extreme weather impacts. However, the volume of streaming data makes mining information a challenge for emergency managers, policy makers, and disciplinary scientists. Here we explore the effectiveness of data learned approaches to mine and filter information from streaming social media data from Hurricane Irma's landfall in Florida, USA. We use 54,383 Twitter messages (out of 784K geolocated messages) from 16,598 users from Sept. 10 - 12, 2017 to develop 4 independent models to filter data for relevance: 1) a geospatial model based on forcing conditions at the place and time of each tweet, 2) an image classification model for tweets that include images, 3) a user model to predict the reliability of the tweeter, and 4) a text model to determine if the text is related to Hurricane Irma. All four models are independently tested, and can be combined to quickly filter and visualize tweets based on user-defined thresholds for each submodel. We envision that this type of filtering and visualization routine can be useful as a base model for data capture from noisy sources such as Twitter. The data can then be subsequently used by policy makers, environmental managers, emergency managers, and domain scientists interested in finding tweets with specific attributes to use during different stages of the disaster (e.g., preparedness, response, and recovery), or for detailed research.

Full PDF

AA multi-modal approach towards mining social mediadata during natural disasters - a case study ofHurricane Irma

Dr. Somya D. Mohanty a, ∗ , Brown Biggers a , Saed Sayedahmed a , NastaranPourebrahim b , Dr. Evan B. Goldstein b , Dr. Rick Bunch b , Dr. GuangqingChi c , Dr. Fereidoon Sadri a , Dr. Tom P. McCoy d , Dr. Arthur Cosby e a Department of Computer Science, University of North Carolina - Greensboro b Department of Geography, Environment, and Sustainability, University of North Carolina- Greensboro c Department of Agricultural Economics, Sociology, and Education, Population ResearchInstitute, and Social Science Research Institute, The Pennsylvania State University d Department of Family & Community Nursing, University of North Carolina - Greensboro e Social Science Research Center, Mississippi State University f

167 Petty Building, Greensboro, NC 27402-6170

Abstract

Streaming social media provides a real-time glimpse of extreme weather impacts.However, the volume of streaming data makes mining information a challengefor emergency managers, policy makers, and disciplinary scientists. Here weexplore the eﬀectiveness of data learned approaches to mine and ﬁlter informa-tion from streaming social media data from Hurricane Irma’s landfall in Florida,USA. We use 54,383 Twitter messages (out of 784K geolocated messages) from16,598 users from Sept. 10 - 12, 2017 to develop 4 independent models to ﬁlterdata for relevance: 1) a geospatial model based on forcing conditions at theplace and time of each tweet, 2) an image classiﬁcation model for tweets thatinclude images, 3) a user model to predict the reliability of the tweeter, and 4) atext model to determine if the text is related to Hurricane Irma. All four modelsare independently tested, and can be combined to quickly ﬁlter and visualizetweets based on user-deﬁned thresholds for each submodel. We envision thatthis type of ﬁltering and visualization routine can be useful as a base model ∗ Corresponding author

Email address: [email protected] (Dr. Somya D. Mohanty)

Preprint submitted to International Journal of Disaster Risk Reduction January 5, 2021 a r X i v : . [ c s . S I] J a n or data capture from noisy sources such as Twitter. The data can then besubsequently used by policy makers, environmental managers, emergency man-agers, and domain scientists interested in ﬁnding tweets with speciﬁc attributesto use during diﬀerent stages of the disaster (e.g., preparedness, response, andrecovery), or for detailed research. Keywords: data mining; social media; natural disaster; machine learning

1. Introduction

Climate change is expected to drive increases in the intensity of tropical cy-clones [1] and increase the occurrence of ‘blue sky’ ﬂooding [2]. Despite thesehazards, coastal populations [3], and investments in the coastal built environ-ment [4] are likely to grow. Understanding the impact of extreme storms andclimate change on coastal communities requires pervasive environmental sens-ing. Beyond the collection of environmental data streams such as river gages,wave buoys, and tidal stations, internet connected devices such as mobile phonesallow for the creation of real-time crowd-sourced information during extremeevents. A key area of research is understanding how to use streaming social me-dia information during extreme events — to detect disasters, provide situationawareness, understand the range of impacts, and guide disaster relief and rescueeﬀorts [e.g. 5, 6, 7, 8, 9, 10, 11].Twitter – with approximately 600 million tweet posts every day [12] andprogrammatic access to messages – has become one of the most popular socialmedia platforms, and a common data source for research on extreme events [e.g.13, 14, 15]. In addition to text, a subset of messages shared across Twitter con-tain images captured by its users (20 - 25% of messages contain images / videos[16]). A key hurdle for studying these aspects of extreme events with Twitteris the data are both large and considerably noisier than curated sources such asdedicated streams of information (e.g., dedicated environmental sensors). Postson Twitter during disasters might also be irrelevant, or provide mis- or dis- in-formation [e.g., 17, 18], highlighting the importance of ﬁltering and subsetting2ocial media data when used during disaster events. Therefore a key step in allwork with Twitter data is to ﬁlter and subset the data stream.Previous work has addressed ﬁltering and subsetting Twitter data duringhazards and other extreme events. Techniques have included relying on speciﬁchashtags [e.g., 19], semantic ﬁltering [20], keyword-based ﬁltering [18], as wellas natural language processing (NLP) and text classiﬁcation that use machinelearning algorithms [18, 21]. Classiﬁers such as support vector machine andNaive Bayes classiﬁers have been used to diﬀerentiate between real-world eventmessages and non-event messages [22], and to extract valuable “informationnuggets” from Twitter text messages[23]. The tweets’ length and textual fea-tures can also used to ﬁlter emergency-related tweets [23]. Tweets have beenscored against classes of event-speciﬁc words (term-classes) to aid in ﬁltering[18]. Previous work have ﬁltered and subset tweets using expert-deﬁned fea-tures of the tweet [24]. Images have also been used to subset tweets based onthe presence/absence of visible damage [25]. Filtering can also be understoodby the extensive work on determining the relevance of tweets for a given event— see recent work and reviews [26, 27, 28]. In the context of this paper, weview ﬁltering as any generic process that subsets tweets, even beyond the binaryclass division of relevance.A few studies have identiﬁed the signiﬁcance of adding spatial features andexternal sources for a better assessment of tweets’ relevance for disaster events.For example, [29] enriched their model with geographic data to identify relevantinformation. Previous work has used spatio-temporal data to determine tweetrelevance [21], or linked geolocated tweets to other environmental data streams[e.g., 30, 31].As observed from prior work, capturing situational awareness informationfrom social media data involves a hierarchical ﬁltering approach [32]. Specif-ically, researchers/interested stakeholders ﬁlter down the data from the noisysocial media data stream to ﬁt their speciﬁc use cases (such as, type of image- destruction, damage, ﬂooding; type of text - damage, donation, resource re-quest/oﬀer; spread of information, etc). A key component in such an approach3s the quality of baseline data capture. Towards this our study proposes a novelapproach towards quality gating the data capture from the social media datastreams using developed threshold measures. This baseline ﬁltering method-ology that can be used to ﬁnd relevant tweets and reﬁning the data captureroutines. Speciﬁcally, the goal of our study is to explore a multi-modal ﬁlteringapproach which can be used to provide situational awareness from social mediadata during disaster events. We develop an initial prototype using tweets fromFlorida, USA during Hurricane Irma. The ﬁltering routine allows users to adjustfour separate models to ﬁlter Twitter messages: a geospatial model, an imagemodel, a user model, and a text model. All four models are tested separately,and can be operated independently or in tandem. This is a design feature as weenvision the sorting and ﬁltering thresholds will be diﬀerent for diﬀerent users,for diﬀerent events, and for diﬀerent locations. We work through each modeland discuss the combined model in the following sections.

2. Methodology

Hurricane Irma (Figure 1) was the ﬁrst category 5 hurricane in the 2017Atlantic hurricane season [33]. Hurricane Irma formed on August 31st, 2017,impacting many islands of the Caribbean, and ﬁnally dissipating over the con-tinental United States [33]. Here we focus on the Twitter record of Irma specif-ically in Florida, USA. Irma made landfall in Florida Keys on 09/10/2017 as acategory 4 hurricane and dissipated shortly after 09/13/2017. 134 fatalities wererecorded as a result of the hurricane, with an estimated loss of $ We used the Twitter Application Programming Interface (API) to collecttweets located in the geospatial bounding box that captured the state boundary4 igure 1: The path of Hurricane Irma in September 2017 (orange line), the extent of tropicalstorm force winds (pink outline), and the location for all 784K geolocated tweets used as thebasis for this study (black dots). of Florida. Tweets were recorded for the period of 09/01/2017 to 10/10/2017,and resulted in the collection of 784K tweets from 96K users during the timeperiod. Our work is focused on 72 hours (09/10/2017 - 09/12/2017) whenHurricane Irma was near or over Florida. Therefore we subset the data and use54,383 tweets from 16,598 users during this 72hr window. Figure 1 highlightsthe locations of the Twitter messages, along with the path of Hurricane Irma,and the extent of tropical Storm force winds.Each tweet from the Twitter API has 31 distinct metadata attributes [36]that can conceptually be grouped into three categories: 1) Spatio-temporal(time of creation and geolocation [latitude, longitude]), 2) Tweet content (tweettext, weblinks, hashtags, and images), and 3) Tweet source (account age, friendscount, followers count, statuses count, and if veriﬁed). Geolocated tweets can5ave one of two types of location data — Places or Coordinates. Coordinates areexact locations with latitude and longitude attributes, while Places are locationswithin a Bounding Box or a Polygon designating a geospatial area in which thetweet is recorded [37]. For tweets with Places attributes, we transform the arearepresentation to a single point by selecting the centroid of the Polygon as thelocation represented by the tweet. Within our study 42.58% (23,157) of thetweets had Coordinate locations and 57.42% (31,226) had Place locations.

We collected meteorological sensor data, wind speed (in mph) and precip-itation (in inches), for each county in Florida for the 72 hours (09/10/2017- 09/12/2017). The hourly wind speeds was collected from the NOAA Na-tional Centers for Environmental Information (NCEI). Hourly precipitation val-ues were obtained from the United States Geological Survey’s Geo Data Portal(USGS GDP) of the United States Stage IV Quantitative Precipitation Archive.Precipitation values from the closest weather station were used due to diﬃcultyin obtaining reliable data for all weather stations. In addition to meteorologicalforcing, we collected data consisted of location of the hurricane’s eye, categoryof the hurricane, pressure and wind speed (NOAA National Hurricane Center).This data were discretized into hourly windows for the 72 hours.

We aligned the 72hrs of Twitter data and the corresponding 72hrs of mete-orological forcing data. Wind and precipitation values at the geolocation of atweet was calculated using Inverse Distance Weighting (IDW). IDW is an inter-polation method that calculates a value at a certain point using values of otherknown points: W p = (cid:80) ni =1 W i D ki (cid:80) ni =1 1 D ki (1)6here, W p is the wind speed to be interpolated at point p , W i is the windspeed at point i , D i is the distance between point p and i , and k is a powerfunction that reduces the eﬀect of distant points as it goes up. IDW has beenwidely used to interpolate climatic data [38]. The method has demonstratedaccurate results when compared to other interpolation methods especially inregions characterized by strong local variations [39]. IDW, for example, as-sumes that any measurement taken at a ﬁxed location (e.g., weather station)has local inﬂuence on surrounding area, and the inﬂuence decreases with in-creasing distance. Within our study we chose IDW as our interpolation methodas meteorological factors in a hurricane are highly inﬂuenced by local variations.Furthermore, each tweet was also annotated with the corresponding tempo-ral hurricane conditions data. Speciﬁcally, for each hourly time window, a tweetwas associated with its distance from the eye of the hurricane and its conditions(i.e., pressure, max wind speed) during that window. Our goal is to develop a single model for tweet relevance based on four submodels — 1) the relevance of the tweet based on geospatial attributes (i.e.,the Tweets location relative to the forcing conditions of the hurricane, 2) therelevance of tweet images (when media is included in the tweet), 3) a scorefor the reliability of the user (i.e., network attributes to predict if a user is‘veriﬁed’ by Twitter), and 4) the relevance of the tweet text . The methods usedto construct of each of these models, and the results of models (submodels andthe combined model) in are discussed in Section 3.

Our goal in designing a geospatial relevance model was to search for thresh-olds in forcing conditions where tweets were likely to be related to HurricaneIrma, as opposed to background social media discussions. Speciﬁcally, we positthat the messages which are in close geospatial and temporal proximity to thedisaster event will have more relevant situational awareness information than7hose which are not. Furthermore, such an approach can be used in real-timeduring the occurrence of an event where meteorological data can provide keyinformation about disaster’s impact at diﬀerent locations.There are many meteorological conditions that can be used as proxies forextreme disaster conditions. We focus here on searching for modeling functionsrelating wind speed ( w ), precipitation ( p ), and distance from hurricane eye ( d ).We acknowledge that other factors could be used in addition to these threeattributes. For example, rainfall during a given interval could be quantiﬁed inseveral ways, such as mean rainfall rate, max rainfall rate, total rainfall in agiven interval. Similarly Wind metrics could include mean wind speed, maxwind speed, metrics based on wind gusts, etc.. For locations nearby the coast,metrics could include tide elevations, or storm surge elevations, and locationsnear streams could include stage and discharge data. Ultimately we chose Windspeed, precipitation and distance from the hurricane eye as these factors areavailable everywhere (vs metrics that are only applicable along streams andrivers) and because they are commonly available and collected by even basicmeteorological stations. Nine diﬀerent functions, — 1) wind ∗ raindistance , 2) raindistance ,3) winddistance , 4) wind ∗ rain √ distance , 5) rain √ distance , 6) wind √ distance , 7) wind ∗ rain √ distance , 8) rain √ distance ,and 9) wind √ distance , combining the geospatial attributes were compared to identifythe best suited model towards creating a relevance score for the tweets. In eachof the models, wind speed and precipitation acted as numerators (individually orcombined), where as distance was used as a denominator — this was a heuristicmethod, as tweets are likely more relevant if forcing conditions are more severe(Higher wind, more precipitation, closer distance to hurricane)Approximately 19,000 tweets from the Irma dataset were hand labeled byhuman coders as “Irma related” or “non-Irma related” based on the tweet con-tent. The performance of each geospatial function was evaluated by comparingthe ratio of Irma related tweets to total number of tweets during each time win-dow. The ratios obtained from each formula was normalised using three diﬀer-ent approaches - Min-max scaling, Log ( log ()), and Box-Cox transformations.Ranking of Shapiro-Wilk (SW) test statistics was used to assess normality. In8ddition, multiple observed statistics of mean, standard deviations, and per-centage of values within 1, 2, and 3 standard deviations from the mean werecalculated to evaluate normality. The goal of this normalization procedure wasto establish a comparative scoring range for each of the models. The scores en-able development of a combined overall model for ﬁltering tweets relevant to thehurricane (as described in Section 2.4. Apart from the ratio, we also evaluatedthe F1-score ( F precision ∗ recallprecision + recall ) for the model, where precision = T PT P + F P , recall = T PT P + F N , and

T P , F P , T N , and

F N represent the number of truepositives, false positives, true negatives, and false negatives, respectively.

Supervised machine learning models were used to develop automated imageclassiﬁcation of images in the Twitter dataset. The goal of this model is twofold: 1) to develop, a binary classiﬁer capable of distinguishing hurricane-relatedimages from the non-related ones, and 2) to then develop a multi-label annotatorcapable of classifying the hurricane-related images into one or more of threeincident categories — 1) Flood, 2) Wind, and 3) Destruction.A key hurdle in the approach was the lack of available labeled training datafor supervised classiﬁcation. We developed a web platform for image labelingfor annotation by human coders. The platform took unlabeled images, and dis-played them on a browser for human coders to annotate. Within the browser,the coder was asked the question of —

Does this image have any of the fol-lowing — 1) Flooding, 2) Windy, and 3) Destruction?

An image is considered“Flooding” if there is water accumulation in an area of the image. An imageis considered “Windy” if there are visual elements in the picture which showtree branches are moving in a direction or some objects that are ﬂying or heavyrain visible in the image. An image is considered to have “Destruction” if thereis damage to property, vehicles, roads, or permanent structures. An image canbe in one or more of the previous classes. If an image has one of the codiﬁedclasses, it is labelled as Irma related and if it does not have any of them theimage is labelled as not related . 9or the dataset, approximately 7,000 images were labeled by 3 human coders/raterswhere the data was divided equally between them. Following codiﬁcation result-ing dataset had the following distribution — Related: 817 / Not-Related: 6081images, and Wind: 120 images; Flooding: 266 images; Destruction: 571 images.We also evaluated the inter-rater reliability using Light’s Kappa [40] for 100sampled images (with balanced distribution of related and not-related classes)that were labeled by all three coders. Agreement between all three coders forrelated versus not-related was at 0 .

77, and across tags Flooding - 0 .

88; Windy0 .

27; and Destruction 0 .

78. This shows signiﬁcant agreement among the coderson the labeling [41] other than the Windy tag (poor/chance agreement).This annotated dataset was used to train deep learning models based on con-volutional neural network (CNN) architectures. Convolutional networks havebeen widely used in large-scale image and video recognition [42]. CNN archi-tecture consists of an input layer, an output layer, and several hidden layers inbetween. A hidden layer can be a convolutional layer, a pooling layer (e.g. max,min, or average), a fully connected layer, a normalization layer, or a concatena-tion layer. Within our approach, we evaluated three modern CNN architectures— 1)VGGNet, 2)ResNet, and 3)Inception-v3, and compared the performanceof each model to its counterparts.In VGGNet [42], the image is passed through a stack of CNN layers, whereﬁlters with a very small receptive ﬁeld is used. Spatial pooling is done byﬁve max-pooling layers, which is followed by convolution layers. The limita-tion of VGGNet is its large number of parameters, which makes it challengingto handle. Residual Neural Network (ResNet) [43] was developed with fewerﬁlters and in turn has lower complexity than VGGNet. While the baseline ar-chitecture of ResNet was mainly inspired by VGGNet, a shortcut connectionwas added to each pair of ﬁlters in the model. In comparison, Inception-v3[44] uses convolutional and pooling layers which are concatenated to create atime/memory eﬃcient architecture. After the last concatenation, a dropoutlayer is used to drop some features to prevent overﬁtting before proceeding withﬁnal result. The architecture is quite versatile, where it can be used with both10ow-resolution images and high-resolution images, and can distinguish any sizeof a pattern, from a small detail to the whole image. This makes it useful in ourapplication as the quality and type of image can vary widely due to disparatesmart devices used by the Twitter population. The pre-trained Inception-V3 istrained on the Imagenet [45] dataset which consists of hundreds of thousandsof images in over one thousand categories. The weights of this model are usedas a starting point for training and ﬁne tuned using our sample images. Theapproach takes advantage of transfer learning [46], where the classiﬁer is able toinitially learn features of physical objects in a wide variety of scenarios and thentrained on speciﬁc observations within our data. This enables a more accurateand generalizable model.Data augmentation methods were used to expand the number of trainingsamples and therefore improve model accuracy. For example, additional trainingimages are generated by rotating and scaling of the original images. This wasdone to balance the number of images of Irma related to the un-related ones.The resulting dataset consisted of approximately 6,000 images in each classfor the binary classiﬁer, and approximately 2,000 images in each class for theannotator model. The models were trained and testing using a 70-30 split onthe dataset. For each model, performance scores (precision, recall, and F1) wasrecorded. Probability scores for each tweet image were then recorded for everyclass, which was further normalized using log-transform and re-scaled usingmin-max scaling to be used in the overall model.

It is essential to quantify the authenticity of user accounts which have postedmessages and images during a disaster event. For the purpose, our goal was todevelop a scoring model which can provide continuous probabilistic measures ofaccount authenticity.Manually annotating user reliability in a large dataset such as Twitter isnot practical. As we did not have a labeled dataset, our starting point was toconsider the user “Veriﬁed” attribute within the tweets. The “Veriﬁed” attribute11s annotated to user accounts which Twitter deﬁnes to be of public interest [47].Within our dataset we had 94,445 non-veriﬁed users and 1,692 veriﬁed users.Since Twitter’s methodology for ﬁnding veriﬁed accounts is not public, we aimto develop a proxy automated model. The aim here is to create a model whichcan help identify users who are also likely to be accounts of public interestand authentic, but remain unveriﬁed. This can be used in conjunction withthe Twitter “Veriﬁed” accounts to provide a comprehensive source of authenticaccounts during a disaster event. Speciﬁcally, our approach provides the adjustthe authenticity thresholds based on the continuous probabilistic scores of themodel, which enables collection information from accounts which have not yetbeen veriﬁed by Twitter but have similar properties to that of a “Veriﬁed”account.The automated model was developed based on supervised machine learning.Speciﬁcally, machine learning models [48] were developed for binary classiﬁ-cation machine to predict the label user “Veriﬁed” (true/false) based on thefeatures of tweets content (weblinks, hashtags) and its creator (account age,friends count, followers count, statuses count). Random Forest (RF) [49], Gra-dient Boosted (GB) [50], and Logistic Regression (LR) [51] classiﬁers were usedto train and test the model.RF is an ensemble model which consists of multiple decision trees trainedon the data and their voting to determine the label class of an observationbased on the features. A decision tree has a set of rules, when evaluated onan input, it returns a prediction of a class or a value. RF also returns theratios of votes for each class it is trained on. A Gradient Boosted (GB) is alsoan ensemble model which builds decision trees leveraging gradient descent tominimize information loss. Similar to RF, GB also uses weighted majority voteof all of the decision trees for classiﬁcation. In comparison, Logistic Regression(LR) is a non-parametric model which tries to ﬁnd the best linear model todescribe the relationship between independent variables and a binary outcomefor classiﬁcation.The output of each of the trained binary models is a classiﬁer capable of12redicting if a user can be veriﬁed or not. The performance of the resultingmodel was evaluated using a 10-fold cross validation [52], with a 70-30 traintest split used in each fold. Furthermore, grid search [53] was used on the bestperforming model for hyper-parameter optimization. Grid search takes in aset of values for each hyperparameter (e.g. number of trees in a forest, maxdepth of a tree, sample splits, max number of leaf nodes, etc.), folds number,and conducts a search using each possible combination of hyperparameters byevaluating them on a scoring metric such as F1-score. The ﬁnal output of thismodel is a min-maxed log-transformed value of the probability scores. This wasdone to reduce the skewness in score distribution needed for the overall model(described later).

The goal of the text model was to delineate tweets with Irma related textfrom those addressing other topics. While generic search term such as the“Hurricane Irma” can provide a starting point, prior research [54, 55, 56] inthe domain has shown that content organically develops to other words. Anautomated system trained on a large corpus to recognize context may improvethe results, but this suﬀers from two signiﬁcant pitfalls. First, training a learneron large bodies of text is costly from the perspective of computational overhead[57]. Second, the dynamic nature of discussions during a disaster, especially ina format as compact as Twitter, can alter the most likely interpretation of aword’s meaning, resulting in false positives in the captured tweets [58].In order to address the issues we developed a dynamic word embedding modelwhich utilizes online learning to update its learned context. Speciﬁcally, we usea neural network based word embedding architecture - Word2Vec [59, 60], whichcaptures the semantic and syntactic relationships between the words present intweets corpora. In the Word2Vec module, each word is evaluated based uponits placement among other words within a tweet. This target word, combinedwith its neighboring words before and after its occurrence in a given tweet, isthen given to a neural network whose hidden layer weights correspond with13he vector representing the target word. Once the vectors for each word aregenerated, the vectors can be compared based upon their cosine similarity. Astwo words get closer in similarity, the vectors representing those words willbecome closer within vector space; the angle internal to the vector will getsmaller; and the cosine of this angle will get closer to, but not exceed, 1. Asa result, the similarity in context between a word and its neighbors in vectorspace can be compared numerically by looking at the cosine of the internal angleformed by two word vectors [61].Within our approach, tweets were parsed and grouped into 24-hour segments,with primary testing done on the time period immediately before and after theinitial landfall. Prior to training the model, tweets were ﬁrst cleaned to elimi-nate punctuation, numbers, and extraneous/stop words. Each tweet temporallyisolated and parsed into token words, to create input vectors for training andtesting of Word2Vec module. Four diﬀerent formulas - 1) Cosine Similarity ofTweet Vector Sum (CSTVS) 1 − α · (cid:80) ki =1 τ i (cid:107) α (cid:107)(cid:107) (cid:80) ki =1 τ i (cid:107) , 2) Dot Product of Search TermVector and Tweet Vector Sum (DP) (cid:107) α (cid:107) × (cid:107) n (cid:80) i =1 τ i (cid:107) × cos θ , 3) Mean CosineSimilarity (MCS) n n (cid:80) i =1 cos( θ τ i α ), 4) Sum of Cosine Similarity over Square Rootof Token Count (SCSSC) √ n n (cid:80) i =1 cos( θ τ i α ), were employed to score a tweet basedupon its component word vectors. CSTVS is a programmatic implementation ofthe cosine distance formula [62] allows an eﬃcient calculation of cosine distance.Cosine Similarity can be calculated by subtracting this value from 1. DP treatsthe sum of the vectors in a tweet as a vector itself ( (cid:80) ki =1 τ i ), and calculatingthe dot product of this interpreted vector and the vector for the search term( α ) returns a value that is proportional to the cosine similarity. MCS is themean cosine similarity of the search term to all terms in a tweet, where n is thenumber of terms in the tweet. SCSSC is similar in function to the MCS, whereit reduces the impact of a shorter tweet by dividing by the square root of thecount of tokens in a tweet ( n ). All formulas return a scalar score for a tweet -search term similarity match.In order to evaluate the model, the codiﬁed data set of 19,000 tweets were14sed. The codiﬁcation was done by a single human coder and a sampled set oftweets (100 with balanced distribution) was veriﬁed by two additional codersto access the inter-rater reliability. Tweets were labelled to be Irma related ifmatched the following criterion — 1) Explicitly contains references to “Irma” or“Hurricane” ; 2)

Contains current meteorological data, such as wind speed, rain-fall levels, etc. ; 3)

Refers to weather events such as storm, ﬂood(ing) and risingwater, rainfall, tornado, etc. ; 4)

Describes the aftermath of extreme weather:trees down, power out, damage to buildings or construction, etc. ; 5) containsreferences to emotional states exacerbated by the weather: worrying about shel-ter, concerns for safety, pleas for help, etc. ; 6)

Lists availability or absence ofnecessities: shelter, water, food, power, etc . A message was labelled not re-lated if it met following criterion — 1) mentioning a location absent any of theabove content ; 2)

Containing an attached picture that may be Irma related, butno additional text ; 3)

Expressing emotions about the state of an event, but itsconnection to weather is ambiguous, i.e. a sporting event canceled, but no ex-planation as to why; Expressing emotions about a person’s condition, but itsconnection to weather is ambiguous: for ex: “I hope @abc123 gets better soon!”.

The resulting dataset had 8,296 tweets related to the Irma and 10,792 tweets notrelated. The inter-rater reliability of the codiﬁed messages using Light’s Kappametric was at .

69, suggesting signiﬁcant agreement between coders [40, 41].This dataset was then used to evaluate the aforementioned formulas fordiﬀerent thresholds of the scores by analyzing the ratio of correctly classiﬁedtweets by the model. Hyper-parameters of the Word2Vec model were also tunedusing the labeled tweets. The parameters selected for testing were context wordwindow sizes from 1 to 10 words on either side of the target word; hidden layerdimensionality in 50D increments from 50D to 500D; minimum word occurrencefrom 0 to 9; negative sampling from 0 to 9 words. The cross product of the valuescontained in these ranges were used as the testing set of tuples for the trainingoperations. For each set of parameters, the NN was trained through varyingepochs, and the resultant word embeddings used in conjunction with the fourscalar formulas to calculate scores for each tweet. The scores for each iteration15ere min-max scaled for the time delta, and the AU-ROC calculated based uponthe thresholds of the scores in relation to the human-coded tweets.

Metadata Extraction

Start geoScore userScore imgScore textScore

Get Address and Live Street View Get 2D Top View Map Web ViewerTweets DatabaseFeature ExtractionHurricane / Wind / Precipitation geoScore >= geoThreshold userScore >= userThreshold imgScore >= imgThreshold textScore >= textThreshold

Geospatial Model User Model Image Model Text ModelGeolocationGoogle Maps API Yes YesNoNo No Score & AnnotationGet Next TweetStreet View - Map

FilteringVisualization

EndLast TweetScored TweetsTweet Data

Figure 2: Overall information ﬂow model. The Metadata Extraction stage develops variablesfrom the raw Twitter data, Filtering stage utilizes the developed 1) Geospatial, 2) User, 3)Image, and 4) Text analysis modules to score tweets, and the Visualization stage is used toobserve the at location posted image along with Google Street View

Following the creation of individual models, we combined the results of eachinto a single overall model (Figure 2) which consists of three distinct stages - 1)16etadata extraction, 2) Filtering, and 3) Visualization of ﬁltered tweets. Forthe ﬁrst stage, the input is a tweet as a data-point. The metadata extractionstage mines the relevant attributes (image, geolocation, user, text) needed forthe individual models of 1) Geospatial, 2) User, 3) Image, and 4) Text analysis.The results of the individual models are then combined in the second stageof ﬁltering, where the normalized scores (decision score ranging from 0-100) foreach models are combined at diﬀerent thresholds to ﬁlter the relevant Twit-ter messages for Hurricane Irma. Any tweets without images are assigned an imgScore = 0, this allows users to view messages which contain images by set-ting the threshold to be imgScore >

0. The ﬂexibility of the approach is inits ability to select diﬀerent thresholds for respective models. This allows for amore generalizable model where a user can choose diﬀerent set of thresholds fordisparate disaster events. A logical

AN D operation is used to obtain messageswhich pass all of the thresholds for each of the individual models. Speciﬁcally,a datapoint can only pass the ﬁltering stage if all of its individual model scoresare greater than or equal to the thresholds set.The ﬁltered data are then stored in a database (Scored Tweets), where eachdatapoint can then be viewed on a visualization platform. The visualizationplatform extracts the location information from each datapoint (Geolocation),which is then cross-referenced with Google Maps API to provide three attributes— 1) Google Street View [63, 64], 2) Physical address, and 3) A 2D top downview of the map at the location. These attributes (Street View - Map) alongwith the Tweet Data (text of the tweet, date-time, user, image, etc) and Score& Annotation information ( P ( Related/N ot − Related ) and P ( T ag ), where P isthe probability and T ag ∈ {

F looding, W indy, Destruction } ) is then displayedon a web viewer. This presents an easy to use interface to view and visualizethe messages for situational awareness.17

100 200 300 400 500 600Distance from Eye of Hurricane Irma (Miles)05101520 P r ec i p i t a t i o n ( I n c h e s ) W i nd ( K n o t s ) RainWind Speed

Figure 3: Precipitation and wind speed in relation to the distance from Hurricane Irma’s eye.

3. Results

Preliminary exploration of the sensor readings for precipitation and windspeed with relative distance from the eye of the hurricane are shown in Figure 4.Precipitation decreases exponentially farther away from the eye of the hurricane,measuring 5 - 20 inches. Median wind speeds have their peak around 300 milesfrom the eye of the hurricane.Nine diﬀerent geospatial models were developed and compared for their per-formance to ﬁlter Irma related tweets. Speciﬁcally, for each model the resultscalculated ratio of Irma related tweets, i.e. number of Irma related tweets / totalnumber of tweets, at diﬀerent thresholds between 0 and 1 (all values were min-maxed for normalization). Irma related tweets were identiﬁed by codiﬁcationof 19,000 messages by human coder (annotation criteria described in Section2.3.4). Figure 3, compares the cumulative distribution function (CDF) plotsbetween the each of the functions within a subplot. The plots (a, b, c) furthercompare the results between - a. Min-Max Normalization, b. Log (log ()), andc. Box-Cox ( γ ())) transformation scores.As observed, the results of the Log and Box-Cox transformations show awider distribution of the ratio in comparison to the Min-Max normalized val-ues, across the diﬀerent thresholds. The results are also conﬁrmed by the18

20 40 60 80 100Threshold0 . . . . . L i k e li h oo d o f o cc u rr e n ce (a) Min-Max Normalization Scores. . . . . . L i k e li h oo d o f o cc u rr e n ce (b) log () Transformed Scores. . . . . . L i k e li h oo d o f o cc u rr e n ce wind ∗ raindistanceraindistancewinddistancewind ∗ rain √ distancerain √ distancewind √ distancewind ∗ rain √ distancerain √ distancewind √ distance (c) Box-Cox Transformed Scores.Figure 4: Cumulative Distribution Function (CDF) for Min-Max Normalization, Log, andBox-Cox transformed geospatial scores for the nine models. The common legend of all threeﬁgures is shown in Figure c. Shapiro-Wilks test (Table 2) where the Log and Box-Cox transformed mod-els have higher scores, suggesting a more normal distribution of the resultsthan the non-transformed ones. Based on the test the top ﬁve functions identi-ﬁed were - ( γ ( wind √ distance ), γ ( wind ∗ raindistance ), log ( wind ∗ raindistance ), γ ( wind ∗ rain √ distance ), and log ( wind ∗ rain √ distance ). Each of the models were in very close proximity to the scoresobserved in the test. 19 able 1: Shapiro-Wilk statistics value for all the models . Top 5 values highlighted. Normalization methodModel X − min( X )max( X ) − min( X ) log () γ () wind ∗ raindistance . . . raindistance .

06 0 .

92 0 . winddistance .

13 0 .

95 0 . wind ∗ rain √ distance . .

98 0 . rain √ distance .

51 0 .

88 0 . wind √ distance .

88 0 . . wind ∗ rain √ distance . .

98 0 . rain √ distance .

65 0 .

85 0 . wind √ distance .

96 0 . . Additional analysis was conducted to observe the statistical properties ofthe top ﬁve models. Figure 5 shows the CDF and F1-Scores for each of thesefunctions. Table 2, show the general statistical properties. Out of the ﬁve, log ( wind ∗ rain √ distance ) was chosen as a ﬁnal model function, based on its mean beingthe closest to 0.5. The performance of various image classiﬁers are shown in Table 3. In the ﬁrststage of classiﬁcation, which uses a binary classiﬁer distinguish hurricane andnon-hurricane related images, the Tuned Inception V3 architecture performedthe best with an overall F1-score of 0.962. Figure 6, shows the comparative AU-ROC curves for the diﬀerent models. Between the classes, the Tuned InceptionV3 model also performed well with an F1-score of 0.959 for class 1 (hurricanerelated) and 0.965 for class 0 (non-hurricane related) images.The hurricane related images were then fed through a second round of clas-20

20 40 60 80 100Threshold0 . . . . . L i k e li h oo d o f o cc u rr e n ce γ ( wind √ distance ) γ ( wind ∗ raindistance )log ( wind ∗ raindistance ) γ ( wind ∗ rain √ distance )log ( wind ∗ rain √ distance ) (a) CDF Scores. . . . . . . . F - S c o r e γ ( wind √ distance ) γ ( wind ∗ raindistance )log ( wind ∗ raindistance ) γ ( wind ∗ rain √ distance )log ( wind ∗ rain √ distance ) (b) F1 Scores.Figure 5: Cumulative Distribution Function (CDF) and F1-Scores for the top ﬁve geospatial models. able 2: General data statistics for top 5 models. log ( wind ∗ rain √ distance ) was selected as thenormalization model for geospatial analysis as its distribution mean was closest to 0 . Data StatisticsModel

Shapiro-Wilks StandardDeviation ( σ ) Mean µ % of Data within − ≤ σ ≤ γ ( wind √ distance ) 0 .

99 0 .

12 0 .

28 0 . γ ( wind ∗ raindistance ) 0 .

99 0 .

14 0 .

38 0 . log ( wind ∗ raindistance ) 0 .

99 0 .

14 0 .

39 0 . γ ( wind ∗ rain √ distance ) 0 .

98 0 .

16 0 .

43 0 . log ( wind ∗ rain √ distance ) 0 .

98 0 .

16 0 .

46 0 . . . . . . . . . . . . . T r u e P o s i t i v e R a t e VGGNet, AUROC = 0.96ResNet, AU-ROC =0.96Inception-V3, AU-ROC = 0.96TunedInception-V3, AU-ROC = 0.99

Figure 6: Area Under - Receiver Operating Characteristics (AU-ROC) Curves for V3, VGGnet, ResNet architecture and Tuned Inception V3 models for binary classiﬁcation of images (hurricane related versus non-hurricane related). siﬁcation trained on multi-label annotation of - 1) ﬂood, 2) wind, and 3) de-struction. Table 3 also compares the results of the analysis, where the TunedInception V3 architecture outperformed the other models, with an average F1-22 able 3: Performance comparison of deep-learning models (Inception-V3, VGGNet, ResNet,and Tuned Inception-V3) for binary classiﬁcation and multi-label annotation.

Model Performance MeasuresBinary Classiﬁer Multi-Label Annotator

Precision Recall F1-Score Precision Recall F1-ScoreVGGNet 0 .

88 0 .

87 0 .

88 0 .

70 0 .

60 0 . .

88 0 .

89 0 .

68 0 .

61 0 . .

89 0 .

88 0 .

75 0 .

72 0 . TunedInception-V3 0 .

96 0 .

95 0 .

90 0 .

92 0 . . . . . . . . . . . . . T r u e P o s i t i v e R a t e Class ’Rain’, AU-ROC = 0.96Class ’Wind’, AU-ROC = 0.98Class ’Destruction’, AU-ROC = 0.97

Figure 7: Area Under - Receiver Operating Characteristics (AU-ROC) Curves for TunedInception V3 model for multi-label annotation for images - 1) ’Flood’, 2) ’Wind’, and 3)’Destruction’. score of 0.896. Within the classes, the F1-scores were well distributed with class1) as 0.821, 2) as 0.888, and 3) 0.941. Figure 7 shows the AU-ROC curves forthe diﬀerent annotations performed on the images by the Tuned Inception V323rchitecture.Analyzing the cutoﬀ thresholds of the probability scores for the Tuned In-ception V3 model, shows a distribution with a mean of 0.63, a median of 0.75,and a standard deviation of 35.07. The values show a wide distribution of prob-ability scores, which is useful in having a wider range in the cutoﬀ thresholdsused for ﬁltering the images. . . . . . . . . . . . . T r u e P o s i t i v e R a t e Gradient Boosting, AU-ROC = 0.97Random Forest, AU-ROC = 0.96Logistic Regression, AU-ROC = 0.94

Figure 8: AU-ROC Curves for Random Forest, Gradient Boosted, and Logistic RegressionClassiﬁers in predicting Veriﬁed users . The F1-score of the Random Forest (RF), Gradient Boosted (GB), and Lo-gistic Regression (LR) models of the models trained on predicting user veriﬁ-cation were recorded at 0.97, 0.92, and 0.88 respectively. Figure 8 shows thecomparative AU-ROC scores of the diﬀerent models, where the RF classiﬁer isable to outperform the rest of the models. The best performing RF model wasdeveloped by using a grid search approach, where multiple model parameters(number of estimators, depth, leaf splits, etc.) were evaluated. The resultingmodel had a precision, recall, and AU-ROC were observed to be 0.96, 0.98, and0.99 respectively. 24he classiﬁer was balanced in its prediction accuracy in both veriﬁed (class1) versus non-veriﬁed (class 0) users (Figure 10). The output probability valuesof the binary model were further min-maxed to a threshold score between 0 and100. The resulting normal distribution had a mean of 50.56, a median of 66.26,and a standard deviation of 39.69.

The results of the text analysis module were based on the binary catego-rization of the tweets codiﬁed as ‘irma related’ (class 0) or ‘non irma related’(class 1). Evaluation of the four diﬀerent resulted in the F1-scores of .6553 -MCS, .7824 - DP, .7049 - CSTVS, and .7347 - SCSSC. We observe the dot prod-uct between search term vector and tweet vector sum (DP) gives us the bestresult. Figure 9 shows the AU-ROC curves comparing the diﬀerent formulaperformance in the analysis. . . . . . . . . . . . . T r u e P o s i t i v e R a t e DP, AU-ROC = 0.8842MCS, AU-ROC = 0.7326SCSSC, AU-ROC = 0.8327CSTVS, AU-ROC = 0.7987

Figure 9: AU-ROC Curves for text — 1) Cosine Similarity of Tweet Vector Sum (CSTVS) , 2)Dot Product of Search Term Vector and Tweet Vector Sum (DP) , 3) Mean Cosine Similarity(MCS), 4) Sum of Cosine Similarity over Square Root of Token Count (SCSSC).

Each model was further evaluated to identify the best set of parameters.Within the analysis we found the DP formula was still the best performing with25 word window size of 1, hidden layer dimensionality of 150, a minimum wordcount of 5, a negative sampling value of 1, and training the Word2Vec modelthrough 25 epochs. The resulting normal distribution had a mean of 24.73, amedian of 21.64, and a standard deviation of 14.05.

4. Discussion

We address each individual model separately before discussing the ﬁnal com-bined model and providing limitations/ future directions for this work.

The geospatial models developed in the study provide a measure of rele-vance to a tweet by including the forcing sensor data (wind speed, precipitation,and distance from the eye of the hurricane). The best performing function of log ( wind ∗ rain √ distance ) combines the values into a single normalized score which canbe used to weight a geographic/sensor relevancy factor for any tweet. Morespeciﬁcally, the function helps us identify Twitter messages at locations whichare in close proximity to the hurricane forcing and have observed increasedamount of precipitation and wind speed. As seen in the results, the chosen log ( wind ∗ rain √ distance ) was the closest to a normally distributed function. This al-lows for a greater granularity on threshold cutoﬀ points in comparison to otherfunctions, leading to a ﬁne-grained control over ﬁltering based on the geographicrelevance of the tweets. The statistical properties of the function also enablesanalysis of conﬁdence intervals which can be used to ascertain the reliability of amessage within the context of sensor data. In other words, tweets with anoma-lous sensor readings can be easily identiﬁed, leading to more reliable mining ofmessages related to the disaster event.We envision that ﬁltering tweets using their geospatial information relativeto storm position and also environmental factors can help isolate tweets fromheavily impacted locations. By examining locations close to the storm, with high26ind gusts, or heavy precipitation allows users to quickly examine locations thatmight be expected to show the most severe imapcts from storm events. Comparing the performance of the CNN architectures (Inception V3, VGG,ResNet, and Tuned Inception V3) for binary classiﬁcation (hurricane and non-hurricane related), we observe that the Tuned Inception V3 model (F1-score0.95) has almost a 6-7% accuracy gain over others. In comparison to the VGGand ResNet architectures, the Tuned Inception V3 larger number of parameterswhich can be trained to observe the nuances between the images. While the baseInception V3 classiﬁer contains the same number of parameters, re-tuning theweights to our training sample of images improved its accuracy considerably forthe binary classiﬁcation. This can be attributed to the pre-training and transferlearning of the model, where it already had prior weights based on classiﬁcationof physical objects, and our image data tuned it further for disambiguatingphysical and non-physical scenes.We do observe a slight performance decrease (F1-score 0.91) of the architec-ture trained on the multi-label annotation of the images. This can be attributedto the limited number of training samples that were available to the classiﬁer.The complexity of the images in the samples further degrades the performance,for example, images of lakes and sea water are not much diﬀerent from imagesof ﬂooding.Prior research in the area of automating image analysis (using machine learn-ing) from social media has primarily focused on quantifying the level of damagein disaster situations [65, 66, 67]. Our approach uses a dual stage model, wherethe ﬁrst stage is responsible for increasing the quality of images by ﬁlteringout the non-relevant / non-physical images. The output is then fed into thesecond stage for categorization into diﬀerent groups based on situational con-ditions (ﬂooding, wind, and destruction). While prior studies have looked atdisambiguating “fake/altered” images [68, 69], they are based on analyzing thecontent of the tweet along with user reliability measures for training machine27earning models. Within our approach we only utilize the image features forthe training our models. The output is image scores are based on normalizedprobability values, which can be used for threshold cutoﬀs, where setting a highthreshold will only mine the most hurricane related images. The second stagethen annotates the images for further ﬁltering of images based on the needs ofthe domain.Filtering images permits users to quickly focus on a small subsetof visual information that is presumed to be most valuable for storm impactassessment, compared to needing to scroll through many images to ﬁnd usefulinformation.

Prior studies [70, 71, 72, 73] focused on identifying incorrect/fake/alteredinformation in social media have established the source of information (socialmedia user) as a key component. A large proportion of the studies [74, 75, 76, 77]have been based on developing machine learning approaches towards detection of“bots” or fake user accounts [78] on social media. For example, [79] identify thecredibility of the user as an important element in mining good quality situationalawareness information from social media. Within our approach, we leverageprior work done in the ﬁeld by identifying the user features of account age,status count, number of followers, number of friends, existence of url links,number of hashtags, existence of images, retweets, geolocation, and messagefrequency in training our machine learning models.Comparing the results between the parametric (Logistic Regression) and thenon-parametric ensemble models (Random Forest and Gradient Boosting), weobserve the ensemble models are able to outperform by a margin of 4-7%. Thedeveloped Random Forest model has a very high accuracy (F1-score 0.97, AU-ROC 0.99) in disambiguating between veriﬁed and non-veriﬁed users. Whilethe ratio of the number of veriﬁed versus non-veriﬁed users was imbalanced(approximately 1:100) in our data, the developed RF model is able to accuratelydistinguish between the classes as shown by the confusion matrix (Figure 8).Further analyzing the RF model, we calculated the average decrease in Gini28 T r u e L a b e l . . . . . . . . . (a) Confusion matrix of non-veriﬁed (class 0) ver-sus veriﬁed (class 1) user prediction with RandomForest Classiﬁer. followers count statuses count friends count account age Feature . . . . . . . . . . G i n i F e a t u r e I m p o rt a n ce S c o r e (b) Relative importance between features (using in-formation gain) for prediction veriﬁed users usingRandom Forest Classiﬁer.Figure 10: Performance of the user Random Forest Classiﬁer in the binary classes and featureimportance metrics.

With the observed dot-product based model performing the best with the F1-score analysis, we applied the model towards an hourly aggregated corpus withinour data. Speciﬁcally, when the corpus was conﬁned to the tweets from a singlehour, the vector representations of word embeddings were only inﬂuenced bythe contexts derived from that hour. Words would have a unique vectorizationspeciﬁc to that hour, and relationships between words were dependent on thecontext interpretations within that time. The cosine similarity of two termscould be calculated for this duration, and words with the highest scoring cosinesimilarity to a term would indicate an observed relationship that was ﬁnitewithin the timeframe. In short, two words could be similar in one hour, andcompletely diﬀerent the next, depending on the content of the tweets at thetime. 30able 4 shows the output of the DP model for the hourly aggregated tweets.Prior to landfall (time 13:00), we observe mentions of the “storm”, “wind”,“eye”, “ese” (East-South-East), “e” (East), etc, having prominence in the top20 words as identiﬁed by the DP model to be semantically similar to searchterm “irma”. There is consistency in the thematic representation where thesewords did occur across the 6 hours prior to the hurricane. During the windowof the hurricane Irma’s landfall we observe “shelter”, “

Table 4: Hourly aggregate of top 20 semantically related to terms to “Irma”, for six hoursprior and after landfall. The words have been stemmed to their root.

WordRank Time7:00 8:00 9:00 10:00 11:00 12:00 13:00Irma Landfall 14:00 15:00 16:00 17:00 18:00 19:00

1. sleep last ese ese tampa tampa shelter shelter shelter shelter ﬁrst whole tampa want outsid safe eye3. need sleep moder help time check wait tampa beauti time food outsid bay4. ese ese beauti yet see good get open ﬁrst5. want e sleep outsid eye tri tri check make open prep get watch hurrican wait7. could hit food guess time make hit8. wind tropic need sleep night ﬁrst made get come hurrican peopl prayer outsid9. tropic good wind heavi last close see friend watch last know ﬁrst eye read yet check see get make11. much world time hit love wait food12. world safe sleep peopl power everyon us13. beach storm storm tropic outsid last night good ride come gonna check shelter14. ﬂorida peopl time ﬁrst eye close come get friend still home last16. know lauderdal outsid make check food hurrican power point17. help ﬁrst go day landfal beauti know see make okay open19. call rain rain beach hit come wait open make home watch alway20. aso safe mesonet rain ﬂorida open pleas home tri way want yet video

The results show that the word embedding based dot product model is capa-ble of identifying tweets which are most relevant to the search/seed term. Thisis highlighted by the example of the term “ese”, which when taken by itself,might reference an informal Spanish colloquialism for “man”. When interpreted31ithin the hourly-divided corpora within this dataset, it takes on a diﬀerent se-mantic interpretation. For the tweets occurring within each of the four hoursimmediately preceding landfall, “ese” is in the top twenty most related termsto “irma”, and does not appear in the hourly lists following. Looking at theterms related to “ese” it can be determined that this refers to the abbreviationfor East-South-East, likely referencing the direction from which the hurricaneapproached. After landfall, this term was no longer as relevant, and thereforeless likely to appear as a related term. . . . . . . L i k e li h oo d o f o cc u rr e n ce User ScoreImage ScoreText ScoreGeospatial Score

Figure 11: CDF of Overall Model and percentage of tweets passing diﬀerent model thresholds.

In the overall model, the number of possible combinations for the thresholdsis large (at 100 ), where each of the four models can have a value ranging from0-100. A cumulative distribution plot (CDF) was used to analyze the percentageof data-points passing the thresholds set for each of the models. Figure 11 showsthe comparative analysis of each of the model, where the curves are inversely32roportional to the thresholds indicating a decrease in the percentage of tweetspassing higher thresholds. Within the analysis, low thresholds are representativeof more reliable sources and related contents, resulting in a low percentage ofoverall tweets passing through the ﬁlter. Similarly, at the threshold of 100, alltweets pass the ﬁlter providing complete access to all data.The CDF plot also highlights the comparative performance of various mod-els, where the text based ﬁlter includes a higher percentage of tweets at lowerthresholds while the user veriﬁcation ﬁlter includes most users at higher thresh-olds. Image classiﬁcation also results in a similar performance to user veriﬁ-cation (including most users at higher thresholds), whereas ﬁltering based ongeospatial scores ﬁlters more linearly. We observe high quality results (low falsepositives) at a likelihood occurrence of 0 .

6. by setting initial thresholds to 30for text, 50 for geospatial, and 85 for both image and user scores. These recom-mended thresholds for Hurricane Irma provide a baseline for comparison withdiﬀerent events , and for hurricanes in diﬀerent locations.

Our current work explores the utilization of multiple modalities present insocial media data to ﬁlter hazard event related information. We acknowledgecertain limitations of this approach. Our approach is to cumulatively evaluatethe operation of all sub-models in capturing the messages. As a result, wefocused in this work on tweets that have all attributes: geolocation, text, andimage (note all tweets have user attributes). However a smaller overall modelwith speciﬁc combinations of the sub-models can be used in certain conditions.For example, researchers who are interested in just messages with text can usean overall model that excludes the image sub-model and subsequently not ﬁlterbased on a threshold for images.Furthermore, our models are evaluated using the data from a single event— (Hurricane Irma) and a single location (Florida, USA). As a part of ourfuture eﬀort we plan to extend this framework to other hurricane events (andlocations), such as, Maria, Harvey, and Florence, along with application of the33pproach to other disaster scenarios, such as ﬁres, earthquakes, ﬂoods, etc. toaid in understanding the ﬁltering step and thresholds in other contexts. Eachevent will likely have diﬀerent speciﬁcations on the quality of data that needsto be extracted, for which we need to cross evaluate the approach against var-ious events to provide recommendations for thresholds to be used for diﬀerentdisaster categories.Our approach can operate as a primary ﬁltering mechanism for additionalanlysis to extracting information during a disaster event. Additional modelswhich help with categorization of messages, such as, disaster damage quantiﬁ-cation, information, requests of help, resources oﬀerings, organizing eﬀorts, etc.,can be implemented to extract higher level information from the data.

5. Conclusion

In this study, a multimodal ﬁltering approach was developed and evaluatedto extract and subset geocoded images posted on Twitter within the contextof Hurricane Irma. Our prototype model consisted of four sub-models: geospa-tial, image classiﬁcation, user credibility, and text analysis. Each sub-modelreturned a score in the range of 0-100 and allowed for user-deﬁned ﬁlteringbased on bespoke thresholds. Each of the four models aim to ﬁlter informationabout reliability, information consistency, and overall usefulness of the mes-sage. This single combined model shows potential for application in disasterand emergency contexts, allowing users to quickly search and ﬁlter for relevantgeolocated tweets.

6. Acknowledgments

The authors would like to thank UNCG undergraduate students KaitlynJessee and Elaina Kauzlarich for classifying the images associated with thetweets. This study was supported in part by the National Science Foundation(Awards

7. Data and codes availability statement

The data and the codes used in the research are available on Figshare: https://figshare.com/s/235146fc2d6de33654f3 [80]

8. ReferencesReferences [1] T. Knutson, S. J. Camargo, J. C. L. Chan, K. Emanuel, C.-H. Ho, J. Kossin,M. Mohapatra, M. Satoh, M. Sugi, K. Walsh, L. Wu, Tropical Cyclonesand Climate Change Assessment: Part II. Projected Response to Anthro-pogenic Warming, Bulletin of the American Meteorological Society (Aug.2019). doi:10.1175/BAMS-D-18-0194.1 .URL https://journals.ametsoc.org/doi/10.1175/BAMS-D-18-0194.1 [2] H. R. Moftakhari, A. AghaKouchak, B. F. Sanders, D. L. Feldman,W. Sweet, R. A. Matthew, A. Luke, Increased nuisance ﬂooding alongthe coasts of the United States due to sea level rise: Past and future,Geophysical Research Letters 42 (22) (2015) 9846–9852. doi:10.1002/2015GL066072 .URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2015GL066072 doi:10.1371/journal.pone.0118571 .URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118571 [4] E. D. Lazarus, P. W. Limber, E. B. Goldstein, R. Dodd, S. B. Armstrong,Building back bigger in hurricane strike zones, Nature Sustainability 1 (12)(2018) 759–762. doi:10.1038/s41893-018-0185-y .URL [5] B. De Longueville, R. Smith, G. Luraschi, ”OMG, from here, I cansee the ﬂames!”: a use case of mining location based social networksto acquire spatio-temporal data on forest ﬁres, 2009, pp. 73–80. doi:10.1145/1629890.1629907 .[6] T. Sakaki, M. Okazaki, Y. Matsuo, Earthquake shakes Twitter users: real-time event detection by social sensors, in: Proceedings of the 19th in-ternational conference on World wide web, WWW ’10, Association forComputing Machinery, Raleigh, North Carolina, USA, 2010, pp. 851–860. doi:10.1145/1772690.1772777 .URL https://doi.org/10.1145/1772690.1772777 [7] Y. Tyshchuk, C. Hui, M. Grabowski, W. A. Wallace, Social Media andWarning Response Impacts in Extreme Events: Results from a NaturallyOccurring Experiment, in: 2012 45th Hawaii International Conference onSystem Sciences, 2012, pp. 818–827. doi:10.1109/HICSS.2012.536 .[8] S. E. Middleton, L. Middleton, S. Modaﬀeri, Real-Time Crisis Mappingof Natural Disasters Using Social Media, IEEE Intelligent Systems 29 (2)(2014) 9–17. doi:10.1109/MIS.2013.126 .[9] S. Muralidharan, L. Rasmussen, D. Patterson, J.-H. Shin, Hope for36aiti: An analysis of Facebook and Twitter usage during the earth-quake relief eﬀorts, Public Relations Review 37 (2) (2011) 175–177. doi:10.1016/j.pubrev.2011.01.010 .URL [10] M. Imran, F. Alam, U. Qazi, S. Peterson, F. Oﬂi, Rapid damage assessmentusing social media images by combining human and machine intelligence,arXiv preprint arXiv:2004.06675 (2020).[11] M. T. Niles, B. F. Emery, A. J. Reagan, P. S. Dodds, C. M. Danforth,Social media usage patterns during natural hazards, PloS one 14 (2) (2019)e0210484.[12] Internet Live Stats - Internet Usage & Social Media Statistics (2014).URL [13] J. A. de Bruijn, H. de Moel, B. Jongman, M. C. de Ruiter, J. Wagemaker,J. C. J. H. Aerts, A global database of historic and real-time ﬂood eventsbased on social media, Scientiﬁc Data 6 (1) (2019) 1–12. doi:10.1038/s41597-019-0326-9 .URL [14] Y. Kryvasheyeu, H. Chen, N. Obradovich, E. Moro, P. Van Hentenryck,J. Fowler, M. Cebrian, Rapid assessment of disaster damage using socialmedia activity, Science advances 2 (3) (2016) e1500779.[15] N. Pourebrahim, S. Sultana, J. Edwards, A. Gochanour, S. Mohanty, Un-derstanding communication dynamics on Twitter during natural disasters:A case study of Hurricane Sandy, International Journal of Disaster RiskReduction 37 (2019) 101176. doi:10.1016/j.ijdrr.2019.101176 .URL [17] A. Gupta, H. Lamba, P. Kumaraguru, A. Joshi, Faking sandy: charac-terizing and identifying fake images on twitter during hurricane sandy, in:Proceedings of the 22nd international conference on World Wide Web, 2013,pp. 729–736.[18] F. Laylavi, A. Rajabifard, M. Kalantari, Event relatedness assessmentof Twitter messages for emergency response, Information Processing &Management 53 (1) (2017) 266–280. doi:10.1016/j.ipm.2016.09.002 .URL [19] N. Murzintcev, C. Cheng, Disaster hashtags in social media, ISPRS Inter-national Journal of Geo-Information 6 (7) (2017) 204.[20] F. Abel, C. Hauﬀ, G.-J. Houben, R. Stronkman, K. Tao, Semantics +ﬁltering + search = twitcident. exploring information in social web streams,in: Proceedings of the 23rd ACM conference on Hypertext and social media- HT ’12, ACM Press, Milwaukee, Wisconsin, USA, 2012, p. 285. doi:10.1145/2309996.2310043 .URL http://dl.acm.org/citation.cfm?doid=2309996.2310043 [21] X. Liu, B. Kar, C. Zhang, D. M. Cochran, Assessing relevance of tweets forrisk communication, International Journal of Digital Earth 12 (7) (2019)781–801. doi:10.1080/17538947.2018.1480670 .URL https://doi.org/10.1080/17538947.2018.1480670 [22] H. Becker, M. Naaman, L. Gravano, Beyond Trending Topics: Real-World Event Identiﬁcation on Twitter, in: ICWSM, 2011. doi:10.7916/D81V5NVX . 3823] M. Imran, S. Elbassuoni, C. Castillo, F. Diaz, P. Meier, Extracting Infor-mation Nuggets from Disaster- Related Messages in Social Media (2013)10.[24] K. Zahra, M. Imran, F. O. Ostermann, Automatic identiﬁcation of eye-witness messages on twitter during disasters, Information processing &management 57 (1) (2020) 102107.[25] A. Ilyas, Microﬁlters: Harnessing twitter for disaster management, in:IEEE Global Humanitarian Technology Conference (GHTC 2014), IEEE,2014, pp. 417–424.[26] M.-A. Kaufhold, M. Bayer, C. Reuter, Rapid relevance classiﬁcation ofsocial media posts in disasters and emergencies: A system and evaluationfeaturing active, incremental and online learning, Information Processing& Management 57 (1) (2020) 102132.[27] M. A. Sit, C. Koylu, I. Demir, Identifying disaster-related tweets and theirsemantic, spatial and temporal context using deep learning, natural lan-guage processing and spatial analysis: a case study of hurricane irma, In-ternational Journal of Digital Earth 12 (11) (2019) 1205–1229.[28] M. Imran, F. Oﬂi, D. Caragea, A. Torralba, Using ai and social mediamultimodal content for disaster response and management: Opportunities,challenges, and future directions (2020).[29] L. Spinsanti, F. Ostermann, Automated geographic context analy-sis for volunteered information, Applied Geography 43 (2013) 36–44. doi:10.1016/j.apgeog.2013.05.005 .URL [30] J. P. De Albuquerque, B. Herfort, A. Brenning, A. Zipf, A geographicapproach for combining social media and authoritative data towards iden-39ifying useful information for disaster management, International journalof geographical information science 29 (4) (2015) 667–689.[31] J. A. de Bruijn, H. de Moel, A. H. Weerts, M. C. de Ruiter, E. Basar,D. Eilander, J. C. Aerts, Improving the classiﬁcation of ﬂood tweets withcontextual hydrological information in a multimodal neural network, Com-puters & Geosciences (2020) 104485.[32] P. M. Landwehr, K. M. Carley, Social media in disaster relief, in: Datamining and knowledge discovery for big data, Springer, 2014, pp. 225–257.[33] J. P. Cangialosi, A. S. Latto, R. Berg, Hurricane Irma, Tech. Rep.AL112017, National Oceanic and Atmospheric Administration U.S. De-partment of Commerce (Jun. 2018).URL [34] L. Nguyen, Z. Yang, J. Li, G. Cao, F. Jin, Forecasting People’s Needs inHurricane Events from Social Network, IEEE Transactions on Big Data(2019) 1–1ArXiv: 1811.04577. doi:10.1109/TBDATA.2019.2941887 .URL http://arxiv.org/abs/1811.04577 [35] U. S. N. H. Center, Costliest U.S. Tropical Cyclones Tables Update, Tech.rep., National Oceanic and Atmospheric Administration (Jan. 2018).URL [36] Introduction to Tweet JSON (2019).URL https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json [37] Geo objects (2019).URL https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/geo-objects [38] M. Tomczak, Spatial interpolation and its uncertainty using automatedanisotropic inverse distance weighting (idw)-cross-validation/jackknife ap-40roach, Journal of Geographic Information and Decision Analysis 2 (2)(1998) 18–30.[39] X. Yang, X. Xie, D. L. Liu, F. Ji, L. Wang, Spatial interpolation of dailyrainfall data for local climate impact assessment over greater sydney region,Advances in Meteorology 2015 (2015).[40] R. J. Light, Measures of response agreement for qualitative data: somegeneralizations and alternatives., Psychological bulletin 76 (5) (1971) 365.[41] M. L. McHugh, Interrater reliability: the kappa statistic, Biochemia med-ica: Biochemia medica 22 (3) (2012) 276–282.[42] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556 [cs]ArXiv: 1409.1556 (Sep.2014).URL http://arxiv.org/abs/1409.1556 [43] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recog-nition, arXiv:1512.03385 [cs]ArXiv: 1512.03385 (Dec. 2015).URL http://arxiv.org/abs/1512.03385 [44] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, Z. Wojna, Rethinking theInception Architecture for Computer Vision, arXiv:1512.00567 [cs]ArXiv:1512.00567 (Dec. 2015).URL http://arxiv.org/abs/1512.00567 [45] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classiﬁcation withDeep Convolutional Neural Networks, in: F. Pereira, C. J. C. Burges,L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural InformationProcessing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf [46] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks, in: 201441EEE Conference on Computer Vision and Pattern Recognition, IEEE,Columbus, OH, USA, 2014, pp. 1717–1724. doi:10.1109/CVPR.2014.222 .URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6909618 [47] Twitter, About veriﬁed accounts, https://help.twitter.com/en/managing-your-account/about-twitter-verified-accounts (2020(accessed June 28, 2020)).[48] S. Kotsiantis, Supervised Machine Learning: A Review of ClassiﬁcationTechniques, Informatica (Ljubljana) 31 (Oct. 2007).[49] A. Liaw, M. Wiener, Classiﬁcation and Regression by randomForest, Rnews 2 (3) (2002) 18–22.[50] J. Ye, J.-H. Chow, J. Chen, Z. Zheng, Stochastic gradient boosted dis-tributed decision trees, in: Proceeding of the 18th ACM conference onInformation and knowledge management - CIKM ’09, ACM Press, HongKong, China, 2009, p. 2061. doi:10.1145/1645953.1646301 .URL http://portal.acm.org/citation.cfm?doid=1645953.1646301 [51] D. G. Kleinbaum, M. Klein, E. R. Pryor, Logistic regression: a self-learningtext, 3rd Edition, Statistics in the health sciences, Springer, New York,2010.[52] P. Domingos, A few useful things to know about machine learning, Commu-nications of the ACM 55 (10) (2012) 78. doi:10.1145/2347736.2347755 .URL http://dl.acm.org/citation.cfm?doid=2347736.2347755 [53] J. Bergstra, Y. Bengio, Random Search for Hyper-Parameter Optimization,Journal of Machine Learning Research 13 (Feb) (2012) 281–305.URL [54] D. Yarowsky, Unsupervised Word Sense Disambiguation Rivaling Super-vised Methods, in: 33rd Annual Meeting of the Association for Computa-tional Linguistics, Association for Computational Linguistics, Cambridge,42assachusetts, USA, 1995, pp. 189–196. doi:10.3115/981658.981684 .URL [55] A. D. Marco, R. Navigli, Clustering and Diversifying Web Search Resultswith Graph-Based Word Sense Induction, Computational Linguistics 39 (3)(2013) 709–754. doi:10.1162/COLI_a_00148 .URL [56] S. Arora, Y. Li, Y. Liang, T. Ma, A. Risteski, Linear Algebraic Struc-ture of Word Senses, with Applications to Polysemy, Transactions ofthe Association for Computational Linguistics 6 (2018) 483–495. doi:10.1162/tacl_a_00034 .URL [57] M. Imran, P. Mitra, C. Castillo, Twitter as a Lifeline: Human-annotatedTwitter Corpora for NLP of Crisis-related Messages, arXiv:1605.05894[cs]ArXiv: 1605.05894 (May 2016).URL http://arxiv.org/abs/1605.05894 [58] C. De Boom, S. Van Canneyt, B. Dhoedt, Semantics-driven event clusteringin Twitter feeds, in: Proceedings of the 5th Workshop on Making Sense ofMicroposts, Vol. 1395, CEUR, 2015, pp. 2–9.URL http://hdl.handle.net/1854/LU-6887623 [59] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eﬃcient Estimation of WordRepresentations in Vector Space, arXiv:1301.3781 [cs]ArXiv: 1301.3781(Sep. 2013).URL http://arxiv.org/abs/1301.3781 [60] Y. Goldberg, O. Levy, word2vec Explained: deriving Mikolov etal.’s negative-sampling word-embedding method, arXiv:1402.3722 [cs,stat]ArXiv: 1402.3722 (Feb. 2014).URL http://arxiv.org/abs/1402.3722 doi:10.1109/ASONAM.2012.14 .[62] G. Salton, E. A. Fox, H. Wu, Extended Boolean information retrieval,Communications of the ACM (Nov. 1983).URL https://dl.acm.org/doi/abs/10.1145/182.358466 [63] D. Anguelov, C. Dulong, D. Filip, C. Frueh, S. Lafon, R. Lyon, A. Ogale,L. Vincent, J. Weaver, Google street view: Capturing the world at streetlevel, Computer 43 (6) (2010) 32–38.[64] A. G. Rundle, M. D. Bader, C. A. Richards, K. M. Neckerman, J. O. Teitler,Using google street view to audit neighborhood environments, Americanjournal of preventive medicine 40 (1) (2011) 94–100.[65] R. Lagerstrom, Y. Arzhaeva, P. Szul, O. Obst, R. Power, B. Robinson,T. Bednarz, Image Classiﬁcation to Support Emergency Situation Aware-ness, Frontiers in Robotics and AI 3 (2016). doi:10.3389/frobt.2016.00054 .URL [66] D. T. Nguyen, F. Oﬂi, M. Imran, P. Mitra, Damage Assessment from So-cial Media Imagery Data During Disasters, in: Proceedings of the 2017IEEE/ACM International Conference on Advances in Social NetworksAnalysis and Mining 2017 - ASONAM ’17, ACM Press, Sydney, Australia,2017, pp. 569–576. doi:10.1145/3110025.3110109 .URL http://dl.acm.org/citation.cfm?doid=3110025.3110109 [67] X. Li, H. Zhang, D. Caragea, M. Imran, Localizing and Quantifying Dam-age in Social Media Images, arXiv:1806.07378 [cs]ArXiv: 1806.07378 (Jun.2018).URL http://arxiv.org/abs/1806.07378 doi:10.1145/2487788.2488033 .URL http://dl.acm.org/citation.cfm?doid=2487788.2488033 [69] F. Marra, D. Gragnaniello, D. Cozzolino, L. Verdoliva, Detection of GAN-Generated Fake Images over Social Networks, in: 2018 IEEE Conferenceon Multimedia Information Processing and Retrieval (MIPR), 2018, pp.384–389, iSSN: null. doi:10.1109/MIPR.2018.00084 .[70] C. Buntain, J. Golbeck, Automatically Identifying Fake News in Pop-ular Twitter Threads, 2017 IEEE International Conference on SmartCloud (SmartCloud) (2017) 208–215ArXiv: 1705.01613. doi:10.1109/SmartCloud.2017.40 .URL http://arxiv.org/abs/1705.01613 [71] X. Zhou, R. Zafarani, Fake News: A Survey of Research, Detection Meth-ods, and Opportunities (Dec. 2018).URL https://arxiv.org/abs/1812.00315v1 [72] F. Masood, G. Ammad, A. Almogren, A. Abbas, H. A. Khattak, I. Ud Din,M. Guizani, M. Zuair, Spammer Detection and Fake User Identiﬁcationon Social Networks, IEEE Access 7 (2019) 68140–68152. doi:10.1109/ACCESS.2019.2918196 .[73] M. Del Vicario, W. Quattrociocchi, A. Scala, F. Zollo, Polarizationand Fake News: Early Warning of Potential Misinformation Targets,arXiv:1802.01400 [cs]ArXiv: 1802.01400 (Feb. 2018).URL http://arxiv.org/abs/1802.01400 [74] A. H. Wang, Detecting Spam Bots in Online Social Networking Sites:A Machine Learning Approach, in: D. Hutchison, T. Kanade, J. Kit-tler, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz,45. Pandu Rangan, B. Steﬀen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y.Vardi, G. Weikum, S. Foresti, S. Jajodia (Eds.), Data and ApplicationsSecurity and Privacy XXIV, Vol. 6166, Springer Berlin Heidelberg, Berlin,Heidelberg, 2010, pp. 335–342. doi:10.1007/978-3-642-13739-6_25 .URL http://link.springer.com/10.1007/978-3-642-13739-6_25 [75] V. S. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K. Ler-man, L. Zhu, E. Ferrara, A. Flammini, F. Menczer, A. Stevens, A. Dekht-yar, S. Gao, T. Hogg, F. Kooti, Y. Liu, O. Varol, P. Shiralkar, V. Vydis-waran, Q. Mei, T. Hwang, The DARPA Twitter Bot Challenge, Computer49 (6) (2016) 38–46, arXiv: 1601.05140. doi:10.1109/MC.2016.183 .URL http://arxiv.org/abs/1601.05140 [76] P. Efthimion, S. Payne, N. Proferes, Supervised Machine Learning BotDetection Techniques to Identify Social Twitter Bots, SMU Data ScienceReview 1 (2) (Jul. 2018).URL https://scholar.smu.edu/datasciencereview/vol1/iss2/5 [77] S. R. Sahoo, B. B. Gupta, Hybrid approach for detection of maliciousproﬁles in twitter, Computers & Electrical Engineering 76 (2019) 65–81. doi:10.1016/j.compeleceng.2019.03.003 .URL [78] E. Ferrara, O. Varol, C. Davis, F. Menczer, A. Flammini, The rise of socialbots, Communications of the ACM (Jun. 2016).URL https://dl.acm.org/doi/abs/10.1145/2818717 [79] A. Karami, V. Shah, R. Vaezi, A. Bansal, Twitter Speaks: A Case ofNational Disaster Situational Awareness, arXiv:1903.02706 [cs, stat]ArXiv:1903.02706 (Mar. 2019).URL http://arxiv.org/abs/1903.02706 [80] S. Mohanty, B. Biggers, S. Sayedahmed, E. Goldstein, R. Bunch,G. Chi, F. Sadri, T. McCoy, A. Cosby, Geolocated tweets from ﬂorida,46sa during hurricane irma (2017) with relevance scores (Jan 2021). doi:10.6084/m9.figshare.11900325 .URL https://figshare.com/articles/dataset/Geolocated_Tweets_from_Florida_USA_during_Hurricane_Irma_2017_with_Relevance_Scores/11900325/1https://figshare.com/articles/dataset/Geolocated_Tweets_from_Florida_USA_during_Hurricane_Irma_2017_with_Relevance_Scores/11900325/1