A Novel Approach for Detection and Ranking of Trendy and Emerging Cyber Threat Events in Twitter Streams
Avishek Bose, Vahid Behzadan, Carlos Aguirre, William H. Hsu
AA Novel Approach for Detection and Ranking of Trendy and EmergingCyber Threat Events in Twitter Streams
Avishek Bose, Vahid Behzadan, Carlos Aguirre, William H. [email protected], [email protected], [email protected], [email protected] ∗ Department of Computer Science, Kansas State University, Manhattan, Kansas, 66506, USA
Abstract —We present a new machine learning and text in-formation extraction approach to detection of cyber threatevents in Twitter that are novel (previously non-extant) and developing (marked by significance with respect to similarity witha previously detected event). While some existing approachesto event detection measure novelty and trendiness, typically asindependent criteria and occasionally as a holistic measure, thiswork focuses on detecting both novel and developing events usingan unsupervised machine learning approach. Furthermore, ourproposed approach enables the ranking of cyber threat eventsbased on an importance score by extracting the tweet termsthat are characterized as named entities, keywords, or both. Wealso impute influence to users in order to assign a weightedscore to noun phrases in proportion to user influence and thecorresponding event scores for named entities and keywords. Toevaluate the performance of our proposed approach, we measurethe efficiency and detection error rate for events over a specifiedtime interval, relative to human annotator ground truth.
Index Terms —novelty detection, emerging topics, event detec-tion, named entity recognition, threat intelligence, user influence,tweet analysis
I. I
NTRODUCTION
This paper presents a new methodology for recognizingpotential cyber threats using passive filtering and rankingin social text streams, particularly Twitter streams.
Passive monitoring here refers to collecting intelligence and solu-tions of different cyber threats from different platforms usingonly text corpora and lists of named entities or keywords(e.g., gazetteers) rather than direct background knowledge ofthreats. Twitter is examined as a high-bandwidth platformwhere both actors from both sides of cyberdefense, such asattackers and security professionals, post cybersecurity-relatedmessages [25]. The overall goal of this work is to analyzethese messages collectively to attain actionable insights andcollect intelligence on emergent cyber threat events. Detectingevents from social media includes a) novel event detection,including first stories or tweets about previously non-extanttopics; and developing events (especially for bursty topics, butalso for non-bursty topics for which volume and aggregateimportant build up gradually). In this work, we treat noveltyof events from the developing nature of events (emergence) as orthogonal properties. This allows novel events that have notyet attained trending status or viral propagation to be tracked,while still incorporating traditional trend detection methods.Recent research includes some work on detecting bothnovel and developing events in Twitter streams (e.g., [2][15]),especially where emergence is defined as trending. However,only a few studies have further focused on detecting cyber threat events in Twitter streams. Furthermore, we propose anapproach to the ranking of events with regards to their signif-icance. While such a ranking generally depends on both theapplication domain and user objectives, the relative importanceto a general community of interest, such as the cybersecuritycommunity, can be imputed based on pervasiveness, spreadrate, and novelty. In this study, we also rank the two types ofevents based on the order of their corresponding importancescore to show how much a particular event is importantcompared to proximate events within a user-specified timerange of a reference tweet. In contrast with large documentcorpora, analyzing short documents such as tweets presentssome specific semantic challenges towards extracting terms,relationships, patterns, and actionable insights in general. Forexample, terms mentioned in a short tweet lack context, andthere is less co-occurrence data in the entire corpus on whichto base expressible relations between named entities or terms.Our system takes as input a user-specified maximum intervalof detection for related cyber threat events, within an originaltweet that is deemed relevant. The full text of this tweet, orquoted part of a retweet, is captured. Social network param-eters such as indegree (number of followers) are calculatedand normalized by range. The text bodies of tweets arevectorized using term frequency-inverse document frequency(TFIDF) [23], the resulting TFIDF vectors are clustered usingthe
DBSCAN [22] density-based clustering algorithm, noisepoints are discarded, and the concatenated text contents ofeach cluster are ranked using the
TextRank algorithm [20],to obtain representative keywords and named entities thatrepresent potential events. We then identify different scenar-ios: a) novel and developing story; b) novel story only; c)developing story only; d) not an event based on heuristics thatare described in Section IV. Additionally, we also calculatean importance score for each event based on the heuristicspresented in Section IV. Finally, we tag each event accordingto their descriptive features and provide a rank based on theirimportance scores.Key novel contributions of this work are as follows:1) We detect both trendy and novel types of events relatedto cybersecurity from Twitter streams.2) We provide a method for the ranking of potential cyberthreat events according to their importance score basedon keywords, as well as their named entity confidenceand user influence scores.3) The proposed method can be tuned to capture important a r X i v : . [ c s . I R ] J u l ybersecurity events based on user-specified parameters.II. R ELATED A PPROACHES
This section briefly summarizes key methodologies forcyber threat detection from text corpora, particularly socialmedia.Dabiri et al.[4] analyzed traffic related tweets for detect-ing traffic event by applying deep learning models, includ-ing convolutional and recurrent neural networks incorporat-ing a word2vec -based word embedding layer to representterms. This approach performs well but is domain-dependentand highly costly in terms of manual annotation for high-throughput sources of training data such as Twitter. In contrast,
TwitInfo [3] incorporates a new streaming algorithm that au-tomatically discovers peaks of event-related tweets and labelsthem from the tweets texts. This approach, however, focusesonly burstiness of tweets and ignores both user influence andnovelty with respect to developing events.Rupinder et al. [9] also proposed a framework based ondeep learning for extracting cyber threat and security-relatedinsights from Twitter, categorizing three types of threats (ex-amples of which are Distributed Denial of Service (DDoS)attacks, data breaches, and account hijacking). From text doc-uments, events are extracted using a) target domain generation;b) dynamically-typed query expansion; and c) event extraction.This approach employs both syntactic and semantic analysisusing dependency tree graphs and convolutional kernel, butis highly computationally intensive due to the cost of autoen-coder training.Sceller et al. [13] uses unsupervised learning to detect andcategorize cybersecurity events by analyzing cybersecurity-related Twitter posts based on a set of seed keywords specifiedfor each level taxonomy. This algorithm is prone to falsenegatives because it may not detect potential cyber threatevents as events in the first place.Ranade et al. [5] propose a method for processing threat-related tweets using the Security Vulnerability Concept Extrac-tor (SVCE) which generates tags about cybersecurity threator vulnerabilities such as means of an attack, consequencesof an attack, affected software, hardware, and vendors. Thisapproach does not generalize to the user communities as it ispersonalized for individual users’ system profiles.Edouard [6] propose a framework that utilizes Named EntityRecognition (NER) and ontology reasoning (using
DBPedia ),along with classification learning approaches such as NaiveBayes, SVM, and a deep neural network (Long Short TermMemory / Recurrent Neural Network, aka LSTM-RNN), forcategory tag imputation. The graph algorithm
PageRank isused to rank candidate items for information retrieval.The approach of Lee et al. [11] focuses on communitycommunication and influence to detect cyber threats by group-ing highly contributing Twitter users and scores them asan expert community to get information to be explored andthen to be efficiently exploited. This framework incorporatesfour components: a) an interface to the Twitter social mediaplatform; b) a flexible machine learning system interface for document categorization; c) a mixture-of-experts weightingand extraction scheme; d) a new topic detector. This frame-work is highly dependent on expertise and data quality.A method by Sapienza et al. [12] considers various webdata sources to generate indication of warnings for detectingupcoming potential cyber threats. While potentially extensibleto named entites discoverable by set expansion, this approachis focused on detecting ”novel words” and does not yet in-corporate a full contextualized topic model, feature weightingmodel, or method of user influence.Finally, the work of Alan et al. [16] is based on a supervisedlearning approach to train an extractor for extracting new cat-egories of cybersecurity events by seeding a small number ofpositive event example over a significant amount of unlabeleddata. As with previous approaches, it does not yet incorporatefull NER nor allow for entity set expansion.III. B
ACKGROUND
This section presents a brief review of the key technologiesthat are adopted in our proposed framework for threat eventdetection in tweets.
A. Named Entity Recognition (NER)
In general, NER is an information extraction task aimed atlocating and classifying the names of specific entities such aspersons, organizations and locations, based on analysis of textunits such as n-grams and noun phrases. Generic entities suchas numerical quantities are sometimes also included. In ouranalysis, NER is used to discover the names of entities in re-ported cyber threats. Key objectives of using machine learningto improve NER are: a) set expansion to broaden the set ofcyber threats based on synonymy and other relationships thatcan be inferred by text pattern analysis; b) feature weightingfor relevance or salience; c) relationships that are discoverablefrom data; d) confidence scoring.
B. TextRank
The TextRank algorithm [20] is an extended version ofGoogle PageRank [21] algorithm that aims to determinekeywords by generating a word graph from a given text doc-ument unlike determining high ranked webpages that is doneby the PageRank algorithm. The TextRank score calculatesimportance of a word from given a text that is identical toPageRank score works for webpages. The importance howeverassociated with a vertex is determined based on the votes thatare cast for it, and the vertices’ score casting these votes.
C. TFIDF
TFIDF [23] is a information retrieval method used invarious purpose such as word co-occurence based documentvectorization, word ranking, document similarity calculation,etc. In information retrieval, TF (term frequency) refers toterm frequency of a particular word in a document, whileIDF (inverse document freqency) refers to inverse documentfrequency of a word in the whole corpus of documents. . DBSCANDBSCAN [22] is a density-based clustering approach worksby enforcing a minimum number of data points (
MinPts ) insidea specified-radius neighborhood (
Eps ) of a data point to makea density-reachable cluster; this process continues until nopoints on the frontier are density-reachable, then restarts witha new initial point.IV. P
ROPOSED M ETHOD
Analyzing Twitter texts for getting valuable insights hasalways been an issue because of its unstructured way ofwriting and the length of tweets. In this section we go throughsome steps described in the subsections below.
A. Tweet Collection and Early Annotation
In this analysis we used Twitter data collected through fourdays from th September to th September in 2018 and asmall portion of data collected in th and th August, 2018that is stored in MongoDB database. As our main focus isto getting insights of cybersecurity-related events and ranktheir scores, we crawled Twitter data using the Twitter APIbased on some security related keywords. Without applyingsecurity related keywords, the crawled Twitter data would begeneralized to all domains and thus the result would be biasedto detecting general kind of events. This datasets was manuallyannotated whether the tweets are relevant to cyber securityor not by taking help from four annotators for our earlierwork [25]. Although we used security related keywords forcrawling the Twitter data, many of those tweets are irrelevantor promotional. That is why the annotation plays a crucial rolehere and the resulting dataset of this process are available at[18]. In this study, we initially have 21368 tweets and workingwith the annotated data, we found 11111 tweets are related tocyber security. We apply our algorithm on those cyber securityrelated tweets and the whole tweets’ datasets individually. Wetook the full text of each tweet if the tweet is not quoted orretweeted from any other users. If any tweet is retweeted orquoted from any other user, we take the original retweeted orquoted tweet full text. Then, we let an user to give a numericvalue input as the number of time intervals based on tweetoccurrence. This process divides the whole time period oftweet occurrence into some equal time chunks based on thenumber that we are taking from the user. Thus, for each timeinterval a number of tweets are aggregated into a chunk basedon their corresponding occurring time.
B. Tweet Pre-processing and Cleaning
As we stated earlier that tweet text is very unstructuredthat contains a lot of misspelled words and sometime thetext is not a complete sentence. That is why we apply atool named SymSpell [24] to correct the misspelled words.Then we take only the characters from the tweet text that arealphanumeric and remove all punctuation characters. Then weremove all the stopwords from the text because these are sofrequently occurred over the whole data set that may reducethe analysis performance. After that we consider cleaning the tweets’ texts in both cases either by applying stemming orwithout stemming. We removed all the words or tokens whichlengths are only one.
C. Influential Twitter user Impact
Influential users tweets are valuable for detecting importantcyber security related events. That is why we keep records ofeach Twitter user with their number of followers correspondingto their posted cyber security related tweets. As the numberof followers represents is directly proportional to the influenceof the user, their used words in cyber security related tweetsare also important. So, for each time interval, we normalizethe values of follower numbers for each user using Min-Maxnormalization. This normalization process normalizes eachvalue between 0 and 1. We then assign the normalized valueof users’ follower number to each of their used noun phrasesin tweets. Here, we used python nltk library to extract nounphrases from tweets. If similar words are used by several tweetuser, we keep the highest normalized value of an user foreach word used in a tweet. Now, for each time interval, forall tweets, each noun phrase has a corresponding value thatrepresents its weight inherited by its user. This value will alsobe used to calculate event score.
D. Determining Algorithm Design Architecture
In this study, we apply a very popular word vectorizationmethod in NLP domain named tfIdf [23] that is based on wordco-occurrence in documents to make word vector for eachtweet from the data set. After doing process mentioned abovefor all tweets in a time interval, it generates a tfIdf matrix. Wefound this method gives a better performance compared to theword semantic relation based approaches that are discussedlater in Section VI. Then, we apply the DBSCAN [22] densitybased clustering algorithm using the aforementioned tfIdfmatrix to find cluster of similar meaning tweets. These clusterscan represent the potential events. However, we did not applythe K-means clustering algorithm because we did want to limitthe number of events found in our analysis, for each timeinterval. For this analysis, we ignore the noise points generatedby applying DBSCAN because in our observation over the dataset we found that the noise points are conveying a very littleimpact to find cyber security related events.
E. Event Detection Heuristics and Scoring
Firsly, we aggregate all tweet texts in a cluster into a singletext. Then we simulate named entity and keyword identifi-cation process on the aggregated text by applying TextRazor[19] online Named Entity recognition API and TextRank [20]algorithm from Gensim library respectively. Additionally, theTextRazor provides a Confidence score for each named entityand TextRank from Gensim [26] also provides a score for eachkeyword based on word graph mentioned earlier in Section IV.We apply two different set rules to determine the type of theevents and and their corresponding score that will be used torank the cyber security related events. To formulate our ideanto implementation, we produce four different sets of tokenset mentioned below.1) commonSet -refers the set of words that are commonin both named entity and keyword. Additionally, wealso take some higher scored named entities and key-words from two sets namedEntitySet and keywordSet mentioned later respectively to add the tokens to the commonSet .2) keywordSet -refers the set of words that only appear inthe keyword set and not common with the named entityset.3) namedEntitySet -refers the set of words that only appearin the named entity set and not common with thekeyword set.4) unionSet -this set keeps all named entities and keywords.The Figure 1 clearly depicts the graphical illustration offour different token sets mentioned above. Here commonSet , keywordSet , namedEntitySet and unionSet are represented by k ∪ ( K ∩ N ) ∪ n , K - N , N - K and K ∪ ( K ∩ N ) ∪ N respectively.Here, k and n are the set of highly scored keywords andnamed entities and can be represented as k ∈ K and n ∈ N respectively.
1) Determine Event Novelty:
We store all the tokens ofthe set namedEntitySet and commonSet into another set oftokens named as noveltyCheckerSet for all clusters generatedfor all time intervals. We are storing all these tokens becausewe are checking the similarity of the tokens from the setof a subsequently generated cluster to the stored tokens’ set noveltyCheckerSet to determine whether the newly generatedcluster has some novelty or not based on a cosine similaritythreshold value determined empirically.
2) Determining Trendiness:
If the similarity score reachesthe defined threshold cosineThresh , we determine the workingcluster is just trendy except the very first cluster becausethis cluster would not find any different set to compare thesimilarity. We also take a user defined threshold of numberof tweets tweetThresh to determine getting a trendy event.Thus, if the the number of tweets does not reach to thevalue mentioned by the user, it will be an unnoticeable event.However, if tweets from a cluster satisfies the cosine similaritythreshold cosineThresh as well as number of tweets threshold tweetThresh , still it may not represent a noticeable eventbecause it may be a spamming of banal topic. So, we apply adifferent heuristic if the length of the commonSet set is greaterthan the one fifth(0.20) times of the namedEntitySet set, onlythen the cluster will be counted as trendy. We are checkingthis because a big cluster of tweets’ texts will have so manynamed entities that will mean a variety of topics but a singlecluster should be biased towards a single topic described in it.
3) Determining Novelty:
Now, if the cosine similarity ofstored token set of tokens noveltyCheckerSet and workingcluster token set is less than the threshold value cosineThresh ,the working cluster could be a potential novel event. However,we set a threshold value minimum three tweets to be in thecluster to refer it as an event. Now, if the number of tweetsis greater or equal than the user defined threshold value of trendiness tweetThresh , the cluster is addressed as “Novel andTrendy” but if the number of tweets in the cluster is less thanthe number of user defined threshold value, it is addressed as“Just Novel”. That is how we determine a type of a generatedcluster of tweets as an event type.
4) Event Score Calculation Process:
As we mentionedearlier the ranking of cyber threat related events is also an im-portant task, we are motivated to calculate score of each eventif an event finds a event type based on our empirically definedheuristics mentioned above. We calculate scores for each ofdefined events individually by applying different heuristics. Asthe confidence score of a named entity in the TextRazor andthe score of a keyword in the TextRank algorithm are differentin scale, we apply sigmoid function to normalize each scoresof a named entity and keyword respectively. As every tokenis stored in a dictionary for each generated cluster, we updatethe score of each token if it considered both named entityand keyword by adding two scores after normalizing by thesigmoid function.
5) Score Calculation for Trendy Event:
Now, for a “JustTrendy” event firstly we calculate the entity score of the eventby adding the scores of each token included in the commonSet and then multiplying the total added value with the value oftotal number of tweets that makes an event as trendy. Secondly,we added the value of each noun phrases corresponding to theevent’s tweets where the noun phrases are inherited from thevalue of influential users’ followers. This score are then addedto the initially calculated entity score mentioned above for theaggregated tweets’ texts to get the total score. We calculate theentity score like this mentioned above because, this will assigna higher score to a trendy event either if tweets in a same topicappear so many times in a cluster or number of tokens in the commonSet is higher. This heuristic assumes that even if theevent topic does not appear so many times in correspondingtweets compared to the other highly appeared event topics,because of the number of common tokens in both name entityand keywords, the heuristic give importance to those tokensas important tokens.
6) Score Calculation for Trendy and Novel Event:
Thenfor a “Novel and Trendy” event, we added the values of allthe tokens of the set being generated by the union operationof keywordSet and commonSet . Afterwards, we multiply theadded value with the value of total number of tweets thatrepresent the event to calculate the entity score. Then weagain added the value of each noun phrases corresponding tothe event’s tweets where the noun phrases are inherited fromthe value of influential users’ followers. This score are thenadded to the entity score like discussed earlier in the aboveparagraph to get the total score. We calculate the entity scorelike this because this event is already proved to be a novelevent and we need to consider whether it has tokens commonin both named entity and keyword set to check the main topicdiscussed about in this event. Additionally, we also need toconsider the keywords appeared in this event to check whichtopics are also mentioned in the tweet texts of the novel event.That means a novel and trendy event will get a very highercore compared to the other events.
7) Score Calculation for Just Novel Event:
Now, for the“First Story” event, we consider to keep a set of tokensgenerated by differentiating keywordSet from unionSet andthen doing union operation with the commonSet . This resultingset stores all the named entities with the keywords whichhave very high score. Then, we add all the values of thecorresponding tokens in the resulting set and multiply theadded value with the user defined threshold value tweetThresh for being an event as trendy to get the entity score. Then werepeat the procedure of adding the value of noun phrases tothe entity score of the working event. We are calculating theentity score like this, because if the event is just novel, it willnot appear in so many tweets and that is how it may loose itsimportance. So, by means of giving importance to this kind ofevents, we are multiplying the total score of resulting set oftokens by the value of tweetThresh . Thus, it can get at leastas importance as any trendy event can acquire. We choose theaforementioned resulting set because, we need to consider thenovelty of an event that is based on the the confidence scoreof the named entity and some high ranked keywords. There isno mean to consider the whole keyword set right now becausein this case, it seems useless to other trendy topics discussedexcept the common ones with named entity set.
8) Ranking Scores of Events:
Our proposed approach re-peats the above mentioned condition checking for each clusterwhether to determine as an event or not and score calculationfor each cluster that is only considered as an event for eachtime interval. Finally, we rank each event by ordering theirtotal score for each time interval. The flow chart of ourproposed approach is depicted in Figure 2.
F. Annotation Approach
To evaluate the performance of our proposed approach, wecompared the results of our proposed method with a manuallyannotated list of events. A subset of 301 tweets collectedin sequence in the window starting at 2018-08-30 23:00:08CST to 2018-09-02 10:50:19 CST was manually annotatedaccording to i. impact, ii. tweet count, and iii. worldwideeffect to be considered as an event. We also consider threecategories whether they are i. first story and novel, ii. trendingor developing, and iii. novel and trending. For validation, wecheck the ratio of correctly detected events in that window tothe total number of relevant events, and the ratio of correctlydetected events to the total number of detected events.V. E
XPERIMENTAL R ESULTS
A. Simulation
For our analysis, we use scikit learn [27] to calculatetfIdf matrix [23], to apply DBSCAN [22] algorithm and touse cosine similarity. The parameters assignments for makingthe tfIdf matrix are max df = 0.90, max features = 200000, min df = 0.01 and ngram range = (1,1). Then, parame-ter assignments for DBSCAN algorithm are eps = 1 and min samples = 3. Again, we use the cosine similarity thresholdas 0.5 for similarity checking for trendiness. Table I shows
Fig. 1. Graphical representation of commonSet , keywordSet and namedEnti-tySet TABLE ISummery Result of five time intervals; NT:Number of Tweets; JT: JustTrendy; TN: Trendy and Novel; FS: First Story; TE: Total Number of EventsInterval NT JT TN FS TE1 145 0 1 14 152 314 0 0 50 503 812 1 7 37 454 1239 0 9 18 275 297 4 0 5 11 the result of five time intervals collectively from 2018-08-30 23:00:08 to 2018-09-02 10:50:19.200000, from 2018-09-0210:50:19.200000 to 2018-09-04 22:40:30.400000, from 2018-09-04 22:40:30.400000 to 2018-09-07 10:30:41.600000, from2018-09-07 10:30:41.600000 to 2018-09-09 22:20:52.800000and from 2018-09-09 22:20:52.800000 to 2018-09-1210:11:04 by Interval 1, 2, 3, 4 and 5 respectively. We keep onlydetected True Positive events and represent in Table I. FromTable I we can see for the first interval (2018-08-30 23:00:08to 2018-09-02 10:50:19.200000) we have total 145 tweets andtotal 15 events. Moreover, out of 15 events we got no “TrendyEvent”, 1 “Novel and Trendy Event” and 14 “Novel Event”.This description will be continued for rest of the intervalssimilarly. In Table II we show the extracted keywords for eachevent for the first time interval. The keywords mentioned inthe Table II are used to detect cyber security related eventsfor the first time interval. For better representation, We showonly the plot of the nd time interval in Fig. 3 because of thepaper space limitation. This figure depicts the found events inx axis, the amount of tweets on left side of figure 3 in y axisand the event score on the right side of figure 3 in y axis.The red vertical bar represents number of tweets and the bluevertical bar represents event score for each detected event. B. Annotation-Based Validation1) Design Selection Approach:
There are few decision wehad to make to formulate the design architecture of thisalgorithm. Firstly, we thought to analyze the semantic relationbetween the words of each tweet text to get the insightsof cyber security events. That is why we previously applieddoc2Vec [28] which is an extended application of word2vec[30] for getting similar meaning tweets to find events fromthe data sets but we could not get any satisfactory result ig. 2. Flowchart of the proposed approachFig. 3. Event plot of the second time interval proposed approach because we found that shallow neural network model textdomain tools like doc2Vec [28] works based on word vectorembeddings that does not perform well for short and noisy textdata set. Embedding methods did not work properly in shorttexts because tokens in a short text have a thin contextualrelation between each other and this relation get worse dueto misspelled and incomprehensible tokens. A sample resultof doc2Vec applying hypermeter values vector size = 300, min count = 2 and epochs = 45 respectively is shown in TableIV that exhibits most similar tweet and second most similartweet of a particular tweet that does not have any noticeablesimilarity with any of those tweet document whereas a doc-ument must show similarity with at least to itself. Similarityscore of each tweet to the particular tweet is represented inside the parentheses in the first column. Again, we thought in adifferent way to apply LDA [29] to find some topics that mayrepresent events. Since the tweet text is very short, almost allof the time a tweet does not represent more that one event.That is why, we decided to apply LDA [29] on the aggregatedtweet texts from corresponding time intervals. This, approachalso fails to show the expected result because of the incoherentnature of tweet text. Due to the space limitation of the wecould not present in this work. Thus, we decided to apply verypopular tfIdf vectorization because we found that words in atweet text has a few semantic relation between each other andword co-occurrence is better option to apply in this domain.Consequently, we found a better performance by comparingthe result with the previously applied approaches.
ABLE IISummery result of time interval 1(’2018-08-30 23:00:08’, ’2018-09-02 10:50:19.200000’)Event NumberN Keywords0 ’security’, ’android (operating system)’, ’android’, ’wi-fi’, ’privacy’1 ’microsoft’, ’disclosed’, ’twitter’, ’windows’, ’microsoft windows’, ’hacker discloses’2 ’website’, ’catalonia’, ’spain’, ’banking’, ’bank’, ’inf’3 ’based’, ’huff’, ’buffer overflow’4 ’vulnerability (computing)’, ’security’, ’repository’, ’critical vulnerability’, ’apache’, ’inf’5 ’vulnerability’, ’resource consumption’, ’prior’, ’resource’, ’rsa’, ’bleach’6 ’task’,’windows’,’patch,’scheduler’7 ’security’, ’android (operating system)’, ’android’, ’data’, ’privacy’, ’tracking’8 ’cracking ransom’, ’coin’, ’free’, ’ransom’, ’cybersex’, ’net’9 ’vulnerability (computing)’, ’patch’, ’spyware’, ’phishing’, ’inf sec cube security’, ’patched’, ’malware’10 ’security’, ’website’11 ’plus’, ’pump’12 ’version’, ’web server’, ’debugger’, ’skype’, ’update’, ’denial service’, ’exploit (computer security)’13 ’cisco systems’, ’service’, ’cisco’14 security’, ’photo’,’service’ TABLE IIISummery result of time interval 1(’2018-08-30 23:00:08’, ’2018-09-02 10:50:19.200000’);EN: Event Number;EL: Event Link; TC:Tweets CountNESR:Normalized Event Score Rank;ET: Event Type; AER: Annotator Event Ranking; DBR: Difference between Rankings; FS: First Story (novel); FST:First Story and Trendy (developing)EN TC Event Score NESR ET EL AER DBR0 5 167.6084 5 FS link1 4 11 7 211.033 3 FS Link2 2 12 21 190.5950 4 FS Link3 3 13 3 55.3226 10 FS NA 13 34 12 110.6048 9 FS Link4 7 25 4 130.4169 8 FS Link5 8 06 3 17.2225 14 FS Link6 10 47 6 145.7938 7 FS Link7 5 28 8 154.3115 6 FS Link8 6 09 5 389.7082 2 FS Link9 9 710 4 40.0639 13 FS NA 14 111 7 46.4706 12 FS Link10 12 012 51 391.3391 1 FST Link11 1 013 5 52.8906 11 FS Link12 11 014 4 16.9345 15 FS NA 15 0TABLE IVSample result of doc2Vec
Terms Texts
Document guides on fixing sql injections vul-nerabilities sql injection technique ex-ploits security vulnerability occurringdatabase layer application the vulnera-bility present user input eitherMost Similar(0.8027611970901489) free vps server ddos protected hostingSecond Most Similar(0.6457577347755432) cvnway just ddos serverMedian (0.21316465735435486) minibb bbfuncsearchphp table sql in-jectionLeast (-0.4030833840370178) ransomware weapon used cyber attackselixir ng news source trust
2) Validation of the Approach:
Table V shows the perfor-mance result of our proposed approach according to the eval-uation methodology described in Section IV-F. The annotatorsannotate 301 and found total 20 events and 6 tweet clustersthat are not events. On the other hand, the our algorithm found total 16 events. Now, 15 events out of 16 events are realevents (True Positive) included in 20 ground truth but oneevent is False positive. So, the True Positive, False Positive,False Negative and True Negative rates are 75%, 16.67%,25% and 83.33% respectively and we got a good precisionvalue that is 93.75%. An interesting news is that we canonly stream a very small amount of tweets per millisecondapproximately 1% of the total tweet posted that is addressedin this web article [31]. So, the Twitter data itself only isnot sufficient to detect all of the ongoing cyber security eventand that is why we limit ourself to calculate recall score bykeeping track published cyber security events in online. InTable III, we present the fifteen true positive events, alongwith their tweet count and their corresponding scores. Weorder the events by their corresponding scores and match withthe annotators’ annotations. The th and the th column ofthe Table III present proposed approach event ranking andannotators’ ranking respectively and comparing the th and the th column of the table, we can see the annotators predictionsare quite similar to our approach in case of event detection andvent ranking. The validity of the detection approach can bechecked by clicking the link mentioned in the th column tosee the reports published in authentic blogs and newspapers.The th column represents the type of events detected by ouralgorithm. The sum squared error (SSE) of the event rankingof our approach and annotator’s ranking is 86 by calculatingthe difference mentioned in the th column of the table. TABLE VConfusion matrix of the algorithm’s generated result
Total Population Ground Truth positive Ground Truth negative
Derived positive TruePositive=75% False Posi-tive=16.67%Derived negative FalseNegative=25% True Nega-tive=83.33%
VI. C
ONCLUSION
We presented a novel machine learning and text informationextraction method for the detection of cyber threat events fromtweets. We considered two types of such events, those that arenovel, and those that are further developments of previouslydetected tweets. Furthermore, we proposed an approach forthe ranking of cyber threat events based on an importancescore computed based on the named entities and keywords inthe text of tweets. We also impute influence to users in orderto assign a weighted score to noun phrases in proportion touser influence and the corresponding event scores for namedentities and keywords. To evaluate the performance of ourproposals, we measure the efficiency and detection error ratefor events over a specified time interval, relative to humanannotator ground truth, and demonstrate the feasibility of itsapplication in detecting cyber threat events from tweets. Futuredirections of this research include the extension of our currentmethod for detection and ranking of sub-events in each cyberthreat event. Moreover, the heuristics applied in this work arepresented as proofs of concept, while leaving room for furtherenhancement and customization per user requirements. As fur-ther venue of future word, the methodology used for influencemeasurement of users can be extended via means such as meta-network modeling and link extraction of the dynamic socialnetwork of users that are active in the cybersecurity domain.R
EFERENCES[1] Q. Li, A. Nourbakhsh, S. Shah and X. Liu, “Real-Time Novel EventDetection from Social Media,” , San Diego, CA, 2017, pp. 1129-1139. doi:10.1109/ICDE.2017.157[2] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. 2010. “ Emergingtopic detection on Twitter based on temporal and social terms evaluation”,
In Proceedings of the Tenth International Workshop on Multimedia DataMining (MDMKDD ’10). ACM , New York, NY, USA, Article 4, 10 pages.DOI: https://doi.org/10.1145/1814245.1814249.[3] Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger,Samuel Madden, and Robert C. Miller. 2011.,“Twitinfo: aggregat-ing and visualizing microblogs for event exploration.”,
In Proceed-ings of the SIGCHI Conference on Human Factors in ComputingSystems (CHI ’11). ACM , New York, NY, USA, 227-236. DOI:https://doi.org/10.1145/1978942.1978975. [4] Sina Dabiri, Kevin Heaslip,“Developing a Twitter-based traffic eventdetection model using deep learning architectures”,
Expert Systemswith Applications , Volume 118, 2019, Pages 425-439, ISSN 0957-4174,https://doi.org/10.1016/j.eswa.2018.10.017.[5] P. Ranade, S. Mittal, A. Joshi and K. Joshi,“Using Deep Neural Networksto Translate Multi-lingual Threat Intelligence”, , Miami, FL,2018, pp. 238-243. doi: 10.1109/ISI.2018.8587374.[6] A. Edouard,“Event detection and analysis on short text messages”,
Universit Cte d’Azur, 2017. [7] W. Li and Y. Huang,“New Event Detect Based on LDA and Correlationof Subject Terms”, , Wuhan, 2011, pp. 1-4. doi: 10.1109/ITAP.2011.6006301[8] Lau, Jey Han, Nigel Collier, and Timothy Baldwin.“On-line trend anal-ysis with topic models:
Proceedings of COLING , 2012 (2012): 1519-1534.[9] Rupinder Paul Khandpur, Taoran Ji, Steve Jan, Gang Wang, Chang-Tien Lu, and Naren Ramakrishnan. 2017.“Crowdsourcing Cybersecu-rity: Cyber Attack Detection using Social Media”,
In Proceedings ofthe 2017 ACM on Conference on Information and Knowledge Man-agement (CIKM ’17).
ACM, New York, NY, USA, 1049-1057. DOI:https://doi.org/10.1145/3132847.3132866[10] Wurzer, Dominik, Victor Lavrenko, and Miles Osborne.“Twitter-scalenew event detection via k-term hashing.”
Proceedings of the 2015Conference on Empirical Methods in Natural Language Processing , pp.2584-2589.[11] K.-C. Lee, C.-H. Hsieh, L.-J. Wei, C.-H. Mao, J.-H. Dai, and Y.-T. Kuang,“Sec-buzzer: cyber security emerging topic mining with openthreat intelligence retrieval and timeline event annotation”,
Soft Comput-ing , vol. 21, no. 11, pp. 28832896, 2017.[12] Sapienza, Anna, Sindhu Kiranmai Ernala, Alessandro Bessi, KristinaLerman, and Emilio Ferrara. “Discover: Mining online chatter for emerg-ing cyber threats.”
Companion of the The Web Conference 2018 on TheWeb Conference 2018 , pp. 983-990. International World Wide WebConferences Steering Committee, 2018.[13] Quentin Le Sceller, ElMouatez Billah Karbab, Mourad Debbabi, andFarkhund Iqbal. 2017.“SONAR: Automatic Detection of Cyber Secu-rity Events over the Twitter Stream.”
Proceedings of the 12th Inter-national Conference on Availability, Reliability and Security (ARES’17). ACM , New York, NY, USA, Article 23, 11 pages. DOI:https://doi.org/10.1145/3098954.3098992[14] Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing,and Raheem Beyah. 2016.“Acing the IOC Game: Toward AutomaticDiscovery and Analysis of Open-Source Cyber Threat Intelligence.”
Proceedings of the 2016 ACM SIGSAC Conference on Computer andCommunications Security (CCS ’16). ACM , New York, NY, USA, 755-766. DOI: https://doi.org/10.1145/2976749.2978315[15] Ifrim, Georgiana, Bichen Shi, and Igor Brigadir.“Event Detection inTwitter using Aggressive Filtering and Hierarchical Tweet Clustering.”
In SNOW-DC@ WWW , pp. 33-40. 2014.[16] Alan Ritter, Evan Wright, William Casey, and Tom Mitchell.2015.“Weakly Supervised Extraction of Computer Security Events fromTwitter.” n Proceedings of the 24th International Conference on WorldWide Web (WWW ’15). International World Wide Web ConferencesSteering Committee , Republic and Canton of Geneva, Switzerland, 896-905. DOI: https://doi.org/10.1145/2736277.2741083[17] Branco, Eunice Picareta.“Cyberthreat discovery in open source intelli-gence using deep learning techniques.”
PhD dissertation
Proceedings of the 2004 conference on empirical methods in naturallanguage processing . 2004.[21] Page, Lawrence, Sergey Brin, Rajeev Motwani, and Terry Winograd.“The PageRank citation ranking: Bringing order to the web”.
StanfordInfoLab , 1999.[22] Ester, Martin, Hans-Peter Kriegel, Jrg Sander, and Xiaowei Xu. “Adensity-based algorithm for discovering clusters in large spatial databaseswith noise.”
Kdd, vol. 96 , no. 34, pp. 226-231. 1996.[23] H. Wu and R. Luk and K. Wong and K. Kwok. “Interpreting TF-IDF term weights as making relevance decisions.
ACM Transactions onInformation Systems , 26 (3). 2008.[24] Wolf Garbe ¡[email protected]¿,“SymSpell 6.4”,https://github.com/wolfgarbe/symspell25] Behzadan, Vahid, Carlos Aguirre, Avishek Bose, and William Hsu.“Corpus and Deep Learning Classifier for Collection of Cyber ThreatIndicators in Twitter Stream”. , pp. 5002-5007. IEEE, 2018.[26] Radim rehurek and Petr Sojka“Software Framework for Topic Modellingwith Large Corpora”,
Proceedings of the LREC 2010 Workshop on NewChallenges for NLP Frameworks ,pages 45–50, May 22, 2010; DOI:http://is.muni.cz/publication/884893/en[27] Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. andThirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss,R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau,D. and Brucher, M. and Perrot, M. and Duchesnay, E., “Scikit-learn:Machine Learning in Python”
Journal of Machine Learning Research ,volume 12, pp 2825–2830, 2011[28] Le, Quoc, and Tomas Mikolov. “Distributed representations of sentencesand documents.” In
International conference on machine learning , pp.1188-1196. 2014.[29] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003).Lafferty, John (ed.). “Latent Dirichlet Allocation”.
Journal of MachineLearning Research . 3 (45): pp. 9931022.[30] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.“Efficient es-timation of word representations in vector space”.
CoRR , abs/1301.3781,2013.[31] Andy Piper, “Potential adjustments to Streaming API sample volumes”, https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628