[PDF] OntoEnricher: A Deep Learning Approach for Ontology Enrichment from Unstructured Text

Abstract

Information Security in the cyber world is a major cause for concern, with significant increase in the number of attack surfaces. Existing information on vulnerabilities, attacks, controls, and advisories available on the web provides an opportunity to represent knowledge and perform security analytics to mitigate some of the concerns. Representing security knowledge in the form of ontology facilitates anomaly detection, threat intelligence, reasoning and relevance attribution of attacks, and many more. This necessitates dynamic and automated enrichment of information security ontologies. However, existing ontology enrichment algorithms based on natural language processing and ML models have issues with the contextual extraction of concepts in words, phrases and sentences. This motivates the need for sequential Deep Learning architectures that traverse through dependency paths in text and extract embedded vulnerabilities, threats, controls, products and other security related concepts and instances from learned path representations. In the proposed approach, Bidirectional LSTMs trained on a large DBpedia dataset and Wikipedia corpus of 2.8 GB along with Universal Sentence Encoder was deployed to enrich ISO 27001 based information security ontology. The approach yielded a test accuracy of over 80\% when tested with knocked out concepts from ontology and web page instances to validate the robustness.

Full PDF

OOntoEnricher: A Deep Learning Approach forOntology Enrichment from Unstructured Text

Lalit Mohan Sanagavarapu, Vivek Iyer, and Y Raghu Reddy [email protected],[email protected],[email protected]

Software Engineering Research Centre, IIIT Hyderabad, India

Abstract.

Information Security in the cyber world is a major causefor concern, with signiﬁcant increase in the number of attack surfaces.Existing information on vulnerabilities, attacks, controls, and advisoriesavailable on the web provides an opportunity to represent knowledgeand perform security analytics to mitigate some of the concerns.Representing security knowledge in the form of ontology facilitatesanomaly detection, threat intelligence, reasoning and relevanceattribution of attacks, and many more. This necessitates dynamic andautomated enrichment of information security ontologies. However,existing ontology enrichment algorithms based on natural languageprocessing and ML models have issues with the contextual extraction ofconcepts in words, phrases and sentences. This motivates the need forsequential Deep Learning architectures that traverse throughdependency paths in text and extract embedded vulnerabilities,threats, controls, products and other security related concepts andinstances from learned path representations. In the proposed approach,Bidirectional LSTMs trained on a large DBpedia dataset andWikipedia corpus of 2.8 GB along with Universal Sentence Encoderwas deployed to enrich ISO 27001 [8] based information securityontology. The approach yielded a test accuracy of over 80% whentested with knocked out concepts from ontology and web pageinstances to validate the robustness.

Keywords:

Ontology Enrichment · Information Security · BidirectionalLSTM · Universal Sentence Encoder.

In recent times, there has been an exponential increase in the number ofcontent providers and content consumers on the internet due to various reasonslike improved digital literacy, aﬀordable devices, better network, etc. Further,the number of internet-connected devices per person is expected to increaseeven more with the adoption of emerging technologies such as Internet ofThings and 5G. This change in users and usage is leading to an increase indata breaches . In many cases, realization of an impact happens long after the https://digitalguardian.com/blog/history-data-breaches a r X i v : . [ c s . C L ] F e b Sanagavarapu et al. attack. Typically, organizations invest in security tools and infrastructure thatare based on rules, statistical models and machine learning (ML) techniques toidentify and mitigate the risks arising from the threats. In addition,organizations purchase threat intelligence feeds to continuously monitor ITinfrastructure for anomaly detection. The subscription fee of threat intelligencefeeds from service providers is expensive and to a large extent, it containsthreat intelligence that is already available in public forums. Public forumssuch as blogs, discussion forums, government sites, social media channelsincluding Twitter and others contain unstructured threat intelligence onvulnerabilities, attacks, and controls. Tech-savvy internet users interested ininformation security access public forums, search, and browse on securityproducts, their conﬁgurations, reviews, vulnerabilities and other relatedcontent for awareness and to protect IT assets.In the recent years, organization’s information security infrastructure useStructured Threat Information eXchange (STIX) / Trusted AutomatedExchange of Intelligence Information (TAXII) knowledge representation fromOASIS to represent observable objects and their properties in the cyberdomain. However, automated processing of unstructured text to generate STIXformat is a formidable challenge [31]. Interestingly, there are transformationsavailable to convert from XML based STIX format to ontological ‘OWL’ or‘RDF’ formats, which in part, has inﬂuenced OASIS to adopt ontology forrepresentations. Evidently, the research to use unstructured security relatedcontent to enrich ontologies is gaining ground to mitigate risks related tozero-day attacks, malware characterization, digital forensics and incidenceresponse and management [1, 11, 31]. Security ontologies are used to analyzevulnerabilities and model attacks [8, 11, 14, 35]. The concepts, relationships andinstances of security ontologies are used to validate the level ofdefence-in-depth to protect IT assets, map security product features to controlswhich leads to assurance of the security infrastructure. The constraints andproperties of ontologies allow root cause analysis of attacks. Additionally, giventhat security-related data is in structured, semi-structured or unstructuredforms, unifying them with ontologies aids in situational awareness andreadiness to defend an attack [31].Traditionally, domain experts constructed and maintained ontologies. Giventhe extent of eﬀort and cost involved, access to domain content and the abilityto process text with advanced natural language processing (NLP) techniquesand ML models on powerful IT infrastructure opens up research opportunitiesto construct and manage ontologies. The information security ontologies can beconstructed or enriched from unstructured text available on public forums,vulnerability databases such as National Vulnerability Database (NVD) andother information security processing systems [2, 26] sources. Also, thestandards and guidelines from ISO/IEC [12], NIST from US, ENISA fromEuropean Nation, Cloud Security Alliance (CSA) and others to protect https://nvd.nist.govntoEnricher 3 conﬁdentiality, integrity and availability of IT assets, contain embeddedconcepts. The ISO 27001:2015 [8] based security ontologies that encompassmost of these guidelines are being extensively explored for protection, auditingand compliance checking. Enrichment of ISO 27001 based ontology provideswider acceptance, easier management and interoperability.In this work, we propose to enrich a widely accepted Information Securityontology instead of constructing a new ontology from text. This avoidsinclusion of trivial concepts and relations. The success of enrichment alsoenables wider acceptance and usage by domain experts. However, the availableliterature on ontology enrichment from text is based on approaches utilizingword similarity and supervised ML models [11, 27]. These ontology enrichmentapproaches, albeit useful to extract word-level concepts, are limited withrespect to (a) extraction of longer concepts embedded in compound words andphrases (b) factoring context while identifying relevant concepts and (c)extracting and classifying instances [13]. In the proposed approach( OntoEnricher ), we implemented a supervised sequential deep learning modelthat: a) factors context from grammatical and linguistic information encodedin the dependency paths of a sentence, and then b) utilizes sequential neuralnetworks, such as Bidirectional Long Short Term Memory (LSTM) [29] totraverse dependency paths and learn relevant path representations thatconstitute relations. In addition, we utilized pre-trained transformer-basedarchitecture of Universal Sentence Encoder (USE) [4] to handle distributionalrepresentations of compound words, phrases, and instances.Information security ontology enrichment datasets are not readily available.As a result, we used a semi-automatic approach to create a training dataset of97,425 related terms (hypernyms, hyponyms and instances) extracted fromDBpedia for all the concepts in the information security ontology. To learn thesyntactic and semantic dependency structure in sentences, a 2.8 GB trainingcorpus on information security was extracted from Wikipedia of all the termsin the ontology and the DBpedia dataset. The curated dataset by three ChiefInformation Security Oﬃcers and corpus were used to train our

OntoEnricher .The

OntoEnricher implemented bidirectional LSTM for learning relevantlinguistic information and path sequences in dependency trees to enrichconcepts, relations, and instances in ontology. We tested

OntoEnricher with10% of training dataset, knocking out terms from ontology and unstructuredtext from web pages and achieved an average accuracy of 80%, which is betterthan the current state-of-the-art approaches. The code and documentation ofthe entire ontology enrichment pipeline is publicly available on GitHub [30] forreuse and extension. The remainder of the paper is organized as follows:Section 2 discusses the state-of-the-art on ontology enrichment. In section 3,

OntoEnricher - our approach on ontology enrichment from text is detailed.The approach is evaluated in Section 4. Conclusion and scope of future work ispresented in section 5.

Sanagavarapu et al.

This section discusses related work on enrichment of ontologies fromunstructured text as well as approaches to create and maintain informationsecurity ontologies. The work on enrichment of knowledge graphs (KG) fromunstructured text is also discussed as it represents knowledge and containssimilarities with ontologies.Researchers worked on knowledge acquisition from text to constructontologies for past couple of decades [3, 17]. The last decade witnessedsigniﬁcant progress in the ﬁeld of information extraction from web withprojects such as DBpedia, Freebase and others. The work of Mitchell et. al [19]known as ‘NELL’ states that it is a never-ending system to learn from web,their work bootstraps knowledge graphs on a continuous basis. Tools such asReVerb [7] and OLLIE [28] were based on open information systems to extracta triple from a sentence using syntactic and lexical patterns. Although theseapproaches extracted triples from unstructured text using shallow and fastmodels, they do not handle ambiguity while entity mapping and do not learnexpressive features compared to deep and multi-layer models.The ML models based on probabilistic, neural networks and others werealso explored for ontology enrichment from text [17, 22, 23]. In 2017, Wang etal [33] conducted a survey on knowledge graph completion, entity classiﬁcationand resolution, and relation extraction. The study classiﬁed embeddingtechniques into translational distance models and semantic matching models.The study also stated that additional information in the form of entity types,textual descriptions, relation paths and logical rules strengthen the research.Deep learning models such as CNN [5], LSTM [16, 20] and variants were usedto construct knowledge graphs from text as they carry memory cells and forgetgates to build the context and reduce noise. Vedula et al. [32] proposed anapproach to bootstrap newer ontologies from related domains.Some of the recent approaches are based on Word2Vec [34] and its variantssuch as Phrase2Vec or Doc2Vec that use distributional similarities to identifyconcepts to enrich an ontology. However, these approaches underperform in theextraction of concepts embedded in words, phrases and sentences due to theirinability to adequately characterize context. Compared to Word2Vec and itsvariants, Universal Sentence Encoder (USE) [4] stands promising to identifyconcepts in long phrases as it encodes text into high dimensional vectors forsemantic similarity. Lately, researchers [9] are exploring USE to producesentence embeddings and deduce semantic closeness in queries. Although,transformer-based models such as BERT and XLNet [6, 18] are of interest toontology enrichment researchers, training them to a domain is eﬀort intensive.The literature to enrich security ontologies from text drew attention withOASIS’s STIX/TAXII standardization and open source threat intelligence. Mostof the current work on security ontologies from text (construction or enrichment)were based on usage of string, substring, pre-ﬁx and post-ﬁx matching of terms,Word2Vec and other basic ML models [21,23,31]. In ontologies as well, the deeplearning approaches based on recurrent neural networks are trending because of ntoEnricher 5 their ability to build the context over multiple words [10, 14]. The research ofHoussem et al. [10] used LSTM for population of security ontologies. However,the details to create corpus, handle phrases and robustness of the approach werenot elaborated, only 40 entities were used in the model. The literature revealedthat security ontologies based on ISO 27001 [8] and MITRE Corporation’s cybersecurity eﬀort [31] were most referred.

The proposed ontology enrichment approach enriches a seed ontology withconcepts, relations and instances extracted from unstructured text. As shownin Figure 1, the ontology enrichment approach consists of four stages: (i)

DatasetCreation : Creation of a training dataset by extracting and curatingrelated terms from DBpedia for all concepts in the ontology (ii)

CorpusCreation : Creation of a domain-speciﬁc training corpus by parsingWikipedia dump using various ﬁltering measures (iii)

T raining : Training

OntoEnricher for relation classiﬁcation of term pairs using training datasetand corpus, and (iv)

T esting : Testing the model by enriching the ontologyfrom domain-speciﬁc webpages.

Fig. 1.

Approach used for Ontology Enrichment Sanagavarapu et al.

The information security seed ontology is based on ISO 27001 [8]. The standardISO 27001:2015 [12] has 114 control across 14 groups. These groups containcontrols related to ‘Human Resources’, ‘Asset Management’, ‘Access Control’,‘Cryptography’, ‘Physical and environmental’, ‘Operations’, ‘Communications’,‘System development and acquisition’, ‘Supplier relations’, ‘Informationsecurity incident management’, ‘Compliance’, ‘Security Policies’ and ‘Securityorganisation’. These groups and controls are represented as 408 concepts in thesecurity ontology to protect assets from vulnerabilities and threats.These 408 concepts extracted from security ontology are used to queryrelated terms, namely hypernyms and hyponyms, from DBpedia. DBpediacontains over 5 million entities, allows querying of semantic relationships,concepts and properties encoded in the form of RDF triples. Typically, theRDF triples (subject-verb-object) in an ontology contain a ‘verb’ relationshipbetween concepts. The verbs are typically domain-speciﬁc and unavailable ingeneral-purpose knowledge graphs like DBPedia. The hypernyms andhyponyms, which denote ‘is-a’ relationship between concepts, are easilyavailable in DBPedia and widely used in ontologies, making these relations anideal choice to demonstrate our approach as a proof-of-concept. The SPARQLqueries to extract hypernyms and hyponyms from DBpedia for each of theconcepts in the security ontology are given below:

SELECT * WHERE{ ?hypernyms}SELECT * WHERE {?hypernyms }

The extracted terms obtained as a result of these queries are converted to triplesof the form ( a, b, label ) where a denotes the ontology concept, b denotes theDBPedia term and label determines the DBPedia relation between a and b .This leads to a dataset of 97,425 triples. These triples are then curated by threedomain experts and the authors to mark unrelated terms as ‘none’. This includespairs that are not related to the domain and pairs that are not related to eachother, as both these cases are not needed for ontology enrichment. In addition,since DBPedia often categorizes ontological instances under ‘hyponyms’, somepairs are separately labelled as ‘instances’ if b is an instance of a and ‘concept’ if b denotes the concept of which a is an instance. The terms classiﬁed include namesof experts, organizations, products and tools, attacks, vulnerabilities, malware,virus and many others. Finally, since the number of ‘none’ pairs (89,820) wassigniﬁcantly higher than the number of ‘non-none’ (7,605) pairs, we sorted the‘none’ pairs in order of increasing similarity and ﬁltered out the ﬁrst 5% of‘none’ pairs, which was experimentally determined to yield better results. TheTable 1 shows the ﬁnal composition of our dataset after the described stages ofextraction, curation and ﬁltering. ntoEnricher 7 Table 1.

Composition of the DatasetRelationship CountHypernymy 2,939Hyponymy 794Instances 2,685Concepts 1,187None 4,490Total 12,096

Once the training dataset is created, a training corpus to provide linguisticinformation for each of the terms in the dataset is extracted. We use Wikipediaas it is moderated and structured for model training. The DBPedia is a part ofthe Wikipedia project, and therefore assures unambigous articles of all theextracted dataset terms. As a ﬁrst step, we extract all the correspondingWikipedia articles for the terms in the dataset and add them to the corpus. Inaddition, we extract other articles related to the information security ontologydomain. This is done by comparing the Doc2Vec [15] similarity of each articlewith the Wikipedia article on ‘Information Security’ and then ﬁltering inarticles with a similarity score higher than a certain threshold (0.27 aftermanual validation). This threshold, in turn, is determined to optimizeclassiﬁcation accuracy on a small validation corpus, extracted from the dump.The two-step ﬁltering described above yielded an information security trainingcorpus, of total size 2.6 GB. OntoEnricher

The training dataset and corpus are parsed to generate various dependency pathsto connect each pair of terms provided in the training dataset. Here, ‘dependencypaths’ refers to the multi-set of all paths that connect a pair of terms in thetraining corpus. These paths are encoded as a sequence of nodes, where eachnode is a 4-tuple of the form ( word, P OS _ tag, dep _ tag, dir ) . The P OS _ tag and dep _ tag denote the Part of Speech (POS) and dependency tags of the wordrespectively, while dir denotes the direction of the edge connecting it to thenext node in that dependency path. The term pairs along with the extracteddependency paths between them are passed to OntoEnricher for training.The ﬁrst layer in the proposed model is the embedding layer. Thedistributional embeddings for the terms (words) are obtained using apre-trained state-of-the-art Universal Sentence Encoder (USE) [4] transformermodel. This model is preferred over other vocabulary-based distributionalmodels such as those belonging to the Word2Vec family as it returnsdistributional embeddings for not just smaller words, but also compound https://en.wikipedia.org/wiki/Information_security Sanagavarapu et al. words, phrases and sentences. In addition, USE is pretrained on Wikipedia,among other corpora, making it ideally suited for our task. Apart frompre-trained word embeddings, the embeddings for the POS tags, dependencytags and direction tags are obtained from trainable embedding layers. Thenode embeddings, constructed from the concatenation of words, POS,dependency, and direction tag embeddings are arranged in a sequence to obtainthe path embeddings. A dropout layer is applied after each of the embeddings.The path embeddings for each path connecting the term pair are then input toa bidirectional, two-layer LSTM which trains on a sequence of linguisticallyand semantically encoded nodes and learns the type of sequences thatcharacterize a particular kind of relation. The bidirectional LSTM allows thenetwork to have both backward and forward information about the pathembeddings at every time step, while the two layers enable capturing of morecomplex relations among dependency paths.The output of the last hidden state of the LSTM is taken as the pathrepresentation. Since a pair of terms may have multiple paths between them, aweighted sum of these path representations is taken by using the path countsas weights, to yield a ﬁnal context vector. This context vector, that encodessyntactic and linguistic information, is passed through a dropout layer andthen concatenated with the distributional embeddings of both terms in orderto encode semantic information as well. The concatenated vector is then passedthrough two Feedfoward Neural Networks, with a ReLU layer in between, toyield the ﬁnal class probability vector. The class with maximum probability isoutput as the predicted relation between the term pair. OntoEnricher

The procedure to extract concepts and instances from (web page) text, duringthe testing stage is detailed here. To avoid usage of every unstructured (webpage) text to enrich an ontology, a lightweight evaluation technique [25] thatchecks for suﬃciency of new security terms is deployed. After passing thesuﬃciency evaluation, as a pre-processing stage, co-reference resolution isapplied and then noun chunks are extracted from the web page. A cartesian( nC ) product is then taken of the extracted noun chunks to constructpotential term pairs. However, a cartesian product to OntoEnricher iscomputationally expensive and also leads to error propagation, we apply atwo-stage ﬁltering that (a) checks if the noun chunks are ‘suﬃciently’ related toInformation Security and (b) if they are ‘suﬃciently’ related to each other.Both of these conditions are checked to compare distributional similarity usingUSE against experimentally determined threshold values. The suﬃcientlysimilar term pairs are then input to the pre-trained model to classify therelationship. The pairs classiﬁed as ‘None’ are discarded and the rest areconverted to RDF triples for security ontology enrichment. ntoEnricher 9

Fig. 2.

Example illustrating Ontology Enrichment approach0 Sanagavarapu et al.

The Figure 2 illustrates the ontology enrichment approach with an example.Consider ‘Real-time adaptive security’ (R-TAS), which is a concept present inthe security ontology. The corresponding article in DBPedia, ‘Real-timeadaptive security’ has ‘model’ as its hypernymy entry, which is returned usinga SPARQL query. The security corpus extracted from Wikipedia dump usingDoc2Vec ﬁlter contains multiple paired mentions of these terms, out of whichone article contains two mentions. The corpus, the aforementioned sentences,are passed to the SpaCy dependency parser and all corresponding dependencypaths to connect every extracted term pair. These dependency paths, whichcontain encoded linguistic information, are passed to a serialization layer thatconverts the dependency graph into a series of nodes to form the input to OntoEnricher .The serialization layer reduces the word in every node in the dependencypath to its lemmatized, root word to enable meaningful training andgeneralization. Thus, ‘Real-time adaptive security’ is reduced to ‘security’ and‘is’ is reduced to ‘be’. It also converts every node to a feature vector. The‘Real-time adaptive security’ is converted to a feature vector that uses‘security’ as the word, ‘PROPN’ as POS tag, ‘nsubj’ as dependency tag and‘+’ denotes the direction of the edge connecting it to the lowest common rootnode between the term pair. Similarly, the next word ‘be’ is a verb and a rootword of ‘is’ does not have any direction ‘ ∼ ’. The last word of this path,‘model’, has ‘NOUN’ as POS tag, ‘attr’ as dependency tag and ‘+’ as directionof the arrow going away from ‘model’ to ‘is’.The same approach is followed for the second dependency path and thenodes are sequenced similarly. These two paths are then passed to theembedding layer that calculates (i) USE embedding for word (ii) POS tagembedding (iii) dependency tag embedding and (iv) direction embedding. Thelast 3 embeddings are trainable while the word embeddings are pre-trainedusing USE. These are concatenated together to yield a node embedding. Allthe paths (node sequences) that connect term pair are passed as input toBidirectional two-layer LSTM. In this example, both the paths connecting‘Real-Time Adaptive Security’ and ‘model’ are input to the LSTM, post whichthe last hidden state is taken as path-wise contextual output. A weighted sumof these paths is then calculated using the frequency of occurrence as weightsto yield the ﬁnal context vector, this has the encoded linguistic information ofthe paths that connect ‘R-TAS’ and ‘model’. This context vector isconcatenated with the distributional embeddings of ‘Real-Time AdapativeSecurity’ and ‘model’. Reducing words to their root form during serializationstage enables to construct a contextualized representation. The characterizedpaths constitutes a speciﬁc relation and the most frequent ones, whiledistributional word and phrase embeddings enable semantic relevance andspeciﬁcity at a conceptual level. This concatenated vector denotes semantic https://spacy.io/ntoEnricher 11 and linguistic information that are passed to 2 Feedforward Neural Networkswith a ReLU layer in between, yielding a ﬁnal class probability vector asoutput. This class probability vector is trained to identify relationship between‘model’ and ‘Real-Time Adapative Security’ as hypernymy. We primarily experimented with two ontologies, namely the ISO 27001 basedInformation Security and Stanford Pizza ontologies. While the former is thefocus of the paper and our use case in terms of analyzing threats and attacksurfaces, we experiment with the latter as well to demonstrate generalizabilityof our approach. While the security corpus is 2.8GB long, the pizza corpus issigniﬁcantly smaller and only 95MB. This can be attributed to the fact thatthe pizza ontology represents a very narrow domain and thus contains fewrelevant Wiki articles, while the security ontology contains much broader,systems-level concepts, information about assets, controls etc that return avariety of related articles. We implemented

OntoEnricher using the deeplearning library Pytorch with ‘0’ as random seed number for consistency inresults. Also, we used various other Python libraries such as Pronto to extractontology terms, Wikiextractor to extract articles from Wikipedia dump,spaCy for dependency graph extraction, and Tensorﬂow-Hub to load UniversalSentence Encoder. We evaluated the performance of OntoEnricher on threediverse test datasets:1. DBPedia test dataset: This is created by randomly extracting 10% of thetraining dataset extracted from DBpedia. It mostly consists of small-mediumlength words.2. ‘Knocked-out’ test dataset: This is created by knocking out concepts andrelations from the ontology. This evaluates the ability of

OntoEnricher toidentify multi-word or phrase-level concepts, as is common in the SecurityOntology, and identiﬁcation of highly-domain speciﬁc, non-English terms asin the Pizza Ontology.3. Instance dataset: This is created by extracting text from security-domainrelated webpages. The Top 10 vulnerability related web pages from OWASPand product pages on ‘ﬁrewall’ are extracted to test the model. The ability toidentify concepts and instances from web pages conﬁrms that the approachcan use text from public forums and other unstructured data sources toprovide threat intelligence feeds. This evaluation was done without factoringsuﬃciency requirement [25] of new terms in the text to evaluate identiﬁcationof ontology terms by

OntoEnricher .The grid search was used to experiment with and arrive at optimal valuesof various hyperparameters [30]. These include hidden dimensions (120, 180, https://pypi.org/project/pronto/ https://github.com/attardi/wikiextractor2 Sanagavarapu et al. OntoEnricher on security and pizza ontologiesare shown in Table 2. We achieved competent and comparable scores onsecurity ontology enrichment on all three datasets. The test results with 10%test dataset performed better, while the test results on knockout concepts orinformation security related web pages are not far apart. This conﬁrms thatperformance does not dip in extraction of phrases, multi-word concepts andinstances which is a key component missing from previous ontology enrichmentapproaches [10, 14, 23, 31]. As the input and output format of existingapproaches are diﬀerent, we only performed qualitative comparison.Additionally, in

OntoEnricher , the number of terms and the size of the corpusused for training and testing are much larger. The diﬀerence between Precisionand Recall value is less, this indicates that terms are not skewed towardsdomain and establishes robustness of the proposed ontology enrichmentapproach. Interestingly, the pizza enrichment results are better than securityenrichment results, presumably due to the domain being narrow as mentionedearlier and concepts being easily identiﬁable as a consequence.

Table 2.

Ontology Enrichment Results

Metrics Information Security Pizza

DBPedia Knocked Out Web DBPedia Knocked Out Web

Terms

Accuracy

Precision

Recall

F1-Score

Most of the existing ontology evaluation metrics [24] are extensions ofPrecision and Recall information retrieval metrics. Hence, we measure precisionscore for k documents (shown in Table 3) to measure the consistency inenrichment of webpages. The scores indicate that the proposed approach canidentify concepts for any large number of domain documents. The Figure 3shows the relationship accuracy for each of the classes. It is observable that allrelationships are classiﬁed equally and hypernymy classiﬁcation seems to berelatively higher. ntoEnricher 13 Table 3.

Precision scores for 20 Random Web Pages in Information Security

Web pages P@5 P@10 P@15 P@20

Score 0.89 0.80 0.82 0.84

Fig. 3.

Accuracy on Class Identiﬁcation

The security ontology enrichment approach is comprehensive with an ability tohandle new terms, changing domain content that includes concepts, relationsand instances. Usage of well accepted ISO 27001 based security ontology, anexhaustive data source such as DBpedia and Wikipedia, Universal SentenceEncoder for distributional embeddings and Bidirectional LSTM for sequentiallearning makes the approach robust and extensible for other domains. In theimplemented enrichment approach, the concepts in seed ontology can be asingle or multiple words, is an improvement from state-of-the-art. Theapproach also incorporated instances from unstructured text (web pages) sothat organizations or individuals have ﬂexibility to reason security ontologiesfor mitigation strategies, vulnerabilities assessment, attack graphs detectionand many other use cases. The enriched security ontology may also be used bysearch engines to display relevant results, top trends in vulnerabilities, threats,attacks and controls.We trained the model with 408 Information Security ontology terms, 97,425DBpedia terms and 2.8 GB Wikipedia articles. The model was tested with 20random information security related web pages extracted from internet with anaccuracy of 80% and an F1-score of 78%. The approach was also trained andtested for ‘Pizza’ domain for generality. The accuracy and the F1-score of themodel to enrich Pizza ontology are 88% and 85%. While we achieved state-of-the-art results implementing a robust approach, we plan to implement the followingactivities as future work - – Optimize the eﬀort required to create DBPedia dataset such as ﬁlteringirrelevant terms. – Test the approach with other security ontologies and extend the trainingcorpus beyond Wikipedia. – Compare our results with other knowledge graph and ontology enrichmentapproaches after curation of input and output format of dataset and corpus. – We understand the need for domain experts evaluation of an enrichedontology. However, manual intervention for evaluation is intensive andbrings in various other dependencies. As part of the future work, wepropose to implement a syntactic and semantic evaluation with a easilyconﬁgurable rules and AI models to reduce the manual eﬀort.

References

1. Al-Aswadi, F.N., Chan, H.Y., Gan, K.H.: Automatic Ontology Construction fromText: a Review from Shallow to Deep Learning Trend. Artiﬁcial Intelligence Reviewpp. 1–28 (2019)2. AlienVault: Open Threat Intelligence (Jan 2021), https://otx.alienvault.com/3. Buitelaar, P., Cimiano, P., Magnini, B.: Ontology Learning from Text: Methods,Evaluation and Applications, vol. 123. IOS Press (2005)4. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N.,Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal Sentence Encoder.ArXiv preprint arXiv:1803.11175 (2018)5. Dettmers, T., Minervini, P., Stenetorp, P., Riedel, S.: Convolutional 2D KnowledgeGraph Embeddings. In: 32nd AAAI Conference on Artiﬁcial Intelligence (2018)6. Ezen-Can, A.: A Comparison of LSTM and BERT for Small Corpus. arXiv preprintarXiv:2009.05451 (2020)7. Fader, A., Soderland, S., Etzioni, O.: Identifying Relations for Open InformationExtraction. In: Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing. pp. 1535–1545. ACL (2011)8. Fenz, S., Ekelhart, A.: Formalizing Information Security Knowledge. In:Proceedings of the 4th International Symposium on Information, Computer, andCommunications Security. ACM (2009)9. Ganesan, B., Dasgupta, R., Parekh, A., Patel, H., Reinwald, B.: A NeuralArchitecture for Person Ontology Population. arXiv preprint arXiv:2001.08013(2020)10. Gasmi, H., Laval, J., Bouras, A.: Cold-start Cybersecurity Ontology Populationusing Information Extraction with LSTM. In: International Conference on CyberSecurity for Emerging Technologies. pp. 1–6. IEEE (2019)11. Iannacone, M., Bohn, S., Nakamura, G., Gerth, J., Huﬀer, K., Bridges, R.,Ferragut, E., Goodall, J.: Developing an Ontology for Cyber Security KnowledgeGraphs. In: Proceedings of the 10th Annual Cyber and Information SecurityResearch Conference. pp. 1–4 (2015)12. ISO/IEC 27001: Information Security Management (Jan 2021),

13. Iyer, V., Mohan, L., Reddy, Y.R., Bhatia, M.: A Survey on Ontology Enrichmentfrom Text. Proceedings of the 16th International Conference on Natural LanguageProcessing (2019)ntoEnricher 1514. Jia, Y., Qi, Y., Shang, H., Jiang, R., Li, A.: A Practical Approach to Constructinga Knowledge Graph for Cybersecurity. Engineering (1), 53–60 (2018)15. Lau, J.H., Baldwin, T.: An Empirical Evaluation of Doc2Vec with PracticalInsights into Document Embedding Generation. arXiv preprint arXiv:1607.05368(2016)16. Li, D., Huang, L., Ji, H., Han, J.: Biomedical Event Extraction based onKnowledge-driven Tree-LSTM. In: NAACL-HLT 2019: Annual Conference of theNorth American Chapter of the Association for Computational Linguistics. pp.1421–1430 (2019)17. Liu, K., Hogan, W.R., Crowley, R.S.: Natural Language Processing Methods andSystems for Biomedical Ontology Learning. Journal of Biomedical Informatics (1), 163–179 (2011)18. Liu, Q., Kusner, M.J., Blunsom, P.: A Survey on Contextual Embeddings. arXivpreprint arXiv:2003.07278 (2020)19. Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Yang, B., Betteridge, J.,Carlson, A., Dalvi, B., Gardner, M., Kisiel, B., et al.: Never-Ending Learning.Communications of the ACM (5), 103–115 (2018)20. Nie, B., Sun, S.: Knowledge Graph Embedding via Reasoning over Entities,Relations, and Text. Future Generation Computer Systems , 426–433 (2019)21. Obrst, L., Chase, P., Markeloﬀ, R.: Developing an Ontology of the Cyber SecurityDomain. In: STIDS. pp. 49–56 (2012)22. Petasis, G., Karkaletsis, V., Paliouras, G., Krithara, A., Zavitsanos, E.: OntologyPopulation and Enrichment: State of the Art. In: Knowledge-driven MultimediaInformation Extraction and Ontology Evolution23. Pingle, A., Piplai, A., Mittal, S., Joshi, A., Holt, J., Zak, R.: RelExt: RelationExtraction using Deep Learning Approaches for Cybersecurity Knowledge GraphImprovement. In: Proceedings of the 2019 IEEE/ACM International Conferenceon Advances in Social Networks Analysis and Mining. pp. 879–886 (2019)24. Sabou, M., Wroe, C., Goble, C., Mishne, G.: Learning Domain Ontologies for WebService Descriptions: An Experiment in Bioinformatics. In: Proceedings of the 14thInternational Conference on World Wide Web. pp. 190–198 (2005)25. Sanagavarapu, L., Gollapudi, S., Chimalakonda, S., Reddy, Y., Choppella, V.: ALightweight Approach for Evaluating Suﬃciency of Ontologies. In: SEKE (2017)26. Sanagavarapu, L.M., Mathur, N., Agrawal, S., Reddy, Y.R.: SIREN-SecurityInformation Retrieval and Extraction eNgine. In: European Conference onInformation Retrieval. pp. 811–814. Springer (2018)27. Sayan, C., Hariri, S., Ball, G.L.: Semantic Knowledge Architecture for CyberSecurity. In: Proceedings of the International Conference on Security andManagement (SAM). pp. 69–76. The Steering Committee of The World Congressin Computer Science, Computer Engineering, and Applied Computing (2019)28. Schmitz, M., Bart, R., Soderland, S., Etzioni, O., et al.: Open Language Learningfor Information Extraction. In: Proceedings of the Joint Conference on EmpiricalMethods in Natural Language Processing and Computational Natural LanguageLearning. pp. 523–534. Association for Computational Linguistics (2012)29. Schuster, M., Paliwal, K.K.: Bidirectional Recurrent Neural Networks. IEEETransactions on Signal Processing (11), 2673–2681 (1997)30. SIREN: Ontology Enrichment (Jan 2021), https://github.com/SIREN-DST/Ontology-Enrichment

31. Syed, Z., Padia, A., Finin, T., Mathews, L., Joshi, A.: UCO: A UniﬁedCybersecurity Ontology. In: Workshops at the 30th AAAI Conference on ArtiﬁcialIntelligence (2016)6 Sanagavarapu et al.32. Vedula, N., Maneriker, P., Parthasarathy, S.: BOLT-K: Bootstrapping OntologyLearning via Transfer of Knowledge. In: The World Wide Web Conference. pp.1897–1908 (2019)33. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge Graph Embedding: A Surveyof Approaches and Applications. IEEE Transactions on Knowledge and DataEngineering29