A Suite of Metrics for Calculating the Most Significant Security Relevant Software Flaw Types
AA Suite of Metrics for Calculating the MostSignificant Security Relevant Software Flaw Types
Peter Mell
National Instituteof Standards and Technology
Gaithersburg MD, [email protected]
Assane Gueye
University Alioune Diop, Bambey-SenegalPrometheus Computing, LLC
Bambey, [email protected]
Abstract —The Common Weakness Enumeration (CWE) is aprominent list of software weakness types. This list is used byvulnerability databases to describe the underlying security flawswithin analyzed vulnerabilities. This linkage opens the possibilityof using the analysis of software vulnerabilities to identify themost significant weaknesses that enable those vulnerabilities.We accomplish this through creating mashup views combiningCWE weakness taxonomies with vulnerability analysis data.The resulting graphs have CWEs as nodes, edges derived frommultiple CWE taxonomies, and nodes adorned with vulnerabilityanalysis information (propagated from children to parents). Usingthese graphs, we develop a suite of metrics to identify the mostsignificant weakness types (using the perspectives of frequency,impact, exploitability, and overall severity).
Index Terms —Metrics, Software flaws, Vulnerabilities
I. I
NTRODUCTION
The Common Weakness Enumeration (CWE) [1] [2] is aprominent list of software weakness types. It is maintained bythe MITRE corporation, funded by the United States (U.S.)government, and developed with the participation of 55 orga-nizations. The CWE list contains 808 weaknesses organizedby multiple views. Views are ‘hierarchical representations’ ofCWEs (i.e., taxonomies) serving different communities withdifferent perspectives on the data.A specific view, 1003, was created to support the labellingof publicly disclosed software vulnerabilities (‘potential weak-nesses within sources that handle public, third-party vulner-ability information’). It contains 123 software flaws. TheNational Vulnerability Database (NVD) [3] and other vulner-ability databases and security tools use view 1003 to describethe underlying security flaws within analyzed vulnerabilities.98.8 % of the 12 760 fully analyzed vulnerabilities publishedby NVD in 2019 were able to be mapped to view 1003,demonstrating its applicability and coverage.This linkage of vulnerability analysis to the view 1003CWEs opens the possibility of using the NVD analysis of soft-ware vulnerabilities to identify the most significant weaknessesthat enable those vulnerabilities. In this work, we accomplishthis through creating mashup views combining the followingresources: • the multiple primary taxonomies of the CWE (views1003, 1000, 699, and 1008), • the Common Vulnerabilities and Exposures (CVE) [4]enumeration of publicly disclosed software vulnerabili-ties, • the NVD mapping of CVEs to view 1003 CWEs, and • the NVD measurements of each CVE using the CommonVulnerability Scoring System (CVSS) [5] [6]. This cal-culates the exploitability, impact, and overall severity ofeach CVE outside of any particular deployment context.The result of creating mashups of this data are graphsthat have CWEs as nodes. The edges between the nodesare extracted from the parent-child relationships between themultiple CWE taxonomies. And the nodes are labelled withCVE and CVSS information (propagated backwards along theedges). We apply to these graphs a suite of simple metrics thatwe developed to identify the most ‘significant’ weakness types.We evaluate significance from multiple perspectives usingmetrics focused on the following areas: frequency, impact,and exploitability. In doing this we evaluate the CWEs intwo distinct groups to take into account the varying levelsof abstraction of the CWEs.We create most significant weakness lists for each metricfor the CVE vulnerabilities published in 2019 (not provideddue to space limitations). We then analyze the differencesbetween these six lists (3 metrics * 2 sets of CWE types) usingtwo algorithms for comparing differences between orderedlists (Kendall’s Tau and the Spearman’s footrule variant [7]).We find that different weaknesses tend to emerge as themost significant depending upon the perspective, the metricused, and the CWE type. Note that we use simple low levelmetrics for our perspectives. This is because there is no groundtruth for aggregating those metrics; equations in security thataggregate simple metrics are often practically useful but lessscientifically defensible.Finally, we note that the CWE already has an officialmetric to identify the ‘most dangerous’ CWEs. It aggregatesboth frequency and severity with severity itself being anaggregate metric combining exploitability and impact. Wediscover weaknesses with this official metric that leads to theunder counting of certain CWEs.We recommend that software developers and creators ofsoftware bug finding tools use our approach to prioritizefinding and eliminating these most significant weaknesses to a r X i v : . [ c s . CR ] J un educe the number and severity of security related flaws insoftware. II. B ACKGROUND
As mashup research, our approach combines multiple re-sources. These are briefly described and referenced here.
A. Common Weakness Enumeration
Our research is primarily focused on the Common Weak-ness Enumeration (CWE) [8], a ‘community-developed listof common software security weaknesses’. ‘It serves as acommon language, a measuring stick for software securitytools, and as a baseline for weakness identification, mitigation,and prevention efforts’ [9]. The 808 software weaknesseswithin the enumeration are referred to as CWEs where eachis named CWE-X with X being some integer. Each CWE ischaracterized as either a class, base, variant, or compound.Classes are the highest level of abstraction, followed by bases,and then by variants. Compounds are relatively rare andare combinations of multiple bases and/or variants. In ourwork we evaluate classes separately from bases, variants, andcompounds, given that the classes have a much higher levelof abstraction.Besides the CWE weaknesses, there are also 295 categoriesand 38 views. Confusingly, these are also considered CWEs;for simplification we use the name CWE to refer only tothe weakness CWEs. The categories are used to organize theCWEs within select views (this is not used in our research).The views are hierarchical organizations of a subset of CWEsaccording to some perspective (essentially a taxonomy). Thethree primary taxonomies are the ‘Research Concepts’ (view1000), ‘Development Concepts’ (view 699), and ‘ArchitecturalConcepts’ (view 1008). This last view, 1008, was not usefulto our work and is not used because it doesn’t provide ahierarchy of CWEs but instead uses the categories to groupCWEs. The view 1003 designed for vulnerability databases,mentioned previously, is called ‘CWE Weaknesses for Sim-plified Mapping of Published Vulnerabilities View’ and is thecore data structure upon which our work builds.
B. Common Vulnerabilities and Exposures
The set of software vulnerabilities used for this researchcome from the Common Vulnerabilities and Exposures (CVE)program, maintained by the MITRE corporation. ‘CVE isa list of entrieseach containing an identification number, adescription, and at least one public referencefor publiclyknown cybersecurity vulnerabilities’ [4] [10].
C. Common Vulnerability Scoring System
The Common Vulnerability Scoring System ‘provides away to capture the principal characteristics of a vulnerabilityand produce a numerical score reflecting its severity’ [11]. Itprovides equations for calculating a vulnerability’s base score(inherent risk outside of any particular environment), temporalscore (changing risk over time), and environmental score (riskwithin a particular environment). We use the base score, which is composed of two sub-scores that calculate the exploitabilityand impact of a vulnerability. It is maintained by the Forum ofIncident Response and Security Teams (FIRST). The detailedspecification for CVSS version 3.1 is available at [5].
D. National Vulnerability Database
The National Vulnerability Database (NVD) is ‘the U.S.government repository of standards based vulnerability man-agement data’ [3]. It is maintained by the U.S. NationalInstitute of Standards and Technology. We use its scoring ofCVEs with CVSS scores and its mapping of the CVEs to view1003 CWEs.III. F
OUNDATIONAL D ATA S TRUCTURES
This section describes how we generate the foundationaldata structure used by our metrics to calculate the mostsignificant security relevant software flaw types. We generatea directed acyclic graph (DAG) of CWEs that we will use topropagate CWE analysis data between the CWEs.
A. View 1003 Graph
We begin with the set of CWEs in CWE view 1003since that is the set that was adopted by the NVD (andis the set identified by MITRE as most applicable to CVEvulnerabilities). We then form a graph of the view 1003 nodesthrough extracting the ‘ChildOf’ relationships in the CWEview 1003 Extensible Markup Language (XML) file. Otherkinds of relationships are provided in the XML file but wedon’t use them because none of them definitively indicatethe parent child relationship needed to construct edges in ourgraph (for example, ‘CanPrecede’). The result is a rooted tree with the root being CWE 1003, the nodes at distance one fromthe root being classes, and the nodes at distance 2 being bases,variants, and compounds. We remove the root as we are onlyinterested in the classes, bases, variants, and compounds. Theresulting DAG has 123 nodes and 87 edges, shown in Figure1. On the left side are the 36 class nodes in blue. The majorityof class nodes have edges to bases, variants, or compounds,but five do not. On the right side, the largest grouping of nodesin a single column in purple represents the 82 bases. Movedslightly to the right and in green are the 3 variants. Movedeven farther to the right in orange are the 2 compounds. B. Direct Edge Augmentation
We next augment our view 1003 DAG with edges extractedfrom the ‘ChildOf’ relationships specified within other CWEview XML files. For this we use both the CWE researchand development concepts views (essentially alternate tax-onomies). We can do this because for our metrics we aren’tfocused on a particular type of child-parent relationship, wejust want to know that a child-parent relationship definitivelyexists between some pair of CWEs in the view 1003 set. Thisanalysis adds 19 edges, shown in green in Figure 2. Note thatwe move three of the class nodes slightly left of the main A perfect tree structure is uncommon in weakness/vulnerability tax-onomies. This encouraged us to explore possible missing relationships.ig. 1. CWE View 1003 (123 nodes, 87 edges)Fig. 2. CWE View 1003 Nodes with Direct Edges from Non-1003 Views(123 nodes, 19 edges) column of class nodes to enhance visibility because they nowhave edges to other classes.
C. Indirect Edge Augmentation
Lastly, we create a new DAG (to be used temporarily forthis section’s analysis) by unifying the set of nodes in views1003, 1000, and 699 and then adding edges based on the’ChildOf’ relationships specified in the three XML view files.This produces a DAG with 834 CWEs and 1046 edges. Thenfor each pair of nodes within view 1003, we determine if a
Fig. 3. CWE View 1003 Nodes with Edges Representing Paths from Non-1003 Views (123 nodes, 29 edges) Fig. 4. Composite Graph of View 1003 with Direct Edges and EdgesRepresenting Paths from Non-1003 Views (123 nodes, 135 edges)Fig. 5. View 1003 Nodes Adorned with NVD Data (no propagation) path exists connecting them that uses at least one node notin view 1003. Each such discovered path can be used to addan edge to our foundational data structure DAG. These 29‘indirect’ edges (that really represent paths using nodes notshown) can be seen in blue in Figure 3.
D. Composite Directed Acyclic Graph
We now put together our DAG representing the 1003 viewwith the direct edge augmentation from Section III-B and theindirect edge augmentation from Section III-C. The resultinggraph is shown in Figure 4. It has 123 nodes and 135 edges.
E. Node Adornment
The next step is to adorn the DAG with vulnerabilityanalysis data from the NVD. We take each CVE in NVD thathas one or more CWE mappings, and we label each relevantCWE node in the DAG with a vector containing the CVEname, the publish date, and the CVSS attribute information.Figure 5 shows this adornment for the CVEs published in2019. Note that the size of each node now represents thenumber of vulnerability vectors mapped to that node.
F. Data Propagation
The edges within the DAG represent opportunities forpropagating vector data between CWEs. Parent CWEs receive ig. 6. View 1003 Nodes Adorned with NVD Data (with propagation) the vectors of their children (with any duplicates being re-moved). This is because if a vector applies to a CWE thenit by definition applies to its more general parent. Also, wediscovered that NVD analysts only label a CVE with its mostspecific CWE. They do not label a CVE with a class ifthey can determine the applicable base, variant, or compoundwithin that class. Figure 6 shows the DAG adorned withthe 2019 vulnerability vectors propagated from children toparents. Note, by comparing figures 5 and 6, how without thepropagation some classes get under counted (especially thosewith many popular bases that are the class’ children).IV. M
ETRICS FOR C ALCULATING S IGNIFICANCE
The DAG in Figure 6 is what we use as the foundationaldata structure to calculate three simple metrics. The metricswe calculate on this DAG are normalized frequency, mean ex-ploitability, and mean impact. These three metrics are definedbelow.We start with a metric to count the number of CVEs mappedto each CWE. Let I designate the set of all CWEs and let J be the set of all CVEs. For CWE i ∈ I , let N i be the numberof CVEs mapped to i . We can write it as: N i = (cid:88) j ∈ J e ij , (1)where e ij = (cid:40) , if CVE j is mapped to CWE i, , otherwise . (2) A. Metric for normalized frequency ( F i ) F i = N i − min i (cid:48) ∈ I ( N i (cid:48) )max i (cid:48) ∈ I ( N i (cid:48) ) − min i (cid:48) ∈ I ( N i (cid:48) ) . (3) B. Metric for mean exploitability ( Q i ) Let q j be the CVSS exploitability score for CVE j . We canwrite the average of the q j in all CVEs mapped to CWE i as: Q i = (cid:80) j ∈ J q j e ij N i . (4) C. Metric for mean impact ( R i ) Let r j be the CVSS impact score CVE j . We can write theaverage of the r j in all CVEs mapped to CWE i as: R i = (cid:80) j ∈ J r j e ij N i . (5)V. W EAKNESSES IN THE O FFICIAL
CWE E
QUATION
In September 2019, the official CWE website provideda metric for measuring the ‘CWE Top 25 Most DangerousSoftware Errors’ [12]. It is an aggregate metric combining thenormalized frequency of CVEs mapped to CWEs while usingthe CVSS severity calculated for each mapped CVE.This metric, like ours, combines together CWE and NVDdata, evaluating only the CWEs within the 1003 view. Dif-fering from ours, it uses the raw NVD mappings (it doesn’tperform any data propagation) and evaluates all CWE typestogether (i.e., classes, bases, variants, and compounds). Themetric is described in [12], we summarize it below (leveragingtwo of our equations from section IV).
A. Official CWE Metric
We first need to define the mean CVSS score for some CWE i as S i . Let s j be the CVSS base score for CVE j . We canwrite the average of the s j in all CVEs mapped to CWE i as: S i = (cid:80) j ∈ J s j e ij N i . (6)Now we define the official ‘most dangerous’ CWE score as D i for some CWE i . Let F i refer to equation 3 and let c j bethe CVSS score for the j -th CVE. D i = F i ∗ S i − min j ∈ J ( c j )max j ∈ J ( c j ) − min j ∈ J ( c j ) ∗ . (7) B. Weakness 1: Undercounting Parent CWEs
Almost all CWE classes in view 1003 have children, asdo a few bases. All CWEs that are parents then get undercounted because CVEs that apply to them are often assignedto their children. The official CWE metric does not propagateCVE assignments from children to parents. Also, the NVDanalysts assign only the most specific CWE to a CVE; theydo not include the parents of marked CWEs. This artificiallydecreases the importance of CWEs that have children whenusing the official metric. Using the official CWE metric on the2019 data, each CWE gets assigned a mean of 87.99 CVEs.Using the propagation proposed in this paper to avoid undercounting parents, each CWE gets assigned a mean of 294.71CVEs.
C. Weakness 2: Class Bias
The inclusion of all CWE types in the official metric (i.e.,classes, bases, variants, and compounds) causes some classesto be unfairly promoted to being within the top lists. Thisis because classes are at a much higher level of abstractionand thus more CVEs will apply to them. Using the officialetric, we find that there are 8 classes on the 2019 top 25list (32 %) while there are 36 classes out of 123 CWEs in the1003 view (29 %). While a bias is not particularly apparenthere, the bias is muted because the classes are under counteddue to weakness 1 (above). To correctly isolate and measureweakness 2, we remove weakness 1 by using the officialmetric but while performing data propagation from children toparents. Our regenerated 2019 official top 25 list then contains16 classes (64 %); here the classes are vastly over representedsince only 29 % of the CWEs in view 1003 are classes.VI. A
NALYSIS
Our approach of propagating CVE data over the CWEtaxonomies fills in data missing from the official CWE metricapproach. This addresses weakness 1 described in Section V-B.Our approach of creating separate top lists for the two levels ofCWE abstraction addresses weakness 2, described in SectionV-C. Thus, we argue that our approach improves over theoriginal. The question though is whether these improvementsmake any difference in the results.Using our DAG and the three metrics, we calculated themost significant CWEs for 2019 at the two levels of abstrac-tion. We now evaluate these results to verify that propagatinganalysis data over our DAG substantially changes the gener-ated most significant software flaw lists. We also verify thatour multiple metrics produce substantially different lists. Tocompare different rankings, we measure their distance usingtwo related metrics: the Kendall’s Tau and the Spearman’sFootrule [7]. For two rankings l and l , the Kendall’s Tau K ( l , l ) measures the number of pairs of elements in l thatare swapped in their relative positions in l . The Spearman’sFootrule F ( l , l ) measures the number of adjacent elementswaps that would need to be performed in l to convert itinto l . It has been proven that ∀ l , l , K ( l , l ) ≤ F ( l , l ) ≤ K ( l , l ) [7].Both approaches require that the rankings be of the samelength and contain the same elements. Thus, when comparingrankings we use the full rankings of all CWEs observed in thedata as opposed to comparing top X lists where X is someinteger (using some X to limit list length usually results inlists that contain at least one distinct CWE). The number ofobserved class CWEs in our data was 36 and the number ofnon-class CWEs was 87. We performed an empirical study todetermine K() and F() for random lists of these sizes using100 000 trials. The results are shown in Table I. TABLE ID
ISTANCE B ETWEEN R ANDOM R ANKINGS
K() F()Length 36 315 432Length 87 1870 2522
We first verify that propagating data over our DAG sub-stantially changes the rankings. For each of our metrics, wecalculate the full ranking using all available CWEs and thencompare that against a ranking created using the same metric but without propagating data over our DAG. The results areshown in Table II. Overall, they show that the lists do changesignificantly when using the DAG to propagate data. Notethat the distances are less for the non-class CWEs which isremarkable because those lists are more than twice as long asthe class CWE lists (longer lists in general produce greaterdistances due to more elements possibly being out of place).However, this can be explained by noting that there are only4 edges between the non-class CWEs which diminishes theeffect of propagating data from children to parents. Thereare, on the other hand, 135 edges over which data can bepropagated to the class CWEs.We next verify that our multiple metrics produce substan-tially different lists. If this were not the case, then that wouldargue towards producing just a single list as opposed to multi-ple lists with different perspectives. Table III shows the resultsfor the class CWE lists and Table IV shows the results for thenon-class CWE lists. Overall, all the lists appear different. Themean exploitability lists have the most distinction from theother two. Comparing the mean exploitability and normalizedfrequencies lists for the class CWEs we get lists that are evenmore different than random (see Table I). Comparing the meanexploitability and the mean impact lists, they are almost asdifferent as the random lists.VII. R
ELATED W ORK
The NVD also provides CWE rating data. This is in the formof two visualizations that show the relative frequency betweenthe observed CWEs per year and another that shows the actualfrequency change for the most frequent CWEs [13]. This datais incorrect because it miscounts the frequencies of the CWEsthat have children in the CWE taxonomies (because NVD onlylabels a CVE with its most specific CWE and this informationis not propagated to its parents). Also, NVD doesn’t dis-tinguish between classes and bases/variants/compounds. Theclasses are larger categories biasing them to be very frequent,crowding out the bases that are less likely simply due to thembeing more specific.While indirectly related, in [14] a study is done on predict-ing the next software flaw instances based on NVD data. Also,text clustering has been done on NVD data to automaticallyidentify software flaw types [15]; this could lead to automat-ically generated taxonomies to which this work could apply.VIII. C
ONCLUSION
The multiple CWE views can be evaluated as hierarchicaltaxonomies that reveal parent-child relationships between pairsof CWEs. The different perspectives for each view does notinvalidate unifying them because in scoring a CWE as to itssignificance we want to know all applicable CVEs regardlessof the particular method used to organize the CWEs hierar-chically. View 1003 provides an obvious base taxonomy fromwhich to start as it was designed to cover the CWEs most usedby CVEs. However, its perfect tree structure indicates likelymissing relationships. We find those relationships throughevaluating the primary three CWE taxonomies (one of which
ABLE IID
ISTANCE B ETWEEN T OP L ISTS C REATED U SING R AW CWE S V ERSUS P ROPAGATING D ATA O VER THE C ONSTRUCTED
DAGClass CWEs(list length of 36) Non-Class CWEs(list length of 87)K() F() K() F()Normalized Frequency 213 324 115 228Mean Exploitability 188 264 81 154Mean Impact 175 256 82 164TABLE IIID
ISTANCE B ETWEEN T OP C LASS
CWE L
ISTS C REATED U SING D IFFERENT M ETRICS ( WITH P ROPAGATING D ATA ON THE
DAG)Normalized Frequency Mean Exploitability Mean ImpactKendall Tau Spearman Kendall Tau Spearman Kendall Tau SpearmanNormalized Frequency 0 0 357 488 213 290Mean Exploitability 0 0 312 420Mean Impact 0 0TABLE IVD
ISTANCE B ETWEEN T OP N ON -C LASS L ISTS C REATED U SING D IFFERENT M ETRICS ( WITH P ROPAGATING D ATA ON THE
DAG)Normalized Frequency Mean Exploitability Mean ImpactKendall Tau Spearman Kendall Tau Spearman Kendall Tau SpearmanNormalized Frequency 0 0 1281 1716 976 1350Mean Exploitability 0 0 1527 1984Mean Impact 0 0 we have to discard because it uses non-CWEs for its higherlevel classifications). We first find direct missing edges andthen find indirect edges (those that represent paths traversingnon-1003 view CWEs).The NVD is an ideal data source to analyze the CWEsbecause it both maps the CVEs to CWEs and also providesCVSS scores for each CVE. We adorned our unified DAGwith the CVE information from NVD and propagated thatinformation from children to parents. We then evaluated theDAG with three simple metrics; focusing separately on classesand bases/variants/compounds due to their very different levelsof abstraction. We then generated top lists that provide themost significant CWEs relative to a particular perspective andabstraction. We analyzed those lists and discovered significantdifferences between them. This argues for the usefulness ofand need for multiple top lists with different perspectives.It is our hope that software developers and creators ofsoftware bug finding tools will use our approach to helpprioritize finding and eliminating CWEs in their code. We hopein turn that this will help reduce the number and severity ofsecurity related flaws in software.A
CKNOWLEDGEMENT
This work was partially accomplished under the NationalInstitute of Standards and Technology Cooperative AgreementNo.70NANB19H063 with Prometheus Computing, LLC.R
EFERENCES[1] Y. Wu, I. Bojanova, and Y. Yesha, “They know your weaknesses–do you?: Reintroducing common weakness enumeration,”
CrossTalk ,vol. 45, 2015.[2] R. Martin, S. Barnum, and S. Christey, “Being explicit about securityweaknesses,”
Blackhat DC
IEEE Security & Privacy , vol. 4, no. 6, pp. 85–89, 2006.[7] “Generalized distances between rankings,” 2010, accessed: 2019-12-10.[Online]. Available: https://tinyurl.com/theory-stanford-edu-sergei[8] R. A. Martin and S. Barnum, “Common weakness enumeration (cwe)status update,”
ACM SIGAda Ada Letters , vol. 28, no. 1, pp. 88–91,2008.[9] “Common weakness enumeration,” 2019, accessed: 2019-12-10.[Online]. Available: https://cwe.mitre.org[10] D. W. Baker, S. M. Christey, W. H. Hill, and D. E. Mann, “The devel-opment of a common enumeration of vulnerabilities and exposures,” in
Recent Advances in Intrusion Detection
International conference on database and expert systems applications .Springer, 2011, pp. 217–231.[15] S. Huang, H. Tang, M. Zhang, and J. Tian, “Text clustering on nationalvulnerability database,” in2010 Second international conference oncomputer engineering and applications