Biased RSA private keys: Origin attribution of GCD-factorable keys
Adam Janovsky, Matus Nemec, Petr Svenda, Peter Sekan, Vashek Matyas
BBiased RSA private keys: Origin attribution ofGCD-factorable keys (cid:63)
Adam Janovsky , (cid:0) ) , Matus Nemec , Petr Svenda , Peter Sekan , and VashekMatyas Masaryk University, Czech Republic (cid:0) [email protected] Invasys, Czech Republic Link¨oping University, Sweden
Abstract.
In 2016, venda et al. (USENIX 2016, The Million-key Ques-tion) reported that the implementation choices in cryptographic librariesallow for qualified guessing about the origin of public RSA keys. We ex-tend the technique to two new scenarios when not only public but alsoprivate keys are available for the origin attribution – analysis of a sourceof GCD-factorable keys in IPv4-wide TLS scans and forensic investiga-tion of an unknown source. We learn several representatives of the biasfrom the private keys to train a model on more than 150 million keyscollected from 70 cryptographic libraries, hardware security modules andcryptographic smartcards. Our model not only doubles the number of dis-tinguishable groups of libraries (compared to public keys from venda etal.) but also improves more than twice in accuracy w.r.t. random guess-ing when a single key is classified. For a forensic scenario where at least10 keys from the same source are available, the correct origin library iscorrectly identified with average accuracy of 89% compared to 4% accu-racy of a random guess. The technique was also used to identify librariesproducing GCD-factorable TLS keys, showing that only three groups arethe probable suspects.
Keywords:
Cryptographic library · RSA factorization · Measurement · RSA key classification · Statistical model.
The ability to attribute a cryptographic key to the library it was generated withis a valuable asset providing direct insight into cryptographic practices. Theslight bias found specifically in the primes of RSA private keys generated bythe OpenSSL library [14] allowed to track down the devices responsible for keysfound in TLS IPv4-wide scans that were in fact factorable by distributed GCDalgorithm. Further work [23] made the method generic and showed that manyother libraries produce biased keys allowing for the origin attribution. As a result, (cid:63)
Full details, datasets and paper supplementary material can be found at https://crocs.fi.muni.cz/papers/privrsa esorics20 a r X i v : . [ c s . CR ] S e p Janovsky et al. both separate keys, as well as large datasets, could be analyzed for their originlibraries. The first-ever explicit measurement of cryptographic library popularitywas introduced in [18], showing the increasing dominance of the OpenSSL libraryon the market. Furthermore, very uncommon characteristics of the library usedby Infineon smartcards allowed for their entirely accurate classification. Impor-tantly, this led to a discovery that the library is, in fact, producing practicallyfactorable keys [19]. Consequently, more than 20 million of eID certificates withvulnerable keys were revoked just in Europe alone. The same method allowed toidentify keys originating from unexpected sources in Estonian eIDs. Eventually,the unexpected keys were shown to be injected from outside instead of beinggenerated on-chip as mandated by the institutional policy [20].While properties of RSA primes were analyzed to understand the bias de-tected in public keys, no previous work addressed the origin attribution problem with the knowledge of private keys. The reason may sound understandable –while the public keys are readily available in most usage domains, the privatekeys shall be kept secret, therefore unavailable for such scrutiny. Yet there areat least two important scenarios for their analysis: 1) Tracking sources of GCD-factorable keys from large TLS scans and 2) a forensic identification of black-boxdevices with the capability to export private keys (e.g., unknown smartcard, re-mote key generation service, or in-house investigation of cryptographic services).The mentioned case of unexpected keys in Estonian eIDs [20] is a practical ex-ample of a forensic scenario, but with the use of public keys only. The analysisbased on private keys can spot even a smaller deviance from the expected originas the bias is observed closer to the place of its inception. This work aims to fillthis gap in knowledge by a careful examination of both scenarios.We first provide a solid coverage of RSA key sources used in the wild by ex-panding upon the dataset first released in [23]. During our work, we more thandoubled the number of keys in the dataset, gathered from over 70 distinct crypto-graphic software libraries, smartcards, and hardware security modules (HSMs).Benefiting from 158.8 million keys, we study the bias affecting the primes p and q . We transform known biased features of public keys to their private key ana-logues and evaluate how they cluster sources of RSA keys into groups. We usethe features in multiple variants of Bayes classifier that are trained on 157 mil-lion keys. Subsequently, we evaluate the performance of our classifiers on further1.8 million keys isolated from the whole dataset. By doing so, we establish thereliability results for the forensic case of use, when keys from a black-box sys-tem are under scrutiny. On average, when looking at just a single key, our bestmodel is able to correctly classify 47% of cases when all libraries are consideredand 64.6% keys when the specific sub-domain of smartcards is considered. Theseresults allow for much more precise classification compared to the scenario whenonly public keys are available.Finally, we use the best-performing classification method to analyze thedataset of GCD-factorable RSA keys from the IPv4-wide TLS scan collectedby Rapid7 [21]. iased RSA private keys: Origin attribution of GCD-factorable keys 3 The main contributions of this paper are: – A systematic mapping of biased features of RSA keys evaluated on a moreexhaustive set of cryptographic libraries, described in Section 2. The dataset(made publicly available for other researchers) lead to 26 total groups oflibraries distinguishable based on the features extracted from the value ofRSA private key(s). – Detailed evaluation of the dataset on Bayes classifiers in Section 3 with anaverage accuracy above 47% where only a single key is available, and almost90% when ten keys are available. – An analysis of the narrow domain of cryptographic smartcards and librariesused for TLS results in an even higher accuracy, as shown in Section 4. – Practical analysis of real-world sources of GCD-factorable RSA keys frompublic TLS servers obtained from internet-wide scans in Section 5.The paper roadmap has been partly outlined above, Section 7 then showsrelated work and Section 8 concludes our paper.
Various design and implementation decisions in the algorithms for generatingRSA keys influence the distributions of produced RSA keys. A specific type ofbias was used to identify OpenSSL as the origin of a group of private keys [17].Systematic studies of a wide range of libraries [23,18] described more reasonsfor biases in RSA keys in a surprising number of libraries. In the majority ofcases, the bias was not strong enough to help factor the keys more efficiently.Previous research [23] identified multiple sources of bias that our observationsfrom a large dataset of private RSA keys confirm:
1. Performance optimizations , e.g., most significant bits of primes set to afixed value to obtain RSA moduli of a defined length.
2. Type of primes : probable, strong, and provable primes: – For probable primes, whether candidate values for primes are chosenrandomly or a single starting value is incremented until a prime is found. – When generating candidates for probable primes, small factors are avoidedin the value of p − – Blum integers are sometimes used for RSA moduli – both RSA primesare congruent to 3 modulo 4. – For strong primes, the size of the auxiliary prime factors of p − p + 1 is biased. – For provable primes, the recursive algorithm can create new primes ofdouble to triple the binary length of a given prime; usually one versionof the algorithm is chosen.
3. Ordering of primes : are the RSA primes in private key ordered by size?
4. Proprietary algorithms , e.g., the well-documented case of Infineon fastprime key generation algorithm [19].
5. Bias in the output of a PRNG : often observable only from a large numberof keys from the same source;
6. Natural properties of primes that do not depend on the implementation.
Janovsky et al.
We collected, analyzed, and published the largest dataset of RSA keys with aknown origin from 70 libraries (43 open-source libraries, 5 black-box libraries,3 HSMs, 19 smartcards). We both expanded the datasets from previous work[23,18] and generated new keys from additional libraries for the sake of this study.We processed the keys to a unified format and made them publicly available.Where possible, we analyzed the source code of the cryptographic library toidentify the basic properties of key generation according to the list above.We are primarily interested in 2048-bit keys, what is the most commonlyused key length for RSA. As in previous studies [23,18], we also generate shorterkeys (512 and 1024 bits) to speed up the process, while verifying that the cho-sen biased features are not influenced by the key size. This makes the keys ofdifferent sizes interchangeable for the sake of our study. We assume that repeat-edly running the key generation locally approximates the distributed behaviourof many instances of the same library. This model is supported by the mea-surements taken in [18] where distributions of keys collected from the Internetexhibited the same biases as locally generated keys.
We extended the features used in previous work on public keys to their equivalentproperties of private keys:
Feature ‘5p and 5q’ : Instead of the most significant bits of the modulus, weuse five most significant bits of the primes p and q. The modulus is definedby the primes, and the primes naturally provide more information. We chose 5bits based on a frequency analysis of high bits. Further bits are typically notbiased and reducing the size of this feature prevents an exponential growth ofthe feature space.
Feature ‘blum’ : We replaced the feature of second least significant bit of themodulus by the detection of Blum integers. Blum integers can be directly iden-tified using the two prime factors. When only the modulus is available, we canrule out the usage of Blum integers, but not confirm it.
Feature ‘mod’ : Previous work used the result of modulus modulo 3. It wasknown that primes can be biased modulo small primes (due to avoiding smallfactors of p − q − p −
1, when the modulus equals2 modulo 3 [23]. It is not possible to rule out higher factors from just a singlemodulus. With the access to the primes we can directly check for this bias forall factors. We detected four categories of such bias, each avoiding all small oddprime factors up to a threshold. We use these categories directly by looking atsmall odd divisors of p − q − Feature ‘roca’ : We use a specific fingerprint of factorable Infineon keys pub-lished in [19]. iased RSA private keys: Origin attribution of GCD-factorable keys 5
Dendrogram for all groups
Fig. 1.
How the keys from various libraries differ can be depicted by a dendrogram. Ittells us, w.r.t. our feature set, how far from each other the probability distributions ofthe sources are. We can then hierarchically cluster the sources into groups that producesimilar keys. The blue line at 0.085 highlights the threshold of differentiating betweentwo sources/groups. This threshold yields 26 groups using our feature set.
Since it is impossible to distinguish sources that produce identically distributedkeys, we introduce a process of clustering to merge similar sources into groups.We cluster two sources together if they appear to be using identical algorithmsbased on the observation of the key distributions. We measure the differencein the distributions using the Manhattan distance . The absolute values of thedistances depend on the actual distributions of the features. Large distancescorrelate with significant differences in the implementations. Note, that verysmall observed distances may be only the result of noise in the distributionsinstead of a real difference, e.g., due to a smaller number of keys available.We attempt to place the clustering threshold as low as possible, maximizingthe number of meaningful groups. If we are not able to explain why two clustersare separated based on the study of the algorithms and distributions of the We experimented with Euclidean distance and fractional norms. While Euclideandistance is a proper metric, our experiments showed that it is more sensitive tothe noise in the data, creating separable groups out of sources that share the samekey generation algorithms. On the other hand, fractional norms did not highlightdifferences between sources that provably differ in the key generation process. Janovsky et al. features, the threshold needs to be moved higher to join these clusters. We workedwith distributions that assume all features correlated (as in [23]).The resulting classification groups and the dendrogram is shown in Figure1. We placed the threshold value at 0.085. By moving it higher than to 0.154,we would lose the ability to distinguish groups 11 and 12. It would be possibleto further split group 14, as there is a slight difference in the prime selectionintervals used by Crypto++ and Microsoft [23]. However, the difference manifestsless than the level of noise in other sources, requiring the threshold to be putat 0.052, what would create several false groups. We use the same clusteringthroughout the paper, although the value of the threshold would change whenthe features change. Note that different versions of the same library may fallinto different groups, mostly because of the algorithm changes between theseversions. This, for instance, is the case of the Bouncy Castle 1.53, and 1.54.
How accurately we can classify the keys depends on several factors, most notablyon: the libraries included in the training set, number of keys available for clas-sification, features extracted from the classified keys, and on the classificationmodel. In this section, we focus on the last factor.
As generating the RSA keys is internally a stochastic process, we choose thefamily of probabilistic models to address the source attribution problem. Sincethere is no strong motivation for complex machine learning models, we utilizesimple classifiers. More sophisticated classifiers could be built based on our find-ings when the goal is to reach higher accuracy or to more finely discriminatesources within a group. The rest of this subsection describes the chosen models.
Nave Bayes classifier.
The first investigated model is a nave Bayes classi-fier , called nave because it assumes that the underlying features are conditionallyindependent. Using this model, we apply the maximum-likelihood decision ruleand predict the label as ˆ y = argmax y P ( X = x | y ). Thanks to the nave assump-tion, we may decompose this computation into ˆ y = argmax y (cid:81) ni =1 P ( x i | y ) forthe feature vector x = ( x , . . . , x n ). Bayes classifier.
We continue to develop the approach originally used in [23]that used the
Bayes classifier without the nave assumption. Several reasons mo-tivate this. First, it allows to evaluate how much the nave Bayes model suffersfrom the violated independence assumption (on this specific dataset). Secondly,it enables us to access more precise probability estimates that are needed to clas-sify real-world GCD-factorable keys. Additionally, we can directly compare theclassification accuracy of private keys with the case of the public keys from [23].However, one of the main drawbacks of the Bayes classifier is that it requiresexponentially more data with the growing number of features. Therefore, when iased RSA private keys: Origin attribution of GCD-factorable keys 7 striving for high accuracy achievable by further feature engineering, one shouldconsider the nave Bayes instead.
Nave Bayes classifier with cross-features.
The third investigated optionis the nave Bayes classifier, but we merged selected features that are knownto be correlated into a single feature. In particular, we merged the featuresof the most significant bits (of p, q ) into a single cross-feature. Subsequently,the nave Bayes approach is used. This enables us to evaluate whether mergingclearly interdependent features into one will affect the performance of nave Bayesclassifier w.r.t. this specific dataset.
Our training dataset contains157 million keys and the test set contains 1.8 million keys. We derived the test setby discarding 10 thousand keys of each source from the complete dataset beforeclustering. This assures that each group has the test set with at least 10 thousandkeys. Accordingly, since the groups differ in the number of sources involved, theresulting test dataset is imbalanced. For this reason, we employ the metrics ofprecision and recall when possible. However, we represent the model performanceby accuracy measure in the tables and in more complex classification scenarios.For group X , the precision can be understood as a fraction of correctly clas-sified keys from group X divided by the number of keys that were marked as group X by our classifier. Similarly, the recall is a fraction of correctly classifiedkeys from group X divided by a total number of keys from group X [11]. Wealso evaluate the performance of the models under the assumption that the userhas a batch of several keys from the same source at hand. This scenario canarise, e.g., when a security audit is run in an organization and all keys are beingtested. Furthermore, to react to some often misclassified groups, we additionallyprovide the answer “this key originates from group X or group Y ” to the user(and we evaluate the confidence of these answers). Comparison of the models.
The overall comparison of all three models canbe seen in Table 1. If the precision for some group is undefined, i.e., no key isallegedly originating from this group, we say that the precision is 0. We evaluatethe nave Bayes classifier on the same features that were used for Bayes classifier
Model Avg. precision Avg. recallBayes classifier 43 .
2% 47 . .
9% 46 . .
7% 47 . Table 1.
Performance comparison of different models on the dataset with all libraries.Note that the precision of a random guess classifier is 3 .
8% when considering 26 groups. Janovsky et al. to measure how much classification performance is lost by introducing the fea-ture independence assumption. A typical example of interdependent features isthat the most significant bits of primes p and q are intentionally correlated topreserve the expected length of the resulting modulus n . Pleasantly, the observedprecision (recall) decrease is only 2 .
3% (1 . .
2% and a recall of 47 . Section 2 outlined the process of choosing a threshold value that determines thecritical distance for distinguishing between distinct groups. Inevitably, the samethreshold value directly influences the number of groups after the clustering task.As such, the threshold introduces a trade-off between the model performanceand the number of discriminated groups. The smaller the difference betweengroup distributions is, the more they are similar, and the model performanceis lower as more misclassification errors occur. The objective of this section isto examine the classification scenario when some prior knowledge is available tothe analyst, limiting the origin of keys to only a subset of all libraries or increasethe likelihood of some. Since Section 3 showed that the Bayes classifier providesthe best performance, this chapter considers only this model.Prior knowledge can be introduced into the classification process in multipleways, e.g., by using a prior probability vector that considers some groups moreprevalent. We also note that the measurement method of [18] can be used toobtain such prior information, but a relatively large dataset (around 10 privatekeys) is required that may not be available. Our work, therefore, considers adifferent setting when some sources of the keys are ruled-out before the classifieris constructed. Such scenario arises e.g., when the analyst knows that the scru-tinized keys were generated in an unknown cryptographic smartcard. In suchcase, HSMs and other sources of keys can thus be omitted from the model alto-gether what will arguably increase the performance of the classification process.Another example is leaving out libraries that were released after the classifieddata sample was collected.We present the classification performance results for three scenarios with alimited number of sources – 1) cryptographic smartcards (Section 4.1), 2) sourceslikely to be used in the TLS domain (Section 4.2) and 3) a specific case of GCD-factorable keys from the TLS domain, where only one out of two primes can iased RSA private keys: Origin attribution of GCD-factorable keys 9Dataset Avg. precision Avg. recall Random guess (baseline)All libraries 43 .
2% 47 .
6% 3 . .
9% 64 .
6% 8 . .
5% 42 .
2% 7 . .
8% 36 .
2% 11 . Table 2.
Bayes classifier performance on three analyzed partitionings of the dataset– complete dataset with all libraries (
All libraries ), smartcards only (
Smartcards do-main ), libraries and HSMs expected to be used for TLS (
TLS domain ) and specificsubset of TLS domain where only single prime is available due to the nature of resultsobtained by GCD factorization method (
Single-prime TLS domain ). Comparison withthe random guess as a baseline is provided (here, accuracy equals precision and recall). be used for classification (see Section 4.3 for more details). The comparison ofmodels for these scenarios can be seen in Table 2.To compute these models we first, discard the sources that cannot be theorigin of the examined keys according to the prior knowledge of the domain (e.g.,smartcards are not expected in TLS). Next, we re-compute the clustering task toobtain fewer groups than on the dataset with all libraries. Finally, we computethe classification tables for the reduced domain and evaluate the performance.
Group 1 (Infineon JTOP 80K)Group 12 (G&D StarSign)Group 3 (G&D SmartCafe 3.2)Group 2 (G&D SmartCafe 4.x, G&D SmartCafe 6.0)Group 5 (NXP J2D081, NXP J2E145G (fingerprint 251))Group 4 (NXP J2D081, NXP J2E145G (fingerprint 131))Group 7 (Taisys SIMoME VAULT)Group 6 (Feitian JavaCOS A22, Feitian JavaCOS A40, Oberthur Cosmo 64)Group 9 (Athena IDProtect, Gemalto GCX4 72K)Group 8 (Gemalto GXP E64)Group 11 (NXP J2A080, NXP J2A081, NXP J3A081, NXP JCOP 41 V2.2.1)Group 10 (Oberthur Cosmo Dual 72K) Dendrogram for smartcard domain
Fig. 2.
The clustering of smartcard sources yields 12 separate groups.
The clustering task in the smartcards domain yields 12 recognizable groupsfor 19 different smartcard models as shown in Figure 2. The training set forthis limited domain contains 20.6 million keys, whereas the test set contains340 thousand keys. On average, 61 .
9% precision and 64 .
6% recall is achieved.
Moreover, 8 out of 12 groups achieve >
50% precision. Additionally, the clas-sifier exhibits 100% recall on 3 specific groups: a) Infineon smartcards (before2017 with the ROCA vulnerability [19]), b) G&D Smartcafe 4.x and 6.0, andc) newer G&D Smartcafe 7.0. Figure 3 shows so-called confusion matrix whereeach row corresponds to percentage of keys in an actual group while each columnrepresents percentage of keys in a predicted group.
Group predicted by our model T r u e g r o u p Fig. 3.
The confusion matrix for the classifier of a single private key generated in thesmartcards domain. A given row corresponds to a vector of observed relative frequen-cies with which keys generated by a specific group (True group) are misclassified asgenerated by other groups (Group predicted by our model). For example, group 1 andgroup 2 have no misclassifications (high accuracy), while keys of group 3 are in 33%cases misclassified as keys from group 2. On average, we achieve 64 .
6% accuracy. Thedarker the cell is, the higher number it contains. This holds for all figures in this paper.
As expected, the results represent an improvement when compared to thedataset with all libraries. When one has ten keys of the same card at hand, theexpected recall is over 90% on 10 out of 12 groups. The full table of results canbe found in the project repository.Interestingly, 512- and 1024-bit keys generated by the same NXP J2E145Gcard (similarly also for NXP J2D081) fall into different groups . The main dif-ference is in the modular fingerprint (avoidance of small factors in p − q − This is an exception to the observation that the selected features behave indepen-dently of key length. Otherwise, keys of different length can be used interchangeably.iased RSA private keys: Origin attribution of GCD-factorable keys 11 larger keys. Such behaviour was not observed for other libraries but highlightsthe necessity of collecting different key lengths in the training dataset when oneanalyzes black-box proprietary devices or closed-source software libraries.To summarize, the classification of private keys generated by smartcards isvery accurate due to the significant differences resulting from the proprietary,embedded implementations among the different vendors. The differences ob-served likely results from the requirements to have a smaller footprint requiredby low-resources devices.
For the TLS domain, we excluded all the libraries and devices unlikely to beused to generate keys then used by TLS servers. All smartcards are excluded,together with highly outdated or purpose-specific libraries like PGP SDK 4.All hardware security modules (HSMs) are present as they may be used asTLS accelerators or high-security key storage. Summarized, we started with 17separate cryptographic libraries and HSMs, inspected in a total of 134 versions.The clustering resulted in 13 recognizable groups as shown in Figure 4.The domain training set contains 121.8 million keys and the test set contains1.3 million keys. On average, the classifier achieves 45 .
5% precision and 42 . >
50% precision. OpenSSL (by far the most popularlibrary used by servers for TLS [18]) has 100% recall, making the classificationof OpenSSL keys very reliable. Complete results can be found in the projectrepository.To summarize, we correctly classify more keys in a more specific TLS domainthan with the full dataset classifier. Additionally, the user can be more confidentabout the decisions of the TLS-specific classifier.
The rest of this section is motivated by a setting when one wants to ana-lyze a batch of correlated keys. Specifically, we assume a case of k ≥ p , q ) , . . . , ( p k , q k ) generated by the same source, where p = p = · · · = p k .This scenario emerges in Section 5 and cannot be addressed by previously con-sidered classifiers. If applied, the results would be drastically skewed since theclassifier would consider each of p i separately, putting half of the weight on theshared prime. For that reason, we train a classifier that works on single primesrather than on complete private keys. Instead of feeding the classifier with abatch of k private keys, we supply it with a batch of k + 1 unique primes fromthose keys. The selected features were modified accordingly: we extract the 5most significant bits from the unique prime, its second least significant bit, and Group 1 (OpenSSL)Group 2 (OpenSSL (8-bit fingerprint))Group 3 (Sage Blum, Sage Provable)Group 4 (Mocana)Group 6 (Bouncy Castle 1.53, SunRsaSign)Group 5 (mbedTLS)Group 8 (SafeNet, cryptlib)Group 7 (Bouncy Castle 1.54, Mocana, Thales)Group 9 (Libgcrypt, Libgcrypt FIPS)Group 10 (Botan, LibTomCrypt, Nettle 3.2, Nettle 3.3, OpenSSL FIPS, WolfSSL)Group 11 (Crypto++, Microsoft)Group 12 (Utimaco)Group 13 (Nettle 2.0, Sage Default) Dendrogram for TLS domain
Fig. 4.
The clustering of the sources from the TLS domain yields 13 separate groups. compute the ROCA and modular fingerprint for the single prime. We trainedthe classifier on the learning set limited to the TLS domain, as in Section 4.2.On average, we achieve 28 .
8% precision and 36 .
2% recall when classifyinga single prime. Table 3 shows the accuracy results in more detail. It should,however, be stressed that this classifier is meant to be used for batches of manykeys at once. When considering a batch of k ≥
10 primes, the accuracy is morethan 77%. The decrease in accuracy compared to Section 4.2 can be explainedby the loss of information from the second prime. The features mod and blum are much less reliable when using only one prime. Since we can compute themost significant bits from a single prime at a time, we lost the informationabout the ordering of primes (since features and are correlated). Thesefacts resulted in only nine separate groups of libraries being distinguishable. Thefollowing groups from the TLS domain are no longer mutually distinguishable:5 and 13, 7 and 11, 8 and 9 and 10. Number of primes in a batchGroup 1Group 2Group 3Group 4Group 5 | | | | Average
Table 3.
Classification accuracy for single-prime features evaluated on TLS domain.iased RSA private keys: Origin attribution of GCD-factorable keys 13
The presented methodology has several limitations:
Classification of an unseen source.
Not all existing sources of RSA keysare present in our dataset for clustering analysis and classification. This meansthat attempting to classify a key from a source not considered in our study willbring unpredictable results. The new source may either populate some existinggroup or have a unique implementation, thus creating a new group. In bothcases, the behaviour of the classifier is unpredictable.
Granularity of the classifier.
There are multiple libraries in a single group.The user is therefore not shown the exact source of the key, but the whole groupinstead. This limitation has two main reasons: 1) Some sources share the sameimplementation and thus cannot be told apart. 2) The list of utilized featuresis narrow. There are infinitely many possible features in principle and somemay hide valuable information that can further help the model performance.Nevertheless, the proposed methodology allows for an automatic evaluation offeatures using the nave Bayes method which shall be considered in future work.
Human factor.
The clustering task in our study requires human knowledge.To be specific, the value of the threshold that splits the libraries into groups (fora particular feature) is established only semi-automatically. We manually con-firmed the threshold – when we could explain the difference between the libraries,or moved it otherwise. Summarized, this complicates the fully automatic evalua-tion on a large number of potential features. Once solved, the relative importanceof the individual features could be measured.
Previous research [16,14,13,2] demonstrated that a non-trivial fraction of RSAkeys used on publicly reachable TLS servers is generated insecurely and is prac-tically factorable. This is because the affected network devices were found to in-dependently generate RSA keys that share a single prime or both primes. Whilean efficient factorization algorithm for RSA moduli is unknown, when two keysaccidentally share one prime, the efficient factorization is possible using the Eu-clidean algorithm to find their GCD . Still, the current number of public keysobtained from crawling TLS servers is too high to allow for the investigation ofall possible pairs. However, the distributed GCD algorithm [15] allows analyzinghundreds of millions of keys efficiently. Its performance was sufficient to analyzeall keys collected from IPv4-wide TLS scans [21,5] and resulted in almost 1% offactorable keys in the scans collected at the beginning of the year 2016.After the detection of GCD-factorable keys, the question of their origin natu-rally followed. Previous research addressed it using two principal approaches: 1)an analysis of the information extractable from the certificates of GCD-factorable Note that the keys sharing both primes are not susceptible to this attack but revealtheir private keys to all other owners of the same RSA key pair.4 Janovsky et al. keys, and 2) matching specific properties of factored primes with primes gener-ated by a suspected library – OpenSSL. The first approach allowed to detecta range of network routers that seeded their PRNG shortly after boot withoutenough entropy, what caused them to occasionally generate a prime shared withanother device. These routers contained a customized version of the OpenSSLlibrary, what was confirmed with the second approach, since OpenSSL code in-tentionally avoids small factors of p − p − do not originate from the OpenSSL library.Two assumptions must be met to employ the classifier studied in Section 4.3.First, we assume that when a batch of GCD-factored keys shares a prime, theywere all generated by sources from a single classification group . This conjectureis suggested in [13,14] and supported by the fact that when distinct librariesdiffer in their prime generation algorithm, they will produce different primeseven when initialized from the same seed. On the other hand, when they sharethe same generation algorithm, they inevitably fall into the same classificationgroup. Second, we assume that if the malformed keys share only single prime, thePRNG was reseeded with enough entropy before the second prime got generated .This is suggested by the failure model studied for OpenSSL in [14] and impliesthat the second prime is generated as it normally would be.Leveraging these conjectures, the rest of this section tracks the libraries re-sponsible for GCD-factorable keys while not relying on the information in thecertificates. First, we describe the dataset gathering process, as well as the fac-torization of the RSA public keys. Later, successfully factored keys are analyzed,followed with a discussion of findings. The input dataset with public RSA keys (both secure and vulnerable ones) wasobtained from the Rapid7 archive. All scans between October 2013 and July 2019(mostly in one or two weeks period) were downloaded and processed, resultingin slightly over 170 million certificates. Only public RSA keys were extracted,and duplicates removed, resulting in 112 million unique moduli. On this dataset,the fastgcd [15] tool based on [3] was used to factorize the moduli into privatekeys. A detailed methodology of this procedure is discussed in Appendix B. iased RSA private keys: Origin attribution of GCD-factorable keys 15
Would the precision and recall of our classifier be 100%, one could process thefactored keys one by one, establish their origin library and thus detect all sourcesof insecure keys. But since the classification accuracy of the single-prime TLSclassifier with a single key is only 36%, we apply three adjustments: 1) batchthe GCD-factorable keys sharing the same prime (believed to be produced bythe same library); 2) analyze only the batches with at least 10 keys (thereforewith high expected accuracy); 3) limit the set of the libraries considered forclassification only to the single-prime TLS domain. Since the keys from theOpenSSL library were already extensively analyzed by [13], we use the mod feature to reliably mark and exclude them from further analysis. By doing so,we concentrate primarily on the non-OpenSSL keys that were not yet attributed.The exact process for classification of factored keys in batches is as follows:1. Factorize public keys from a target dataset (e.g., Rapid7) using fastgcd tool.2. Form batches of factored keys that share a prime and assume that theyoriginate from the same classification group.3. Select only the batches with at least k keys (e.g., 10).4. Separate batches of keys that all carry the OpenSSL fingerprint. As a controlexperiment, they should classify only to a group with the OpenSSL library.5. Separate batches without the OpenSSL fingerprint. This cluster contains yetunidentified libraries.6. Classify the non-OpenSSL cluster using a single-prime TLS classifier. Group(s) | |
10 (various libraries, see Figure 4) 2783; 4; 6; 12; 5 |
13; 7 |
11 0 ( improbable ) Table 4.
Keys that share a prime factor belong to the same batch. Classification ofmost batches resulted in OpenSSL as the likely source. The rest of the batches werelikely generated by libraries in the combined group 8 | | In total, we analyzed more than 82 thousand primes divided into 2511 batches.While each batch has at least 10 keys in it, the median of the batch size is 15.Among the batches, 88 .
8% of them exhibit the OpenSSL fingerprint. This num-ber well confirms the previous finding by [13] that also captured the OpenSSL-specific fingerprint in a similar fraction of keys. We attribute three other batches Note that without using single-prime model, the results are biased as the sharedprime is considered multiple times in the classification process.6 Janovsky et al. as coming from the OpenSSL (8-bit fingerprint), an OpenSSL library compiledto test and avoid divisors of p − |
13, or 7 |
11, itis very improbable that any GCD-factorable keys originate from the respectivesources in these libraries.
The fingerprinting of devices based on their physical characteristics, exposedinterfaces, behaviour in non-standard or undefined situations, errors returned,and a wide range of various other side-channels is a well-researched area. Theexperience shows that finding a case of a non-standard behaviour is usuallypossible, while making a group of devices indistinguishable is very difficult dueto an almost infinite number of observable characteristics, resulting in an armsrace between the device manufacturers and fingerprinting observers.Having the device fingerprinted is helpful to better understand the complexecosystem like quantifying the presence of interception middle-boxes on the inter-net [9], types of clients connected or version of the operating system. Differencesmay help point out subverted supply chains or counterfeit products.When applied to the study of cryptographic keys and cryptographic libraries,researchers devised a range of techniques to analyze the fraction of encryptedconnections, the prevalence of particular cryptographic algorithms, the chosenkey lengths or cipher suites [8,10,2,1,12,4,24]. Information about a particular keyis frequently obtained from the metadata of its certificate.Periodical network scans allow to assess the impact of security flaws in prac-tice. The population of OpenSSL servers with the Heartbleed vulnerability wasmeasured and monitored by [7], and real attempts to exploit the bug were sur-veyed. If the necessary information is coincidentally collected and archived, evena backward introspection of a vulnerability in time might be possible.The simple test for the ROCA vulnerability in public RSA keys allowed tomeasure the fraction of citizens of Estonia who held an electronic ID supportedby a vulnerable smartcard, by inspecting the public repository of eID certificates[19]. The fingerprinting of keys from smartcards was used to detect that privatekeys were generated outside of the card and injected later into the eIDs, despitethe requirement to have all keys generated on-card [20].The attribution of the public RSA key to its origin library was analyzed by[23]. Measurements on large datasets were presented in [18], leading to accurateestimation of the fraction of cryptographic libraries used in large datasets likeIPv4-wide TLS. While both [23] and [18] analyze the public keys, private keyscan be also obtained under certain conditions of faulty random number generator[16,6,13,14,22]. The origin of weak factorable keys needs to be identified in order iased RSA private keys: Origin attribution of GCD-factorable keys 17 to notify the maintainers of the code to fix underlying issues. A combination ofkey properties and values from certificates was used.
We provide what we believe is the first wide examination of properties of RSAkeys with the goal of attribution of private key to its origin library. The attribu-tion is applicable in multiple scenarios, e.g., to the analysis of GCD-factorablekeys in the TLS domain. We investigated the properties of keys as generated by70 cryptographic libraries, identified biased features in the primes produced, andcompared three models based on Bayes classifiers for the private key attribution.The information available in private keys significantly increases the classifica-tion performance compared to the result achieved on public keys [23]. Our workenables to distinguish 26 groups of sources (compared to 13 on public keys) whileincreasing the accuracy more than twice w.r.t. random guessing. When 100 keysare available for the classification, the correct result is almost always provided( > Acknowledgements
The authors would like to thank anonymous reviewersfor their helpful comments. P. Svenda and V. Matyas were supported by CzechScience Foundation project GA20-03426S. Some of the tools used and otherpeople involved were supported by the CyberSec4Europe Competence Network.Computational resources were supplied by the project e-INFRA LM2018140.
References
1. Albrecht, M.R., Degabriele, J.P., Hansen, T.B., Paterson, K.G.: A surfeit of SSHcipher suites. In: Proceedings of the 2016 ACM SIGSAC Conference on Computerand Communications Security. pp. 1480–1491. ACM (2016)2. Barbulescu, M., Stratulat, A., Traista-Popescu, V., Simion, E.: RSA weak publickeys available on the Internet. In: International Conference for Information Tech-nology and Communications. pp. 92–102. Springer-Verlag (2016)3. Bernstein, D.J.: How to find smooth parts of integers (2004), [cit. 2020-07-13].Available from http://cr.yp.to/papers.html
A Detailed discussion of classifier results
Some groups are accurately classified and rarely misclassified even with asingle key available: namely group 1 (Infineon prior 2017, distinct because ofthe ROCA fingerprint), group 2 (Giesecke&Devrient SmartCafe 4.x and 6.0),group 24 (standard OpenSSL without the FIPS module enabled) and group 26(Giesecke&Devrient SmartCafe 7.0) are all classified with more than 96% recall.Groups 1, 2, and 26 are rarely misclassified as origin library (false positive).The keys from group 25 (OpenSSL avoiding only 8-bit small factors in p − .
6% cases, which stillidentifies the origin library correctly, only misidentifies the OpenSSL compile-time configuration.In contrast, keys from groups 7, 10, 11, 14, 15, and 17 are almost alwaysmisclassified (less than 8% recall, some even less than 1%). However, as dis-cussed in the next section, if some additional information is available and canbe considered, this misclassification can be largely remediated.Keys from group 7 (Libgcrypt) are mostly misclassified as group 6 (PGP SDK4, 64 . . . . . . . . . . Group 15 (large group with multiple frequently used libraries) is misclassifiedas group 12 (card Taisys, 27 . . . . Top 1 match Top 2 match Top 3 match
Average
Table 5.
The average classification accuracy of the best performing Bayes classifier.In the i -th column we consider a classifier successful if the true source of the key isamong i best guesses of our model. Similarly, for each of the 3 columns we evaluatethe success rate when 1 , , , B Obtaining dataset of GCD-factorable keys
The fastgcd [15] tool based on [3] was used to perform the search for the GCD-factorable keys. Only valid RSA keys were considered . Running the fastgcd toolfor a high number of keys (around 112 million for Rapid7 dataset) requires anextensive amount of RAM. Running the tool on a machine with 500 GB ofRAM resulted in only a few factored keys, all sharing just tiny factors, whilethe tool did not produce any errors or warnings. The same computation on a The factorization occasionally finds small prime factors up to 2 , likely because thepublic key (certificate) was damaged, e.g., by a bit flip.iased RSA private keys: Origin attribution of GCD-factorable keys 21 subset of 10 million keys revealed a substantial number of large factors. Likely,the fastgcdfastgcd