[PDF] Biased RSA private keys: Origin attribution of GCD-factorable keys

Abstract

In 2016, Svenda et al. (USENIX 2016, The Million-key Question) reported that the implementation choices in cryptographic libraries allow for qualified guessing about the origin of public RSA keys. We extend the technique to two new scenarios when not only public but also private keys are available for the origin attribution - analysis of a source of GCD-factorable keys in IPv4-wide TLS scans and forensic investigation of an unknown source. We learn several representatives of the bias from the private keys to train a model on more than 150 million keys collected from 70 cryptographic libraries, hardware security modules and cryptographic smartcards. Our model not only doubles the number of distinguishable groups of libraries (compared to public keys from Svenda et al.) but also improves more than twice in accuracy w.r.t. random guessing when a single key is classified. For a forensic scenario where at least 10 keys from the same source are available, the correct origin library is correctly identified with average accuracy of 89% compared to 4% accuracy of a random guess. The technique was also used to identify libraries producing GCD-factorable TLS keys, showing that only three groups are the probable suspects.

Full PDF

BBiased RSA private keys: Origin attribution ofGCD-factorable keys (cid:63)

Adam Janovsky , (cid:0) ) , Matus Nemec , Petr Svenda , Peter Sekan , and VashekMatyas Masaryk University, Czech Republic (cid:0) [email protected] Invasys, Czech Republic Link¨oping University, Sweden

Abstract.

In 2016, venda et al. (USENIX 2016, The Million-key Ques-tion) reported that the implementation choices in cryptographic librariesallow for qualiﬁed guessing about the origin of public RSA keys. We ex-tend the technique to two new scenarios when not only public but alsoprivate keys are available for the origin attribution – analysis of a sourceof GCD-factorable keys in IPv4-wide TLS scans and forensic investiga-tion of an unknown source. We learn several representatives of the biasfrom the private keys to train a model on more than 150 million keyscollected from 70 cryptographic libraries, hardware security modules andcryptographic smartcards. Our model not only doubles the number of dis-tinguishable groups of libraries (compared to public keys from venda etal.) but also improves more than twice in accuracy w.r.t. random guess-ing when a single key is classiﬁed. For a forensic scenario where at least10 keys from the same source are available, the correct origin library iscorrectly identiﬁed with average accuracy of 89% compared to 4% accu-racy of a random guess. The technique was also used to identify librariesproducing GCD-factorable TLS keys, showing that only three groups arethe probable suspects.

Keywords:

Cryptographic library · RSA factorization · Measurement · RSA key classiﬁcation · Statistical model.

The ability to attribute a cryptographic key to the library it was generated withis a valuable asset providing direct insight into cryptographic practices. Theslight bias found speciﬁcally in the primes of RSA private keys generated bythe OpenSSL library [14] allowed to track down the devices responsible for keysfound in TLS IPv4-wide scans that were in fact factorable by distributed GCDalgorithm. Further work [23] made the method generic and showed that manyother libraries produce biased keys allowing for the origin attribution. As a result, (cid:63)

Full details, datasets and paper supplementary material can be found at https://crocs.ﬁ.muni.cz/papers/privrsa esorics20 a r X i v : . [ c s . CR ] S e p Janovsky et al. both separate keys, as well as large datasets, could be analyzed for their originlibraries. The ﬁrst-ever explicit measurement of cryptographic library popularitywas introduced in [18], showing the increasing dominance of the OpenSSL libraryon the market. Furthermore, very uncommon characteristics of the library usedby Inﬁneon smartcards allowed for their entirely accurate classiﬁcation. Impor-tantly, this led to a discovery that the library is, in fact, producing practicallyfactorable keys [19]. Consequently, more than 20 million of eID certiﬁcates withvulnerable keys were revoked just in Europe alone. The same method allowed toidentify keys originating from unexpected sources in Estonian eIDs. Eventually,the unexpected keys were shown to be injected from outside instead of beinggenerated on-chip as mandated by the institutional policy [20].While properties of RSA primes were analyzed to understand the bias de-tected in public keys, no previous work addressed the origin attribution problem with the knowledge of private keys. The reason may sound understandable –while the public keys are readily available in most usage domains, the privatekeys shall be kept secret, therefore unavailable for such scrutiny. Yet there areat least two important scenarios for their analysis: 1) Tracking sources of GCD-factorable keys from large TLS scans and 2) a forensic identiﬁcation of black-boxdevices with the capability to export private keys (e.g., unknown smartcard, re-mote key generation service, or in-house investigation of cryptographic services).The mentioned case of unexpected keys in Estonian eIDs [20] is a practical ex-ample of a forensic scenario, but with the use of public keys only. The analysisbased on private keys can spot even a smaller deviance from the expected originas the bias is observed closer to the place of its inception. This work aims to ﬁllthis gap in knowledge by a careful examination of both scenarios.We ﬁrst provide a solid coverage of RSA key sources used in the wild by ex-panding upon the dataset ﬁrst released in [23]. During our work, we more thandoubled the number of keys in the dataset, gathered from over 70 distinct crypto-graphic software libraries, smartcards, and hardware security modules (HSMs).Beneﬁting from 158.8 million keys, we study the bias aﬀecting the primes p and q . We transform known biased features of public keys to their private key ana-logues and evaluate how they cluster sources of RSA keys into groups. We usethe features in multiple variants of Bayes classiﬁer that are trained on 157 mil-lion keys. Subsequently, we evaluate the performance of our classiﬁers on further1.8 million keys isolated from the whole dataset. By doing so, we establish thereliability results for the forensic case of use, when keys from a black-box sys-tem are under scrutiny. On average, when looking at just a single key, our bestmodel is able to correctly classify 47% of cases when all libraries are consideredand 64.6% keys when the speciﬁc sub-domain of smartcards is considered. Theseresults allow for much more precise classiﬁcation compared to the scenario whenonly public keys are available.Finally, we use the best-performing classiﬁcation method to analyze thedataset of GCD-factorable RSA keys from the IPv4-wide TLS scan collectedby Rapid7 [21]. iased RSA private keys: Origin attribution of GCD-factorable keys 3 The main contributions of this paper are: – A systematic mapping of biased features of RSA keys evaluated on a moreexhaustive set of cryptographic libraries, described in Section 2. The dataset(made publicly available for other researchers) lead to 26 total groups oflibraries distinguishable based on the features extracted from the value ofRSA private key(s). – Detailed evaluation of the dataset on Bayes classiﬁers in Section 3 with anaverage accuracy above 47% where only a single key is available, and almost90% when ten keys are available. – An analysis of the narrow domain of cryptographic smartcards and librariesused for TLS results in an even higher accuracy, as shown in Section 4. – Practical analysis of real-world sources of GCD-factorable RSA keys frompublic TLS servers obtained from internet-wide scans in Section 5.The paper roadmap has been partly outlined above, Section 7 then showsrelated work and Section 8 concludes our paper.

Various design and implementation decisions in the algorithms for generatingRSA keys inﬂuence the distributions of produced RSA keys. A speciﬁc type ofbias was used to identify OpenSSL as the origin of a group of private keys [17].Systematic studies of a wide range of libraries [23,18] described more reasonsfor biases in RSA keys in a surprising number of libraries. In the majority ofcases, the bias was not strong enough to help factor the keys more eﬃciently.Previous research [23] identiﬁed multiple sources of bias that our observationsfrom a large dataset of private RSA keys conﬁrm:

1. Performance optimizations , e.g., most signiﬁcant bits of primes set to aﬁxed value to obtain RSA moduli of a deﬁned length.

2. Type of primes : probable, strong, and provable primes: – For probable primes, whether candidate values for primes are chosenrandomly or a single starting value is incremented until a prime is found. – When generating candidates for probable primes, small factors are avoidedin the value of p − – Blum integers are sometimes used for RSA moduli – both RSA primesare congruent to 3 modulo 4. – For strong primes, the size of the auxiliary prime factors of p − p + 1 is biased. – For provable primes, the recursive algorithm can create new primes ofdouble to triple the binary length of a given prime; usually one versionof the algorithm is chosen.

3. Ordering of primes : are the RSA primes in private key ordered by size?

4. Proprietary algorithms , e.g., the well-documented case of Inﬁneon fastprime key generation algorithm [19].

5. Bias in the output of a PRNG : often observable only from a large numberof keys from the same source;

6. Natural properties of primes that do not depend on the implementation.

Janovsky et al.

We collected, analyzed, and published the largest dataset of RSA keys with aknown origin from 70 libraries (43 open-source libraries, 5 black-box libraries,3 HSMs, 19 smartcards). We both expanded the datasets from previous work[23,18] and generated new keys from additional libraries for the sake of this study.We processed the keys to a uniﬁed format and made them publicly available.Where possible, we analyzed the source code of the cryptographic library toidentify the basic properties of key generation according to the list above.We are primarily interested in 2048-bit keys, what is the most commonlyused key length for RSA. As in previous studies [23,18], we also generate shorterkeys (512 and 1024 bits) to speed up the process, while verifying that the cho-sen biased features are not inﬂuenced by the key size. This makes the keys ofdiﬀerent sizes interchangeable for the sake of our study. We assume that repeat-edly running the key generation locally approximates the distributed behaviourof many instances of the same library. This model is supported by the mea-surements taken in [18] where distributions of keys collected from the Internetexhibited the same biases as locally generated keys.

We extended the features used in previous work on public keys to their equivalentproperties of private keys:

Feature ‘5p and 5q’ : Instead of the most signiﬁcant bits of the modulus, weuse ﬁve most signiﬁcant bits of the primes p and q. The modulus is deﬁnedby the primes, and the primes naturally provide more information. We chose 5bits based on a frequency analysis of high bits. Further bits are typically notbiased and reducing the size of this feature prevents an exponential growth ofthe feature space.

Feature ‘blum’ : We replaced the feature of second least signiﬁcant bit of themodulus by the detection of Blum integers. Blum integers can be directly iden-tiﬁed using the two prime factors. When only the modulus is available, we canrule out the usage of Blum integers, but not conﬁrm it.

Feature ‘mod’ : Previous work used the result of modulus modulo 3. It wasknown that primes can be biased modulo small primes (due to avoiding smallfactors of p − q − p −

1, when the modulus equals2 modulo 3 [23]. It is not possible to rule out higher factors from just a singlemodulus. With the access to the primes we can directly check for this bias forall factors. We detected four categories of such bias, each avoiding all small oddprime factors up to a threshold. We use these categories directly by looking atsmall odd divisors of p − q − Feature ‘roca’ : We use a speciﬁc ﬁngerprint of factorable Inﬁneon keys pub-lished in [19]. iased RSA private keys: Origin attribution of GCD-factorable keys 5

Dendrogram for all groups

Fig. 1.

How the keys from various libraries diﬀer can be depicted by a dendrogram. Ittells us, w.r.t. our feature set, how far from each other the probability distributions ofthe sources are. We can then hierarchically cluster the sources into groups that producesimilar keys. The blue line at 0.085 highlights the threshold of diﬀerentiating betweentwo sources/groups. This threshold yields 26 groups using our feature set.

Since it is impossible to distinguish sources that produce identically distributedkeys, we introduce a process of clustering to merge similar sources into groups.We cluster two sources together if they appear to be using identical algorithmsbased on the observation of the key distributions. We measure the diﬀerencein the distributions using the Manhattan distance . The absolute values of thedistances depend on the actual distributions of the features. Large distancescorrelate with signiﬁcant diﬀerences in the implementations. Note, that verysmall observed distances may be only the result of noise in the distributionsinstead of a real diﬀerence, e.g., due to a smaller number of keys available.We attempt to place the clustering threshold as low as possible, maximizingthe number of meaningful groups. If we are not able to explain why two clustersare separated based on the study of the algorithms and distributions of the We experimented with Euclidean distance and fractional norms. While Euclideandistance is a proper metric, our experiments showed that it is more sensitive tothe noise in the data, creating separable groups out of sources that share the samekey generation algorithms. On the other hand, fractional norms did not highlightdiﬀerences between sources that provably diﬀer in the key generation process. Janovsky et al. features, the threshold needs to be moved higher to join these clusters. We workedwith distributions that assume all features correlated (as in [23]).The resulting classiﬁcation groups and the dendrogram is shown in Figure1. We placed the threshold value at 0.085. By moving it higher than to 0.154,we would lose the ability to distinguish groups 11 and 12. It would be possibleto further split group 14, as there is a slight diﬀerence in the prime selectionintervals used by Crypto++ and Microsoft [23]. However, the diﬀerence manifestsless than the level of noise in other sources, requiring the threshold to be putat 0.052, what would create several false groups. We use the same clusteringthroughout the paper, although the value of the threshold would change whenthe features change. Note that diﬀerent versions of the same library may fallinto diﬀerent groups, mostly because of the algorithm changes between theseversions. This, for instance, is the case of the Bouncy Castle 1.53, and 1.54.

How accurately we can classify the keys depends on several factors, most notablyon: the libraries included in the training set, number of keys available for clas-siﬁcation, features extracted from the classiﬁed keys, and on the classiﬁcationmodel. In this section, we focus on the last factor.

As generating the RSA keys is internally a stochastic process, we choose thefamily of probabilistic models to address the source attribution problem. Sincethere is no strong motivation for complex machine learning models, we utilizesimple classiﬁers. More sophisticated classiﬁers could be built based on our ﬁnd-ings when the goal is to reach higher accuracy or to more ﬁnely discriminatesources within a group. The rest of this subsection describes the chosen models.

Nave Bayes classiﬁer.

The ﬁrst investigated model is a nave Bayes classi-ﬁer , called nave because it assumes that the underlying features are conditionallyindependent. Using this model, we apply the maximum-likelihood decision ruleand predict the label as ˆ y = argmax y P ( X = x | y ). Thanks to the nave assump-tion, we may decompose this computation into ˆ y = argmax y (cid:81) ni =1 P ( x i | y ) forthe feature vector x = ( x , . . . , x n ). Bayes classiﬁer.

We continue to develop the approach originally used in [23]that used the

Bayes classiﬁer without the nave assumption. Several reasons mo-tivate this. First, it allows to evaluate how much the nave Bayes model suﬀersfrom the violated independence assumption (on this speciﬁc dataset). Secondly,it enables us to access more precise probability estimates that are needed to clas-sify real-world GCD-factorable keys. Additionally, we can directly compare theclassiﬁcation accuracy of private keys with the case of the public keys from [23].However, one of the main drawbacks of the Bayes classiﬁer is that it requiresexponentially more data with the growing number of features. Therefore, when iased RSA private keys: Origin attribution of GCD-factorable keys 7 striving for high accuracy achievable by further feature engineering, one shouldconsider the nave Bayes instead.

Nave Bayes classiﬁer with cross-features.

The third investigated optionis the nave Bayes classiﬁer, but we merged selected features that are knownto be correlated into a single feature. In particular, we merged the featuresof the most signiﬁcant bits (of p, q ) into a single cross-feature. Subsequently,the nave Bayes approach is used. This enables us to evaluate whether mergingclearly interdependent features into one will aﬀect the performance of nave Bayesclassiﬁer w.r.t. this speciﬁc dataset.

Our training dataset contains157 million keys and the test set contains 1.8 million keys. We derived the test setby discarding 10 thousand keys of each source from the complete dataset beforeclustering. This assures that each group has the test set with at least 10 thousandkeys. Accordingly, since the groups diﬀer in the number of sources involved, theresulting test dataset is imbalanced. For this reason, we employ the metrics ofprecision and recall when possible. However, we represent the model performanceby accuracy measure in the tables and in more complex classiﬁcation scenarios.For group X , the precision can be understood as a fraction of correctly clas-siﬁed keys from group X divided by the number of keys that were marked as group X by our classiﬁer. Similarly, the recall is a fraction of correctly classiﬁedkeys from group X divided by a total number of keys from group X [11]. Wealso evaluate the performance of the models under the assumption that the userhas a batch of several keys from the same source at hand. This scenario canarise, e.g., when a security audit is run in an organization and all keys are beingtested. Furthermore, to react to some often misclassiﬁed groups, we additionallyprovide the answer “this key originates from group X or group Y ” to the user(and we evaluate the conﬁdence of these answers). Comparison of the models.

The overall comparison of all three models canbe seen in Table 1. If the precision for some group is undeﬁned, i.e., no key isallegedly originating from this group, we say that the precision is 0. We evaluatethe nave Bayes classiﬁer on the same features that were used for Bayes classiﬁer

Model Avg. precision Avg. recallBayes classiﬁer 43 .

2% 47 . .

9% 46 . .

7% 47 . Table 1.

Performance comparison of diﬀerent models on the dataset with all libraries.Note that the precision of a random guess classiﬁer is 3 .

8% when considering 26 groups. Janovsky et al. to measure how much classiﬁcation performance is lost by introducing the fea-ture independence assumption. A typical example of interdependent features isthat the most signiﬁcant bits of primes p and q are intentionally correlated topreserve the expected length of the resulting modulus n . Pleasantly, the observedprecision (recall) decrease is only 2 .

3% (1 . .

2% and a recall of 47 . Section 2 outlined the process of choosing a threshold value that determines thecritical distance for distinguishing between distinct groups. Inevitably, the samethreshold value directly inﬂuences the number of groups after the clustering task.As such, the threshold introduces a trade-oﬀ between the model performanceand the number of discriminated groups. The smaller the diﬀerence betweengroup distributions is, the more they are similar, and the model performanceis lower as more misclassiﬁcation errors occur. The objective of this section isto examine the classiﬁcation scenario when some prior knowledge is available tothe analyst, limiting the origin of keys to only a subset of all libraries or increasethe likelihood of some. Since Section 3 showed that the Bayes classiﬁer providesthe best performance, this chapter considers only this model.Prior knowledge can be introduced into the classiﬁcation process in multipleways, e.g., by using a prior probability vector that considers some groups moreprevalent. We also note that the measurement method of [18] can be used toobtain such prior information, but a relatively large dataset (around 10 privatekeys) is required that may not be available. Our work, therefore, considers adiﬀerent setting when some sources of the keys are ruled-out before the classiﬁeris constructed. Such scenario arises e.g., when the analyst knows that the scru-tinized keys were generated in an unknown cryptographic smartcard. In suchcase, HSMs and other sources of keys can thus be omitted from the model alto-gether what will arguably increase the performance of the classiﬁcation process.Another example is leaving out libraries that were released after the classiﬁeddata sample was collected.We present the classiﬁcation performance results for three scenarios with alimited number of sources – 1) cryptographic smartcards (Section 4.1), 2) sourceslikely to be used in the TLS domain (Section 4.2) and 3) a speciﬁc case of GCD-factorable keys from the TLS domain, where only one out of two primes can iased RSA private keys: Origin attribution of GCD-factorable keys 9Dataset Avg. precision Avg. recall Random guess (baseline)All libraries 43 .

2% 47 .

6% 3 . .

9% 64 .

6% 8 . .

5% 42 .

2% 7 . .

8% 36 .

2% 11 . Table 2.

Bayes classiﬁer performance on three analyzed partitionings of the dataset– complete dataset with all libraries (

All libraries ), smartcards only (

Smartcards do-main ), libraries and HSMs expected to be used for TLS (

TLS domain ) and speciﬁcsubset of TLS domain where only single prime is available due to the nature of resultsobtained by GCD factorization method (

Single-prime TLS domain ). Comparison withthe random guess as a baseline is provided (here, accuracy equals precision and recall). be used for classiﬁcation (see Section 4.3 for more details). The comparison ofmodels for these scenarios can be seen in Table 2.To compute these models we ﬁrst, discard the sources that cannot be theorigin of the examined keys according to the prior knowledge of the domain (e.g.,smartcards are not expected in TLS). Next, we re-compute the clustering task toobtain fewer groups than on the dataset with all libraries. Finally, we computethe classiﬁcation tables for the reduced domain and evaluate the performance.

Group 1 (Infineon JTOP 80K)Group 12 (G&D StarSign)Group 3 (G&D SmartCafe 3.2)Group 2 (G&D SmartCafe 4.x, G&D SmartCafe 6.0)Group 5 (NXP J2D081, NXP J2E145G (fingerprint 251))Group 4 (NXP J2D081, NXP J2E145G (fingerprint 131))Group 7 (Taisys SIMoME VAULT)Group 6 (Feitian JavaCOS A22, Feitian JavaCOS A40, Oberthur Cosmo 64)Group 9 (Athena IDProtect, Gemalto GCX4 72K)Group 8 (Gemalto GXP E64)Group 11 (NXP J2A080, NXP J2A081, NXP J3A081, NXP JCOP 41 V2.2.1)Group 10 (Oberthur Cosmo Dual 72K) Dendrogram for smartcard domain

Fig. 2.

The clustering of smartcard sources yields 12 separate groups.

The clustering task in the smartcards domain yields 12 recognizable groupsfor 19 diﬀerent smartcard models as shown in Figure 2. The training set forthis limited domain contains 20.6 million keys, whereas the test set contains340 thousand keys. On average, 61 .

9% precision and 64 .

6% recall is achieved.

Moreover, 8 out of 12 groups achieve >

50% precision. Additionally, the clas-siﬁer exhibits 100% recall on 3 speciﬁc groups: a) Inﬁneon smartcards (before2017 with the ROCA vulnerability [19]), b) G&D Smartcafe 4.x and 6.0, andc) newer G&D Smartcafe 7.0. Figure 3 shows so-called confusion matrix whereeach row corresponds to percentage of keys in an actual group while each columnrepresents percentage of keys in a predicted group.

Group predicted by our model T r u e g r o u p Fig. 3.

The confusion matrix for the classiﬁer of a single private key generated in thesmartcards domain. A given row corresponds to a vector of observed relative frequen-cies with which keys generated by a speciﬁc group (True group) are misclassiﬁed asgenerated by other groups (Group predicted by our model). For example, group 1 andgroup 2 have no misclassiﬁcations (high accuracy), while keys of group 3 are in 33%cases misclassiﬁed as keys from group 2. On average, we achieve 64 .

6% accuracy. Thedarker the cell is, the higher number it contains. This holds for all ﬁgures in this paper.

As expected, the results represent an improvement when compared to thedataset with all libraries. When one has ten keys of the same card at hand, theexpected recall is over 90% on 10 out of 12 groups. The full table of results canbe found in the project repository.Interestingly, 512- and 1024-bit keys generated by the same NXP J2E145Gcard (similarly also for NXP J2D081) fall into diﬀerent groups . The main dif-ference is in the modular ﬁngerprint (avoidance of small factors in p − q − This is an exception to the observation that the selected features behave indepen-dently of key length. Otherwise, keys of diﬀerent length can be used interchangeably.iased RSA private keys: Origin attribution of GCD-factorable keys 11 larger keys. Such behaviour was not observed for other libraries but highlightsthe necessity of collecting diﬀerent key lengths in the training dataset when oneanalyzes black-box proprietary devices or closed-source software libraries.To summarize, the classiﬁcation of private keys generated by smartcards isvery accurate due to the signiﬁcant diﬀerences resulting from the proprietary,embedded implementations among the diﬀerent vendors. The diﬀerences ob-served likely results from the requirements to have a smaller footprint requiredby low-resources devices.

For the TLS domain, we excluded all the libraries and devices unlikely to beused to generate keys then used by TLS servers. All smartcards are excluded,together with highly outdated or purpose-speciﬁc libraries like PGP SDK 4.All hardware security modules (HSMs) are present as they may be used asTLS accelerators or high-security key storage. Summarized, we started with 17separate cryptographic libraries and HSMs, inspected in a total of 134 versions.The clustering resulted in 13 recognizable groups as shown in Figure 4.The domain training set contains 121.8 million keys and the test set contains1.3 million keys. On average, the classiﬁer achieves 45 .

5% precision and 42 . >

50% precision. OpenSSL (by far the most popularlibrary used by servers for TLS [18]) has 100% recall, making the classiﬁcationof OpenSSL keys very reliable. Complete results can be found in the projectrepository.To summarize, we correctly classify more keys in a more speciﬁc TLS domainthan with the full dataset classiﬁer. Additionally, the user can be more conﬁdentabout the decisions of the TLS-speciﬁc classiﬁer.

The rest of this section is motivated by a setting when one wants to ana-lyze a batch of correlated keys. Speciﬁcally, we assume a case of k ≥ p , q ) , . . . , ( p k , q k ) generated by the same source, where p = p = · · · = p k .This scenario emerges in Section 5 and cannot be addressed by previously con-sidered classiﬁers. If applied, the results would be drastically skewed since theclassiﬁer would consider each of p i separately, putting half of the weight on theshared prime. For that reason, we train a classiﬁer that works on single primesrather than on complete private keys. Instead of feeding the classiﬁer with abatch of k private keys, we supply it with a batch of k + 1 unique primes fromthose keys. The selected features were modiﬁed accordingly: we extract the 5most signiﬁcant bits from the unique prime, its second least signiﬁcant bit, and Group 1 (OpenSSL)Group 2 (OpenSSL (8-bit fingerprint))Group 3 (Sage Blum, Sage Provable)Group 4 (Mocana)Group 6 (Bouncy Castle 1.53, SunRsaSign)Group 5 (mbedTLS)Group 8 (SafeNet, cryptlib)Group 7 (Bouncy Castle 1.54, Mocana, Thales)Group 9 (Libgcrypt, Libgcrypt FIPS)Group 10 (Botan, LibTomCrypt, Nettle 3.2, Nettle 3.3, OpenSSL FIPS, WolfSSL)Group 11 (Crypto++, Microsoft)Group 12 (Utimaco)Group 13 (Nettle 2.0, Sage Default) Dendrogram for TLS domain

Fig. 4.

The clustering of the sources from the TLS domain yields 13 separate groups. compute the ROCA and modular ﬁngerprint for the single prime. We trainedthe classiﬁer on the learning set limited to the TLS domain, as in Section 4.2.On average, we achieve 28 .

8% precision and 36 .

2% recall when classifyinga single prime. Table 3 shows the accuracy results in more detail. It should,however, be stressed that this classiﬁer is meant to be used for batches of manykeys at once. When considering a batch of k ≥

10 primes, the accuracy is morethan 77%. The decrease in accuracy compared to Section 4.2 can be explainedby the loss of information from the second prime. The features mod and blum are much less reliable when using only one prime. Since we can compute themost signiﬁcant bits from a single prime at a time, we lost the informationabout the ordering of primes (since features and are correlated). Thesefacts resulted in only nine separate groups of libraries being distinguishable. Thefollowing groups from the TLS domain are no longer mutually distinguishable:5 and 13, 7 and 11, 8 and 9 and 10. Number of primes in a batchGroup 1Group 2Group 3Group 4Group 5 | | | | Average

Table 3.

Classiﬁcation accuracy for single-prime features evaluated on TLS domain.iased RSA private keys: Origin attribution of GCD-factorable keys 13

The presented methodology has several limitations:

Classiﬁcation of an unseen source.

Not all existing sources of RSA keysare present in our dataset for clustering analysis and classiﬁcation. This meansthat attempting to classify a key from a source not considered in our study willbring unpredictable results. The new source may either populate some existinggroup or have a unique implementation, thus creating a new group. In bothcases, the behaviour of the classiﬁer is unpredictable.

Granularity of the classiﬁer.

There are multiple libraries in a single group.The user is therefore not shown the exact source of the key, but the whole groupinstead. This limitation has two main reasons: 1) Some sources share the sameimplementation and thus cannot be told apart. 2) The list of utilized featuresis narrow. There are inﬁnitely many possible features in principle and somemay hide valuable information that can further help the model performance.Nevertheless, the proposed methodology allows for an automatic evaluation offeatures using the nave Bayes method which shall be considered in future work.

Human factor.

The clustering task in our study requires human knowledge.To be speciﬁc, the value of the threshold that splits the libraries into groups (fora particular feature) is established only semi-automatically. We manually con-ﬁrmed the threshold – when we could explain the diﬀerence between the libraries,or moved it otherwise. Summarized, this complicates the fully automatic evalua-tion on a large number of potential features. Once solved, the relative importanceof the individual features could be measured.

Previous research [16,14,13,2] demonstrated that a non-trivial fraction of RSAkeys used on publicly reachable TLS servers is generated insecurely and is prac-tically factorable. This is because the aﬀected network devices were found to in-dependently generate RSA keys that share a single prime or both primes. Whilean eﬃcient factorization algorithm for RSA moduli is unknown, when two keysaccidentally share one prime, the eﬃcient factorization is possible using the Eu-clidean algorithm to ﬁnd their GCD . Still, the current number of public keysobtained from crawling TLS servers is too high to allow for the investigation ofall possible pairs. However, the distributed GCD algorithm [15] allows analyzinghundreds of millions of keys eﬃciently. Its performance was suﬃcient to analyzeall keys collected from IPv4-wide TLS scans [21,5] and resulted in almost 1% offactorable keys in the scans collected at the beginning of the year 2016.After the detection of GCD-factorable keys, the question of their origin natu-rally followed. Previous research addressed it using two principal approaches: 1)an analysis of the information extractable from the certiﬁcates of GCD-factorable Note that the keys sharing both primes are not susceptible to this attack but revealtheir private keys to all other owners of the same RSA key pair.4 Janovsky et al. keys, and 2) matching speciﬁc properties of factored primes with primes gener-ated by a suspected library – OpenSSL. The ﬁrst approach allowed to detecta range of network routers that seeded their PRNG shortly after boot withoutenough entropy, what caused them to occasionally generate a prime shared withanother device. These routers contained a customized version of the OpenSSLlibrary, what was conﬁrmed with the second approach, since OpenSSL code in-tentionally avoids small factors of p − p − do not originate from the OpenSSL library.Two assumptions must be met to employ the classiﬁer studied in Section 4.3.First, we assume that when a batch of GCD-factored keys shares a prime, theywere all generated by sources from a single classiﬁcation group . This conjectureis suggested in [13,14] and supported by the fact that when distinct librariesdiﬀer in their prime generation algorithm, they will produce diﬀerent primeseven when initialized from the same seed. On the other hand, when they sharethe same generation algorithm, they inevitably fall into the same classiﬁcationgroup. Second, we assume that if the malformed keys share only single prime, thePRNG was reseeded with enough entropy before the second prime got generated .This is suggested by the failure model studied for OpenSSL in [14] and impliesthat the second prime is generated as it normally would be.Leveraging these conjectures, the rest of this section tracks the libraries re-sponsible for GCD-factorable keys while not relying on the information in thecertiﬁcates. First, we describe the dataset gathering process, as well as the fac-torization of the RSA public keys. Later, successfully factored keys are analyzed,followed with a discussion of ﬁndings. The input dataset with public RSA keys (both secure and vulnerable ones) wasobtained from the Rapid7 archive. All scans between October 2013 and July 2019(mostly in one or two weeks period) were downloaded and processed, resultingin slightly over 170 million certiﬁcates. Only public RSA keys were extracted,and duplicates removed, resulting in 112 million unique moduli. On this dataset,the fastgcd [15] tool based on [3] was used to factorize the moduli into privatekeys. A detailed methodology of this procedure is discussed in Appendix B. iased RSA private keys: Origin attribution of GCD-factorable keys 15

Would the precision and recall of our classiﬁer be 100%, one could process thefactored keys one by one, establish their origin library and thus detect all sourcesof insecure keys. But since the classiﬁcation accuracy of the single-prime TLSclassiﬁer with a single key is only 36%, we apply three adjustments: 1) batchthe GCD-factorable keys sharing the same prime (believed to be produced bythe same library); 2) analyze only the batches with at least 10 keys (thereforewith high expected accuracy); 3) limit the set of the libraries considered forclassiﬁcation only to the single-prime TLS domain. Since the keys from theOpenSSL library were already extensively analyzed by [13], we use the mod feature to reliably mark and exclude them from further analysis. By doing so,we concentrate primarily on the non-OpenSSL keys that were not yet attributed.The exact process for classiﬁcation of factored keys in batches is as follows:1. Factorize public keys from a target dataset (e.g., Rapid7) using fastgcd tool.2. Form batches of factored keys that share a prime and assume that theyoriginate from the same classiﬁcation group.3. Select only the batches with at least k keys (e.g., 10).4. Separate batches of keys that all carry the OpenSSL ﬁngerprint. As a controlexperiment, they should classify only to a group with the OpenSSL library.5. Separate batches without the OpenSSL ﬁngerprint. This cluster contains yetunidentiﬁed libraries.6. Classify the non-OpenSSL cluster using a single-prime TLS classiﬁer. Group(s) | |

10 (various libraries, see Figure 4) 2783; 4; 6; 12; 5 |

13; 7 |

11 0 ( improbable ) Table 4.

Keys that share a prime factor belong to the same batch. Classiﬁcation ofmost batches resulted in OpenSSL as the likely source. The rest of the batches werelikely generated by libraries in the combined group 8 | | In total, we analyzed more than 82 thousand primes divided into 2511 batches.While each batch has at least 10 keys in it, the median of the batch size is 15.Among the batches, 88 .

8% of them exhibit the OpenSSL ﬁngerprint. This num-ber well conﬁrms the previous ﬁnding by [13] that also captured the OpenSSL-speciﬁc ﬁngerprint in a similar fraction of keys. We attribute three other batches Note that without using single-prime model, the results are biased as the sharedprime is considered multiple times in the classiﬁcation process.6 Janovsky et al. as coming from the OpenSSL (8-bit ﬁngerprint), an OpenSSL library compiledto test and avoid divisors of p − |

13, or 7 |

11, itis very improbable that any GCD-factorable keys originate from the respectivesources in these libraries.

The ﬁngerprinting of devices based on their physical characteristics, exposedinterfaces, behaviour in non-standard or undeﬁned situations, errors returned,and a wide range of various other side-channels is a well-researched area. Theexperience shows that ﬁnding a case of a non-standard behaviour is usuallypossible, while making a group of devices indistinguishable is very diﬃcult dueto an almost inﬁnite number of observable characteristics, resulting in an armsrace between the device manufacturers and ﬁngerprinting observers.Having the device ﬁngerprinted is helpful to better understand the complexecosystem like quantifying the presence of interception middle-boxes on the inter-net [9], types of clients connected or version of the operating system. Diﬀerencesmay help point out subverted supply chains or counterfeit products.When applied to the study of cryptographic keys and cryptographic libraries,researchers devised a range of techniques to analyze the fraction of encryptedconnections, the prevalence of particular cryptographic algorithms, the chosenkey lengths or cipher suites [8,10,2,1,12,4,24]. Information about a particular keyis frequently obtained from the metadata of its certiﬁcate.Periodical network scans allow to assess the impact of security ﬂaws in prac-tice. The population of OpenSSL servers with the Heartbleed vulnerability wasmeasured and monitored by [7], and real attempts to exploit the bug were sur-veyed. If the necessary information is coincidentally collected and archived, evena backward introspection of a vulnerability in time might be possible.The simple test for the ROCA vulnerability in public RSA keys allowed tomeasure the fraction of citizens of Estonia who held an electronic ID supportedby a vulnerable smartcard, by inspecting the public repository of eID certiﬁcates[19]. The ﬁngerprinting of keys from smartcards was used to detect that privatekeys were generated outside of the card and injected later into the eIDs, despitethe requirement to have all keys generated on-card [20].The attribution of the public RSA key to its origin library was analyzed by[23]. Measurements on large datasets were presented in [18], leading to accurateestimation of the fraction of cryptographic libraries used in large datasets likeIPv4-wide TLS. While both [23] and [18] analyze the public keys, private keyscan be also obtained under certain conditions of faulty random number generator[16,6,13,14,22]. The origin of weak factorable keys needs to be identiﬁed in order iased RSA private keys: Origin attribution of GCD-factorable keys 17 to notify the maintainers of the code to ﬁx underlying issues. A combination ofkey properties and values from certiﬁcates was used.

We provide what we believe is the ﬁrst wide examination of properties of RSAkeys with the goal of attribution of private key to its origin library. The attribu-tion is applicable in multiple scenarios, e.g., to the analysis of GCD-factorablekeys in the TLS domain. We investigated the properties of keys as generated by70 cryptographic libraries, identiﬁed biased features in the primes produced, andcompared three models based on Bayes classiﬁers for the private key attribution.The information available in private keys signiﬁcantly increases the classiﬁca-tion performance compared to the result achieved on public keys [23]. Our workenables to distinguish 26 groups of sources (compared to 13 on public keys) whileincreasing the accuracy more than twice w.r.t. random guessing. When 100 keysare available for the classiﬁcation, the correct result is almost always provided( > Acknowledgements

The authors would like to thank anonymous reviewersfor their helpful comments. P. Svenda and V. Matyas were supported by CzechScience Foundation project GA20-03426S. Some of the tools used and otherpeople involved were supported by the CyberSec4Europe Competence Network.Computational resources were supplied by the project e-INFRA LM2018140.

References

1. Albrecht, M.R., Degabriele, J.P., Hansen, T.B., Paterson, K.G.: A surfeit of SSHcipher suites. In: Proceedings of the 2016 ACM SIGSAC Conference on Computerand Communications Security. pp. 1480–1491. ACM (2016)2. Barbulescu, M., Stratulat, A., Traista-Popescu, V., Simion, E.: RSA weak publickeys available on the Internet. In: International Conference for Information Tech-nology and Communications. pp. 92–102. Springer-Verlag (2016)3. Bernstein, D.J.: How to ﬁnd smooth parts of integers (2004), [cit. 2020-07-13].Available from http://cr.yp.to/papers.html

A Detailed discussion of classiﬁer results

Some groups are accurately classiﬁed and rarely misclassiﬁed even with asingle key available: namely group 1 (Inﬁneon prior 2017, distinct because ofthe ROCA ﬁngerprint), group 2 (Giesecke&Devrient SmartCafe 4.x and 6.0),group 24 (standard OpenSSL without the FIPS module enabled) and group 26(Giesecke&Devrient SmartCafe 7.0) are all classiﬁed with more than 96% recall.Groups 1, 2, and 26 are rarely misclassiﬁed as origin library (false positive).The keys from group 25 (OpenSSL avoiding only 8-bit small factors in p − .

6% cases, which stillidentiﬁes the origin library correctly, only misidentiﬁes the OpenSSL compile-time conﬁguration.In contrast, keys from groups 7, 10, 11, 14, 15, and 17 are almost alwaysmisclassiﬁed (less than 8% recall, some even less than 1%). However, as dis-cussed in the next section, if some additional information is available and canbe considered, this misclassiﬁcation can be largely remediated.Keys from group 7 (Libgcrypt) are mostly misclassiﬁed as group 6 (PGP SDK4, 64 . . . . . . . . . . Group 15 (large group with multiple frequently used libraries) is misclassiﬁedas group 12 (card Taisys, 27 . . . . Top 1 match Top 2 match Top 3 match

Average

Table 5.

The average classiﬁcation accuracy of the best performing Bayes classiﬁer.In the i -th column we consider a classiﬁer successful if the true source of the key isamong i best guesses of our model. Similarly, for each of the 3 columns we evaluatethe success rate when 1 , , , B Obtaining dataset of GCD-factorable keys

The fastgcd [15] tool based on [3] was used to perform the search for the GCD-factorable keys. Only valid RSA keys were considered . Running the fastgcd toolfor a high number of keys (around 112 million for Rapid7 dataset) requires anextensive amount of RAM. Running the tool on a machine with 500 GB ofRAM resulted in only a few factored keys, all sharing just tiny factors, whilethe tool did not produce any errors or warnings. The same computation on a The factorization occasionally ﬁnds small prime factors up to 2 , likely because thepublic key (certiﬁcate) was damaged, e.g., by a bit ﬂip.iased RSA private keys: Origin attribution of GCD-factorable keys 21 subset of 10 million keys revealed a substantial number of large factors. Likely,the fastgcdfastgcd