[PDF] Characterizing Entities in the Bitcoin Blockchain

Abstract

Bitcoin has created a new exchange paradigm within which financial transactions can be trusted without an intermediary. This premise of a free decentralized transactional network however requires, in its current implementation, unrestricted access to the ledger for peer-based transaction verification. A number of studies have shown that, in this pseudonymous context, identities can be leaked based on transaction features or off-network information. In this work, we analyze the information revealed by the pattern of transactions in the neighborhood of a given entity transaction. By definition, these features which pertain to an extended network are not directly controllable by the entity, but might enable leakage of information about transacting entities. We define a number of new features relevant to entity characterization on the Bitcoin Blockchain and study their efficacy in practice. We show that even a weak attacker with shallow data mining knowledge is able to leverage these features to characterize the entity properties.

Full PDF

CCharacterizing Entities in the Bitcoin Blockchain

Marc Jourdan

IBM Research, Singapore

10 Marina Boulevard18983 [email protected]

Sebastien Blandin

IBM Research, Singapore

10 Marina Boulevard18983 [email protected]

Laura Wynter

IBM Research, Singapore

10 Marina Boulevard18983 [email protected]

Pralhad Deshpande

IBM Research, Singapore

10 Marina Boulevard18983 [email protected]

Abstract —Bitcoin has created a new exchange paradigm withinwhich ﬁnancial transactions can be trusted without an interme-diary. This premise of a free decentralized transactional networkhowever requires, in its current implementation, unrestrictedaccess to the ledger for peer-based transaction veriﬁcation. Anumber of studies have shown that, in this pseudonymous context,identities can be leaked based on transaction features or off-network information. In this work, we analyze the informationrevealed by the pattern of transactions in the neighborhood ofa given entity transaction. By deﬁnition, these features whichpertain to an extended network are not directly controllableby the entity, but might enable leakage of information abouttransacting entities. We deﬁne a number of new features relevantto entity characterization on the Bitcoin Blockchain and studytheir efﬁcacy in practice. We show that even a weak attackerwith shallow data mining knowledge is able to leverage thesefeatures to characterize the entity properties.

Index Terms —Bitcoin, Privacy, Pattern classiﬁcation, Bipartitegraph.

I. I

NTRODUCTION

Bitcoin [18] stands out as the ﬁrst global decentralized cur-rency, and has seen spectacular growth recently, as illustratedby the exponential shape of the value of a transaction fee overthe year , see Figure 1. The underlying Bitcoin data struc-

Figure 1.

Transaction fee in USD: over the year . ture, the Blockchain, has been perceived as a catalyst to theemergence of broad decentralized applications, from crypto-currency exchanges, to decentralized autonomous organiza-tions (DAO), or tokens, see [1] for a comprehensive review.However, one of the most compelling applications remains thepromise of a global decentralized currency, supporting largeportions of the global economy. A. Emergence of a decentralized global currency

As presented in [13], after only a few years, the bitcoinnetwork has emerged to reﬂect a complex global paymentsystem with new forms of players representing the traditionalactors of the established ﬁnancial infrastructure. Global trans-actions between Bitcoin exchanges, illustrated in Figure 2,have drastically complexiﬁed in the last years. It is importantfor regulatory purposes as well as for the development of newcrypto-currency paradigms to understand Bitcoin anonymityproperties. Indeed, a healthy economy requires application ofthe rule of law on ﬁnancial transactions [6], which usuallyentails traceability and transparency on identities. On the otherhand, in order to attract users, privacy considerations associ-ated with crypto-currency have to be adequate and competitivewith privacy properties of transactions using ﬁat money.We say that a transaction medium is private if no infor-mation on the transacting entity is revealed by the transactiongraph. In the Bitcoin context, each individual transaction ispublicly associated with a pseudonymous identity. Hence thestrongest possibly guarantee is that the set of transactionsassociated with a given pseudonymous identity does not revealinformation about the hidden transacting entity.The main contribution of this work is the illustration thatpatterns of transactions involving the transacting entity butalso patterns of neighboring transactions (on the transactiongraph) are characteristic of the entity, in that these patterns canbe turned into a ﬁngerprint of transacting entity classes. Wefurther give evidence that using these neighboring transactionpatterns allows reaching state-of-the-art entity classiﬁcationresults. B. Related work

Because the Bitcoin transaction graph is completely public itis clear that Bitcoin ﬂows can be traced, although this is limitedby merging transactions [24]. Recent analysis of ransomwareattacks and associated bitcoin transactions can be found in [9],[16], [21].Furthermore, joining Bitcoin transaction with off-networkactivities can contribute to linking identities of certain Bitcointransactions [23], for instance the authors of [10] link Bitcoinusers with Tor hidden services.A number of studies are focused on the problem of addressclustering consisting of identifying the address set associatedwith a transacting entity, and typically rely on heuristics a r X i v : . [ c s . CR ] O c t igure 2. Bitcoin exchanges transactions: for the month of March (top) and March (bottom). The width of the edge is proportional to thetotal value of transactions during the month for the associated exchange pair(the size of the nodes is arbitrary). motivated by properties of the Bitcoin Blockchain protocol,such as the requirement of a unique signature for transactioninputs, see for instance [4], [14] and references therein formore details. The authors of [8] highlight that these methodscrucially depend on address re-use behavior.Another attack motivated by the Bitcoin protocol involvesthe inference of peer-to-peer communication structure [5] andinformation leakage from the message dissemination pattern.A different type of attack is presented in [20], where theauthors show that statistical analysis of bloom ﬁlters could help to identify the set of addresses owned by Bitcoin wallets.It has also been shown that patterns of newly minted bitcoinscould play a role in revealing information of certain Bitcoinusers [15].Thematically closer to our work, several studies have con-sidered the extent to which data mining methods can be ap-plied to the entity characterization problem, with for instancethe use of transaction-speciﬁc features in [26], able to achieve accuracy for classifying entities into several types. In[22], the authors introduce the notion of transaction motifswith application to the detection of bitcoin exchanges, andachieve greater than accuracy.

C. Contributions of this work

Motivated by the success of [22], we consider the moregeneral problem of classifying entities into multiple classes,based on extended transaction neighborhood properties. Thestudy of transaction graph neighborhood structure is promisingfor at least two reasons.First, the use of graph network features has been shown tobe efﬁcient for graph learning problems [27]. The promise ofsuch methods is illustrated by the spread of graph databasesand related applications [25]. The authors of [19] for exampleare able to successfully re-identify nodes from a noisy graphstructure provided as part of a Kaggle contest.Second, if neighboring network transactions are conﬁrmedto be informative of entity identities, they constitute a fun-damental limit to Bitcoin privacy guarantees. Indeed, whilean individual has control of his personal address usage andtransaction patterns, he does not have control of the behaviorof the entities he is transacting with. While services such asCoinJoin serve to limit information leakage of Bitcoin ﬂowsvia the merging of individual transactions [17], there is at thisstage no service providing full neighborhood obfuscation.The main contribution of this work are the following: • deﬁnition of novel features for entity classiﬁcation froma graph neighborhood perspective, • analysis of the performance of various classiﬁcationmethods, • discussion of the implications of these entity characteri-zation results in the context of Bitcoin anonymity.The structure of the paper is as follows. In Section II we deﬁnethe graph model we propose for the Bitcoin Blockchain. InSection III, we present our classiﬁcation models as well asdetails of the graph neighborhood features developed. Finallywe provide in Section IV numerical results that demonstratethe effectiveness of entity characterization using actual BitcoinBlockchain data. Section V provides concluding remarks.II. B LOCKCHAIN GRAPH MODEL

The Bitcoin Blockchain is a succession of blocks B . Eachblock b ∈ B contains a set of transactions T ( b ) = { t i } i ⊂ T . We ﬁrst present a bipartite address-transaction graphmodel, and then explain how we derive a discrete-time entity-transaction graph model with features relevant to the entitycharacterization program. . Address-transaction graph We model the Bitcoin Blockchain as a directed weightedbipartite graph H = ( A , T , L ) , where a ∈ A representsan address, t ∈ T represents a transaction between addresses,and l ∈ L represents an edge between an address and atransaction. We partition the edge set into edges incomingto a transaction, and edges outgoing from a transaction, asfollows: l ∈ L = I ∪ O , where I stands for input edges,i.e. incoming edges of a vertex t , and O for output edges, i.e.outgoing edges of a vertex t . Multiple edges can exist betweena pair ( a, t ) ∈ A × T .For each transaction t ∈ T , we deﬁne I ( t ) ⊂ I and O ( t ) ⊂ O as the corresponding edges between the transactionand the addresses associated. For each i ∈ I (resp. o ∈ O ),we uniquely deﬁne the associated transaction t ( i ) (resp. t ( o ) )and address a ( i ) (resp. a ( o ) ); these are well deﬁned becausethe graph is bipartite.We ignore the address and transaction ﬁelds that are speciﬁcto the protocol or not relevant to our study, see for instance [2]for related graph models for permissioned blockchains, or [3]for a more fundamental analysis of a transaction graph modeland inherited protocol properties. In this work, we considerthe following vertex and edge properties: • edge : inputs, I , (resp. outputs, O ) are represented bytheir amount in BTC, v ( i ) for i ∈ I ( v ( o ) for o ∈ O ),sent (resp. received) by an address, a ( i ) (resp. a ( o ) ), fora given transaction, t ( i ) (resp. t ( o ) ), • transaction vertex : a transaction contained in a block b ( t ) has a fee, f ( t ) , and a time τ ( t ) , corresponding to the timewhen the block it belongs to is validated, τ ( t ) = τ ( b ( t )) . • address vertex : an address has a creation date and abalance.In the next section we explain how we derive the entity-transaction graph from the address-transaction graph. B. Entity-transaction graph

In the Bitcoin Blockchain a user may employ severaladdresses. We therefore introduce the concept of an “entity”where entity, e , is fully characterized by a set of addresses A ( e ) = { a ( e ) i } i , which can be interpreted as a logical user. Wediscuss subsequently the various methods for deﬁning thesesets, such as the common spending heuristic.The entity-transaction graph is a directed weighted bipartitegraph G = ( E , T e , L e ) of entities and associated transactions.Let e ∈ E denote an entity and t ∈ T e , a transaction betweenentities. Entities and transactions are connected by edges l ∈ L e = I e ∪ O e , where I e stands for input edges, and O e represents output edges.As with addresses, for each transaction t ∈ T e we deﬁne I e ( t ) ⊂ I e and O e ( t ) ⊂ O e as the corresponding edgesbetween the transactions and entities. For each i ∈ I e (resp. o ∈ O e ), as for the address-transaction graph, we uniquelydeﬁne the associated transaction t ( i ) (resp. t ( o ) ) and entity e ( i ) (resp. e ( o ) ). Inputs, I e , (resp. outputs, O e ) represent theamount in BTC, v ( i ) for i ∈ I e ( v ( o ) for o ∈ O e ), sent (resp. received) by an entity, e ( i ) (resp. e ( o ) ), for a giventransaction. We deﬁne the set of addresses, A ( I e ( t )) (resp. A ( O e ( t )) ), associated with a set of inputs (resp. outputs).The mechanism to build the entity-transaction graph from theaddress-transaction graph is illustrated in Figure 3. a ( e )2 a ( e )1 a ( e )3 a ( e )3 a ( e )2 a ( e )1 a ( e )1 a ( e )4 t e e e e t Figure 3.

Entity graph : (right) is obtained by aggregation of the addressgraph (left).

The vertices and edges of the entity-transaction graph inheritthe properties of the address-transaction graph by standardsummation of continuous properties and aggregation of graphtopology. Additionally we consider that vertices and edges ofthe entity-transaction graph inherit some statistics of the entity-transaction graph that are lost by summation (e.g. an entitygraph edge inherits the count of address graph edges which itsubsumes).We further deﬁne a set of categories c ∈ C , correspondingto entity class labels. We shall consider in the remainder of thiswork the entity classiﬁcation problem, namely identiﬁcation ofthe class label c ∈ C associated with an entity e ∈ E , basedon the address-transaction graph. C. Discrete time graph

We apply temporal aggregation to the directed weightedbipartite graph of entities G = ( E , T e , L e ) to obtain adiscrete-time data structure with commensurate properties. Deﬁnition 1 (Discrete-time operator) . A discrete-time aggre-gation operator is an operator : ∆ ( τ ,τ ) : G (cid:55)→ G ( τ ,τ ) where time-aggregated entities and transactions E ( τ ,τ ) , T ( τ ,τ ) e are derived from E , T e , L e and required tosatisfy the following constraints: • Entity activity: only entities transacting during the timeperiod are considered. E ( τ ,τ ) = { e ∈ E , ∃ t ∈ T e , τ ( t ) ∈ [ τ , τ ] ∧ (( e, t ) ∈ I e ( t ) ∨ ( t, e ) ∈ O e ( t )) } • Transaction: the edges, ( e i , e j ) ∈ T ( τ ,τ ) e , represent theaggregation of transactions between two entities withinthe time window, { t ∈ T e , τ ( t ) ∈ [ τ , τ ] ∧ e ( I e ( t )) = e i ∧ ∃ o ∈ O e ( t ) , e ( o ) = e j } .In the following, we work with the discrete-time graph G ( τ ,τ ) = ( E ( τ ,τ ) , T ( τ ,τ ) e ) obtained by applying thediscrete-time operator, with typical discrete time intervals ofa day, a week, or a month. . Motifs The notion of motif in the Bitcoin Blockchain was in-troduced in [22] as a useful concept in entity classiﬁcationstudies, speciﬁcally in the context of Exchange detection. Theauthors deﬁne the so-called 2-motif. In this work, we considermore generally the case of N-motif and present a few relevantspecial cases.

Deﬁnition 2 (1-motif) . A 1-motif is a path of length 2 on theentity-transaction graph, ( e , t, e ) ∈ E × T e × E , in G . motif ( t ) = { ( e i , t, e j ) } i,j . If e = e we call it a Loop 1-motif, otherwise it is calledDistinct 1-motif.The 1-motif, is a direct transaction between entities. Moregenerally, a direct N-motif is a path of length N in thedirected weighted bipartite graph starting and ending with anentity. Deﬁnition 3 (Direct N-motif) . A N-motif is a path of length N , ( e , t , . . . , t N , e N +1 ) ∈ E × T e × · · · × T e × E , in G starting and ending with an entity. A motif N ( t , . . . , t N ) = { ( e i , t , . . . , t N , e i N +1 ) } is required to satisfy the followingconstraint: • Direct: at least one output from each transaction is an in-put to the next transaction. ∀ k ∈ { , . . . , N − }∃ ( o, i ) ∈ O ( t k ) × I ( t k +1 ) , a ( o ) = a ( i ) .If e = e N +1 we call it a Direct Loop, otherwise DirectDistinct. From this condition we deduce immediately thatthe transactions are ordered in time: τ ( t ) < · · · < τ ( t N ) .Considering only Direct N-motif avoids redundancy of explo-ration and focuses on fast ﬂow of value. In this work, we donot consider non-Direct paths that would correspond to morecomplex transfer-and-hold patterns.We illustrate the case of a 3-motif in Figure 4. e e e e t (3)1 t (3)2 t (3)3 Figure 4. : consisting of a path of length on the bipartite entity-transaction graph. Statistics of 1,2, and 3-motifs over the dataset consideredare presented in Table I, with the following entity categories:Exchange, Gambling, Mining, Service, Darknet, as per la-belling from Wallet Explorer. An indication of the powerof transaction graph neighborhood structure is that certainmotifs dominate within certain entity categories, e.g. directtransactions with distinct entities are more characteristic ofExchanges than of other entity categories.Motif attributes such as BTC volume and number of ad-dresses are similarly inherited from the address-transactionnetwork, by summation and aggregation.III. E

NTITY CLASSIFICATION

In this section we present the methods used for inferringthe category associated with each entity, using our graph neighborhood features on the Bitcoin transaction graph. Weﬁrst recall the heuristics used for associating distinct addressesto a single entity.

A. Common spending heuristic for address clustering

The common spending heuristic consists of clustering ad-dresses that are inputs to the same transaction with the sameentity. Formally,

Hypothesis . ∀ t ∈ T , ∃ ! e ∈ E , ∀ i ∈ I ( t ) , a ( i ) ∈ A ( e ) The common spending heuristic is equivalent to the assump-tion that having access to multiple private keys at a givenpoint in time is a deﬁning property of an entity. Indeed, inorder to submit a transaction T to the Blockchain protocol,the transaction must be signed, implying using the private keyof each address in the input set of the transaction, see Figure 5. a a a T a ( e )2 a ( e )1 a ( e )3 T Figure 5.

Common Spending Heuristic : all addresses input to a transactionare associated with the same entity.

The notion of transitive closure allows extending the set ofaddresses associated with a given entity.

Hypothesis . ∃ ( t , t ) ∈ T , s.t. { a , a } = A ( I ( t )) , { a , a } = A ( I ( t ))= ⇒ ∃ ! e, { a , a , a } ⊂ A ( e ) Under Hypotheses 1, 2, each transaction only has one inputentity: ∀ t ∈ T e , | I e ( t ) | = 1 . In the following section wepresent the types of features used for entity classiﬁcation. B. Features

We wish to demonstrate the effectiveness of graph neighbor-hood features for entity characterization. As such we proposethe following ﬁve feature classes: • Address features involving only address properties, • Entity features that can be computed from the set ofaddresses associated with a given entity, • Temporal features related to the evolution of speciﬁctransaction properties, • Centrality features encoding the value of classical cen-trality measures [7], • Motif features corresponding to transaction paths involv-ing the entity of interest.The remainder of this section provides more details on eachset of features.Address-speciﬁc features include attributes such as thetotal BTC received, the total BTC balance, the number of able I M OTIF DISTRIBUTION : 1-

MOTIF D ISTINCT IS INDICATIVE OF E XCHANGES ( . OF SUCH MOTIFS CORRESPOND TO E XCHANGES ). O

N THE OTHERHAND MOTIF D IRECT L OOP IS INDICATIVE OF S ERVICES ( . OF SUCH MOTIFS CORRESPOND TO S ERVICES ).Type Sub-type Quantity Exchange Gambling Mining Service Darknet1-motif Loop 16.085.493 26.3% 25.7% 4.9% 39.6% 3.5%1-motif Distinct 5.390.310 64.6% 1.1% 2.7% 30.3% 1.2%2-motif Direct Loop 10.196.844 21.1% 28.1% 0.1% 48.5% 2.1%2-motif Direct Distinct 20.469.285 46.9% 6.3% 3.7% 38.7% 4.4%3-motif Direct Loop 30.914.975 24.6% 11.5% 0.1% 63.0% 0.9%3-motif Direct Distinct 85.822.858 54.0% 4.4% 3.6% 34.1% 3.9% input/output transactions, the number of predecessor/successoraddresses, unique and otherwise, the number of predecessorsaddresses that are also successors, and the number of siblingaddresses in output.Analogous features are deﬁned at the entity level as well asthe number and proportion of coinbase transactions.1-motif features include the value of incoming/outgoingtransactions in BTC and USD, the number of incom-ing/outgoing addresses, the number of incoming/outgoingtransactions per day, their total value in BTC and USD, andtheir total fee.2-motif and 3-motif features are analogous but include alsoparticularities of this graph structure such as, for 2-motif,the number of inputs (resp. outputs) in the ﬁrst (resp. sec-ond) transaction of an incoming/outgoing/loop motif, the totalvalue of the inputs (resp. outputs) of the ﬁrst (resp. second)transaction of an incoming/outgoing/loop motif in BTC andUSD, as well as the number of incoming/outgoing/loop motifs,the number of addresses involved as center of an incom-ing/outgoing/loop motif, the value transferred in the middleand the fees of the transactions in BTC and USD, and thenumber of predecessors/successors for an incoming/outgoingmotif. 2-motif features are illustrated in Figure 6. e e e t (3)1 t (3)2 nb inputs nb outputsin val out valnb addressFee Fee mid val Figure 6. (rectangular white boxes) annotated over a 2-motif, including both edge features and vertex features over transaction paths.

Centrality-based features include measures of betweennesscentrality, closeness centrality, degree centrality, in-degreecentrality, out-degree centrality, PageRank, and load centrality.These features are computed on the discrete-time aggregationof the entity graph over a week, a month and a year.Temporal features are those such as the number of weeks,months, years of activity, the number of entity traded with perweek, month, year, the number of receiving, sending days, theactivity period duration, and the active day ratio.We make use of in total 10 address features, 8 entityfeatures, 16 temporal features, 42 centrality features, 44 1-motif features, 81 2-motif features, and 114 3-motif features.

Remark.

For each continuous feature such as volume ofBitcoin transactions, we calculate the mean and the standard deviation in order to allow the classiﬁer to discriminate up tosecond order statistics.Transaction values are considered both in terms of BTC andUSD, in order to support the analysis over the multiple yearsover which the BTC value in USD changed.

C. Classiﬁcation method

Given our interest in understanding the behavioral featurescharacterizing certain categories of Bitcoin users, we proposeto use a decision tree method, which provides feature relevancestatistics via bootstrapping. A decision tree h m ( x ) partitionsthe feature space into J m disjoint regions R m , . . . , R J m m andproduces a constant value in each region. The output of h m ( x ) for input x can be written as the sum: h m ( x ) = J m (cid:88) j =1 b jm R jm ( x ) where b jm is the model response for input features fromregion R jm . We consider an ensemble of trees, which we learnsequentially by minimizing the weighted sum F m ( x ) = F m − ( x ) + γ m h m ( x ) ,γ m = argmin γ n (cid:88) i =1 L ( y i , F m − ( x i ) + γh m ( x i )) . according to a gradient boosting procedure, where L ( · ) is theloss function.In order to identify the ensemble tree hyper-parameters mostsuited to our problem, we approximate the performance ofthe tree ensemble for given parameter values using GaussianProcesses. We learn a surrogate ˆ f ( · ) of the loss function f ( · ) based on previous evaluations { ( θ , f ( θ )) , . . . , ( θ k , f ( θ k ) } ,and identify the parameter minimizing this surrogate function.The calibration procedure can be summarized as follows: Algorithm . For t ∈ { , . . . , T } : • Given observations { ( θ i , f ( θ i )) , ∀ i ∈ { , . . . , t }} , webuild a surrogate ˆ f t using Gaussian processes. Each value f ( θ i ) is the loss function value obtained after havingtrained a decision tree with the hyper-parameter θ i . • Given the surrogate function ˆ f t , we identify the parameter θ t +1 providing a good compromise between minimizingthe surrogate ˆ f t and exploring the parameter space. Theevaluation of the surrogate function for a parameteralue consists in running the gradient boosting routinedescribed above.We benchmark the result of our decision tree algorithmagainst a logistic regression algorithm, deﬁned below: ∀ x i ∈ E , ˆ f entity ( x i ) = y i = argmax c ∈ C f c ( x i ) where f c ( X ) = P ( Y = c | X ) = h (( X Φ ) T β c ) , h ( t ) = e t e t . IV. N

UMERICAL R ESULTS

In this section we present our experimental results, aswell as implications of these results for Bitcoin transactionanonymity.

A. Blockchain dataset

We consider the set of blocks of height inferior or equal to514.971, corresponding to blocks created before March 24th2018, 15:19:02, which contains about . . addresses.Address labels are obtained from WalletExplorer .We apply the common spending heuristic and transitive clo-sure operation described in Section III-A to the labeled datasetobtained from WalletExplorer, and extend it slightly. Weinteract with the Bockchain via the BlockSci toolbox v.0.4.5released on March 16th 2018 [11], on a 64 GB machine. Theﬁnal labeled dataset used in numerical experiments consists of . . addresses, associated with | E known | = 272 entitiesrepresenting entity categories in the following proportions: • Exchange : 108 entities, 7.892.587 addresses. • Service : 68 entities, 17.606.608 addresses. • Gambling : 65 entities, 2.775.810 addresses. • Mining Pool : 19 entities, 78.488 addresses. • DarkNet Marketplace : 12 entities, 1.978.207 addresses.While the set of labeled address is of signiﬁcant size, it isimportant to observe the entity category class imbalance, withthe dominant

Service class representing more than of thedataset, and the smallest

Mining Pool category representingless than . of the dataset. B. Model calibration procedure

The decision tree model described in Section III-C isdeployed in Python via the LightGBM implementation [12].The Gaussian Process-based optimization procedure for hyper-parameter optimization is implemented using the Python skoptlibrary with initial parameter values obtained from a coarserandom search.We use a typical / training / test partition of our dataset.The learning rate hyper-parameter of the decision tree modelis optimized over the interval [0 . , . after having donea random search over [0 , ; the resulting value is . . Totrain LightGBM, we use an early stopping procedure whichstops the training if the log loss does not decrease over ten https://scikit-optimize.github.io/ consecutive iterations. The procedure stops after iterationswith a loss of . .The inverse of the L regularization parameter of the logisticregression model is optimized over the interval [0 . , afterhaving done a random search over [0 , ; the value obtainedis . .The Gaussian Processes procedure is used with 50 iterations,which is a reasonable compromise between a small computa-tion time, less than one hour for LightGBM in our hardwaresetting, and a good exploration of the interval. For fairnesswe use the same criterion for the logistic regression model.Along the same lines, we only optimize one hyper-parameterfor LightGBM, namely the learning rate. C. Feature importance

We analyze the performance of the classiﬁcation modelﬁrst from the perspective of an unsophisticated attacker in-crementally adding features to a generic model based ontheir ease of access. We then model a sophisticated attackerwith extensive modeling knowledge, collecting the full set offeatures, calibrating model hyper-parameters, and identifyingthe minimal set of relevant features required for a successfulattack.

1) Weak attacker:

Consider an attacker who collects fea-tures in the following order, from simplest to most complex: • Address features requiring access only to the address set, • Entity features requiring access to the address set andaddress clustering heuristics, • Temporal features, • Centrality features requiring crawling the Blockchaintransaction network for connectivity information, • accuracy, i.e. a factor of two improvement ascompared to a random guess over the classes.Second, it is of signiﬁcance that while the model choicedoes not signiﬁcantly impact the classiﬁcation performance,using more sophisticated features provides drastic improve-ment. Indeed, the user of 3-motif features, encompassingthe behavior of the 3-hop graph neighborhood of a givenidentity, contributes more than relative improvement to able II I NCREMENTAL GROUPING OF FEATURES

AND ASSOCIATEDPERFORMANCE METRICS .Features —Features— Alg. Accuracy F PrecisionAddress 10 LR 0.415 0.303 0.351Entity 18 (+8) LR 0.476 0.369 0.4451-motif 62 (+44) LR 0.524 0.471 0.474Temporal 78 (+16) LR 0.512 0.493 0.498Centrality 120 (+42) LR 0.561 0.545 0.5512-motif 201 (+81) LR 0.585 0.574 0.5733-motif 315 (+114) LR 0.841 0.835 0.857Address 10 LGBM 0.5 0.487 0.492Entity 18 (+8) LGBM 0.476 0.429 0.4151-motif 62 (+44) LGBM 0.622 0.597 0.613Temporal 78 (+16) LGBM 0.659 0.649 0.654Centrality 120 (+42) LGBM 0.610 0.597 0.6032-motif 201 (+81) LGBM 0.683 0.654 0.6673-motif 315 (+114) LGBM 0.890 0.886 0.897 the accuracy score (from . to . accuracy for the decisiontree model).

2) Strong attacker:

We now consider a strong attackercollecting the full set of features, and then selecting a smallset of highly signiﬁcant features. We model this process byapplying the hyper-parameter calibration procedure describedearlier. We then obtain the ranked feature set from the decisiontree model. Table III provides the top ten features of thedecision tree model.

Table III R ANKED FEATURES : BY IMPORTANCE ACCORDING TO THE DECISIONTREE MODEL .Rank Type Name1 3-motif unique entity 3 successor2 Entity prop coinbase3 2-motif loop 2 std nb inputs4 2-motif loop 2 mean nb inputs5 3-motif outgoing 3 mean nb outputs6 3-motif outgoing 3 std nb outputs7 3-motif outgoing 3 std fee 1 btc8 3-motif outgoing 3 mean nb inputs9 1-motif incoming mean fee usd10 3-motif loop 3 mean fee 2 btc

The results indicate that the 3-hop graph neighborhoodfeatures dominate. Note, however that there are 114 such 3-hop graph neighborhood features out of 315 features in total.It can be seen from Table III that outgoing features aremore relevant that incoming features. This supports a “causal”interpretation that features speciﬁc to an entity can be welldetected from its downstream transaction trace.We now examine the minimal set of features required toobtain state-of-the-art accuracy. We present in Figure 7 the F1and accuracy improvement over the complete feature relevanceranking from the decision tree algorithm for both the logisticregression and the decision tree model.It is clear from the ﬁgure that even with a simple logisticregression model, the most relevant features are sufﬁcient toobtain high classiﬁcation accuracy. Using a more sophisticated Figure 7.

Feature Selection using LightGBM decision tree model allows reducing the number of features to , after which improvement plateaus. D. Overall model performance

Finally, we consider the sophisticated attacker using hyper-parameter optimization. Table IV presents the Accuracy andF1 score for both classiﬁcation methods.

Table IV O PTIMIZING F WITH

FEATURES , ITERATIONS OF THEHYPER - PARAMETER OPTIMIZATION PROCEDURE , GLOBAL RESULTS .Algorithm Accuracy F PrecisionLogistic Regression 0.85 0.85 0.87LightGBM 0.92 0.91 0.92

Table V, provides the class-speciﬁc overall performanceresults.The results dominate state-of-the-art results from the liter-ature, which may be due to the use of novel advanced graphneighborhood features, as evidenced from Table II, along withthe hyper-parameter optimization. At the class level, goodclassiﬁcation accuracy (above ) is achieved over the setof exchanges, gambling services and general services as well able V O PTIMIZING F WITH

FEATURES , ITERATIONS OF THEHYPER - PARAMETER OPTIMIZATION PROCEDURE , CLASS - LEVEL RESULTS .Category Algorithm Accuracy F PrecisionExchange LR 0.91 0.91 0.91Gambling LR 0.9 0.82 0.75Mining LR 0.5 0.67 1.0Service LR 0.85 0.87 0.89Darknet LR 0.75 0.75 0.75Exchange LGBM 0.94 0.92 0.91Gambling LGBM 0.95 0.97 1.0Mining LGBM 0.5 0.67 1.0Service LGBM 0.95 0.88 0.83Darknet LGBM 1.0 1.0 1.0 as the darknet category. Mining pool behavior is less-wellcaptured as evidenced by the low accuracy, indicating thatthere is no consistent transaction pattern identiﬁed for thisclass. V. C

ONCLUSION

We formulate the problem of analyzing Bitcoin Blockchaintransaction graph anonymity properties as a classiﬁcationproblem over a set of categories of Bitcoin users. Our resultsindicate that it is feasible for a weak attacker to characterizeentities using a set of new graph neighborhood features thatwe propose, and that is feasible for a strong attacker to do thesame with as little as of the most relevant features.This suggests a number of interesting avenues for furtherwork. Since it is possible as we have shown to accuratelyclassify transacting entities on the Bitcoin Blockchain, thequestion arises as to whether it is possible to develop aneffective generative model of the transaction network. If so, itwould enable a wealth of studies into the effect of changes innetwork protocols or regulatory frameworks on the evolutionof the Bitcoin economy.R EFERENCES[1] Cuneyt Gurcan Akcora, Yulia R. Gel, and Murat Kantarcioglu.Blockchain: A graph primer.

CoRR , abs/1708.08749, 2017.[2] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Kon-stantinos Christidis, Angelo De Caro, David Enyeart, Christopher Ferris,Gennady Laventman, Yacov Manevich, Srinivasan Muralidharan, ChetMurthy, Binh Nguyen, Manish Sethi, Gari Singh, Keith Smith, Alessan-dro Sorniotti, Chrysoula Stathakopoulou, Marko Vukolic, Sharon WeedCocco, and Jason Yellick. Hyperledger fabric: a distributed operatingsystem for permissioned blockchains. In

EuroSys , pages 1–15. ACM,2018.[3] Christian Cachin, Angelo De Caro, Pedro Moreno-Sanchez, Bj¨ornTackmann, and Marko Vukolic. The transaction graph for modelingblockchain semantics.

IACR Cryptology ePrint Archive , 2017:1070,2017.[4] Dmitry Ermilov, Maxim Panov, and Yury Yanovich. Automatic bitcoinaddress clustering. In

Machine Learning and Applications (ICMLA),2017 16th IEEE International Conference on , pages 461–466. IEEE,2017.[5] Giulia Fanti and Pramod Viswanath. Deanonymization in the bitcoinP2P network. In

Advances in Neural Information Processing Systems(NIPS) , pages 1364–1373, 2017.[6] Y.J. Fanusie and T. Robinson. Bitcoin laundering: An analysis of illicitﬂows into digital currency services.

Center on Sanctions and IllicitFinance, Elliptic , 2018. [7] Linton C Freeman. A set of measures of centrality based on between-ness.

Sociometry , pages 35–41, 1977.[8] Martin Harrigan and Christoph Fretter. The unreasonable effec-tiveness of address clustering. , pages 368–373, 2016.[9] Danny Yuxing Huang, Maxwell Matthaios Aliapoulios, Vector Guo Li,Luca Invernizzi, Elie Bursztein, Kylie McRoberts, Jonathan Levin, KirillLevchenko, Alex C Snoeren, and Damon McCoy. Tracking ransomwareend-to-end. In ,pages 618–631. IEEE, 2018.[10] Husam Al Jawaheri, Mashael Al Sabah, Yazan Boshmaf, and AimanErbad. When A small leak sinks A great ship: Deanonymizing torhidden service users through bitcoin transactions analysis.

CoRR ,abs/1801.07501, 2018.[11] Harry A. Kalodner, Steven Goldfeder, Alishah Chator, Malte M¨oser, andArvind Narayanan. Blocksci: Design and applications of a blockchainanalysis platform.

CoRR , abs/1709.02489, 2017.[12] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, WeidongMa, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efﬁcient gradientboosting decision tree. In

Advances in Neural Information ProcessingSystems (NIPS) , pages 3146–3154, 2017.[13] M. Lischke and B. Fabian. Analyzing the bitcoin network: The ﬁrst fouryears.

Future Internet , 8(1):7, 2016.[14] D. D. F. Maesa, A. Marino, and L. Ricci. Uncovering the bitcoinblockchain: An analysis of the full users graph. In ,pages 537–546, October 2016.[15] D. McGinn, D. McIlwraith, and Y. Guo. Toward open data blockchainanalytics: A bitcoin perspective.

CoRR , abs/1802.07523, 2018.[16] Sarah Meiklejohn, Marjori Pomarole, Grant Jordan, Kirill Levchenko,Damon McCoy, Geoffrey M Voelker, and Stefan Savage. A ﬁstfulof bitcoins: characterizing payments among men with no names. In

Proceedings of the 2013 conference on Internet measurement conference ,pages 127–140. ACM, 2013.[17] Malte M¨oser and Rainer B¨ohme. The price of anonymity: empiricalevidence from a market for bitcoin anonymization.

J. Cybersecurity ,3(2):127–135, 2017.[18] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2008.[19] Arvind Narayanan, Elaine Shi, and Benjamin IP Rubinstein. Linkprediction by de-anonymization: How we won the kaggle social networkchallenge. In

Neural Networks (IJCNN), The 2011 International JointConference on , pages 1825–1834. IEEE, 2011.[20] Jonas David Nick. Data-driven de-anonymization in bitcoin. Master’sthesis, ETH-Z¨urich, 2015.[21] Masarah Paquet-Clouston, Bernhard Haslhofer, and Benoit Dupont.Ransomware payments in the bitcoin ecosystem.

CoRR , abs/1804.04080,2018.[22] Stephen Ranshous, Cliff A Joslyn, Sean Kreyling, Kathleen Nowak,Nagiza F Samatova, Curtis L West, and Samuel Winters. Exchangepattern mining in the bitcoin transaction directed hypergraph. In

International Conference on Financial Cryptography and Data Security ,pages 248–263. Springer, 2017.[23] Fergal Reid and Martin Harrigan. An analysis of anonymity in thebitcoin system. In

SocialCom/PASSAT , pages 1318–1326. IEEE, 2011.[24] Dorit Ron and Adi Shamir. Quantitative analysis of the full bitcointransaction graph. In Ahmad-Reza Sadeghi, editor,

Financial Cryptog-raphy and Data Security , pages 6–24, Berlin, Heidelberg, 2013. SpringerBerlin Heidelberg.[25] Gabriel Tanase, Toyotaro Suzumura, Jinho Lee, Chun-Fu Chen, JasonCrawford, Hiroki Kanezashi, Song Zhang, and Warut D. Vijitbenjaronk.System G distributed graph database.

CoRR , abs/1802.03057, 2018.[26] Kentaroh Toyoda, Tomoaki Othsuki, and P. Takis Mathiopoulos. Multiclass bitcoin-enabled service identiﬁcation based on transaction historysummarization. In

IEEE Conference on IoT, GCC, CPSC, SD, B, CIT,Congress on Cybermatics , 2018.[27] Saurabh Verma and Zhi-Li Zhang. Hunt for the unique, stable, sparseand fast feature learning on graphs. In