GraphSense: A General-Purpose Cryptoasset Analytics Platform
GGraphSense: A General-Purpose CryptoassetAnalytics Platform
Bernhard Haslhofer , Rainer Stütz , Matteo Romiti , and Ross King AIT Austrian Institute of TechnologyVienna, Austria
Version 0.4.5There is currently an increasing demand for cryptoasset analysis tools amongcryptoasset service providers, the financial industry in general, as well as acrossacademic fields. At the moment, one can choose between commercial servicesor low-level open-source tools providing programmatic access. In this paper, wepresent the design and implementation of another option: the GraphSense Cryp-toasset Analytics Platform, which can be used for interactive investigations ofmonetary flows and, more importantly, for executing advanced analytics tasks us-ing a standard data science tool stack. By providing a growing set of open-sourcecomponents, GraphSense could ultimately become an instrument for scientificinvestigations in academia and a possible response to emerging compliance andregulation challenges for businesses and organizations dealing with cryptoassets.
Keywords: cryptoassets, analytics, blockchain
In recent years, we have observed a rapidly increasing demand for cryptoassetanalysis tools in industry and academia: businesses dealing with cryptoassetsanalyze transactions to fulfill compliance guidelines and regulations (c.f., [13,1]);law enforcement needs these techniques to track and trace illicit money flows(e.g., [7,8]); designers of distributed ledger technology analyze deployed systemsto make informed system design decisions [14]; business analysts and investorsanalyze transactional data to understand markets; and, last but not least, sci-entists from a wide range of academic disciplines use cryptoasset analytics toolsto find answers to their research questions.At the moment, analysts can choose from two main options. On one hand,they can use commercial service offerings and analyze cryptoasset addresses andtransactions via provided user interfaces and APIs. Neglecting the relativelyhigh service costs, this has the advantage of a low entry barrier and availabilityof so-called attribution tags, which associate cryptoasset addresses with real-world actors such as exchanges. Alternatively, one can use free, open-sourceblockchain analytics tools like BlockSci [5], which provides programmatic accessto the full blockchain data and a highly efficient in-memory transaction graph representation. a r X i v : . [ c s . CR ] F e b n this paper, we present a third option: the GraphSense Cryptoasset Ana-lytics Platform , which is designed as an extensible and scalable analytics plat-form for running customized analytics tasks on data gathered from multipleblockchains and other contextually relevant sources, such as exchange rate ser-vices. Similar to commercial offerings, GraphSense also provides a dashboard forbasic, interactive investigations, which lowers the entry barrier for non-expertusers. Similar to BlockSci, it provides the flexibility to perform analytics taskson pre-computed graph abstractions. However, in contrast to BlockSci, Graph-Sense provides access to the so-called address and entity graphs , which reflectthe main structural elements of cryptoasset ecosystems: actors, who interactwith each other and are linked together through cryptoasset transfers (c.f., [10]).Furthermore, GraphSense introduces the notion of
TagPacks , which support col-laborative collection and provenance-aware curation of attribution tags, whichare valuable data points in most analytics tasks.Our vision was for GraphSense to become a general-purpose cryptoasset an-alytics platform that supports analysts in conducting microscopic, transaction-level investigations as well as more extensive macroscopic investigations on struc-tural and dynamic aspects of cryptoasset ecosystems. Technically, GraphSensecontributes reusable building blocks that can easily be integrated into an ETL orcryptoasset analytics pipeline. By being published under an open-source license,which permits reuse for commercial and non-commercial purposes, GraphSensehas already attracted interest and contributions from third parties and could ul-timately become a core technology for cryptoasset analytics research in academiaand industry.In the following, in Section 2, we first provide some background informationon graph-abstractions required for cryptoasset analytics and then present ourrationale for designing GraphSense. We then present the technical design andthe architecture of the GraphSense platform in Section 3, before we providefurther details on TagPacks in Section 4. Finally, in Section 5, we provide someinsight into known challenges and future development directions.This paper currently describes version 0.4.5 of the GraphSense CryptoassetAnalytics Platform. It will be updated based on users’ feedback and new featuresincluded in future releases. In our terminology, we denote an asset as something that has some value forsomeone. Building on this, we denote a cryptoasset as a virtual asset that utilizescryptography and some form of ledger technology, which may be distributed ornot, for recording and sharing value transfers. As depicted in Figure 1, we canroughly divide the spectrum of cryptoassets into native cryptocurrencies like Bit-coin, and tokens as they are deployed on account-model ledgers like Ethereum. https://github.graphsense.info cashMoneroLitecoinBitcoin DAI Tether Polymath Golem ENS CryptokittiesPrivacy FocusedTransparent Fungible Tokens Non-Fungible TokensNative Cryptocurrencies TokensCryptoassets Fig. 1.
The spectrum of cryptoassets. Technically, one can distinguish between
NativeCryptocurrencies and
Tokens deployed on platforms like Ethereum. From a usage per-spective, one can distinguish between Payment Tokens ( ), Security Tokens ( ),and Utility Tokens ( ). A cryptoasset ecosystem represents a community of actors, who interact asa system and are linked together through cryptoasset transfers. The goal of cryptoasset analytics is to develop and apply quantitative methods to understandthe technical and socio-economic aspects of cryptoasset ecosystems. This hasbeen our underlying motivation for building GraphSense.The required algorithmic building blocks for enabling cryptoasset analyticsdepend on the conceptual design of a distributed ledger. Ledgers that followBitcoin’s unspent transaction outputs (UTXO) model, which includes Bitcoinderivatives like Litecoin and Zcash, allows a single transaction to have multipleinputs and outputs. We can compute various types of graph abstractions fromthe underlying blockchain [10] such as the transaction graph , which is a directedtemporal graph connecting transactions by their inputs and outputs. Further, wecan compute the address graph , a bi-directed cyclic graph in which a node rep-resents an address and an edge represents the set of transactions two addresseswere involved in as input or output. Addresses can further be linked using vari-ous address clustering heuristics , most importantly the so-called multiple-inputor co-spent heuristic, which groups addresses that are likely controlled by thesame real-world actor based on common use and reuse in transactions [6]. Afterapplying the clustering algorithm, one can build another graph abstraction: theso-called entity-graph , which is a bi-directed, cyclic graph in which a node repre-sents the set of addresses that are likely controlled by the same real-world actor(e.g., an exchange) and an edge that represents the aggregate set of transactionsbetween two address sets (entities).Other ledgers like Ethereum, NEO, or EOS follow a different conceptualmodel called the account model . In that model, a single transaction has exactlyone source and one destination account address. While it is still possible to com-pute the transaction and address graph, existing heuristics based on multipleinputs or outputs cannot be used. However, recent work has shown that addressclustering is also possible for Ethereum’s account model based on heuristics de-rived from the analysis of usage patterns surrounding deposit addresses, airdrops,or token transfer authorization [15]. 3 .2 Design Rationale The current system design of GraphSense reflects a number of observations andrequirements that have registered over the past several years.
Data sovereignty
We observe that programmatic access to the full data, whichincludes blockchain data as well as attribution tags, is essential for efficient andeffective analytics going beyond following individual transactions. When analyz-ing entire markets, as we did for Ransomware [7] or Sextortion [8], one mustextract relevant data points related to thousands of addresses in a single analyt-ics task. Predefined interfaces, whether graphical dashboards or programmaticREST APIs, always somehow limit the scope of cryptocurrency analyses. There-fore, GraphSense follows a full data sovereignty strategy and provides program-matic access to the full underlying data, and thereby supports advanced usagescenarios.
Pre-computed Graph Abstractions
Nowadays, cryptoasset transactions can beinspected with a wide variety of publicly available blockchain explorers suchas blockchain.com or etherscan.io . Next to providing details on individ-ual blocks and transactions they also support navigation along the transactiongraph , which means users can navigate from a certain transaction output addressto the next transaction that uses that address as input and vice versa. Whilethis is certainly useful and important, we observed that many analytics tasksfocus on the investigation of monetary flows between cryptoasset addresses, ormore importantly between the real-world actors (e.g., exchanges) that somehowcontrol these addresses, hence the cryptoasset entities. Therefore, GraphSenseprovides higher-level graph abstractions, namely the address- and entity-graph for various cryptoassets. Since computing these graph abstractions is computa-tionally expensive, GraphSense pre-computes them and makes them availablefor subsequent analytics tasks. Collaborative Address Tagging
We are aware that attribution tags , which asso-ciate cryptoasset addresses and entities with some real-world actor, are essentialfor conducting effective analyses. They can, for instance, be collected manuallyby interacting with certain services and assigning human-readable labels to ad-dresses controlled by these services. Since this is a costly and resource-intensivetask, we propose TagPacks, which is a simple file-structure for organizing andexchanging attribution tags. TagPacks can be collected and collaboratively main-tained using Git , which is a free version control system that provides the tech-nical means for recording data-provenance information. This aspect is importantbecause documented evidence of the origin of data is increasingly emphasized orrequired in both academia and legal proceedings (c.f. [2]). https://etherscan.io https://git-scm.com calability & Extensibility We note that the volume of transactional data inblockchains is growing and new ledgers are appearing. In the long-run, in-memory graph representations might face scalability issues when working withhigher-level graph abstractions computed over several ledgers. We also followthe reasoning behind BlockSci and point out that most of the relevant datacome from append-only data structures, which makes the ACID properties ofgeneral-purpose databases unnecessary. Therefore, we decided to build Graph-Sense on-top of a standard data science technology stack, which uses ApacheCassandra as a NoSQL storage engine and Apache Spark as an analytics en-gine. Both technologies can increase capacities by connecting additional hard-ware and therefore respond to growing data volumes. We also take into accountthat cryptoasset analysis is a highly dynamic field in which new ledgers areconstantly being added or existing ones are changing. Therefore, we designedGraphSense as a modular and extensible analytics pipeline consisting of multi-ple, standalone building blocks, which can be updated, extended, and possiblyreplaced as needed. Transparency & Open Source
GraphSense leverages other open source efforts likeBlockSci and uses them in an integrated analytics pipeline. In return, all Graph-Sense components are published as open-source software on GitHub under anMIT license and can be re-used for commercial and non-commercial purposes.In this manner, GraphSense also fulfills the requirement of algorithmic trans-parency, which is another important condition for safeguarding the evidentialvalue of cryptoasset investigations. GraphSense is designed as a modular and extensible analytics pipeline consistingof multiple, standalone building blocks, which are connected and orchestrated viaDocker and Docker Compose. As depicted in Figure 2, the overall pipeline canbe divided into several parts: the relevant data sources that provide the raw datapoints needed for further analyses; several data aggregation components thatretrieve data from different sources and ingest them into GraphSense’s NoSQLstorage back-end; a data transformation job that computes statistical properties,clusters addresses and the required address- and entity graph abstractions; andfinally interfaces that provide programmatic access to the underlying data aswell as a Dashboard that supports users in analyzing individual nodes and edgesin these graphs. https://cassandra.apache.org https://spark.apache.org https://opensource.org/licenses/MIT ata Transformation and StorageRaw Data TransformedDataREST InterfaceData AggregationBlockSciTagpack ToolExchange RateCrawlerData SourcesUTXO-Model LedgersTagpacksExchange Rates DashboardCLI Fig. 2.
GraphSense Architecture
GraphSense uses data from the following sources: – UTXO-Model Ledgers : At the moment, Bitcoin, Bitcoin Cash, Zcash, Lite-coin are supported. – TagPacks : GraphSense integrates collaboratively collected attribution tagsin the form of TagPacks. Further details will be provided in Sectionµ 4. – Exchange Rates : GraphSense utilizes cryptoasset exchange rates from publicservices such as CoinDesk , CoinMarketCap , and the European CentralBank (ECB) Raw data are aggregated from the above sources using several data-source-specific connectors and extractors. Table 1 shows the number of blocks, trans-actions, addresses, and tags that are aggregated at the time of this writing.For UTXO model ledgers, GraphSense currently relies on BlockSci, whichprovides an efficient parser for large chains like Bitcoin as well as REST-APIconnectors for other, smaller ledgers. At the moment, GraphSense also relies onBlockSci’s mapping from transaction and address hashes to integer IDs, whichsignificantly lowers memory consumption and storage space.Bitcoin exchange rates are gathered from CoinDesk’s public API (BitcoinPrice Index API ), where historical exchange rates are provided for different https://coinmarketcap.com able 1. Summary of supported cryptocurrency ledgers.Currency Date
FIAT currencies in JSON format at a dedicated API endpoint . Daily historicalexchange rates for the remaining cryptocurrencies are retrieved in U.S. Dollarsfrom the CoinMarketCap API . For conversion to other fiat currencies we useforeign exchange rates provided by the European Central Bank (ECB) .TagPacks can be aggregated and validated using the GraphSense TagPackManagement Tool. Since TagPacks use terms from external taxonomies , thattool can also be used for aggregating, validating and ingesting taxonomy conceptsand definitions.All aggregated raw data are ingested into the NoSQL storage back-end in adedicated raw keyspace . The next step in the GraphSense data analytics pipeline is a transformationjob that computes statistical summaries on central blockchain entities (blocks,transactions, addresses). For UTXO ledgers, it also computes clusters of ad-dresses that are likely controlled by the same real-world entity, which could,for instance, be an exchange. The entire transformation job is implemented inApache Spark and runs in parallel over raw data items, which are stored inApache Cassandra and distributed over a cluster of connected machines.
Statistical properties
The properties computed for blocks and transactions aretrivial and roughly correspond to those that can also be found in public blockchainexplorers (e.g., total transaction inputs and outputs). For addresses and entities,however, we compute semantically richer statistics such as the total volume ofcurrency units received by an address, while taking into account historical ex-change rates for each transaction.
Address graph
A cryptoasset address a i represents a node in the address graphand carries a set of key-value pairs P a i providing statistical summaries for indi-vidual addresses: number of i) deposits, ii) withdrawals, iii) depositing addresses,iv) withdrawing addresses, v) coins received, vi) coins spent and vii) balance as https://api.coindesk.com/v1/bpi/historical/close.json https://web-api.coinmarketcap.com/v1/cryptocurrency/ohlcv/ https://interpol-innovation-centre.github.io/DW-VA-Taxonomy T a i ,a j from address a i to a j represents theedge between the two nodes and is also labeled with key-value pairs P a i ,a j : i) es-timated transferred value, ii) number of transactions and iii) list of transactions.Here we point out that an exact computation of the value transfer between twoaddresses in UTXO ledgers is not possible, because a single UTXO transactionhas multiple inputs and outputs. Therefore, it is not possible to associate a valuefrom one input address with an output address (see [4]).Since each address node carries several computed statistical properties andalso the edges are labeled with properties and values, we represent the addressgraph following the property graph model [11]. A property graph is essentiallya bi-directed multi-graph with labeled nodes and edges, where edges have theirown identity. Address clustering
In UTXO ledgers, a user can create and control an arbi-trary number of addresses at virtually no cost. Linking and clustering theseaddresses into a single set, which represents the real-world entity that likelycontrols these addresses, is an essential task in cryptoasset analytics. Graph-Sense currently implements the co-spent heuristics [6], which is also known asmultiple-input heuristics and assumes that inputs spent in the same transactionsare controlled by the same user who must possess the corresponding private keyfor signing these inputs. While this method has proved very effective in prac-tice [3], a known, possible source for false positives are CoinJoins, which canbe identified and filtered before applying that heuristics (see [5]). Other cluster-ing heuristics rely on the identification of change addresses in the transactionoutputs. Since this depends on the technical nature of the client executing thetransactions, GraphSense refrains from implementing any change heuristics.From a technical perspective, clustering is therefore implemented as a union-find algorithm that selects all address IDs from all non-multi-signature transac-tions with more than one input, ships them to a central master node where thedisjoint-set data structures are computed, and ships them back to all nodes inthe cluster for assigning unique cluster or entity IDs to each address.
Entity graph
By combining the previously described address graph with theentities (disjoint address sets) computed by address clustering, we can now buildthe entity graph . In the entity graph, a node represents an entity e x which reflectssome real-world actor (e.g., an exchange) controlling a set of addresses, while anedge represents the aggregated set of transactions T e x ,e y that occurred betweentwo entities e x , e y .In general, the entity graph carries the same properties as the address graph,but on an aggregated level. Hence, a node e x carries the following key-valuepairs P e x : number of i) deposits, ii) withdrawals, iii) depositing entities, iv)withdrawing entities, v) coins received, vi) coins spent, vii) balance as well as8iii) activity period based on the ix) first transaction and the x) last transactionand, additionally, xi) the number of addresses and xii) a tag coherence score .Analogously, an edge has the following aggregated key-value pairs P e x ,e y : i)estimated transferred value, ii) number of transactions and iii) list of transac-tions .Figure 3 illustrates both the address and entity property graphs. Addresses a and a are clustered into entity e , while entity e and e are made of one addressonly ( a and a , respectively). Table 2 shows the dimensionality (number of nodesand edges) of both graphs and one can clearly see that the entity graph reducesthe dimensionality. In the entity graph, the number of nodes is approximatelyhalved, and the number of edges is reduced by factor of 3.5–5, respectively. e e e a a a a { P e } { P e }{ P e }{ P a }{ P a } { P a }{ P a }{ P a ,a }{ P a ,a } { P a , a } { P e , e } { P e , e } Fig. 3.
Conceptual Address ( ) and Entity ( ) Graph Model. The Entity Graphis a higher-level abstraction and an aggregated model of the Address Graph, both atnode and edge level. In this example, P e ,e is an aggregation of P a ,a and P a ,a . Table 2.
Summary of computed graph representations.Address Graph Entity GraphCurrency Tag coherence: a metric that uses the string similarity between tags related to theentity addresses to describe the entity consistency and composition. Since storing the entire list of transactions among two entities might be expensive,we disregard transaction lists with more than 100 entries. raph Storage In GraphSense, the address and entity graphs are stored as nodeand edge lists in a distributed NoSQL database. Since NoSQL databases typi-cally don’t support efficient lookup-by-indices on non-partition keys, GraphSensestores each edge list twice: once to support retrieval of an edge by source nodeid and once to support the reverse direction. While we consider the additionalrequired disk space as being a non-issue, the challenge clearly lies in partitioningthe edge list across machines so that partition sizes follow a roughly uniformdistribution and the data keeps load balanced throughout the cluster.
A wide range of analyses, which go beyond an inspection of individual transac-tions, can only be achieved through programmatic access to the entire underly-ing dataset. Although this is associated with an additional effort initially, onegains reproducibility and repetition with minimal additional costs. Therefore,GraphSense offers two options for programmatic access: a REST-API and thepossibility to run customized Apache Spark Jobs over the entire dataset.The REST-API follows the OpenAPI specification , which defines a stan-dard, language-agnostic interface to RESTful APIs that can be used by codegeneration tools to generate servers and clients in various programming lan-guages. GraphSense implements a REST-Server-Stub in Python Flask and cur-rently provides client libraries in Python and R.The second more powerful option is to implement a customized Apache Sparkjob and run it over the entire dataset. This of course requires direct access tothe cluster running GraphSense and full knowledge and understanding of theNoSQL model, which is used for storing the data. In order to provide a low entry barrier for non-expert users, GraphSense alsoprovides a visual Dashboard, as shown in Figure 4. It supports the inspectionof blocks, transactions, addresses, and entities as well as navigation along thenodes and edges of the address and entity graph. In this manner, users can tracemonetary flows and construct relevant sub-graphs reflecting the result of theirinvestigations. The dashboard also provides means for automatically searchingfor certain types of nodes, such as entities representing exchanges, within certainboundary conditions (e.g., maximum node degrees). Users can also annotatenodes, export graphs, import additional tags, and download audit logs of theirinteractions.Technically, the Dashboard is implemented as a pure JavaScript REST-APIclient, which is bundled using webpack . It is also important to emphasize thatthe GraphSense Dashboard is read-only, which means that no user interactionsor data entered by the user are sent back to the GraphSense server. https://swagger.io/specification https://webpack.js.org ig. 4. Screenshot of the GraphSense Dashboard TagPacks
Attribution tags are any form of context information that can be attributed toan address, transaction, or cluster, such as the name of an exchange hostingthe associated wallet or some other personally identifiable information (PII) ofthe account holder. The strength of the attribution approach lies in combiningaddress clusters with attribution tags: a tag attributed to a single address beingcontrolled by some cryptoasset service, which typically forms a large addresscluster, can easily de-anonymize hundreds of thousands of addresses.In our previous work [2], we have already highlighted the important role ofattribution tags in modern cryptoasset analytics and identified key legal require-ments for the forensic processing of data. We pointed out that the provenance of attribution tags is a critical foundation for assessing their quality and authen-ticity, as well as for enabling trust and allowing reproducibility. If used for lawenforcement purposes, the provenance even becomes a legal requirement.In the following, we describe how we collect and organize attribution tags inso-called TagPacks, how we shared and collaboratively managed them using Git,and how we intend to establish attribution tag interoperability among tools byusing an agreed-upon taxonomy.
A TagPack defines a structure for collecting and packaging attribution tags withadditional shared provenance metadata (e.g., title, creator, etc.). TagPacks arerepresented as YAML files, which can easily be created by hand or exportedautomatically from other systems. Listing 1.1 shows a minimal TagPack, whichattributes addresses from different ledgers (BTC, BCH, ZEC) to the “InternetArchive”, which is a non-profit organization in the US. It also records the cre-ator of this TagPack, its last modification date ( lastmod ), the type of entitycontrolling these addresses ( category ), as well as the information source . title: GraphSense Demo TagPackcreator: GraphSense Teamdescription: A collection of tags commonly used for demonstratingGraphSense featureslabel: Internet Archivecategory: organizationlastmod: 2019-06-12source: https://archive.org/donate/cryptocurrencytags:- address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dNcurrency: BTC- address: 1K1rgZ1dz9w7dsR1HGS1drmzfUHMtqx1Tccurrency: BCH- address: t1ZmpK4QFcvyQZ3ghTgSboBW8b4HgiZHQF9currency: ZEC Listing 1.1.
Minimal TagPack Example
12 TagPack consists of a header and a body section. The header lists severalmandatory and optional metadata fields and the body provides the list of tags.The range of possible properties for header and body entries is defined in aTagPack Schema. In the above example, the properties title and creator arepart of the TagPack header, the list of tags represents the body.To avoid that property values need to be repeated for all tags in a TagPack,body fields can also be abstracted and added to the header, thereby being in-herited by all body elements, as shown in the following example. In the exampleabove, the label
Internet Archive is a body-level tag property, which has beenadded to the header to avoid repetition. It is also possible to override abstractedfields in the body. This could be relevant if someone creates a TagPack compris-ing several tags and then adds additional tags later on, which then, of course,have different property values.GraphSense also provides a dedicated TagPack Management Tool , whichallows validation of TagPacks against the TagPack schema and referenced tax-onomies before they are ingested into the NoSQL storage back-end and processesas part of the transformation step. Instead of defining and building a data provenance model and management sys-tem from scratch, GraphSense adopts Git for storing and publishing attributiontags. Git has its origin in distributed software development and has, over thelast decade, become the de-facto standard for publishing and tracking changesin source code files. It automatically creates hashes over each file and allows, ifrequired, users to digitally sign their contents after each commit. Git is increas-ingly used for sharing smaller and even large datasets (Git LFS ). The use of common terminologies is essential for data sharing and establishinginteroperability across tools. Therefore, the TagPack schema defines two prop-erties that take concepts from agreed-upon taxonomies as values: – category : defines the type of real-world entity that is in control of a givenaddress. Possible concepts (e.g., Exchange, Marketplace) are defined in theINTERPOL Darkweb and Cryptoassets Entity Taxonomy . – abuse : if an address was involved in some abusive behavior, this property’svalue defines the type of abuse and can take values from the INTERPOLDarkweb and Cryptoassets Abuse Taxonomy . https://github.com/graphsense/graphsense-tagpack-tool https://git-lfs.github.com https://github.com/INTERPOL-Innovation-Centre/DW-VA-Taxonomy https://github.com/INTERPOL-Innovation-Centre/DW-VA-Taxonomy
13n the example TagPack provided in Listing 1.1, for instance, the real worldactor controlling these addresses is categorized as organization , which directlymaps to a concept that is uniquely identified via a URI as part of the IN-TERPOL Darknet Entity Taxonomy. If all cryptoasset analytics tools categorizeattribution tags according to these taxonomies and also use the provided defi-nitions, then attribution data can be harmonized across tools and the first steptowards better interoperability can be achieved. Given current developments in the cryptoasset field, we strongly believe thatthere will a need for a deeper quantitative understanding of both individual andaggregate transaction flows, and of the technical and socio-economic aspects ofincreasingly complex cryptocurrency ecosystems. Networks are natural abstrac-tions for such systems as they provide the basis for task-specific measurement andsimulation methods. With GraphSense we provide the required computationalinfrastructure and pre-computed network abstractions for implementing suchmethods. With its modular, horizontally scalable system architecture, Graph-Sense also provides the flexibility to quickly react to upcoming, yet unforeseendevelopments and methodological challenges in this field.
Limitations
GraphSense is a steadily evolving system and also faces some yetunresolved limitations. First, the price of horizontal scalability is that Graph-Sense runs on a distributed hardware infrastructure. The operation of such aninfrastructure requires a specific skill-set, which is hard to find and also requiresrelatively large initial investment costs. However, we argue that hosting Graph-Sense externally (e.g., in commercial cloud infrastructure) might become evenmore costly with increasing data volumes and involve yet unforeseen technical,organizational, and financial dependencies.Lack of real-time updates is another inherent limitation of the overall sys-tem architecture, which has been designed for data analytics workflows. Thebottleneck lies in updating the address clusters and in re-computing the graphabstractions, which can, depending on the dimensions of the hardware cluster,take several hours. However, we argue that real-time investigations are hardlyever needed, because most analytics tasks, for instance, the forensic analysis ofa ransomware attack, are conducted in retrospect.The third limitation we are facing at the moment is the lack of incentive forcollecting and sharing attribution tags. The industry has the means for collectingbut not the incentive for sharing for competitive reasons; in academia, there isan incentive for sharing, for scientific reproducibility for example, but typicallyfew resources for collecting.Finally, we also would like to point out that the overall analytics pipelinecould still be optimized. Address clusters, for instance, are currently computed https://interpol-innovation-centre.github.io/DW-VA-Taxonomy/taxonomies/entities Outlook
GraphSense follows an agile release plan with major and minor releases.With the next upcoming minor release (0.4.6), it should be possible to deployGraphSense on a single server and retrieve pre-computed dumps from a peri-odically updated data repository. The next major release (0.5.X) will supportaccount model ledgers, starting with Ethereum. Depending on the adoption ofoff-chain payment channels, a future major release (0.6.X) might also supportanalysis of payment channel transactions across ledgers (c.f. [12]).We also envision GraphSense to become a key technology in a research sub-field, which we call
CryptoFinance . The goal is to systemically assess emergingtechnologies and paradigms like Decentralized Finance (DeFi), to learn moreabout opportunities and risks associated with these developments, and to ul-timately come up with measures that help us in quantifying systemic risks incryptoasset ecosystems. Efficient and effective computations over graph and net-work abstractions will certainly play a central role in this effort. GraphSense is a response to the increasing need for a general-purpose cryptoassetanalytics platform. In this paper, we discussed the network abstractions relevantin this field, elaborated on our design rationale, and described the architectureand current technical building blocks of GraphSense. We will certainly continuedeveloping GraphSense as part of our research activities in the field of cryptoassetanalytics and expect that GraphSense will soon support analytics across UTXOand account-model ledgers and, if relevant, also off-chain payment channels. Wealso expect and see to some extent that future GraphSense development willbecome more a community effort, which is driven by academia and stakeholdersin the financial industry and the emerging CryptoFinance field.
References
1. FATF: Guidance for a risk-based approach to virtual assets and virtual asset serviceproviders. Tech. rep., FATF (2019),
2. Fröwis, M., Gottschalk, T., Haslhofer, B., Rückert, C., Pesch, P.: Safeguarding theevidential value of forensic cryptocurrency investigations. Forensic Science Inter-national: Digital Investigation (2020). https://doi.org/10.1016/j.fsidi.2019.200902, . Harrigan, M., Fretter, C.: The unreasonable effectiveness of address cluster-ing. In: 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing,Advanced and Trusted Computing, Scalable Computing and Communi-cations, Cloud and Big Data Computing, Internet of People, and SmartWorld Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). pp.368–373 (2016). https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.00714. Haslhofer, B., Karl, R., Filtz, E.: O bitcoin where art thou? insight into large-scaletransaction graphs. In: Joint Proceedings of the Posters and Demos Track of the12th International Conference on Semantic Systems (SEMANTiCS 2016). Leipzig,Germany (2016), http://ceur-ws.org/Vol-1695/paper20.pdf
5. Kalodner, H., Möser, M., Lee, K., Goldfeder, S., Plattner, M., Chator, A.,Narayanan, A.: Blocksci: Design and applications of a blockchain analysis plat-form. In: 29th USENIX Security Symposium (USENIX Security 20). pp. 2721–2738. USENIX Association (Aug 2020),
6. Meiklejohn, S., Pomarole, M., Jordan, G., Levchenko, K., McCoy, D., Voelker,G.M., Savage, S.: A fistful of bitcoins: Characterizing payments among men withno names. In: Proceedings of the 2013 Conference on Internet Measurement Con-ference. p. 127–140. IMC ’13, Association for Computing Machinery, New York,NY, USA (2013). https://doi.org/10.1145/2504730.25047477. Paquet-Clouston, M., Haslhofer, B., Dupont, B.: Ransomware paymentsin the bitcoin ecosystem. Journal of Cybersecurity (1) (05 2019).https://doi.org/10.1093/cybsec/tyz0038. Paquet-Clouston, M., Romiti, M., Haslhofer, B., Chavat, T.: Spams meet cryp-tocurrencies: Sextortion in the bitcoin ecosystem. In: ACM conference on Ad-vances in Financial Technologies (AFT’19). Zurich, Switzerland (2019), https://arxiv.org/abs/1908.01051
9. Park, H.M., Park, N., Myaeng, S.H., Kang, U.: PACC: Large scale connected com-ponent computation on Hadoop and Spark. PLOS ONE (3), 1–25 (03 2020).https://doi.org/10.1371/journal.pone.022993610. Reid, F., Harrigan, M.: An Analysis of Anonymity in the Bitcoin System, pp. 197–223. Springer, New York, NY (2013). https://doi.org/10.1007/978-1-4614-4139-7_1011. Rodriguez, M.A., Neubauer, P.: Constructions from dots and lines. Bulletin of theAmerican Society for Information Science and Technology (6), 35–41 (2010)12. Romiti, M., Victor, F., Moreno-Sanchez, P., Nordholt, P.S., Haslhofer, B., Maffei,M.: Cross-layer deanonymization methods in the lightning protocol. In: Finan-cial Cryptography and Data Security (FC 2021) (2021), https://arxiv.org/abs/2007.00764
13. Sackheim, M.S., Howell, N.A.: The Virtual Currency Regulation Review. Law Busi-ness Research Ltd, London (2020)14. Stütz, R., Gaži, P., Haslhofer, B., Illum, J.: Stake shift in major cryptocurrencies:An empirical study. In: Bonneau, J., Heninger, N. (eds.) Financial Cryptographyand Data Security. Lecture Notes in Computer Science, vol. 12059, pp. 97–113.Springer, Cham (2020). https://doi.org/10.1007/978-3-030-51280-4_7, https://arxiv.org/abs/2001.04187
15. Victor, F.: Address Clustering Heuristics for Ethereum. In: Bonneau, J.,Heninger, N. (eds.) Financial Cryptography and Data Security. Lecture Notesin Computer Science, vol. 12059, pp. 617–633. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-51280-4_3315. Victor, F.: Address Clustering Heuristics for Ethereum. In: Bonneau, J.,Heninger, N. (eds.) Financial Cryptography and Data Security. Lecture Notesin Computer Science, vol. 12059, pp. 617–633. Springer, Cham (2020).https://doi.org/10.1007/978-3-030-51280-4_33