Scuola Politecnica e delle Scienze di Base!
Corso di Laurea Magistrale in Ingegneria Informatica!
!!Tesi di Laurea Magistrale in sistemi distribuiti!
A Visual Interactive Realtime EXplorer for Bitcoin!
!Anno Accademico 2013/14
Ch.mo Prof. Stefano Russo
Dott. Marco Benedetti
Chapter 1: Bitcoin: the protocol and the currency 5
Monetary aspects and incentive 10
Bitcoin origins 12
Bitcoin ecosystem 13
Chapter 2: State of the art in bitcoin forensic analysis 15
Related literature 15
The flexcoin case 16
The biggest unsolved case: Mt.Gox eruption 16
Chapter 3: Considerations about the blockchain 18
Chapter 4: Requirements specification 23
Chapter 5: High-level architecture 27
Chapter 6: The Identity Reasoner 29
Address clustering 29
Data structure 31
Invariants of the identity reasoner database 32
Considerations about address graph size 39
Implementation details and experimental results 41
Chapter 7: The Query Engine 46
Balance queries 46
Flow queries 49
Implementation details 52
Chapter 8: The graphical user interface 54
Implementation details 60
Chapter 9: Experimental results 61
Future directions 66
Bitcoin is a peer-to-peer, decentralised virtual currency born a few years ago and cur- rently used for trading various services and goods all over the world. Such currency is not legally recognised by most government and is controlled by no central entity. De- spite these shortcomings (or perhaps thank to them), the market capitalisation of BC is already in the orders of billions of dollars.
The identity of Bitcoin users is hidden behind pseudonyms, but the ledger book of finan- cial transactions is globally visible. Many academic studies deal with the problem of linking pseudonyms to real-world identities and infer knowledge from the graph of transactions (whose size is in the order of the tens of millions and rising quickly). At present, the graph of transactions is explored manually or by ad-hoc scripting or by us- ing software developed for other goals, such as visualizers for (graph) DBs. Looking for and making sense of interesting/novel transaction patterns may be quite challenging.
This work aims to produce a modular, scalable, adaptable software toolkit meant to as- sist a human expert in analysing and making sense of a network of bitcoin transactions.
The software will be be called “VIREX-BC”, as in “Visual Interactive Realtime EXplor- er” for BC:
visual, in that all the information will be presented and explored by sophisticated im- agery and info graphics generated on-the-fly depending on the search context;
interactive, in that input from users will be accepted at any moment to direct and refine the exploration, and
realtime, in that fresh transactions will be included and analysed on-the-fly as they are timestamped in the network.
This work is inspired to “bitIodine”, an open source tool for extracting intelligence from the bitcoin network, developed by Michele Spagnuolo for his master thesis at Politecni- co di Milano and published on Financial Cryptography 2014 .
Chapter 1: Bitcoin: the protocol and the currency
Bitcoin is a decentralised global digital currency, based on an open-source software, im- plementing a peer to peer network that agrees on a logical order of transactions thanks to a distributed algorithm. Its first appearance is dated to Jan. 2009, with the first release of the Bitcoin client. In this chapter we will describe the working principles of Bitcoin (ad- dresses, transactions and blocks) and the ecosystem of services born around this striking technology.
Before starting accepting payments in bitcoin, we have to create a bitcoin address. The address, in bitcoin, is a string of letters and numbers which can be thought of as an In- ternational Bank Account Number (IBAN) code: it’s public, it has a reduced risk of tran- scription error and it’s needed to receive money! An example of address is 1Peppe- MEUXx6XgjubBnEQKtay2xpefnCZT and currently this address has a balance of 0.2 bitcoin. Generating a bitcoin address has no cost and it can be done using the bitcoin client. The privacy model introduced by bitcoin (see picture on the left) has public transactions, so addresses work
also as pseudonyms to hide people real identities. A person can have multiple addresses, therefore one of the principal activities in bitcoin forensic analysis is linking addresses
controlled by the same user. In the chapter “The Identity Reasoner” a definition for ad- dress control and for cluster of addresses is given.
Bitcoin transactions are public transfers of funds among addresses and, in particular, they can transfer bitcoins from from zero or more bitcoin addresses, to one or more bit- coin addresses. The following picture shows four transactions, in a scenario in which Alice and Bob are two bitcoin users. For each of the four transactions, inputs and outputs addresses are represented by colours: transactions on the utmost right have no input ad- dresses and one output address, the mid transaction has one input address (green) and two output addresses (red and cyan) and finally, the fourth transaction has two input ad- dresses and two output addresses.
Moreover, each input of a transaction, has a reference (marked with a broken line) to the output of a previous transaction. These references are needed to prove that an address owns an amount of bitcoins that wants to transfer and, together with transactions form a directed acyclic graph. Ron and Shamir analyse this graph in details in .
Transactions on the utmost right don’t have neither input addresses nor references to previous transactions: they are the so called mining transactions. Mining transactions are the way bitcoins are injected into the network, and their amount is a reward for bitcoin miners, the ones that contribute to write the bitcoin public ledger book (see Blocks sec- tion for details).
The mid transaction has a very common structure: it has one input address (the green one, that belongs to Bob) and two outputs (Bob’s red address, and Alice’s cyan address).
In this case, Bob is spending his mined 50 bitcoins, to pay 30 bitcoins to the Alice’s cyan address, that already has a balance of 50 bitcoins. Since the bitcoin protocol forces users to spend the whole output of a transaction, Bob needs to spend the whole content of its green address (50 bitcoins), even if he just wants to transfer 30 bitcoins to Alice.
In order to collect change, Bob creates a new red address, called change address, and dispatches to it an amount of 19 bitcoins. As probably you have already noticed, the sum of the inputs is not equal to the sum of the outputs, but there is a difference of 1 bitcoin.
This difference (that actually is lower that 1 bitcoin) is called a transaction fee, and is necessary to let the network accept and timestamp that transaction.
The last transaction is multi-input. Let’s suppose that Alice wants to buy a service for 60 bitcoins. She must prove to own 60 bitcoins, so she needs to insert two references to previous transactions that have deposited 80 bitcoins in her cyan address. In order to col- lect change, Alice generates an orange address that will be used in the future. It’s not
necessary that multi input transactions have the same addresses as inputs. As we will see, multi input transactions are the primary source of linking addresses to the same con- troller.
Technically speaking a transaction is a section of data that is broadcast to the bitcoin network. As shown in the picture at left, it’s signed with the public key of the payee and the reference to the previous transaction is obtained through hashing. All transactions are public, not encrypted and permanently recorded into the blockchain since the origin.
Now that we have defined a data structure for transactions, we need a way to logically order them, so that the payee knows that the previous owners of an output did not sign any earlier transaction, therefore avoiding “double spending” of a transaction output. A common solution to this problem is to introduce a trusted central authority, or mint, that checks every transaction for double spending.
Bitcoin proposes a completely distributed solution, in which all transactions are publicly announced and the network agrees on a single history in which they were received .
Using the following picture as reference, we will now explain the distributed algorithm that each node performs in order to reach this goal.
Step1: new transactions are broadcast to all nodes Step2: each node collects new transactions into a block
Blocks are represented as green boxes: they contain transactions, and are stored in an ordered list, called the blockchain, that contains the logical order of transactions in the bitcoin network. In fact, each block contains an hash of the previous one, proving that data of the previous block must have existed (in order to get into hash) at the time that the next is added to the blockchain. In the example shown in figure, we can state that Tx1 and Tx2 must have existed when Tx6 and Tx7 are added to the blockchain. transac- tions in the same block have to be considered concurrent, in the sense that it’s not possi- ble to logically order them. As we will see in the following sections, all nodes agree on a single blockchain and a new block is generated at a rate of 1 every (about) 10 minutes.
Step3: each node works on finding a difficult proof-of-work for its block
Once all new transactions are collected into a block, a node tries to add this block to the current blockchain and to persuade other nodes to agree that the next ring of the
blockchain is the block he forged, but this is a very difficult task! In fact, users in the bitcoin network will accept only blocks that carry with them a proof-of-work, just like the one described in , a piece of data difficult to find but immediate to verify. In par- ticular, the proof-of-work for a block is a nonce value that gives a blocks’s hash that be-
gins with a fixed number of zero bits. Once the CPU effort has been expended to make it satisfy the proof-of-work, the block cannot be changed without redoing the work. As lat- er blocks are chained after it, the work to change the block would include redoing all the blocks after it.
The majority decision is represented by the longest chain, which has the greatest proof of work invested in it. This algorithm works if a majority of CPU power is controlled by honest nodes, where a node is said to be honest when accepts to work on the longest chain he knows. Some considerations about proof-of-work integrity can be found in 
Step4: When a node finds a proof-of-work, it broadcasts the block to all nodes.
Step5: Nodes accept the block only if all transactions in it are valid and not already spent.
Step6: Nodes express their acceptance of the block by working on creating the next block in the chain, using the hash of the accepted block as the previous hash.
Monetary aspects and incentive
Bitcoin is designed as a system where no central monetary authority is involved. In fact, new money is created and introduced into the system via the process of validating trans- actions (i.e. finding valid blocks): by convention, the first transaction in a block is a spe-
cial transaction that starts a new coin owned by the creator of the block. This, apart from providing a way to initially distribute coins
into circulation, adds an incentive for nodes to support the network. The steady addition of a constant of amount of new coins is analogous to gold miners expending re- sources to add gold to circulation. In our case, it is CPU time and electricity that is expended .
The supply of money evolves based on an
agreement between users performing the mining activity . Currently, the scheme has been technically designed to supply money with a predictable pace, and the number of bitcoins generated per block will half every 4 years, reaching a total number of bitcoins into circulation equals to 21 millions in 2040 (see graph from ). This solution has many macroeconomic negative implications, such as price instability and deflationary economy .
When the money supply has reached the plateau, the incentive will be found with trans- action fees. If the output value of a transaction is less than its input value, the difference is a fee that is added to the incentive value of the block containing the transaction.
Both incentives may help encourage nodes to stay honest. If a greedy attacker is able to assemble more CPU power than all the honest nodes, he would have to choose between using it to defraud people by stealing back his payments, or using it to generate new coins. He ought to find it more profitable to play by the rules, such rules that favour him with more new coins than everyone else combined, than to undermine the system and
the validity of his own wealth.  assesses integrity by proof of work in a scenario in which bitcoin is used as a primary currency for online transfers currently carried out by credit cards.
The theoretical roots of Bitcoin can be found in the Austrian school of economics and its criticism of the current fiat money system and interventions undertaken by governments and other agencies, which, in their view, result in exacerbated business cycles and mas- sive inflation .
In 1988, cryptography advocate Wei Dai suggested a system in which the currency would be both regulated and created through crowdsourced cryptography.
In 2008, a person (or a group of people) under the pseudonym of Satoshi Nakamoto dis- tributed a paper named “Bitcoin: A Peer-to-Peer Electronic Cash System”  and then released an open source software named Bitcoin, which was a first attempt to give a shape to this idea. The first bitcoin transaction was dated Jan, 3rd 2009. In the first years of its life, bitcoin was used in small communities of early adopters: everyone could in- stall the open source software on its personal computer and participate to the network also by minting new bitcoins. In 2010 Bitcoin was used by an individual to trade a real good for the first time, but the true explosion of its popularity can be dated to the mid of 2012.
Since then a wide variety of service providers began to accept bitcoin as a mean of pay- ments and an ecosystem of support services, such as wallet services or exchanges, was born. Some of these third-parties are noteworthy.
Wallet services allow bitcoin users to transact with others without installing the bitcoin client. These services manage a bitcoin address in place of its respective owner, so that he can send and receive bitcoins, in a home-banking fashion.
greenaddress.it and blockchain.info offer wallet services.
Bitcoin currency exchanges allow users to trade bitcoins with other currencies, earning commissions for each trade. They usually operate also as a wallet service, storing
amounts of money on behalf of their customers, and allow deposits and withdrawals in different currencies.
The Silk Road was a famous online shop in the deep web that could only be accessed via TOR. This site allowed people to buy a variety of items, but became famous for be- ing a drug market and other illicit items.
On October 2, 2013 the FBI shut down the silk road and its creator was arrested on charge of alleged murder-for-hire and narcotics trafficking violation.
Mining pools are distributed services aimed at transaction validation, in which clients contribute together to the validation of a transaction, and then split the reward that comes from this activity, according the processing power that each participant put into play. Pooled mining effectively reduces the granularity of the block generation reward, spreading it out more smoothly over time. Deepbit was an example of mining pool.
Bitcoin network is an optimal infrastructure for gambling. Its protocol allows online gambling services to confirm that the results were actually calculated fairly without trusting any external party. Hundreds of gambling sites exploit bitcoin network, includ- ing dice games, casino, lotteries, slot machines, and poker rooms.
Satoshidice and just-dice are two of the most famous dice games.
Nowadays many vendors accept bitcoin as a mean of payment, including restaurants and shops, even if it’s rather unusual that bitcoins are used to purchase physical goods or services, in particular because of price instability.
coinmap.org shows vendors accepting bitcoin spread all around the world.
It’s important to know that bitcoin is not the unique exemplar of virtual currency: many other currencies, called alt-coins, have been created since 2009, by modifying bitcoin core source code, such as litecoin, namecoin, dogecoin. Moreover, a lot of currencies, usually called meta-coins, are built upon the bitcoin infrastructure, each adding a partic- ular service (e.g. zerocoin adds strong anonymity to bitcoin).
In general, the bitcoin protocol and its infrastructure (the blockchain), currently mainly used to transfer coins among people, can be used to send public, potentially anonymous, timestamped, timeless, and certified messages, therefore has a wide variety of applica- tions and can replace many forms of intermediations. It’s not sure that bitcoin will suc- cess as a currency, but for sure it’s technology is worth of attention by entrepreneurs and regulators.
Chapter 2: State of the art in bitcoin forensic analysis
Due to Bitcoin claimed anonymity, forensics analysis in its network has been a well studied topic in literature since 2009.
In 2011 Reid and Harrigan  first linked addresses belonging to the same entity and showed some implications for anonymity.
In 2012  analysed and evaluated the privacy implications of Bitcoin if it was used as a primary currency to support the daily transactions of individuals in a university setting.
Through a simulator that faith-fully mimics the use of Bitcoin within a university, they show that the profiles of almost 40% of the users can be, to a large extent, recovered.
In 2013 some researchers at the University of California collected information on the web and tried to group bitcoin addresses based on the evidence of shared authority. Their work is published in .
In 2011, Michele Spagnuolo released the open source software Bitiodine, simultaneously with his thesis at the university of Illinois. Its work has been later published on Financial Cryptography , with the name “Bitiodine: Extracting Intelligence From The Bitcoin Network”.
Bitiodine is able to cluster addresses and classify them using a dataset partially obtained in an automatic fashion, using scrapers for major web sources of bitcoin addresses. Bi- tiodine has been the main source of inspiration for this work: with the help of its creator, its source has been deeply studied and analysed. Virex tries to maintain bitiodine
strengths and to add some improvements, such as an architecture for real time tracking of transactions and a graphical user interface.
The flexcoin case
Flexcoin , a bitcoin bank, has been forced to close because of a theft of 896 bitcoin on 1 March, 3rd 2014.
The company posted on its website the following statement:
The attacker logged into the flexcoin front end from IP address 188.8.131.52 under a newly created username and deposited to address 1DSD3B3uS2wGZjZA-
The coins were then left to sit until they had reached 6 confirmations.!
The attacker then successfully exploited a flaw in the code which allows transfers between flexcoin users. By sending thousands of simultaneous requests, the attack- er was able to "move" coins from one user account to another until the sending ac- count was overdrawn, before balances were updated.!
This was then repeated through multiple accounts, snowballing the amount, until the attacker withdrew the coins (1NDkevapt4SWYFEmquCDBSf7DLMTNVggdu, and 1QFcC5JitGwpFKqRDd9QNH3eGN56dCNgy6)!
Provided information are enough to visualise flows between flexcoin and its attacker and also to infer some conclusions about the end of stolen coins.
The biggest unsolved case: Mt.Gox eruption
Mt. Gox, called "Mount Gox" or "MTGOX", was one of the most widely used bitcoin currency exchange market: it was launched in July 2010 and by 2013 was handling 70%
of all Bitcoin tradings . The market was closed on February, 2014. Mark Karpelès, Mt. 2 Gox CEO, claimed bankruptcy and announced that around 850,000 bitcoins belonging to customers and the company were missing and likely stolen. Although 200,000 bitcoins
have since been found, the reason(s) for the disappearance—theft, fraud, mismanage- ment, or a combination of these—are unclear as of March 2014. The timeline of the events that lead to Mt.Gox shutdown are the following.
On 07 February 2014 Mt. Gox halted all bitcoin withdrawals. The company said it was pausing withdrawal requests “to obtain a clear technical view of the currency processes”.
On 10 February 2014 The company issued a press release stating that the issue was due to transaction malleability, a known bug that affected many bitcoin clients, including the official one. For technical details about transactions malleability, see Decker and Wat- tenhofer .
On 24 February 2014, Mt. Gox suspended all trading, and hours later its website went offline, returning a blank page
On 28 February 2014 Mt. Gox filed for bankruptcy protection in Tokyo, reporting that the company had lost almost 750,000 of its customers' bitcoins, and around 100,000 of its own bitcoins.
On 20 March 2014, Mt. Gox reported on its website that it found 200,000 bitcoins in an old format cold wallet. That brings the total number of lost bitcoins, down to 650,000 from 850,000.
Chapter 3: Considerations about the blockchain
In this chapter the reader will find some considerations about the blockchain size and consequent scalability of virex. First, the state of the current blockchain will be analysed and some assumptions about the structure of transactions will be made. Then, the trend of the total number of transactions will be considered in order to make an attempt to predict the size of the blockchain in the future.
In the following table there are some measurements obtained from the blockchain at time of writing (Tue, 13 May 2014 06:42:37 GMT).
Each transaction can have an arbitrary number of inputs and outputs, and can generate an arbitrary number of new addresses, but some considerations about the distributions of the number of inputs and outputs per transaction can be made. The following charts gives evidence to the fact that these distributions show a peak at 1 input and 2 outputs, respectively.
Number of blocks 300.493
Number of transactions 38.649.948 ~39M
Number of distinct addresses 35.741.676 ~36M
Number of outputs 99.854.793 ~100M
Number of inputs 88.905.461 ~89M
0 10000000 20000000 30000000 40000000
0 1 2 3 4 5
A transaction structure with one input and two outputs is the most common, since the input address is used to collect money, the former output address is controlled by the payee and the latter is used to collect change. Sometimes one address is not sufficient to collect an high amount of money to transfer and more input addresses are needed. We can assume that these distributions as approximatively time-invariant because an in- creasing in the size of a transaction (in terms of number of outputs and inputs) leads to expensive transaction fees. The expected values for the above distributions are summa- rized in the following table.
NB: The value “expected number of inputs per transaction E2” considers mining trans- actions as transactions with one input.
Now let’s focus on the total number of transactions and on its derivative, the number of transactions per day (see diagrams below). Both have been growing quite slow from 2009 to mid 2012, but straight afterwards they started to grow faster, with a change in
Number of inputs Probability Number of outputs Probability
0 0.01 0 0
1 0.62 1 0.07
2 0.21 2 0.85
3 0.05 3 0.04
4 0.05 4 0.01
5 0.02 5 0
… … … …
Current number of transactions ~39M
Expected number of addresses per transaction E[na/nt] 0,92 Expected number of inputs per transactions E[nin/nt] 2,30 Expected number of inputs per transactions E2[nin/nt] 2,31 Expected number of outputs per transactions E[nout/nt] 2,58
the trend of the number of transactions per day, in the mid of 2012. The steeper slope is in accordance with the diffusion of bitcoin among not-very-early adopters, and probably we will experience other trend-changes in the future, but it’s possible to state that, when the growth will significantly slow down, the total number of transactions will level off at an approximately constant value.
In order to disclose the current trend behind the growth of the total number of transac- tions after mid 2012, and predict its value in the near future, a simple linear regression between the number of days elapsed from 2012 June 01 (next called the reference date)
and the number of transactions per day is estimated, resulting in a fitting line with inter- cept at 29.884 transactions/day and slope 53,75 transactions/day/day.
Afterwards, to estimate the total number of transactions it’s necessary to integrate this quantity, considering an initial value at the reference date of 3.590.000 transactions.
The results are shown in the following table.
These results are obtained through a rough calculation, but could be useful to asses the feasibility of the project: as explained in chapters “The Identity Reasoner” and “The Query Engine”, the database size depends linearly from the number of transactions.
Since transactions per day currently grows linearly and, according to Moore’s law, memory size doubles every year (or every three years) it should be possible, to keep in memory the whole database. However, this analysis is quite “optimistic”. If bitcoin will become commonly accepted as a mean of payment, transactions will grow at a very higher pace before reaching the saturation plateau.
Blockchain estimated size
Date (Jun 01) Number of transactions per
day Total number of transactions
2012 29.884 3.590.000
2013 49.502,75 18.078.082
2014 69.121,5 39.727.008
2015 88.740,25 68.536.777
2016 108.359 104.507.390
Chapter 4: Requirements specification
Requirements of VIREX can be resumed in a set of questions to which the system tries to answer. They are all intended to be “queries” in the sense that they don’t modify the state of the internal systems and for this reason, virex interface is often refereed as the
“virex query language”.
A first classification of virex operations separates questions about balances and ques- tions about flows. In the first category there are questions about the amount of bitcoins controlled by an addresses, an entity or a cluster; in the second there are questions about bitcoin transfers among addresses, but also about mined bitcoins.
A second (orthogonal) classification separates questions about addresses and questions about cluster of addresses. The first class considers flows among single bitcoin ad- dresses or bitcoin entities, without applying any clustering algorithm, and all information are extracted from the public ledger book of the bitcoin blockchain. The second class of operations answers applying clustering to bitcoin entities, with the aid of clustering heuristics and algorithms described in literature (see chapter named “The Identity Rea- soner”).
Tables in this section specify all interface methods. Implementation details are in the chapter named “The Query Engine”.
In the un-clustered version of a balance query, when a controller is specified, virex re- turns the sum of the amounts of bitcoin deposited in the addresses controlled by the se- lected entity.
questions What’s the balance of the address 1dice8EMZ… at 2014 May 26 14:07:44 UTC ?
What’s the balance of the addresses controlled by Satoshi at 2014 May 26 14:07:44 UTC ?
Name Type Description
Inputs entity String The address, or the supposed controller, of which we are interested in the balance
timestamp Number The unix timestamp of the date and time Outputs balance Number The balance, in satoshis, of specified entity at
requested date and time
F.R.2 BALANCE CLUSTERED
question What’s the balance of the cluster to which address 1dice8EMZ…
belongs, at 26 May 2014 14:07:44 UTC ?
What’s the balance of the cluster controlled by Giuseppe, at 26 May 2014 14:07:44 UTC ?
Name Type Description
Inputs entity String A representative address or supposed controller of the cluster of which we are interested in the balance timestamp Number The unix timestamp of the date and time
Outputs balance Number The balance, in satoshis, of specified cluster at requested date and time
F.R.3 FLOW Natural language
question What’s the flow between address 1dice8EMZ… and address 1NDpZ2wyFe... in the period of time that goes from 15 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC ?
What’s the flow between addresses controller by Satoshi and address 1NDpZ2wyFe... in the period of time that goes from 15 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC ?
Name Type Description
entity String The address, or supposed controller, of the payer of the flow we are interested in
entity String The address, or supposed controller, of the payee of the flow we are interested in
timestamp Number The unix timestamp of the initial date and time to date
timestamp Number The unix timestamp of the final date and time Outputs flow Number The flow between addresses in the specified period
F.R.4 FLOW CLUSTERED
question What’s the flow between the cluster to which address
1dice8EMZ… belongs and the cluster controlled by Giuseppe, in the period of time that goes from 15 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC ?
Name Type Description
entity String A representative address, or the supposed controller of the payer cluster
entity String A representative address, or the supposed controller of the payee cluster
timestamp Number The unix timestamp of the initial date and time to date
timestamp Number The unix timestamp of the final date and time Outputs flow Number The flow between clusters in the specified period
It’s important to notice that virex has been designed to ask a lot of more questions, such as:
• How many addresses did 1dice8EMZ… payed, in the period of time going from 5 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC?
• Who controls the clusters that 1dice8EMZ… payed in the period of time going from 5 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC?
This questions enable for a deeper analysis of bitcoin flows, but are not formalized here, and no implementation is still available.
Now let’s define the real-time word in the VIREX acronym. We say that virex is real- time in the sense that all questions specified with the the bitcoin query language shall receive answers updated to the latest confirmed transactions. After a transaction is 3 broadcast to the bitcoin network, it may be included in a block and when that happens it is said that one confirmation has occurred for the transaction. With each subsequent block that is added to the blockchain, the number of confirmations is increased by one.
To protect against double spending, a transaction should not be considered as confirmed until a certain number of blocks have been added. Just like the classic bitcoin client, we will consider a transaction as confirmed when at least 6 blocks confirm the transaction.
F.R.5 MINED BITCOINS (UN-CLUSTERED OR CLUSTERED)
question What’s the amount of bitcoin mined by address 1dice8EMZ… (or by the cluster to which the address 1dice8EMZ… belongs) in the period of time that goes from 15 Jan 2014 00:00 UTC to 26 May 2014 14:07:44 UTC ?
Chapter 5: High-level architecture
High level architecture for VIREX system is shown in the following figure. Backend components are enclosed in white boxes, data flows are represented by lines, and ar- row’s direction identifies the component that takes the initiative (push/pull).
At the origin of data there is the Bitcoin Network that, block by block, timestamps trans- actions and inject them into an extended client that is responsible for realtime tracking of transactions (Realtime Tracker).
The Transaction Manager is responsible to analyse new transactions and extract from them essential information needed for address clustering, and to update information con- trolled by the query engine. In particular, it takes into account flows and balances that generate from analyzed transactions.
The Query Engine is the core of the system and is the component responsible at answer questions described in requirements. It must be extremely fast and scalable, in order to support requests coming from the user interface (graphical or not).
The Identity Reasoner tries to link addresses together, using information gathered from the blockchain itself and from the web. It clusters together all the addresses likely to be controlled by the same entity.
Currently, not all described components have a real implementation. In particular only prototypes for the Identity Reasoner, Query Engine and Web user interface have been implemented. Moreover, all these components have to be orchestrated and synchronised to maintain a consistent state of the bitcoin transaction graph, but this problem is not ad- dressed in this work.
Chapter 6: The Identity Reasoner
Virex Identity reasoner is the component responsible to cluster addresses and associate them to entities of the real word (a person, a service, a forum user), with the aid of ad- dress clustering and data collection.
In particular, it needs to:
• Track clusters of addresses in realtime, while transactions are timestamped in the block chain.
• Merge clusters that belong to the same entity according to heuristics and user knowl- edge.
• Collect and store information about addresses.
Address clustering in Bitcoin is the activity that seeks to identify groups of addresses that are probably controlled by the same entity. It’s possible to reach this goal to some extent, thanks to two well-known heuristics able to link addresses from the structure of transactions in which they are involved.
Before presenting heuristics it’s important to define the meaning of address control, as in . In short, the controller of an address is the expected entity responsible for forming transactions on behalf of that address. Private key knowledge is a necessary require- ment for address control, but not a sufficient one. Consider, for example, buying physi- cal bitcoins from a vendor such as Casascius. Both creator and buyer of the physical bit- coin know the private key, but, according to the previous definition, the controller is the bitcoin buyer. Moreover, it’s important to emphasise that this definition of address con- trol, is quite different from account ownership. For example, a wallet service or an ex-
change service is the controller of all addresses it generates (often used by customers for deposits / withdrawals), but the funds in these addresses are owned by a wide variety of distinct users.
The first linking heuristic is often referred as “heuristic of multi-input transactions”
and was already identified by bitcoin creators: it’s described in the privacy section of the original bitcoin paper . Briefly, in the hypothesis that users don’t share their private keys, if two addresses are used as inputs to the same transaction, then they are controlled by the same entity. For a more formal definition of this heuristic, it’s possible to read 
The second linking heuristic is often called “shadow address guessing”  and aims at guessing, for each transaction, the address used for change. According to this heuristic, the address used for change is controlled by the same entity controlling input addresses.
As Satoshi Nakamoto suggests in its paper, a new key pair should be used for each transaction to keep them from being linked to a common owner, and in fact, current bit- coin implementation generates, for each transaction, a new address for collecting change. Many techniques to identify this address are described in literature, but In this work the more stringent one will be used, i.e. the variant described in :
If there are two output addresses (one payee and one change address, which is true for the vast majority of transactions), and one of the two has never appeared before in the block chain, while other has, then we can safely assume that the one that nev- er appeared before is the shadow address generated by the client to collect change back.!
This version, although effective, has proven significantly less safe than the multi input transaction heuristic.  reports very high rate of false positives, ending up with a giant super-cluster containing the public keys of Mt.Gox, Instawallet, BitPay, and Silk Road.
Moreover, it’s possible to understand that two addresses are controlled by the same user thanks to data collection, by labelling addresses as being controlled by some known real-world entity. Data collection can be performed by transacting with real actors in the bitcoin ecosystem (e.g. playing with just-dice, depositing and withdrawing from an ex- change), but always more frequently the primary source of this data is the big and un- structured word of the internet. A very huge dataset was collected and described in :
services include mining pools, wallets, exchanges, vendors and many others, while Bi- tiodine  includes scrapers for just-dice, bitcointalk, bitcoin-otc and many other sites.
In addition, many users publicly claim their own addresses on the web, and many of these are collected at blockchain.info/tags.
Identity reasoner core data structure is a graph in which nodes represent addresses and relationships represent links between addresses that state the two addresses are con- trolled by the same entity.
Each node has the following properties:
• An address, a string representing the bitcoin address of the node
• A controller, a string identifying the controller of the bitcoin address
• A cluster id, a numeric identifying the cluster to which the address belongs There are three types of relationships:
• HEURISTIC1, directed, to identify a link between two addresses caused by a multi input transaction.
• HEURISTIC2, directed, to identify a link between two addresses caused by change address guessing.
• SAME_CONTROLLER, undirected, to identify a link between two addresses caused by knowledge of shared control between the two addresses.
Each relationship has a description property, giving information about its origin (e.g. for H1 and H2 relationships, the description is an identifier of the transaction that caused the linking).
Given the identity reasoner data structure, it’s possible to identify and track clusters of ad- dresses using well knows graph algorithms.
It is straightforward to compute connected components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph) using either breadth-first search or depth first search.
There are also efficient algorithms to dynamically track connected components of a graph as vertices and edges are added.
Invariants of the identity reasoner database
Some invariants are defined to keep the data structure consistent with knowledge ex- tracted from the blockchain and from the web.
INVARIANT0: There aren’t two nodes in the graph with the same address.
INVARIANT1: Two addresses are in the same connected components if and only if then they have the same cluster identifier.
INVARIANT2: Given a transaction ! , with M ordered input addresses
and an irrelevant number of output addresses, then exists in the identity reasoner the fol- lowing path, with edges of type HEURISTIC1:
INVARIANT3: Given a transaction ! , with an input address in position 0 (first posi- tion), say ! , and a shadow address ! , then exists in the identity reasoner the following edge of type HEURISTIC2:
INVARIANT4: Two addresses have the same controller property if and only if they are linked by a SAME_CONTROLLER relationship.
The data structure should be upgraded each time one of the following event happens
• A new transaction is confirmed in the blockchain. In this case, the identity reasoner should add new nodes corresponding to new addresses that appeared in the network and new edges, corresponding to heuristics that have been evaluated.
• E new controller for an address is discovered. In this case, the identity reasoner should merge clusters that are controlled by the same entity or separate addresses that are no more controlled by the same entity.
Virex identity reasoner graph is supposed to have, as well as setters and getters for node’s properties, a series of primitive operations which don’t guarantee identity reason- er invariants, but are useful to define more complex “transactional” operations described in subsequent paragraphs.
1. create_node(address): if a node with the specified address doesn’t exist in the net- work, create the node.
2. create_relationship(address1, address2, type): if a relationship between address1 and address2, with the specified type, doesn’t exist, create the relationship.
3. delete_relationship(address1, address2, type)
4. traverse_address(address): starting from the node with the specified address, and us- ing a breadth/depth first algorithm, identify all nodes in the same connected compo- nents of the starting node, marking them with the same cluster id.
5. merge_clusters(address1, address2): merge clusters of selected addresses, without the need of re traversing a portion of the graph.
To initially bootstrap the identity reasoner graph, it’s necessary to read the whole blockchain and importing into the graph all identified nodes (addresses) and edges rela- tive to heuristics 1 and 2.
Then we need to traverse all nodes of the graph in order to identify connected compo- nents for the first time.
Adding a bitcoin transaction
When a new transaction is timestamped into the blockchain, it’s necessary to update the virex identity reasoner data structure with all new addresses and new heuristics.
From the point of view of the identity reasoner, a transaction can be considered as a set of addresses and a set of heuristics. In the following example a new transaction involves
“Address6” and “Address4” and a new heuristic of type 1.
When a new transaction is added to the identity reasoner, there is always no need to delete edges, hence there is no need to re-traverse portions of the graph.
Given an address A, if we are going to set its controller property to C, it’s important to guarantee invariants for each possible state of the network.
We summarise this state using three binary variables as shown in the following table. Six out of eight possible states are consistent with invariants and are therefore noteworthy.
The address has a different controller
The address is linked to another one with a SAME_CONTROLLE
Another address with the same controller as C exists in the network
FALSE FALSE FALSE 1
FALSE FALSE TRUE 2
FALSE TRUE FALSE Inconsistent
FALSE TRUE TRUE Inconsistent
TRUE FALSE FALSE 3
TRUE FALSE TRUE 4
TRUE TRUE FALSE 5
Each of the consistent state will be analyzed in details.
Setting controller 1/6
In the first case, you just need to set the controller property to C for the given address.
Setting controller 2/6
In this second case, after setting the controller property for “Address6” to “Alice”, and adding an edge between “Address6” and “Address5”, it’s needed to merge clusters 2 and 3. Primitive operations to execute are the following:
TRUE TRUE TRUE 6
The address has a different controller
The address is linked to another one with a SAME_CONTROLLE
Another address with the same controller as C exists in the network
2. create_relationship(“Address5”, ”Address6”, SAME_CONTROLLER) 3. merge_clusters(“Address5”,”Address6”)
Setting controller 3/6
In this example you need to change controller for “Address5” from “Alice” to “Chris”.
The node “Address5” is not connected to other nodes with a SAME_CONTROLLER relationship, and no node with controller “Chris” exists in the network, so you just need to change the controller property.
Setting controller 4/6
In this example you need to change controller for “Address5” from “Alice” to “Bob”.
The node “Address5” is not connected to other nodes with a SAME_CONTROLLER
relationship, but a node with controller “Bob” already exists in the network. Primitive operations to execute are the same as in case 2.
Setting controller 5/6
In this example, you need to change the controller for “Address1” from “Bob” to
“Chris”. The SAME_CONTROLLER relationship between “Address1” and “Address5”
has to be dropped and connected components involving these addresses need to be iden- tified again. Primitive operations to be executed follows:
2. delete_relationship(“Address1”,”Address5”,SAME_CONTROLLER) 3. traverse_address(“Address1”)
4. traverse_address(“Address5”) Setting controller 6/6
In the last example you need to change controller for “Address6” from “Alice” to “Bob”.
Primitive operations to be executed are the following:
2. remove_relationship(“Address6”,”Address5”,SAME_CONTROLLER) 3. traverse_address(“Address6”)
5. create_relationship(“Address1”,”Address6”,SAME_CONTROLLER) 6. merge_clusters(“Address1”,”Address6”)
Considerations about address graph size
The number of nodes is proportional to the number of addresses in the blockchain. De- noting with E[na/nt] the expected number of addresses addresses per transaction, we have that the number of nodes is
Considering heuristic 1, we have an edge for each couple of addresses in a transaction, so the number of relationships of type HEURISTIC1 can be expressed as
Where nt is the number on transactions in the blockchain and E[nin/nt] is the expected number of inputs per transaction.
Considering heuristic 2, we have at most a single shadow address per transaction, so an upper bound to the number of relationships of type HEURISTIC2 can be expresses as
It’s important to note that both the number of nodes and the number of edges are linear with the number of transactions.
Implementation details and experimental results
Identity reasoner has been implemented using neo4j, a famous graph database. A graph database uses graph structures, such as nodes, edges, and properties to represent and store data and is a powerful tool for graph-like queries, for example traversing or com- puting the shortest path between two nodes.
The resulting database size is about 12 GB, and the number of identified clusters for each heuristic, in the current blockchain (~ 35.7M addresses) is reported in the following table.
It is evident that implemented heuristic 2 is quite unsafe, since it ends up in a giant su- percluster of about 40% of addresses and, for this reason, it won’t be taken into account in the discussions that will follow. A refined implementation is described in  and should be implemented in the near future.
0 5000000 10000000 15000000 20000000
1 addr 2 addr 3 addr 4 addr 5 addr
H1 H2 H1+H2
Expected number of edges
of edges Number of identified clusters
Maximum cluster size (addresses)
Average cluster size (addresses)
H1 ~ 50 M ~ 50 M ~ 16M ~ 1M ~ 2.18
H2 ~ 39 M ~ 13 M ~ 23M ~ 3M ~ 1.54
H1+H2 ~ 89 M ~ 63 M ~ 8.5M ~ 13M ~ 4.22
It’s now time add some prior knowledge to the entity reasoner, and to link addresses to their supposed controllers.
A first dataset is taken from the BitIodine software and is composed of about 70,000 ad- dresses potentially belonging to the authors of CryptoLocker, a famous ransomware that locks computers running MS Windows, by encrypting important files with an RSA pub- lic key and then offers to decrypt the data if a payment through bitcoin is made. These addresses have been obtained by searching on google for extracts of the text of the mon- ey request displayed by the malware and by reading a Reddit thread in which victims and researchers post addresses . 4
When adding this dataset to the entity reasoner, we end up in a giant supercluster of about 13M addresses that contains addresses controlled by both MtGox and Cryptolock- er. This result has two potential implications, not excluding each other: the first is that there is some false information in the dataset, i.e. some addresses have been announced as controlled by Cryptolocker, but are actually controlled by, for example, Mt. Gox and have nothing to do with the famous malware. The second is that there could be a connec- tion between Cryptolocker and MtGox, that can lead to think that Cryptolocker was a Mt. Gox customer, and some coins stored in addresses controlled by Mt. Gox are owned by Cryptolocker. This scenario highlights the central role that exchanges play in the bit- coin ecosystem, since nowadays goods and services are mostly payed with fiat curren- cies.
A second dataset is obtained form another Reddit thread started just after MtGox filed for bankruptcy , with the aim of trying to find an acknowledge for the story told by its 5 CEO and to figure out the financial situation of the famous currency exchange. Many of these addresses belong to the second biggest cluster (about 500k addresses) whose rep- resentative is 1LNWw6yCxkUmkhArb2Nf2MPw6vG7u5WG7q and some of them be- long to very small clusters with 1 up to 4 addresses, that are likely to belong to MtGox.
Third, we were able to identify a Bitstamp hot wallet thanks to the knowledge of one of 6 its addresses 18xgnWy7HmrPnUsD6NJCc29nu4QL21vaYD.
In the following picture we show, in linear scale, the size, in number of addresses, of the biggest four clusters and of hot wallets for known entities.
As shown in the diagram, we can state that the second cluster is a MtGox hot wallet, but other big clusters controllers are unknown.
If we plot the portion of the identity reasoner graph relative to big clusters and report some clustering statistics, we are able to identify some false positives heuristic1 edges.
Dealing with address clustering and identity reasoning is for sure one of the most fasci- nating challenge of bitcoin forensics analysis. We tried to describe a model, based on a graph data structure, that can incrementally evolve with the bitcoin network and with an increasing knowledge of address-controller associations.
Unfortunately this model is incredibly subject to corruption, and if raw information (e.g.
collected on the web) reveals affected by errors, it suddenly will bring clusters (especial- ly the biggest ones, belonging to influential actors of the network) to tie together, hence distorting results. A model for adding edges to the graph is necessary, and should take into accounts the size of the clusters that are going to be merged, the amount and the quality of information collected.
Bitcoin service providers are neither strong neither decentralised (yet), so users have strong interest in forensic analysis, as confirmed by discussions about Mt. Gox and
Cryptolocker cases. For this reason it would be very nice if they could play an active role in this activity, by reporting information they own about subjects they want to con- trol, in a crowd-sourced fashion.
Chapter 7: The Query Engine
Bitcoin query engine is the component responsible to answer users’ requests defined in requirements chapter. It’s designed to be extremely fast and scalable.
Let’s recall example queries about balances such as:
• What was the balance of “1dice…” at Mon, 26 May 2014 14:07:44 UTC?
• What was the balance of the cluster containing “1dice…” at Mon, 26 May 2014 14:07:44 UTC?
Given an address a and an instant of time t, balance is a non negative value and can be obtained from bitcoin transactions with the following formula:
where T is a transaction with Nt outputs (Nt>0) and Mt inputs (Mt may be 0 for mining transactions), whose timestamp isn’t greater than t. In other words, evaluating the bal- ance of an address means to sum over the boundaries (unspent outputs) of the transac- tion graph at time t: if an output is already spent at time t, it’s necessary to cancel the correspondent positive addend using a negative one. Given this definition of address
balance, to evaluate a cluster balance is just necessary to sum over single balances of all addresses of that cluster.
To efficiently compute balance of addresses and clusters, a data structure called balance element is defined. Balance elements are build starting from transactions: given a trans- action T, timestamped at time t, with Mt input addresses in(i) and Nt output addresses out(i), then for each address in input addresses or in output addresses, a balance ele- ments is defined as follows:
Using balance elements, the balance of an address can be evaluated aggregating all amounts of interesting balance elements.
Let’s consider, for instance, the transaction identified by 58545bb4cdbd0272df60e- fa969e1f9604944c507d5634926c6bfd113a9712c2d , with two inputs and two outputs 7 and has been timestamped in the block with height 300934 and timestamp 2014-05-15 23:41:01.
tx id (optional) identifies the transaction that generated the balance element. Can be either the hash of the transaction or a progressive identifier.
address identifies the address of the input/output
cluster id identifies the cluster to which the address belongs
timestamp the time at which the transaction was timestamped into the blockchain amount amount of satoshis transferred from / to the specified address. Amount is
positive for output addresses and negative for input addresses.
The selected transaction has two inputs (1Ai and16U) and 2 outputs (1Et and 1Q9) hence produces 4 balance elements with negative variations for balances of 1Ai and 16U and positive variations for balances of 1Et and 1Q9. In particular, the following balance elements are inserted into the query engine database.
Since each transaction produces a balance elements for each input and a balance element for each output, the total number of balance elements can be estimated, starting from the number of transactions, using the following formula:
and considering that expected values for the number of inputs and outputs of a given transaction is constant in time, it’s possible to assume that the number of balance ele- ments is linear with the number of transactions.
tx_id address cluster_id
(invented) timestamp amount
BE1 5854… 1Ai… 5 1400197261 -0,01
BE2 5854… 16U… 5 1400197261 -0,01
BE3 5854… 1Et… 5 1400197261 0,008
BE4 5854… 1Q9… 3 1400197261 0,012
Flow queries aim to answer questions like the following:
• What was the flow between “address1” and “address2” between May 26, 2013 and May 01, 2014 ?
• What was the flow between the cluster controller by “Alice” and the cluster con- trolled by “Bob” between May 26, 2013 and May 01, 2014 ?
The issue of flow queries deserves more attention. In fact, because of multi input trans- actions, trying to define flow between address it’s not trivial. For example, let’s consider a transaction T, with two inputs and two outputs.
How does this transaction contributes to the flow between in(0) and out(0)? It may be 5, but also 1, or 0. In short, flow between addresses is not well defined in case of multi in- put transactions. However, assuming that bitcoin users do not share their private keys (therefore making deterministic the first heuristic) it’s possible to give a definition of the flow between addresses that is consistent with the flow between clusters.
The flow between two addresses (a1 and a2) is the sum of output values deposited to a2 in transactions having a1 as inputs, divided by the total number of input addresses of each transaction.
Considering the previous transaction, the flow between in(0) and out(0) is equals to 2.5, just like the flow between in(1) and out(0). Since in(0) and in(1) are in the same cluster
(first heuristic), it’s possible to obtain the flow between clusters by summing over flows between addresses, obtaining a total flow between the cluster in(0)-in(1) and out(0) of 5.
The following formula defines flow between addresses and clusters.
Another issue with flows is mining. It would be interesting if it would be possible to link together balances and flow with the following formula:
This though simple equation is not so obvious in bitcoin if we do not extend flow defini- tion to transactions with no inputs (mining transactions). In particular the addresses set was extended with a special address, defined as mine address.
So, if you ask to the virex query engine the flow between the mine address and another address, you are asking for the amount of bitcoins mined by that address.
To efficiently compute flows between addresses and clusters, another simple data struc- ture, called flow element is defined. Flow elements are also built starting from transac-
tions: given a transaction T, timestamped at time t, with Mt>0 input addresses (mining address is considered as an input address if the transaction has no input addresses) and Nt output addresses out(0)…out(Nt-1), then for each pair (in(i), out(j)) a flow element is defined as follows:
Using flow elements the flow between two addresses can be evaluated aggregating all amounts of interesting flow elements.
Total number of flow elements can be estimated starting from the number of transac- tions. For each transaction, a flow element is generated for each pair of input and output addresses, as described by the following formula:
Just like balance elements, it’s possible to conclude that the number of flow elements is linear with the number of transactions.
tx id (optional) identifies the transaction that generated the flow element
payer identifies the payer address in(i), or the mining address if the transaction has no inputs
payer_cid identifies the cluster of the payer address payee identifies the payee address out(j) payee_cid identifies the cluster of the payee address
timestamp the time at which the transaction was timestamped into the blockchain
flow amount of satoshis transferred from payer to payee address evaluated using the definition of flow between addresses