3.2.1
Cryptocurrencies page
The second dataset this thesis relies on is cryptocurrencies’ Wikipedia pages data. It was collected through the Wikipedia API [152] and includes the daily number of page views and the page edit history of the 38 cryptocur- rencies with a page on Wikipedia. Table 3.2 shows the cryptocurrencies with Wikipedia pages along with some market characteristics. Using Wikipedia
API, we can choose which page views to be considered. For example, the parameterall-accessfilter by access method such as desktop, mobile-app or mobile-web. On the other, the parameteragentfilter by agent type, for exam- ple user or bot. For the page views we use the API call:https://wikimedia.
org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/ user/wiki_page/daily/start_date/end_date, wherewiki pageis the page
name andstart dateandend dateare the requested dates.
Page-view data range from July 1st 2015 until January 23rd 2019, since earlier data are not accessible through the API. On the other hand, the full editing history is accessible through the API, and includes the content of each edit, the editor, the time of creation and possible comments meant to highlight the nature and the reason of the edit. To retrieve the edit history, we use the API call: https://en.wikipedia.org/w/api.php?action=query&format=json& prop=revisions&rvprop=timestamp%7Cuser%7Ccomment%7Ccontent&&titles= wiki_page. Since edits retrieved from Wikipedia contain XML and HTML tags, we cleaned each edit by removing all those tags and keeping only the text of the edit.
Automated tools known as “bots” often carry out repetitive tasks to main- tain pages. Wikipedia requires bots to have separate accounts and names which include the word “BOT”, in order to make their edits identifiable. We excluded all edits from bots from our analysis.
We classified edits into two categories, namely edits with new content and maintenance edits. Maintenance edits aim to keep consensual page content by restoring a more accurate old version (reverts) and fighting malicious edits (vandalism). We identified reverts by selecting edits comments con- taining the word “rv” or “revert” [153], and by employing an MD5 hashing scheme. MD5 (message digest) is a hashing scheme which digests a variable- length sequence and outputs a fixed-length (128 bits) sequence [154]. MD5 is commonly used to identify identical files. We created an MD5 hash for all edits, and we identified edits sharing the same hash with a previous edit as reverts. Reverts which were made specifically to fight vandalism were identified by selecting edits labelled in their associated comment as fighting “vandalism” [153]. We considered edits which are not classified as vandalism
nor reverts as new content.
We also collected data on the activity of the most active editors in other Wikipedia pages. To retrieve this data, we used Xtool [155], a web tool providing general statistics on the editors and their most edited pages. The tool does not provide an API, but through calling the URL of each editor, we accessed the page of the editor and scraped the data.
TABLE3.2: Cryptocurrencies with a page in Wikipedia.The table is generated using data collected on January 23rd, 2019. The table shows, for the cryptocurrencies considered, their name, wikipedia page identifier, the date of their first appear- ance on the market, page creation date, rank (based on market capitalisation), and whether they can be marginally traded or not (see Appendix C.1 for information on exchanges that support margin trading). Currencies are ordered alphabetically
Name Wikipedia page identi- fier Trading start date Wikipedia page creation date Rank Margin trad- ing Auroracoin Auroracoin 2014−02−27 2014−03−16 578 No Bitcoin Bitcoin 2013−04−28 2009−03−08 1 Yes Bitcoin
Cash
Bitcoin Cash 2017−07−23 2017−07−28 6 Yes
Bitcoin Pri- vate
Bitcoin Private 2018−03−10 2018−01−18 121 No
Bitconnect Bitconnect 2017−01−20 2017−06−28 Delisted No Bitcoin
Gold
Bitcoin Gold 2017−10−23 2017−10−15 28 Yes
Cardano Cardano (platform) 2017−10−01 2018−01−10 12 Yes Dash Dash (cryptocurrency) 2014−02−14 2014−06−01 15 Yes Decred Decred 2016−02−10 2017−10−22 32 No Dogecoin Dogecoin 2013−12−15 2013−12−14 24 Yes EOS EOS.IO 2017−07−01 2017−11−30 5 Yes Ethereum Ethereum 2015−08−07 2014−01−27 2 Yes
Table 3.2 –Continued from previous page
Name Wikipedia page link Trading start date Wikipedia page creation date Rank Margin trad- ing Ethereum Classic
Ethereum Classic 2016−07−24 2016−07−25 18 Yes
Filecoin Filecoin 2017−12−13 2017−08−11 1744 No Gridcoin Gridcoin 2015−02−28 2016−08−30 1179 No Litecoin Litecoin 2013−04−28 2012−10−20 4 Yes MazaCoin MazaCoin 2014−02−27 2014−02−28 Delisted No Monero Monero (cryptocurrency) 2014−05−21 2015−03−19 13 Yes Namecoin Namecoin 2013−04−28 2012−06−27 242 No NEM NEM (cryptocurrency) 2015−04−01 2014−12−11 19 No NEO NEO (cryptocurrency) 2016−09−09 2017−12−27 16 Yes NuBits NuBits 2014−09−24 2015−11−03 900 No
Nxt Nxt 2013−12−04 2014−03−09 124 No
OmiseGO OmiseGO 2017−07−14 2017−09−14 30 Yes Peercoin Peercoin 2013−04−28 2013−04−10 188 No Petro Petro (cryptocurrency) 2014−04−15 2017−12−03 1208 No PotCoin PotCoin 2014−02−10 2014−08−06 413 No Primecoin Primecoin 2013−07−11 2013−07−29 438 No Ripple Ripple (payment protocol) 2013−08−04 2005−08−06 3 Yes Stellar Stellar (payment network) 2014−08−05 2014−09−04 9 Yes Tether Tether (cryptocurrency) 2015−02−25 2017−12−05 7 Yes Tezos Tezos 2017−10−02 2018−06−28 23 No Titcoin Titcoin 2014−08−26 2014−09−13 1602 No Vechain Ven (currency) 2018−08−03 2012−12−04 34 No
Table 3.2 –Continued from previous page
Name Wikipedia page link Trading start date Wikipedia page creation date Rank Margin trad- ing
Verge Verge (cryptocurrency) 2014−10−25 2018−01−24 49 No Vertcoin Vertcoin 2014−01−20 2014−03−12 175 No Waves plat-
form
Waves platform 2016−06−02 2017−04−27 21 No
Zcash Zcash 2016−10−29 2016−10−03 20 Yes
We also extracted data of drugs’ Wikipedia pages for the prediction of drug sales on dark markets in Chapter 7. The next section will detail the process of data extraction.
3.2.2
Drug pages
We collect Wikipedia page views data through the Wikipedia API (similar to cryptocurrencies Wikipedia pages), which runs from July 2015 onward [156]. The raw data is available at daily frequency, but we aggregate to monthly frequency to match the sales data (discussed in Section 3.3.3).
We further split the data by language, and use that as a proxy for the country of the viewer. This is likely a reasonable assumption for some languages. For example, viewers of the Polish language page are probably located in Poland. However there may be a measurement error, particularly for the English language page which we assume to cover several countries. We lim- ited our analysis to the languages of the top 9 countries trading on the dark market. In order to retrieve data using API for several languages we use the
following callhttps://wikimedia.org/api/rest_v1/metrics/pageviews/
per-article/lang.wikipedia.org/all-access/user/wiki_page/daily/start_ date/end_date, wherelangis the language desired to be retrieved,wiki page is the drug page name andstart date andend date are the requested dates. Table 3.3 shows the list of drugs Wikipedia pages retrieved.
TABLE 3.3: Drug Wikipedia pagesThe table shows the list of drugs Wikipedia pages included in our analysis. For each drug the table shows the street name and the Wikipedia page
identifier.
Drug street name Wikipedia page
MDMA MDMA
Cannabis Cannabis (drug)
Amphetamine Amphetamine
Diazepam Diazepam
LSD Lysergic acid diethylamide
2CB 2C-B Ketamine Ketamine Methamphetamine Methamphetamine Alprazolam Alprazolam DMT N,N-Dimethyltryptamine Cocaine Cocaine Heroin Heroin Fentanyl Fentanyl
Figure 3.1 describes the Wikipedia data over the sample period. Figure 3.1A shows that total monthly views across the period of study are relatively stable over time. Figure 3.1B shows the distribution of views across languages, of which the English pages are unsurprisingly by far the most popular. Figure 3.1C shows the distribution of views across drugs, which is more evenly spread.