Big data in official statistics Insights about world heritage from the analysis of Wikipedia use

27 

Loading....

Loading....

Loading....

Loading....

Loading....

Full text

(1)

Big data in official statistics

Insights about world heritage from the

analysis of Wikipedia use

Fernando Reis, European Commission - Eurostat

International Symposium on the Measurement of Digital Cultural Products Montreal, 9-11 May 2016

(2)

Eurostat

Defining big data in 1 minute

• Data deluge

• Exhaust data + sensors

• High detail, massive size (large p, large n)

• Data-driven analytical applications

• Statistical modelling, machine learning • Visualisation

• Data-driven economy

• Official statistics does not have a nearly statistical

(3)

Eurostat

Scheveningen Memorandum on Big Data

Examine the potential of Big Data sources for official statistics

Official Statistics Big Data strategy as part of wider

government strategy

Address privacy and data protection

Collaboration at European and global level

Address need for skills

Partnerships between different stakeholders (government,

academics, private sector)

Developments in Methodology, quality assessment and IT

Adopt action plan and roadmap for the European Statistical

(4)

Eurostat

Big data strategy

Start with concrete pilots3 time-frames

Short-term

Medium-term

Long-term

(5)

Eurostat

Big Data Action Plan and Roadmap @ a glance

Policy Quality Skills

Experience sharing Legislation IT

Infrastructures

Methods CommunicationEthics / Big data sources

Governance

(6)

Eurostat

Action (example)

Pilot projects, carried out by the Member States (ESSnet)

 2015 – 2019 (European Statistical System network)

 Exploring different big data sources (but also IT architecture, partnerships),

developing generic guidelines and frameworks

 Establish Parternships with data providers and research and international

organisations

 Cooperation with UN on Metodological Framework

Challenges

▫ cooperation, sharing of know-how

▫ development of a sound methodology ("from design-based to model-design-based approach")

▫ exploration & tentative implementation ▫ Looking for partners

6 Policy Quality Skills

Experience sharing Legislation IT Infrastructures

Methods CommunicationEthics / Big data sources Governance

(7)

Eurostat

Eurostat big data pilots

• Contracts

Feasibility study on the use of mobile phone data for

tourism statistics

Internet as a data source for information society

statistics

Accreditation of big data sources

• Internal projects

• Wikipedia use

• Mobile phone for urban statistics • Web evidence for nowcasting

(8)

Wikipedia as a big data source

Insights about world heritage from the analysis of Wikipedia use

(9)

World Heritage Sites

Convention Concerning the Protection of the

World Cultural and Natural Heritage

List of World Heritage Sites maintained by

(10)

Data sources: UNESCO

1. List of World Heritage Sites from UNESCO

Public source

(11)

Data sources: Wikipedia

2. Wikipedia

Public source

Digital traces left by people

Widely used

• In 2013, 44% of individuals 16 to 74 years old living in EU consulted wikis to obtain knowledge (e.g. Wikipedia)

• This was 69% for individuals between 16 and 24 years old • Community Survey on ICT Usage by Individuals

(12)

Data sources: Wikipedia

2.1 Content (text and links)

Selection of articles related to World Heritage sites

2.2 Page views

Wikistats: hourly number of page views for all articles of all wiki projects of the Wikimedia foundation

(13)

Wikipedia page views raw data

Example: zu.z Ulimi 8 AE1,LN2O1Q1,AX1,FB2,

[wiki code][article title][monthly total][hourly counts]

Monthly files

From Jan 2012 to Oct 2015

(14)

Data processing

Sandbox computer cluster4 nodes, each:

2 x Intel Xeon E5-2650 v3 10 cores

128GB RAM4 x 4TB diskFDR Infiniband (56Gbit) 3 stages:Pre-processingExtractionAnalytics

(15)

Pre-processing

Scripts in Unix shell and Pig

Filtering of raw data to needed project and language

Change of format:

en.z Banc_d'Arguin_National_Park [0, 0, 0, 5, 6, 16, 5, 20, 25, 21, 48, 29, 43, 40, 46, 0, 30, 55, 36, 39, 28, 28, 204, 218]

(16)

Extraction

Map-reduce jobs

Scripts in Unix shell and python

Filtering to list of articles supplied

Time aggregation from hourly to daily, weekly and monthly

(17)

Data analysis

R and RStudio

Querying APIs (CatScan, Wikipedia Miner, Wikimedia)

Web scrapping of Wikipedia for selection of articles (geo-coordinates, categorisation, information boxes, article redirects, articles links)

(18)

Number of page views of related Wikipedia articles per country of location of the WHS

Reference:

Jan.2012 – Oct.2015 31 languages

(19)

Average number of page views according to the date of inscription

Reference:

Jan.2012 – Oct.2015 31 languages

(20)

Top 20 World Heritage Sites in number of page views of related Wikipedia articles

Reference:

Jan.2012 – Oct.2015 31 languages

(21)

Distribution of page views of articles related to World Heritage Sites by language of Wikipedia

en es de fr ru it pl pt nl tr Reference: Jan.2012 – Oct.2015 31 languages

(22)

Top 5 WHS in number of page views of related Wikipedia articles by language

Reference:

Jan.2012 – Oct.2015 31 languages

English Spanish

(23)

Page views of Wikipedia articles related to World Heritage Sites 45M 40M 35M 30M 25M 20M 15M 10M 5M 0M

Mar 2012 Jul 2012 Nov 2012 Mar 2013 Jul 2013 Nov 2013 Mar 2014 Jul 2014 Nov 2014 Mar 2015 Jul 2015

Reference:

Jan.2012 – Oct.2015 31 languages

(24)

Page views of Wikipedia articles related to World Heritage Sites (English Wikipedia)

(25)
(26)

Distribution of WHS by number of page views (NOT log) The percentage

of page views going to the top 20 WHS is 32%

(27)

Eurostat

Thank you for your attention

Fernando Reis

Eurostat Task Force on Big Data

https://github.com/reisfe/ https://twitter.com/reisfe/ https://linkedin.com/in/reisfe/ fernando.reis@ec.europa.eu

Figure

Updating...

Related subjects :