Big Data a big issue for Official Statistics?

14  Download (0)

Full text

(1)

Big Data – a big issue for Official

Statistics?

ASC Conference – 26 September 2014

Pete Brodie

Session objectives

• Big Data and Official Statistics

• The ONS Big Data Project aims

• Wider engagement and communication

• Pilots

Infrastructure, Innovation Labs

Smartmeters

Mobile Phones

Prices

Twitter

• Emerging Findings

• Next Steps

(2)

Big Data and Official Statistics

• Replace existing outputs

• Produce an entirely new output

• Complement other sources:

• Filling in gaps

• Auxiliary variables for statistical models

• Improve operational processes

• Quality assurance

ONS Big Data Project

• A one year project which aims to:

• investigate the potential for big data in official statistics while

understanding the challenges

• establish an ONS policy and longer term strategy which

incorporates ONS’s position within Government and

internationally in this field

• Recommend next steps to support the strategy going

forward

(3)

Wider engagement and communication

• International:

• UNECE / ESS

• Cross-government:

• HMG Data Science Community of Interest Group

and Cross Profession Working Group

• GSS data strategy

• GDS/DfT/DECC/BoE/DSTL

• Big data for statistics vs other types of analysis

Wider engagement and communication

• Academia:

• ESRC/RSS

• University of Southampton/Cardiff/Huddersfield

• Private sector:

• Mobile network operators

• Mysupermarket/Billion Prices

• Google

• Privacy groups

• B2011 Privacy Advisory Group/GDS Privacy and

Ethical Committee

(4)

Pilots - Infrastructure

• Huge and continuously growing data streams,

requiring new data architectures and software

• Feasibility and efficiency of processing,

typically requiring parallel computing on a

large scale

• New skills will be required, bringing together

statistical and technological expertise

(5)

Pilots – Innovation Labs

• A new facility to allow research with datasets

and tools without compromising ONS security

• INDEPENDENT of ONS main systems

• NOT SECURE – so the only data that can go

on there is PUBLIC data

• A “private cloud” – individual machines are

pooled together to provide an integrated

environment, accessed via web browser

Pilots – Smart meters

Investigating the potential of smart meter

electricity data (high frequency – 30 mins)

to identify household occupancy levels,

potentially household structure

• England and Ireland both conducted pilots of

rollout in 2009-2010 – data now available for

research

• Southampton University commissioned by

Beyond 2011 to conduct preliminary research

(6)

Pilots – Smart meters

Day

T

ot

al

da

il

y

el

ec

tr

ic

it

y

c

on

s

um

p

ti

o

n

(k

Wh)

Irish smart meter pilot study:

Single meter, total daily electricity consumption

Ju l y 2 00 9 O ct o be r 2 00 9 Feb rua ry 20 10 May 2 01 0 Au gu st 20 10 Dec em be r 20 10 Christmas 2009 Christmas 2010

Consecutiv e days w ith low consumption, possibly a w eek aw ay? 50

100 150

Pilots – Mobile Phones

Investigating using mobile phone data to

model population flows, eg travel to work

statistics

• GDS

• Discussions with mobile phone providers

(Telefonica, Vodafone, EE) to provide

aggregate data on origin-destination flows

• Ethical/commercial issues

(7)

Pilots – Prices

Scraping prices data from the internet for

use within price statistics

• Potential for richer, more frequent and

cheaper data collection

• Focus on grocery prices from three on-line

supermarkets

• Prototype scrapers collecting a selection of

CPI/RPI item categories (daily collection)

• We are purchasing data from

MySupermarket.com (linked data, longer time

series) for research purposes

Webscraping

Rendered webpage:

HTML code:

...

</div><div class="productLists" id="endFacets-1"><ul class="cf products line"><li id="p-254942348-3" class=" first"><div class="desc"><h3 class="inBasketInfoContainer"><a id="h-254942348" href="/groceries/Product/Details/?id=254942348" class="si_pl_254942348-title"><span class="image"><img

src="http://img.tesco.com/Groceries/pi/121\5010044000121\IDShot_90x90.jpg" alt="" /><!----></span>Warburtons Toastie Sliced

White Bread 800G</a></h3><p class="limitedLife"><a href="http://www.tesco.com/groceries/zones/default.aspx?name=quality-and-freshness">Delivering the freshest food to your door- Find out more &gt;</a></p><div class="descContent"><!----><div class="promo"><a href="/groceries/SpecialOffers/SpecialOfferDetail/Default.aspx?promoId=A31234788" title="All products available for this offer" id="flyout-254942348-promo-A31234788--pos" class="promoFlyout"><span class="promoImgBox"><img src="/Groceries/UIAssets/I/Sites/Retail/Superstore/Online/Product/pos/2for.png" class="promoFlyout promo" alt="Special Offer" id="flyout-254942348-promo-A31234788--posimg" /></span><em>Any 2 for £2.00</em></a><span> valid from 21/1/2014 until 10/2/2014</span></div><div class="tools"><div class="moreInfo"><a href="/groceries/Product/Details/?id=254942348" class="midiFlyout" id="flyout-254942348-midi-0-"><img class="midiFlyout hd"

src="http://ui.tescoassets.com/groceries/UIAssets/I/../Compressed/I_635209615845382232/Sites/Retail/Superstore/Online/Product/i nfoBlue.gif" alt="" title="View product information" id="flyout-254942348-midi-1-" /></a></div><!----><div

class="links"><ul><li><a

href="http://www.tesco.com/groceries/product/browse/default.aspx?notepad=white%20sliced%20loaf%20800g&amp;N=4294793217" class="shelfFlyout active plaintooltip" id="s-tt-254942348" title="Premium White Bread"> Rest of <span class="hide">Premium White Bread <!----></span>shelf </a></li></ul></div></div></div></div><div class="quantity"><div class="content addToBasket"><p

(8)

The Billion Prices Project @ MIT

Daily Online

Price Index

(United

States)

Lehman Brothers files

for bankruptcy (15 Sept

2008)

Pilots – Twitter

Using geo-located tweets from Twitter to

provide insights on internal migration:

• Harvesting geo-located tweets from the

Twitter Streaming API

• Development of clustering methods to group

tweets by user and identify ‘significant’

locations

• Using Addressbase to classify clusters as

(e.g. residential or commercial)

(9)

Lots of activity in

different places but

where does this

Twitter user live?

Density-Based Spatial Clustering with

Noise (DBSCAN)

2 parameters:

• Distance

(10)

Raw Data

Cluster Centroid

Noise

Cluster_id

Northing

Easting

Count

60033_1

105431

530702

28

60022_2

104041

530894

4

60033_6

182546

532010

13

60033_13

104956

531017

3

60033_15

179830

533395

3

60033_21

165947

532851

3

Most likely

lives here

Emerging findings: Big Data in ONS

•Prices Pilot

•Web scrapers

(11)

Emerging findings: Big Data in ONS

•Smartmeter Pilot

•Potential to identify vacant/unoccupied properties

•Intelligence used in the field/within address register

Emerging findings: Big Data in ONS

•Big Data Technologies

•Innovation Labs

(12)

Emerging findings: Big Data in ONS

•Statistical

•Precision may not be an issue but bias is

•Traditional statistical methods must not be forgotten in

the hype

•Crucial role for ONS

Emerging findings: Big Data in ONS

•Ethical

•Engagement with privacy groups

•ONS Ethical Committee

(13)

Emerging findings: Big Data in ONS

•Commercial

•Different commercial models need to be considered

•Need to develop procurement framework for

engagement – issues around brand

Emerging findings: Big Data in ONS

•Capability

•Team sport, cross disciplinary

•Need for senior focus – Chief Data Scientist

•Time to develop staff, encourage innovation

•Links with academia – support research, attract

graduates, placement students

(14)

Emerging findings: Big Data in ONS

•Starting to demonstrate tangible benefits and provide

evidence that challenges can be overcome

•But more long term work is needed to build on these initial

findings

•Aligned with ONS and GSS strategy, Government initiatives,

UNECE/Eurostat work programmes, academic investment

Next steps - aims

• Support ONS business as usual, projects and

programmes with BD

• Build up a pool of expertise

• Develop best practise, standards and guidance and

training, understand ‘data scientist’ role

• Continue to develop partnerships across the GSS

• Continue to develop partnerships with academics and

the private sector

• Contribute to international and cross Government BD

initiatives

Figure

Updating...

References

Related subjects :