BIG DATA AND OFFICIAL STATISTICS. Filomena Maggino, Monica Pratesi

(1)

BIG DATA AND

OFFICIAL STATISTICS

(2)

What about

risks, needs, and challenges

of big-data in the context of

measuring wellbeing?

(3)

«Data are widely available, what is scarce is the ability to

extract wisdom from them»

(Hal Varian, Google chief economist)

http://www.economist.com/node/15557443

(4)

risk

need

challenge

(5)

risk

loosing the way

(6)

risk

loosing the way

BIG

more we have, better it is

(7)

risk

loosing the way

BIG

more we have, better it is

meaningful mass of

information

(8)

risk

loosing the way

“big” should represent an

opportunity of transversal reading

(this idea is what the multipurpose

project at ISTAT has in a nutshell)

(9)

9

system

need

(10)

10

system

need

Exploiting all data sources in order to

describe a consistent frame about

community’s wellbeing '

(11)

11

system

need

' through a transversal and horizontal approach

creating a big and heterogeneous patrimony from

which generating an overall view

(12)

challenge

heterogeneity

(13)

challenge

heterogeneity

BIG

heterogeneity of its

components

(14)

challenge

heterogeneity

not [only] integration of different sources

but [also] '

(15)

challenge

heterogeneity

' building and re-building paths of transversal senses

(16)

16

The definition of new indicators of countries’ progress

and wellbeing introduced new needs of data.

(17)

BIG DATA

(18)

18

Instruments to

manage big

data

(19)

In order to avoid

indigestible mixtures

''

(20)

'.. a consistent

conceptual

framework is needed

(21)

conceptual framework + big data + analytic instruments = measuring country’s wellbeing

(22)

22

In this perspective, we need to take into account the

conceptual dimensions

describing

country’s progress and communities’ wellbeing

(23)

23 1. Wellbeing

• quality of life:

o living conditions

o subjective wellbeing

• quality of society social cohesion (participation, trust, social relation, identity)

2. Equity

• distribution of wellbeing inequalities, regional disparities • social exclusion

3. Sustainability

Relationship between the previous levels, the environment and the future

(24)

24

The conceptual dimensions

need to be observed and analyzed at micro level

(individual / household) (*)

(*) see Stiglitz J. E., A. Sen & J.-P. Fitoussi eds. (2009) Report by the Commission on the Measurement of Economic Performance and Social Progress, Paris. http://www.stiglitz-sen-fitoussi.fr/en/index.htm

(25)

25

Our aim

is to introduce

BIG DATA

and their potential informative load

into the dimension of

social indicators

in the field of official statistics

(26)

26

Our challenge

is to construct

complex indicators

able to

(i) monitor communities wellbeing

(ii) support the definition for better policies

by introducing new descriptions captured by big data.

(27)

27

Our challenge is to construct

complex indicators

by meeting

the required characteristics '

(28)

An indicator

should be able

to:

• define and describe

• observe unequivocally and stably

• record by a degree of distortion as low as possible

(I)

METHODOLOGICAL SOUNDNESS

• adhere to the principle of objectivity (II)

INTEGRITY

• reflect adequately the conceptual model • meet current ad potential users’ needs • be observed through realistic efforts and

costs

• reflect the length of time between its availability and the event of phenomenon it describes

• be analyzed in order to record differences and disparities (III) SERVICEABILITY • be spread (IV) ACCESSIBILITY

Identifying indicators

(29)

In other words, our goal is to extract

consistent knowledge, new insights and

meaningful pictures of our societies’

progress and wellbeing

from

(30)

Introduction to Small Area Estimation

Population of interest (or target population): population for

which the survey is designed

direct estimators should be reliable for the target population Domains: sub-populations of the population of interest, they

could be planned or not in the survey design

Geographic areas (e.g. Regions, Provinces, Municipalities, Health Service

Area)

Socio-demographic groups (e.g. Sex, Age, Race within a large

geographic area)

Other sub-populations (e.g. the set of firms belonging to a industry

subdivision)

we don’t know the reliability of direct estimators for the

(31)

Introduction to Small Area Estimation

Often direct estimators are not reliable for some domains of

interest

In these cases we have two choices:

oversampling over that domains

applying statistical techniques that allow for reliable estimates in that

domains

Small Domain or Small Area: geographical area or

domain where direct estimators do not reach a minimum level of precision

Small Area Estimator (SAE): an estimator created to obtain

(32)

Small Area Estimation and Big Data

Our aim is to use the huge source of data coming from

human activities - the big data - to make accurate inference at a small area level

We identified three possible approaches:

1. Use big data as covariates in small area models

2. Use survey data to remove self-selection bias from

estimates obtained using big data

(33)

Use Big Data as Covariates in Small Area

Models

Big data often provide unit level data

The outcome variable have to be linked to auxiliary variables in

order to use unit level data in a small area model

Due to technical challenges and law restrictions, it is unfeasible at

this stage to have unit level big data that can be linked with administrative archive, census or survey data

Big data can be aggregate at area level and then used in an

area level model

(34)

Use Survey Data to Remove Self-Selection

Bias from Estimates Obtained Using Big Data

An option is to use big data directly to measure poverty

and social exclusion

It is realistic to think that the big data are not representative

of the whole population of interest (self-selection problem)

Using a quality survey we can check the differences in the

distribution of common variables between big data and survey data

If there aren’t common variables we can use known

correlated data to check the differencse in the distributions

Given this differences, we can compute weights that allow

the reduction of bias due to the self-selection of the big data

(35)

Use Big Data to Validate Small Area

Estimates

Poverty and deprivation measures obtained from big data

can be compared with similar measures obtained from official survey data

If there is accordance between big data estimates and

survey data estimates, then there is a double checked evidence of the level of poverty and deprivation