BIG DATA AND
OFFICIAL STATISTICS
What about
risks, needs, and challenges
of big-data in the context of
measuring wellbeing?
«Data are widely available, what is scarce is the ability to
extract wisdom from them»
(Hal Varian, Google chief economist)
http://www.economist.com/node/15557443
risk
need
challenge
risk
risk
risk
risk
loosing the way
risk
risk
risk
risk
loosing the way
BIG
more we have, better it is
risk
risk
risk
risk
loosing the way
BIG
more we have, better it is
meaningful mass of
information
risk
risk
risk
risk
loosing the way
“big” should represent an
opportunity of transversal reading
(this idea is what the multipurpose
project at ISTAT has in a nutshell)
9
system
need
10
system
need
Exploiting all data sources in order to
describe a consistent frame about
community’s wellbeing '
11
system
need
' through a transversal and horizontal approach
creating a big and heterogeneous patrimony from
which generating an overall view
challenge
heterogeneity
challenge
heterogeneity
BIG
heterogeneity of its
components
challenge
heterogeneity
not [only] integration of different sources
but [also] '
challenge
heterogeneity
' building and re-building paths of transversal senses
16
The definition of new indicators of countries’ progress
and wellbeing introduced new needs of data.
BIG DATA
18
Instruments to
manage big
data
In order to avoid
indigestible mixtures
''
'.. a consistent
conceptual
framework is needed
conceptual framework + big data + analytic instruments = measuring country’s wellbeing
22
In this perspective, we need to take into account the
conceptual dimensions
describing
country’s progress and communities’ wellbeing
23 1. Wellbeing
• quality of life:
o living conditions
o subjective wellbeing
• quality of society social cohesion (participation, trust, social relation, identity)
2. Equity
• distribution of wellbeing inequalities, regional disparities • social exclusion
3. Sustainability
Relationship between the previous levels, the environment and the future
24
The conceptual dimensions
need to be observed and analyzed at micro level
(individual / household) (*)
(*) see Stiglitz J. E., A. Sen & J.-P. Fitoussi eds. (2009) Report by the Commission on the Measurement of Economic Performance and Social Progress, Paris. http://www.stiglitz-sen-fitoussi.fr/en/index.htm
25
Our aim
is to introduce
BIG DATA
and their potential informative load
into the dimension of
social indicators
in the field of official statistics
26
Our challenge
is to construct
complex indicators
able to
(i) monitor communities wellbeing
(ii) support the definition for better policies
by introducing new descriptions captured by big data.
27
Our challenge is to construct
complex indicators
by meeting
the required characteristics '
An indicator
should be able
to:
• define and describe
• observe unequivocally and stably
• record by a degree of distortion as low as possible
(I)
METHODOLOGICAL SOUNDNESS
• adhere to the principle of objectivity (II)
INTEGRITY
• reflect adequately the conceptual model • meet current ad potential users’ needs • be observed through realistic efforts and
costs
• reflect the length of time between its availability and the event of phenomenon it describes
• be analyzed in order to record differences and disparities (III) SERVICEABILITY • be spread (IV) ACCESSIBILITY
Identifying indicators
In other words, our goal is to extract
consistent knowledge, new insights and
meaningful pictures of our societies’
progress and wellbeing
from
Introduction to Small Area Estimation
Population of interest (or target population): population for
which the survey is designed
direct estimators should be reliable for the target population Domains: sub-populations of the population of interest, they
could be planned or not in the survey design
Geographic areas (e.g. Regions, Provinces, Municipalities, Health Service
Area)
Socio-demographic groups (e.g. Sex, Age, Race within a large
geographic area)
Other sub-populations (e.g. the set of firms belonging to a industry
subdivision)
we don’t know the reliability of direct estimators for the
Introduction to Small Area Estimation
Often direct estimators are not reliable for some domains of
interest
In these cases we have two choices:
oversampling over that domains
applying statistical techniques that allow for reliable estimates in that
domains
Small Domain or Small Area: geographical area or
domain where direct estimators do not reach a minimum level of precision
Small Area Estimator (SAE): an estimator created to obtain
Small Area Estimation and Big Data
Our aim is to use the huge source of data coming from
human activities - the big data - to make accurate inference at a small area level
We identified three possible approaches:
1. Use big data as covariates in small area models
2. Use survey data to remove self-selection bias from
estimates obtained using big data
Use Big Data as Covariates in Small Area
Models
Big data often provide unit level data
The outcome variable have to be linked to auxiliary variables in
order to use unit level data in a small area model
Due to technical challenges and law restrictions, it is unfeasible at
this stage to have unit level big data that can be linked with administrative archive, census or survey data
Big data can be aggregate at area level and then used in an
area level model
Use Survey Data to Remove Self-Selection
Bias from Estimates Obtained Using Big Data
An option is to use big data directly to measure poverty
and social exclusion
It is realistic to think that the big data are not representative
of the whole population of interest (self-selection problem)
Using a quality survey we can check the differences in the
distribution of common variables between big data and survey data
If there aren’t common variables we can use known
correlated data to check the differencse in the distributions
Given this differences, we can compute weights that allow
the reduction of bias due to the self-selection of the big data
Use Big Data to Validate Small Area
Estimates
Poverty and deprivation measures obtained from big data
can be compared with similar measures obtained from official survey data
If there is accordance between big data estimates and
survey data estimates, then there is a double checked evidence of the level of poverty and deprivation