IPUMS-International:
Harmonizing Big Data for Smart Research
Patricia Kelly Hall
University of Minnesota Presented at the
Microdata Computation Centre
(MiCoCe)
Workshop
April 29, 2014 Nuremberg, Germany
IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
IF TIME: User statistics for Europe
IPUMS: Big Data for Smart Research
IPUMS-International Overview
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for Europe
international.ipums.org
Source data for IPUMS-International are generously provided by participating National Statistical Offices
IPUMSI Microdata Availability
IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for European
Hazards of using Big Data
Big data
= many different types of data
Hazards of using Big Data
Big data
= many different types of data
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
= different places and times
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
= different places and times
= different variable names
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
= different places and times
= different variable names
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
= different places and times
= different variable names
= different levels of documentation
Hazards of using Big Data
Big data
= many different types of data
= different population coverage
= different data producers
= different places and times
= different variable names
= different levels of documentation
Big data born of rapid change in technology
Technology ≠ substitute for understanding
Overcoming the
Overcoming the
Hazards of using Big Data
Harmonization of data is essential
Overcoming the
Hazards of using Big Data
Harmonization of data is essential
Overcoming the
Hazards of using Big Data
Harmonization of data is essential
- universe, variable meaning, codes
Harmonization of microdata
Overcoming the
Hazards of using Big Data
Harmonization of data is essential
- universe, variable meaning, codes
Harmonization of microdata
AND
metadata are both necessary
Overcoming the
Hazards of using Big Data
Harmonization of data is essential
- universe, variable meaning, codes
Harmonization of microdata
AND
metadata are both necessary
Tools to facilitate access to metadata
IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for European
DATA METADATA Data files Data dictionary Enumeration forms Enum. instructions Census/sample design Reformat data Donation Draw sample Confidentiality A
IPUMS data dictionary Images to editable files Translate docs to English
Create source variables Confidentiality B
Verify data
Tag enumeration text Document sourcevariables
Harmonize codes Variable programming Family pointers
GIS boundary files
Variable descriptions Sample documentation
IPUMS follows international standards on
microdata confidentiality
EUROSTAT statistical confidentiality standards
(Thorogood, 1999) – basic framework for IPUMS-International protocolsECE cites IPUMS as “best practice”
“Entrusting census microdata and metadata for timely integration and
dissemination via the IPUMS-EurAsia and IECM initiatives, 2010-2014,” ECE/CES/GE.41/2009/23
Dennis Trewin, ISI Special Task force on Statistical
Confidentiality & Microdata access describes
IPUMS-International as
“...a best practice for a data repository of international statistical data.”
•
Suppress low-level geographic identifiers
- usually < 20,000 persons).
•
Swap a small percentage of cases between
geographic areas.
•
For recent censuses: recode cells representing very
small numbers of persons in the population.
•
Suppress categories or entire variables as requested
by the NSO.
IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for European
IPUMS value-added features facilitate research
Available
free
on-line,
easy to use,
saves time
Harmonized variables with
consistent coding
Time-series
potential for most countries
Consistent
geographic regions
Constructed
family interrelationship
variables
Comprehensive interactive
online documentation
Pooled data
from
customizable extract system
SPSS, Stata or SAS
sytax files
Great
user support
IPUMS Customizable Data Extract System
Home Ownership Relation to Head Age Marital Status OccupationData extract
3. Submit extract
Pooled Data Extracts
sample water sex education
Argentina 2001 3.6 million Chile 2002 1.5 million Cuba 2002 1.1 million Extract Engine
Argentina 2001
Chile 2002
Cuba 2002
Water supply
Sex
Education
1. Select samples
2. Select variables
1 dataset 3 censuses 4 variables 6.2 million records Harmonized codesSelected United Nations MDG indicators using IPUMS-I data
Goal Indicators Target
1. Eradicate extreme poverty and hunger
Non-official Young unemployment rate, aged 15-24, each sex and total
1.B. Achieve full and productive employment and decent work for all, including women and young people.
2. Achieve universal primary education
2.1. 2.2 2.3
Net enrollment ratio in primary education
Proportion of pupils starting grade 1 who reach grade 5b Literacy rate of 15-24 year-olds.
2. Ensure that, by 2015, children everywhere, boys and girls alike, will be able to complete a full course of primary schooling
3. Promote gender equality and empower women 3.1A 3.1B 3.2
Ratio of girls to boys in primary, secondary, and tertiary education Ratio of literate women to men, 15-24 years old
Share of women in wage employment in the non-agricultural sector
3. Eliminate gender disparity in primary and secondary education preferably by 2005, and in all levels of education no later than 2015
5. Improve maternal
health 5.4 Adolescent birth rate 5B. Achieve, by 2015, universal access to reproductive health.
7. Ensure environmental sustainability
Non-official Proportion of the population using solid fuels
7A.. Integrate the principles of sustainable development into country policies and programs and reverse the loss of environmental resources
7.8 7.9
Proportion of population with sustainable access to an improved water source, urban and rural
Proportion of population with access to improved sanitation facility, urban and rural
7C. Halve, by 2015, the proportion of people without
sustainable access to safe drinking water and basic sanitation
Non-official Proportion of households with access to secure tenure
7D. By 2020, to have achieved a significant improvement in the lives of at least 100 million slum dwellers
8. Develop a global partnership for development
8.14/8.15 8.16
Telephone lines and cellular subscribers per 100 population Internet users and personal computers per 100 population
8F. In cooperation with the private sector, make available the benefits of new technologies, especially information and communications
Goal 2: Achieve universal primary education
2.3 Literacy rate of 15-24 year-olds: “...percentage of the population 15–24 years old who can both read and write with understanding a short simple statement on everyday life.” (United Nations, 2003)
IPUMS-I operationalization
:
IPUMS-I Integrated variables used:
AGE & LIT24) AGE & 15 (AGE 24 -15 ages Persons 24) AGE & 15 (AGE 24 -15 ages and 2) (LIT persons Literate Formula
Developing Countries: Literacy rates (ages 15-24) in IPUMS samples
Source: Cuesta and Lovatón (2013)
Data Source: Original census data provided by national statistical offices of partner countries; data harmonized and distributed by PUMS-Interneational
Source: Cuesta and Lovatón (2013)
Data Source: Original census data provided by IBGE, national statistical office of Brazil; data harmonized and distributed by PUMS-International.
Census
2000
Census
1991
Census
2010
IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for Europe
IPUMS User Statistics:
Registered IPUMS users by country data requested
In Europe
(n=1,673)
In the World
(n=5,299)
1 Mexico 1,481 2 Brazil 1,269 3 United States 1,258 4 Colombia 944 5 Argentina 831 6 Chile 746 7 France 740 8 South Africa 737 9 China 720 10 Kenya 704 11 Spain 702 12 Canada 701 1 France 740 2 Spain 702 3 Greece 561 4 United Kingdom 554 5 Portugal 540 6 Austria 515 7 Romania 495 8 Hungary 443 9 Italy 417 10 Netherlands 391 11 Switzerland 347 12 Belarus 322
IPUMS User Statistics:
Registered IPUMS users by user country of residence
In Europe
(n=2,290)
In the World
(n=5,299)
1 United States 3,256 2 Spain 402 3 France 371 4 United Kingdom 264 5 Germany 230 6 Italy 215 7 Brazil 159 8 Switzerland 149 9 Austria 147 10 Netherlands 133 11 Canada 116 12 China 113 1 Spain 402 2 France 371 3 United Kingdom 264 4 Germany 230 5 Italy 215 6 Switzerland 149 7 Austria 147 8 Netherlands 133 9 Belgium 59 10 Romania 47 11 Hungary 46 12 Denmark 32 322
IPUMS User Statistics:
Registered IPUMS users by user country of residence
In Europe
(n=2,290)
In the World
(n=5,299)
1 United States 3,256 2 Spain 402 3 France 371 4 United Kingdom 264 5 Germany 230 6 Italy 215 7 Brazil 159 8 Switzerland 149 9 Austria 147 10 Netherlands 133 11 Canada 116 12 China 113 1 Spain 402 2 France 371 3 United Kingdom 264 4 Germany 230 5 Italy 215 6 Switzerland 149 7 Austria 147 8 Netherlands 133 9 Belgium 59 10 Romania 47 11 Hungary 46 12 Denmark 32 322
IPUMS User Statistics:
Registered IPUMS users by user country of residence
In Europe
(n=2,290)
In the World
(n=5,299)
1 United States 3,256 2 Spain 402 3 France 371 4 United Kingdom 264 5 Germany 230 6 Italy 215 7 Brazil 159 8 Switzerland 149 9 Austria 147 10 Netherlands 133 11 Canada 116 12 China 113 1 Spain 402 2 France 371 3 United Kingdom 264 4 Germany 230 5 Italy 215 6 Switzerland 149 7 Austria 147 8 Netherlands 133 9 Belgium 59 10 Romania 47 11 Hungary 46 12 Denmark 32 322
IPUMS User Statistics:
Top-ranked Institutions by number of users
In the World
In Europe
1 Univ. Auton. de Barcelona (164) 2 Panteion Univ. of Athens
3 INED - France
4 Vienna Institute of Demography 5 Paris School of Economics
6 Universite de Strasbourg 7 City University London 8 MPID - Germany 9 University of Oxford 10 University of Vienna 11 Bocconi University 11 University of Tubingen 13 University of Essex 13 University of Groningen 15 University of Geneva (28) 1 University of Minnesota (201) 2 United Nations
3 Univ. Autonoma de Barcelona
4 The World Bank
5 University of Chicago 6 Harvard University
7 Panteion Univ. of Athens
8 University of Michigan 9 University of Washington 10 Arizona State University 11 Columbia University
12 Stanford University
IPUMS User Statistics:
No. of extracts of
European census samples by year of extract
Europe France Ireland Turkey
2003 84 84
2004 59 59
2005 101 101
2006 174
91
2007 774 123
2008 1,419 183
2009 1,794 197
2010 2,365
269
2011 4,169 562 186
2012 4,932 523 332 179
2013 4,339 367 278 298
Total 20,210 2,559
796 477
IPUMS User Statistics:
Number of extract
requests that include European samples by country
In Europe
(n=20,210)
11 Netherlands 1,097 12 Belarus 1,032 13 Ireland 918 14 Slovenia 813 15 Germany 809 16 Turkey 654 1 France 2,717 2 Spain 2,305 3 Greece 1,925 4 Romania 1,680 5 Portugal 1,673 6 Austria 1,622 7 United Kingdom 1,559 8 Hungary 1,311 9 Italy 1,183 10 Switzerland 1,101IPUMS: Big Data for Smart Research
IPUMS-International Overview
Hazards of “Big Data” research
IPUMS harmonization principles / process
Data & tools for smart research
User statistics for Europe
Data Tabulator
Fast online analysis of sample data
Pooled samples
(across time and/or country)
International.ipums.org
Click on Analyze Data Online
Local Value-Added Web Portals
IECM
Barcelona, Spain
http://www.iecm-project.org/
African Integrated Census Microdata
Addis Ababa, Ethiopia
http://ecastats.uneca.org/aicmd/
Economic Research Forum
Cairo, Egypt
Under construction at
http://www.erf.org.eg
Geography Enhancements /
GIS Boundary Files
Better geographic documentation
Pooling adjacent units
(20,000 person threshold)
Creating integrated geo-statistical units
Collecting boundaries
- Digital and paper maps