USE OF GEOSPATIAL
AND WEB DATA FOR
OECD STATISTICS
CCSA SPECIAL SESSION ON SHOWCASING BIG DATA
1 OCTOBER 2015
Paul Schreyer
•
OECD:
–
Facilitator of discussion on new data sources
for NSOs
–
OECD’s own use of new data sources
•
From Big Data to Smart Data
–
Not every New data source is Big
Business value analysis: why are we
working on this?
•
More granularity or coverage of existing data
(e.g. spatial disaggregation)
•
New output (e.g., measuring trust, inequalities)
•
Greater timeliness
– nowcasting
•
Increased impact
– analysis supporting OECD
mission, possibility to link areas
•
Increased responsiveness
– capacity to address new
topics quickly, respond to what-if questions
–
Capacity to identify, evaluate and access new
data sources
–
Command of methodology
–
Proven quality and metadata frameworks
–
Suitable IT infrastructures
–
Established legal and ethical frameworks
–
Skills and training capacity
Business process analysis:
Necessary capabilities
* Online Real estate prices (OECD GOV) * Measuring trade restrictiveness by scraping and analysing trade laws (OECD TAD)
Web crawling, web scraping
Content Analysis Mobility studies Sensor and geospatial data
* African Economic Outlook (AEO): Civil tensions and political governance indicators (OECD DEV)
* Big Data Measures of Human Well-Being – Evidence from US Google Index (OECD STD)
* Measure transport reliability from
geolocalisation logs (ITF)
* Air quality and land cover data (OECD GOV)
* Enriching the
metropolitan database using geo-spatial data (OECD GOV)
* PIAAC log file data (OECD EDU)
EXAMPLE 1 ENVIRONMENTAL
INDICATORS
Using geospatial data
(satellite data)
–
Where air pollution is above recommended
levels
–
Where improvements in air quality have
happened
–
Linking air pollution to health
Average population exposure to air
pollution (PM2.5)
Source: Raster (satellite observations)
9
Ground-based stations Satellite observations Advantages • Direct measures
• Offer regular levels of air pollution over time
• More pollutants are available
• Global coverage
• Consistent method to compute air pollution in cities, regions and countries
• Consistent time-series data, spanning more than a decade
Disadvantages • Low coverage in developing countries • Uneven coverage within and across
countries
• PM2.5 concentration rarely monitored • Site selection, measurement
techniques, and reporting methods differ across regions and countries
• Modelled data
• Satellite observations are less precise for bright surfaces (snow or desert) • Current data are on a multi-year
average, evaluation of short-term events often unavailable
Satellite observations
• Raster: van Donkelaar et al. (2014) • Resolution: ~10 km2
1. The satellite-based
values of air pollution
are multiplied by the population living in the
area (using a 1km2 resolution grid)
2. The
exposure to air pollution
in a region is
given by the sum of the population weighted
values of PM2.5 in the 1km2 grid cells falling
within the boundaries of the region
3. Finally, dividing this aggregated value by the
total population in the region, we obtain the
average exposure to PM2.5 concentration in
a region
• 68% of the urban population in OECD countries (376 million people) are exposed to pollution above the WHO’s recommended levels.
• OECD estimates show wide variation in PM2.5 exposure levels across cities within countries, the largest in Mexico, Italy, Japan and Korea
11
Levels and trends in OECD cities
Mé ri d a Pal er m o N aha Uls a n T oul on P or tland G dańs k Las P al m as B rem en S toc k hol m G las gow B rn o C onc epc ió n G enev a Q uebec U tr ec ht Li sbon A thens Ant w er p Li nz C uer nav ac a M ila n K um am ot o C heongj u S tr as bour g B uf fal o K rak ów Z ar agoz a E sse n Ma lm ö Li ver pool Os tr a va S ant iago Z u ri ch T or ont o T he H ague P o rto T hes sal oni ca B ru sse l V ienna B udapes t B rat is lav a Lj ubl jana C openhague n He ls in ki T a llin n Os lo Du b lin -10 0 10 20 30 40 M ex ic o ( 33) It al y ( 11) Japan ( 36) K or ea ( 10) Fr anc e ( 15) U ni ted S tat es ( 70) P ol and ( 8) S pai n ( 8) G er m any ( 24) S w eden ( 3) U ni ted K ingdom ( 15) C zec h R epubl ic ( 3) C hi le ( 3) S w itz er land ( 3) C anada ( 9) N et her lands ( 5) P or tugal ( 2) G reec e ( 2) B el gi um ( 4) A us tr ia ( 3) H ungar y ( 1) S lov ak R epubl ic ( 1) S lov eni a ( 1) D enm ar k ( 1) Fi nl and ( 1) E st oni a ( 1) N or w ay ( 1) Ir el and ( 1)
Metropolitan minimum Country average Metropolitan maximum
Co un try (N o. of c itie s)
Europe USA Japan World Raster
name Corine land cover National land cover dataset (NLCD) Japan National Land Service
Information data
MODIS 500 Map of Global Urban Extent
Resolution 25 metres 30 metres 100 metres 500m
Years 2000-06 2001-06 1997-2006 2008
Classif. of
urban land 44 land urban classes 21 land cover classes 11 land cover classes 17 land cover classes Water
Other example: raster sources used for
land cover
…feeds into the OECD Regional Well-Being
Database
Links: Regional Well-Being database Regional Well-Being web tool
EXAMPLE 2 TRADE POLICY
ANALYSIS
Using qualitative data from
government websites
Basic idea
Traditionally:
• Policy questionnaires to countries
• ‘Manual’ screening of government websites
New:
• Machine-based monitoring of government web sites
• Automatic check for changes or addition of rules and
regulations
Test case: qualitative information for the OECD’s trade
restrictiveness information and index
Text comparison - Initial discovery
Run a text comparison between the original document and the new updated document
Detect and flag specific paragraphs changed or updated inside long documents
Text comparison - Advanced discovery
.
Changes in rules and regulations can also happen through new pages
Use ‘big data’ techniques to compare in house
structured information to the universe of laws and regulations in a given country.
Work on text definitions similar to the original ones to help identifying potentially relevant documents.
Web-crawling: scripts to systematically scan
governmental websites where regulations can be found (federal, provincial, regional, etc.).
Web-scraping: scripts to extract the relevant
information in documents, possibly based on articles and paragraphs (text analysis).
Document conversion: most laws and regulations are
in pdf but possibly in other formats that would need to become text documents to run text analysis.
Text comparison: tools and dictionaries to compare
the text of updated documents with the original text, to calculate similarity coefficients with other documents, in a variety of languages with the option to also use proximity of similar words.
Promising results on French legal texts (Legifrance)