Using Statistical data
Background
Background
Focus is on visualization, but that is useless without data… and data is useless without an easy way to load it.
Background
Data Providers
Loaded
Indicators Selected
Background
• Data loading demo – Start off on a bright note
– Download PC-Axis from SCB
– Load directly into Statistics eXplorer or Mdim eXplorer – http://www.scb.se/Pages/ListWide____259087.aspx – http://www.ssd.scb.se/databaser/makro/visavar.asp? yp=duwird&xu=c5587001&lang=1&langdb=1&Fromw here=S&omradekod=BE&huvudtabell=BefolkningNy &innehall=Folkmangd&prodid=BE0101&deltabell=K2 &fromSok=&preskat=O
Background
• To make our tool useful, it needs:
– Support the most common formats – Combine data from different sources – Load data in a intuitive way
• Should be easy to understand WHY data is loaded in a specific way
Background
• To make our tool useful, it needs:
– Support the most common formats – Combine data from different sources– Load data in a intuitive way
• Should be easy to understand WHY data is loaded in a specific way
Formats
• Generic Formats
– Excel – txt – CSV• Statistics Formats
– PC-Axis – SDMXGeneric Formats
• User are guided to use our structure • Simpler to have
special additions like categorical data and groupings
• Proper error
management and
feedback goes a long way
– Make sure the user knows what is
wrong
• Limits the user to supported structures
• Their export format either needs specific support OR they need to edit their files • Problematic to keep track of
Excel: Categorical
Example
Excel: Categorical
Categorical Numerical
Excel: Categorical
Treemap
Numerical
Excel: Categorical
Color Map
Categorical Numerical
Statistics Formats
• Strictly structured
• Has identifiable properties that can be used by
our tools
– Dimensions – Values
– Time
Statistics Formats
• Exported data can directly be used in tools
which support the format
• No need for editing or changing data bases as
long as they support proper export mechanisms
• Potentially much simpler to update and manage
Common issues - Notation
• Contents
– Spatial
• Countries, Regions…
• Extra important if the tool uses a map
• Identified in different ways depending on the publisher, language and data set.
– region, country, geo, cou, location etc.
• Usage of codes and/or names differs as well
Common issues - Notation
• Contents
– Spatial
• Need to prompt the user to identify the spatial dimension
PC-Axis prompt in Statistics eXplorer, Reading a Finnish language PC-Axis
file
SDMX Load interface in Statistics eXplorer, Loading fields for both files, along with
Common issues - Notation
• Contents
– Spatial
• Problem do exist for other formats as well, but there are fewer options
Prompt when reading an Excel file with data on both sheets and columns, where they couldn’t be correctly identified.
Common issues - Notation
• Contents
– Time • 2012-05-31 • 05-31-2012 • Q2-2012 • 2012-Q2 • January, February • Etc..Our tools currently don’t care, they only assume it can
be sorted alphabetically.
Plans on using proper Date standards exist, but there
are many localization issues.
Common issues - Notation
• Contents
– Dimensions
• Any number of value dimensions
– Gender: Men, Women
– Population: Age 0-14, Age 15-64, Age 65+
– Title and Description fields
Common issues – Notation - Example
• How the structure of PC-Axis is used in eXplorer:
– TITLE: Title of the file
– CONTENTS: Contents of the file – STUB: dimensions
– HEADING: dimensions
– VALUES: Contains the content of dimensions – DESCRIPTION: Description of the file
Common issues – Notation - Example
• Example
– TITLE: “Population numbers by gender” – CONTENTS: “Population”
– STUB: “regions”
– HEADING: “gender”, “time”
– VALUES(“gender”)=“Men”, ”Women”
– VALUES(“time”)=“2000”,”2001”,”2002”…
– VALUES(“region”)=“Norrköping”,”Linköping”…
Name of the indicators would be:
Common issues - Notation- Example
• Example from SCB
– TITLE: “Statistics focused on sick leave numbers by region, time and value”
– CONTENTS: “Statistics focused on sick leave” – STUB: “regions, “variables”
– HEADING: “time”, “indicators”
– VALUES(“variables”)=“Total”, “Men”, ”Women” – VALUES(“indicators”)=”Sick leave, days”,
”Percentage who contributes to sick leave, per cent"
Name of the indicators would be:
Common issues - Notation- Example
• Leaves work for the user, to make sure their file
has a structure that fits what we do.
• Being more flexible in the tool could help, but
make it more complex to read data.
Common issues
• Usage of special characters
– () – ; – “ ” – ‘ ‘
• All cases has to be correctly identified
SDMX
• Our tools can read:
– SDMX-ML: XML based format – It needs two files:
• DSD: Data structure definition • Data
– Location/regional dimension has to be identified
• We use an Open Source project: flex-cb,
previously developed by ECB.
SDMX
• OECD: DotStat integration
– eXplorer component viewer: Single view app. – Integrated into the database
– Allows direct viewing of data in our graphs
User select data Query URL OECD web service SDMX data
SDMX
• Testing with SCB and Eurostat
– Evaluating usage of SDMX
• For regular users?
• What kind of files are suitable
– Usually very large files, for database communication
– Finding bugs
• No SDMX implementation seems to be the same • Both in our reader and the export functionality
SDMX
• Often completely irrelevant to the normal user
• Extremely powerful for technical users
Web services
• Best way of acquiring data for normal users
• Format is irrelevant, black-box approach
Web services
• Standards?
• World dataBank uses its own API and data
format
Wrapping up
• Most common format is Excel
– Statisticians don’t want a black box format – Harder to detect errors in files
• PC-Axis used by a certain group of people
– They are usually experienced with PC-Axis editing.
• SDMX is only used by technical experts
– Used for data export and webservices – Quite heavily promoted
• From our point of view it’s hard to know the focus of it
Wrapping up
• Need more structure?
– Not at all! A flexible system will always be better
• Guidelines are important
– Usage of codes and structures
• Know your audience
– Make sure they have options on data structure, and that it is clear how to reach it.