SDMX technical standards
Data validation and other major enhancements
Internal processes
(self regulated)
Statistical Institutions
SDMX SDMX
exchange exchange
Statistical Data and Metadata eXchange
GSBPM phases originally involved
Current line of tendency:
Support to the whole process
Some outcome from SDMX TWG
Work Package «SDMX and other standards»
• To support the statistical processes, a language for
validations and calculations is needed
• On their own, some institutions have adopted a
language having this aim, (e.g. Bank of Italy,
Eurostat, Unesco …) and use it for internal processing and for exchanging validation and calculation rules with their reporting entities and correspondents
• For the same aim, many other institutions are
willing to adopt a similar language
• Unless a standard language is introduced, such kind of
languages would proliferate
• The SDMX community, and also the DDI and GSIM
ones, are interested in introducing and sharing a
The SDMX situation
Structural Validation – supported– Assurance that the structure of the data observations matches the
Data Structure Definition, in term of:
• Concepts used as Dimensions, Measures, Attributes
• their admissible values (Codelists and values‟ Constraints);
Validation of the Information Content – not yet supported
– Assurance that data give correct information about the real world, for example:
• Completeness / Integrity • Accuracy / Plausibility • Coherence
stocks vs. flows
Validation rule:
If A-B = C “ok” else “error”
Aggregate “A”: stock at reference period: t
Aggregate “B”: stock at reference period: (t -1) Aggregate “C”: flows between (t-1) and t
Example of validation of the
information content
The validation of the information contents is a kind of calculation
The current initiative
A Work Package for introducing a standard
Validation and Tranformation Language (VTL)
was launched in 2012 by the SDMX Secretariat
• The DDI community expressed interest in
developing and adopting the VTL
• Analoguous interest was expressed by many
contributors to the GSIM standard
• A working group is in place, composed of members from the SDMX TWG and SWG and from the DDI and GSIM communities
Development priorities
For most institutions
validation is the priority
First VTL development:
• Support to the validation rules,
• Support to basic calculation capabilities
(as needed for the validation)
At a later stage:
• Improve the VTL to support more complex algorithms
First VTL development:
main goals
•
Define and preserve validation rules
(document and preserve the validation know-how)
•
Exchange and share validation rules
(with reporting institutions & other correspondents)
•
Apply validation rules
in the collection and
production processes (aiming at an industrialized
processing of statistical data)
First VTL development:
Implementation Plan
Main requirements
:June 2013
Use cases of Validation
:– From single institutions: August 2013
– Use cases finalization: October 2013
Basic VTL features
:January 2014
Operators
(syntax, semantic): April 2014Comments
andfinalization
: July 2014SDMX implementation
: October 2014How to implement a language
usable in different standards?
The problem
– A language manipulates the model artefacts to
produce other model artefacts (property of closure) – A language for SDMX wouldn‟t fit DDI & GSIM - and
vice-versa (artefacts are different)
The approach
– Build the VTL on an ‟agnostic’ information model,
made of the basic artefacts common to SDMX, DDI and GSIM (i.e. dimensional structures)
Main VTL requirements
• User orientation
• Integrated Approach
• IT implementation independence
• Active Role for processing
User Orientation
The VTL should be:
• declarative, so that users without IT skill should be able to define calculations and validations autonomously (without IT experts intermediation)
• user friendly (users should define & understand expressions as much as possible intuitively)
• oriented to statistics, which is the user skill (the language should operate on statistical artefacts by means of
Integrated approach
The VTL should be:
• independent of the statistical domain of the data to be processed
• suitable for the various typologies of data of a
statistical environment (e.g. dimensional data, survey data, registers data, micro and macro, quantitative and qualitative, …)
• independent of the phases of the statistical process
IT implementation independence
The VTL should:
• allow many different IT implementations
(for example in different organizations / institutions) and not be bound to a specific IT environment
• permit the use of heterogeneous IT tools in an
integrated IT solution (for example, combined use of tools like SQL, R, XML …)
• make users unaware of the IT solution as much as possible
• minimize impacts on users when the IT solution changes (for example following the adoption of
Active Role for Processing
The VTL should:
• be able to drive the validation & calculation software, so be convertible in the languages of the IT tools used for validation and calculation (e.g. SQL, R, XML …)
• be described through a formal grammar, to be easily parsed and processed (for example in Backus-Naur form)
• generate results unambiguously interpretable by software and by statisticians (the results should be artefacts of the information model in their turn)
Extensible and Customizable
The VTL should allow:
• the incremental introduction of the operators
according to the evolution of the business needs (e.g. the operators for the validation first and the operators for the compilation and estimation at a later stage)
• the adoption of operators derived from other
languages (e.g. “SQL like” operators, time series
processing operators …)
• the possible customization for specific needs, (e.g. if
some institutions need to extend the language for their own purposes)
VTL Governance
The VTL is intended to be:
• a standard language under a common governance, not controlled by any private party (such as an IT company)
• subject to appropriate governance rules aimed to ensure its proper evolution (to be defined)
• able to evolve more dynamically than the SDMX versions (without affecting the information model)
• coordinated with possible extensions made by some institutions through proper rules (to be defined)
Some Functional Requirements (draft)
The VTL should allow:
• Operations on dimensions, mono and multi-measure data, data attributes
• Aggregation according to hierarchical links
• Proper behaviour for missing data
• Historicity: possibility of handling the changes of the
artefacts and of the algorithms with reference to the time
• Persistency control: possibility of defining the persistency of the intermediate results
• Expressions chaining: possibility of having expressions as input operands of other expressions
Some requirements about the operators (1)
Data retrieval and storage (e.g. get, put)
Projection (e.g. drop, keep …)
Filter (e.g. =, <, <=, >, >=, <>, like, between …)
Aggregation (e.g. sum, avg, min, max, first, last …)
Other manipulators of the data structure (e.g. rename …) Join, Union, Partition
Algebraic and string manipulation (e.g. +, -, *, /)
Some requirements about the operators (2)
Logical (e.g. and, or, not …) Validation, e.g.:
– Check of a generic condition
– Existence and referential integrity checks – Completeness check
– Calculation of the imbalance
– Calculation of the error severity level
Conditional execution (e.g. case)
Currency conversion
Basic building block:
the Transformation
e.g. calculation of the Einstein equation E=MC2
Operand:
2
Operand:C
Result:E
Expression:E = M*(C**2)
Operand:M
The tranformations graph
Collection activity n.1 C1 C2 C3 C4 C5 T1 T3 T2 C11 C12 C13 C15 C17 C16 T13 T12 T1 4 Collection activity n.2 Collection activity n.3 C21 C22 C23 C24 T22 T21 C51 C52 T53 T52 T51Analysis & research models
C54 C53 T54 C60 C61 Publications T60 T61 Statistical products C70 T71 T70 T72 C71 C72 C41 T42 T41 C42
SDMX technical standards
Data validation and other major enhancements
Thank you for the attention