SDMX technical standards Data validation and other major enhancements

(1)

SDMX technical standards

Data validation and other major enhancements

(2)

Internal processes

(self regulated)

Statistical Institutions

SDMX SDMX

exchange exchange

Statistical Data and Metadata eXchange

(3)

GSBPM phases originally involved

(4)

Current line of tendency:

Support to the whole process

(5)

Some outcome from SDMX TWG

Work Package «SDMX and other standards»

• To support the statistical processes, a language for

validations and calculations is needed

• On their own, some institutions have adopted a

language having this aim, (e.g. Bank of Italy,

Eurostat, Unesco …) and use it for internal processing and for exchanging validation and calculation rules with their reporting entities and correspondents

• For the same aim, many other institutions are

willing to adopt a similar language

• Unless a standard language is introduced, such kind of

languages would proliferate

• The SDMX community, and also the DDI and GSIM

ones, are interested in introducing and sharing a

(6)

The SDMX situation

Structural Validation – supported

– Assurance that the structure of the data observations matches the

Data Structure Definition, in term of:

• Concepts used as Dimensions, Measures, Attributes

• their admissible values (Codelists and values‟ Constraints);

Validation of the Information Content – not yet supported

– Assurance that data give correct information about the real world, for example:

• Completeness / Integrity • Accuracy / Plausibility • Coherence

(7)

stocks vs. flows

Validation rule:

If A-B = C “ok” else “error”

Aggregate “A”: stock at reference period: t

Aggregate “B”: stock at reference period: (t -1) Aggregate “C”: flows between (t-1) and t

Example of validation of the

information content

The validation of the information contents is a kind of calculation

(8)

The current initiative

A Work Package for introducing a standard

Validation and Tranformation Language (VTL)

was launched in 2012 by the SDMX Secretariat

• The DDI community expressed interest in

developing and adopting the VTL

• Analoguous interest was expressed by many

contributors to the GSIM standard

• A working group is in place, composed of members from the SDMX TWG and SWG and from the DDI and GSIM communities

(9)

Development priorities

For most institutions

validation is the priority

First VTL development:

• Support to the validation rules,

• Support to basic calculation capabilities

(as needed for the validation)

At a later stage:

• Improve the VTL to support more complex algorithms

(10)

main goals

•

Define and preserve validation rules

(document and preserve the validation know-how)

•

Exchange and share validation rules

(with reporting institutions & other correspondents)

•

Apply validation rules

in the collection and

production processes (aiming at an industrialized

processing of statistical data)

(11)

Implementation Plan

Main requirements

:

June 2013

Use cases of Validation

:

– From single institutions: August 2013

– Use cases finalization: October 2013

Basic VTL features

:

January 2014

Operators

(syntax, semantic): April 2014

Comments

and

finalization

: July 2014

SDMX implementation

: October 2014

(12)

How to implement a language

usable in different standards?

The problem

– A language manipulates the model artefacts to

produce other model artefacts (property of closure) – A language for SDMX wouldn‟t fit DDI & GSIM - and

vice-versa (artefacts are different)

The approach

– Build the VTL on an ‟agnostic’ information model,

made of the basic artefacts common to SDMX, DDI and GSIM (i.e. dimensional structures)

(13)

Main VTL requirements

• User orientation

• Integrated Approach

• IT implementation independence

• Active Role for processing

(14)

User Orientation

The VTL should be:

• declarative, so that users without IT skill should be able to define calculations and validations autonomously (without IT experts intermediation)

• user friendly (users should define & understand expressions as much as possible intuitively)

• oriented to statistics, which is the user skill (the language should operate on statistical artefacts by means of

(15)

Integrated approach

The VTL should be:

• independent of the statistical domain of the data to be processed

• suitable for the various typologies of data of a

statistical environment (e.g. dimensional data, survey data, registers data, micro and macro, quantitative and qualitative, …)

• independent of the phases of the statistical process

(16)

IT implementation independence

The VTL should:

• allow many different IT implementations

(for example in different organizations / institutions) and not be bound to a specific IT environment

• permit the use of heterogeneous IT tools in an

integrated IT solution (for example, combined use of tools like SQL, R, XML …)

• make users unaware of the IT solution as much as possible

• minimize impacts on users when the IT solution changes (for example following the adoption of

(17)

Active Role for Processing

The VTL should:

• be able to drive the validation & calculation software, so be convertible in the languages of the IT tools used for validation and calculation (e.g. SQL, R, XML …)

• be described through a formal grammar, to be easily parsed and processed (for example in Backus-Naur form)

• generate results unambiguously interpretable by software and by statisticians (the results should be artefacts of the information model in their turn)

(18)

Extensible and Customizable

The VTL should allow:

• the incremental introduction of the operators

according to the evolution of the business needs (e.g. the operators for the validation first and the operators for the compilation and estimation at a later stage)

• the adoption of operators derived from other

languages (e.g. “SQL like” operators, time series

processing operators …)

• the possible customization for specific needs, (e.g. if

some institutions need to extend the language for their own purposes)

(19)

VTL Governance

The VTL is intended to be:

• a standard language under a common governance, not controlled by any private party (such as an IT company)

• subject to appropriate governance rules aimed to ensure its proper evolution (to be defined)

• able to evolve more dynamically than the SDMX versions (without affecting the information model)

• coordinated with possible extensions made by some institutions through proper rules (to be defined)

(20)

Some Functional Requirements (draft)

The VTL should allow:

• Operations on dimensions, mono and multi-measure data, data attributes

• Aggregation according to hierarchical links

• Proper behaviour for missing data

• Historicity: possibility of handling the changes of the

artefacts and of the algorithms with reference to the time

• Persistency control: possibility of defining the persistency of the intermediate results

• Expressions chaining: possibility of having expressions as input operands of other expressions

(21)

Some requirements about the operators (1)

Data retrieval and storage (e.g. get, put)

Projection (e.g. drop, keep …)

Filter (e.g. =, <, <=, >, >=, <>, like, between …)

Aggregation (e.g. sum, avg, min, max, first, last …)

Other manipulators of the data structure (e.g. rename …) Join, Union, Partition

Algebraic and string manipulation (e.g. +, -, *, /)

(22)

Some requirements about the operators (2)

Logical (e.g. and, or, not …) Validation, e.g.:

– Check of a generic condition

– Existence and referential integrity checks – Completeness check

– Calculation of the imbalance

– Calculation of the error severity level

Conditional execution (e.g. case)

Currency conversion

(23)

Basic building block:

the Transformation

e.g. calculation of the Einstein equation E=MC2

Operand:

2

Operand:

C

Result:

E

Expression:

E = M*(C**2)

Operand:

M

(24)

The tranformations graph

Collection activity n.1 C₁ C₂ C₃ C₄ C₅ T₁ T₃ T₂ C₁₁ C₁₂ C₁₃ C₁₅ C17 C₁₆ T₁₃ T₁₂ T1 4 Collection activity n.2 Collection activity n.3 C₂₁ C₂₂ C₂₃ C₂₄ T₂₂ T₂₁ C₅₁ C₅₂ T₅₃ T₅₂ T₅₁

Analysis & research models

C₅₄ C₅₃ T₅₄ C₆₀ C₆₁ Publications T₆₀ T₆₁ Statistical products C₇₀ T₇₁ T₇₀ T₇₂ C₇₁ C72 C₄₁ T₄₂ T₄₁ C₄₂

(25)

SDMX technical standards

Data validation and other major enhancements

Thank you for the attention