• No results found

SDMX technical standards Data validation and other major enhancements

N/A
N/A
Protected

Academic year: 2021

Share "SDMX technical standards Data validation and other major enhancements"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

SDMX technical standards

Data validation and other major enhancements

(2)

Internal processes

(self regulated)

Statistical Institutions

SDMX SDMX

exchange exchange

Statistical Data and Metadata eXchange

(3)

GSBPM phases originally involved

(4)

Current line of tendency:

Support to the whole process

(5)

Some outcome from SDMX TWG

Work Package «SDMX and other standards»

• To support the statistical processes, a language for

validations and calculations is needed

• On their own, some institutions have adopted a

language having this aim, (e.g. Bank of Italy,

Eurostat, Unesco …) and use it for internal processing and for exchanging validation and calculation rules with their reporting entities and correspondents

• For the same aim, many other institutions are

willing to adopt a similar language

• Unless a standard language is introduced, such kind of

languages would proliferate

• The SDMX community, and also the DDI and GSIM

ones, are interested in introducing and sharing a

(6)

The SDMX situation

Structural Validation – supported

– Assurance that the structure of the data observations matches the

Data Structure Definition, in term of:

• Concepts used as Dimensions, Measures, Attributes

• their admissible values (Codelists and values‟ Constraints);

Validation of the Information Content – not yet supported

– Assurance that data give correct information about the real world, for example:

• Completeness / Integrity • Accuracy / Plausibility • Coherence

(7)

stocks vs. flows

Validation rule:

If A-B = C “ok” else “error”

Aggregate “A”: stock at reference period: t

Aggregate “B”: stock at reference period: (t -1) Aggregate “C”: flows between (t-1) and t

Example of validation of the

information content

The validation of the information contents is a kind of calculation

(8)

The current initiative

A Work Package for introducing a standard

Validation and Tranformation Language (VTL)

was launched in 2012 by the SDMX Secretariat

• The DDI community expressed interest in

developing and adopting the VTL

• Analoguous interest was expressed by many

contributors to the GSIM standard

• A working group is in place, composed of members from the SDMX TWG and SWG and from the DDI and GSIM communities

(9)

Development priorities

For most institutions

validation is the priority

First VTL development:

• Support to the validation rules,

• Support to basic calculation capabilities

(as needed for the validation)

At a later stage:

• Improve the VTL to support more complex algorithms

(10)

First VTL development:

main goals

Define and preserve validation rules

(document and preserve the validation know-how)

Exchange and share validation rules

(with reporting institutions & other correspondents)

Apply validation rules

in the collection and

production processes (aiming at an industrialized

processing of statistical data)

(11)

First VTL development:

Implementation Plan

Main requirements

:

June 2013

Use cases of Validation

:

– From single institutions: August 2013

– Use cases finalization: October 2013

Basic VTL features

:

January 2014

Operators

(syntax, semantic): April 2014

Comments

and

finalization

: July 2014

SDMX implementation

: October 2014
(12)

How to implement a language

usable in different standards?

The problem

– A language manipulates the model artefacts to

produce other model artefacts (property of closure) – A language for SDMX wouldn‟t fit DDI & GSIM - and

vice-versa (artefacts are different)

The approach

– Build the VTL on an ‟agnostic’ information model,

made of the basic artefacts common to SDMX, DDI and GSIM (i.e. dimensional structures)

(13)

Main VTL requirements

• User orientation

• Integrated Approach

• IT implementation independence

• Active Role for processing

(14)

User Orientation

The VTL should be:

declarative, so that users without IT skill should be able to define calculations and validations autonomously (without IT experts intermediation)

user friendly (users should define & understand expressions as much as possible intuitively)

oriented to statistics, which is the user skill (the language should operate on statistical artefacts by means of

(15)

Integrated approach

The VTL should be:

independent of the statistical domain of the data to be processed

suitable for the various typologies of data of a

statistical environment (e.g. dimensional data, survey data, registers data, micro and macro, quantitative and qualitative, …)

independent of the phases of the statistical process

(16)

IT implementation independence

The VTL should:

allow many different IT implementations

(for example in different organizations / institutions) and not be bound to a specific IT environment

permit the use of heterogeneous IT tools in an

integrated IT solution (for example, combined use of tools like SQL, R, XML …)

make users unaware of the IT solution as much as possible

minimize impacts on users when the IT solution changes (for example following the adoption of

(17)

Active Role for Processing

The VTL should:

be able to drive the validation & calculation software, so be convertible in the languages of the IT tools used for validation and calculation (e.g. SQL, R, XML …)

be described through a formal grammar, to be easily parsed and processed (for example in Backus-Naur form)

generate results unambiguously interpretable by software and by statisticians (the results should be artefacts of the information model in their turn)

(18)

Extensible and Customizable

The VTL should allow:

• the incremental introduction of the operators

according to the evolution of the business needs (e.g. the operators for the validation first and the operators for the compilation and estimation at a later stage)

• the adoption of operators derived from other

languages (e.g. “SQL like” operators, time series

processing operators …)

• the possible customization for specific needs, (e.g. if

some institutions need to extend the language for their own purposes)

(19)

VTL Governance

The VTL is intended to be:

a standard language under a common governance, not controlled by any private party (such as an IT company)

subject to appropriate governance rules aimed to ensure its proper evolution (to be defined)

able to evolve more dynamically than the SDMX versions (without affecting the information model)

coordinated with possible extensions made by some institutions through proper rules (to be defined)

(20)

Some Functional Requirements (draft)

The VTL should allow:

Operations on dimensions, mono and multi-measure data, data attributes

Aggregation according to hierarchical links

Proper behaviour for missing data

Historicity: possibility of handling the changes of the

artefacts and of the algorithms with reference to the time

Persistency control: possibility of defining the persistency of the intermediate results

Expressions chaining: possibility of having expressions as input operands of other expressions

(21)

Some requirements about the operators (1)

Data retrieval and storage (e.g. get, put)

Projection (e.g. drop, keep …)

Filter (e.g. =, <, <=, >, >=, <>, like, between …)

Aggregation (e.g. sum, avg, min, max, first, last …)

Other manipulators of the data structure (e.g. rename …) Join, Union, Partition

Algebraic and string manipulation (e.g. +, -, *, /)

(22)

Some requirements about the operators (2)

Logical (e.g. and, or, not …) Validation, e.g.:

– Check of a generic condition

– Existence and referential integrity checks – Completeness check

– Calculation of the imbalance

– Calculation of the error severity level

Conditional execution (e.g. case)

Currency conversion

(23)

Basic building block:

the Transformation

e.g. calculation of the Einstein equation E=MC2

Operand:

2

Operand:

C

Result:

E

Expression:

E = M*(C**2)

Operand:

M

(24)

The tranformations graph

Collection activity n.1 C1 C2 C3 C4 C5 T1 T3 T2 C11 C12 C13 C15 C17 C16 T13 T12 T1 4 Collection activity n.2 Collection activity n.3 C21 C22 C23 C24 T22 T21 C51 C52 T53 T52 T51

Analysis & research models

C54 C53 T54 C60 C61 Publications T60 T61 Statistical products C70 T71 T70 T72 C71 C72 C41 T42 T41 C42

(25)

SDMX technical standards

Data validation and other major enhancements

Thank you for the attention

References

Related documents

As of September 30, 2013, total long-term assets amounted to EUR 197�0 million (March 31, 2013: EUR 202�2 million)� The change results from capital calls of EUR 10�6

To resolve this issue, we have grown MSCs in high-glucose medium for 11, 12, and 13 passages, and confirmed the expression of PDX-1 (pancreatic and duodenal homeobox 1), which is

In this current study involving babies of Igbo ethnic extraction form Nigeria, birth weight correlated very strongly with the anthropometric variables of CC, CC + OFC (sum) and

In Figure 4, we see that the outlier counties with extremely high property tax capacity tend to have only modest level of sales tax base, and thus adding LOST pro- ceeds

The level of total protein and albumin was depleted in the group treated with diclofenac sodium (toxin control) and they were significantly decreased ( P &lt; 0. 001 ) when compared

While suppliers will continue to work to monetize the computing and network assets that underpin the cloud services, it is the operational expertise of billing

the hyperthyroid mother has produced goi- ter in the newborn.3 The precise functional status of the goiter presumably reflects the interplay between the dosage and duration

Literature survey revealed that synthesis of these ligands require the usage of harmful solvents like benzene, toluene, chloroform etc.. In this research paper, we are