DATA GOVERNANCE METHODS. Model Driven Data Engineering added value

(1)

DATA GOVERNANCE METHODS

Model Driven Data Engineering added value

Vincent Ciselet (Ingénieur), Jean Henrard (Ph. D.), Jean-Marc Hick (Ph. D.), Frumence Mayala (Ingénieur), Christophe Mouchet (Ingénieur), Dominique Orban (Ingénieur), Didier Roland (Ph.D.)

(2)

DATA GOVERNANCE METHODS

MANAGEMENT SUMMARY

Although it is taught in all technical colleges and training institutes and is recommended as "best practice" by many experts, data modelling is not widely used in business organizations. There are doubtless many reasons for this but two reasons appear obvious: first the arcane language that allows only specialists to understand and use effective data modelling and second the lack of a concrete product at the end of the modelling process; that is, a product that can be used by those investing in the data management project.

However, in recent years this has had to change as a result of two converging trends. Firstly, the need for organizations to develop methods and tools to maintain control of their applications (with their increasing complexity) while, at the same time, the number of applications to be controlled increased rapidly.

Secondly, the increase in the number of efficient data modelling software solutions in which application source code is no longer the central element of software, but is rather an element derived from the data modelling process which nowadays includes data and business rules as part of the input.

REVER‟s products are an important part of the second element described above. For over 5 years REVER have industrialized and brought to market products based on leading-edge research and development carried out over the last 25 years in the Database Engineering Laboratory (LIBD) at the University of Namur in Belgium.

If data modelling seems to be somewhat anecdotal compared to the overall business challenge one must keep in mind both that the data are at the heart of computer applications and that it is easy to underestimate the vital role of data in the functioning of organizations.

Moreover, the very significant costs of generating the data and the inherent value of the data for an organisation mean that organizations must move beyond data management as an arcane function of the IS department into central challenge of "data governance". Just like corporate governance, data governance must define the rules for exercising the activities of data management, to ensure compliance with these rules and their implementation and to ensure their development and evaluate their effectiveness.

It is from this perspective that this document has been drafted. It is aimed at managers, practitioners and all those whose function within an organisation requires them to be aware of how data are managed and how data use is governed correctly and effectively within an organisation. That is, this document is an introduction to the ideas of corporate data governance from a technical standpoint.

It shows that by considering all the company data as an "ecosystem" and offering such innovative features such as cogeneration, co-evolution and the comparison of ecosystems using a Model Driven Data Engineering (MDDE) approach contributes to the goals of data governance. In particular, it illustrates that in addition to maintaining permanent coherence of the data ecosystem, the MDDE approach makes the link between:

 policy requirements of governance raised by the "business" as follows:

o define information systems (IS) (creation of databases) o evaluate the existing IF (data quality, database quality, risks, ...)

o to change the SI (evolutionary maintenance, migration of databases, ...)

o use and reuse of existing data (migration / data integration, exchange, extraction, ...)

 appropriate methods chosen by the department

 operational tools needed to stakeholders for the implementation of technical projects

The undoubted successes enjoyed by project after project through the use of solutions (methods supported by tools) presented in this document clearly demonstrate their relevance and effectiveness. These solutions have been adopted by many large organizations and integrators and are used in a wide variety of projects in different technical and organizational environments. These solutions offer:

 high-quality professional results

 a drastic reduction in technical risks associated with data projects through the verification of all components of the ecosystem

 a very substantial reduction in the time required to execute projects resulting from significant automation of processing

 maintenance of a permanent link between business and IS that permits the straightforward application development alongside sustainable investment.

(3)

Page | 3

1 CONCEPTS ... 5

1.1 MODEL-DRIVEN ENGINEERING ... 5

1.2 ECOSYSTEM OF DATA ... 5

1.3 MODEL-DRIVEN DATA ENGINEERING (MDE)... 6

1.3.1 THE APPROACH ... 6

1.3.2 THE LEVELS OF MODELLING ... 6

1.3.3 MODELLING TOOLS ... 6

2 DATA GOVERNANCE ... 8

2.1 DATA IN THE ORGANIZATION ... 8

2.2 GOVERNANCE AND DATA ENGINEERING ... 8

2.3 BUSINESS REQUIREMENTS ... 11

2.3.1 DEFINE ... 11

2.3.2 ASSESS ... 14

2.3.3 DEVELOP ... 20

2.3.4 REUSE ... 24

3 THE MDDE PERSPECTIVES ... 29

3.1 IMPROVEMENT OF HUMAN-MACHINE INTERFACE... 29

3.2 COUPLING WITH OTHER MODELLING SYSTEMS ... 30

3.2.1 APPLICATION ACCEPTANCE ... 31

APPENDIX1:GLOSSARY ... 33

APPENDIX2:EXAMPLEOFECOSYSTEM ... 35

ANNEXE3:BIBLIOGRAPHY ... 38

(4)

THEMATIC INDEX

This document provides an overview of the contribution of MDDE. The description however risks losing readers who would like to obtain information about a specific project. The aim of this index is to respond to this concern by providing links between the headings of a project, or parts of the document concerned and the technical solutions described in the document entitled “data governance: technical solutions”.

PROJECT TYPE

REFERENCE

Label Number

Technical Solution (http://www.rever.eu/white- papers/technicalsolutionsMD

DE-EN.pdf)

Data Archiving

understand 2.3.2.1 2.1

export 2.3.4.1 4.1

data cleansing

Erreur ! Source du renvoi introuvab le.

4.1.1

Understanding Applications understand 2.3.2.1 2.1

Development of new applications develop 2.3.1.1 1.2

Evaluation of IS systems understand 2.3.2.1 2.1

programmes risks DB 2.3.2.2.3 2.2.3

Data Exchange export 2.3.4.1 4.1

Extraction of data sets export 2.3.4.1 4.1

data sets 2.3.4.2 4.1.2

Fusion of databases (meta DB creating repeating structures and data of several DB)

modernise 2.3.3.1 3.2

Integration of data (injection into an existing DB or a package of data from one or several DB)

import 2.3.4.4 2.1

Evolutionary Maintenance

update 2.3.3.1 3.1

co-evolution 2.3.3.2 3.3

Migration of DBs (between DBMSs) understand 2.3.2.1 2.1

modernise 2.3.3.1 3.2

Migration of data (injection into an existing DB or a package of data from one or several DB)

export 2.3.4.1 4.1

import 2.3.4.4 4.2

Quality of databases understand 2.3.2.1 2.1

DB quality 2.3.2.2.2 2.2.2

Quality of data understand 2.3.2.1 2.1

data quality 2.3.2.2.1 2.2.1

Rewrite of application

develop 2.3.1.1 1.2

import 2.3.4.4 4.2

Reverse Documentation understand 2.3.2.1 2.1

coupling 3.2 5.2

Reverse Engineering understand 2.3.2.1 2.1

coupling 3.2 5.2

(5)

Page | 5

(6)

1 Concepts

1.1 Model-Driven Engineering

Modelling has always been an important part of science and technology. The essential features of modelling include:

 a description and understanding of the system under review;

 communication between those working on a specific problem;

 provision of an abstraction to aid reasoning independent of technical constraints;

 prediction in the sense of identifying a priori the impacts of changes or developments;

 simulation of scenarios expected in real life use of the system being modelled.

Model Driven Engineering (MDE) is part of this approach and can be defined as a form of generative engineering, that is an approach in which all or part of a computer application is generated from the model.

This approach corresponds to a paradigm in which the source code is no longer regarded as the centrepiece of software, but as an element derived from modelling. This approach is important in software and hardware architecture-led models using standards such as the MDA (Model-Driven Architecture) specification proposed by the OMG (Object Management Group)

Such architectures fit naturally into a models -based development process ensuring at each level of modelling, that the models obtained and reused have the required quality. This approach puts the model in the centre of analysts and designers‟ thoughts.

If the name sounds new, the process, it is not: modelling activities are the daily bread of the developers and have always been. However, in most cases, the models and solutions remain implicit, or at least informal, and are applied manually. The proposed MDE approach simply formalizes and mechanizes the processes that experienced engineers follow by hand. In other words, the MDE is simply an application to software design of the "classical" approach of the engineer who created a mechanical object following design documents.

To be useful and effective, the MDE approach requires that the models and processes be made explicit and precise enough to be interpreted or processed by machines. In this context, the process can then be seen as a set of partially ordered transformations of models, each of the transformations taking a model input and producing a model output, until all the required runtime artefacts have been created. Thus, when you need to create a new solution, be it a simple evolution of an existing solution or a new variant, one can simply "replay" the majority of the process by changing just a few details here and there in the model.

The model-driven data engineering (MDDE) described later in this document is simply an MDE approach applied to the "ecosystem" of data.

1.2 Ecosystem of data

Persistent data in computer systems are stored in databases. This definition is generic and does not favour any type of management system used for data storage: this system may be composed of "flat" files, XML files, databases management systems (DBMS) of different types such as hierarchical, network, relational, or any combination of these

Whatever the system used, the data (to be more precise: "the values") are stored in a structure defined in order to be processed by programs. In addition, data are comparable to the pieces of a jigsaw puzzle such that each piece of the puzzle or piece of data is isolated from the others but corresponds to specific rules such as its height, flexibility and so on. Each piece must adjust to its neighbours in its shape and colours.

Taken together, they are links between them that ensure the consistency of the information - for example in a health care database, patients receiving prenatal care would be female. Such rules are called the data rules.

Finally, the data stored are accessed by programs so that they can be manipulated to achieve a required result. If it is common to consider the structures, values and "rules" as part of "ecosystem" of a database, it is less so to include access to the data found in the programs. For an MDDE process to be complete, data access has to be included for the simple reason that the rules are not only located in the storage system data but split in a non-homogeneous manner in storage systems and programs.

Appendix 2 describes an example of a shopping cart management application data ecosystem.

(7)

Page | 7

1.3 Model-driven data engineering (MDE) 1.3.1 The approach

The "heart" of an MDDE approach is the use of models to produce tools used in the technical environments of the projects. This approach is already widely used by many products in the market for the creation of a database structures. It is however still too often limited to this type of action. With a view to wider use, we should be asking two basic questions:

 How to design the modelling approach so that it can be used by all types of stakeholders (analysts, developers, database experts)?

 What are the necessary and sufficient functions to include in modelling tools in order to generalize the approach of MDDE to operations other than but still including the creation of databases structures?

1.3.2 The levels of modelling

To answer the first concern, the generally adopted principle is to divide the model into levels, with each level of modelling corresponding to specific views. For databases, it is conventional to consider three levels of modelling:

 the "conceptual" (or semantic) level describes the point of view of the business stakeholder‟s information system. By nature, this level is independent of all technologies and is a representation of the information system necessary for business activities

 the "logical" level describes the information from the point of view of the developers of the application system. This model is dependent on the category of technology used for data storage: the logical model derived from a conceptual model will be different depending on whether the data will be stored, for example, in a relational DBMS or using file-based XML;

 the "physical" level describes the database from the point of view of the database expert. This model is dependent on the technical environment in which the database will be implemented: the physical model of a database may be different depending on whether it is, for example, an implementation of a relational system in ORACLE or DB2.

1.3.3 Modelling tools

In the context of an MDDE approach to modelling tools should be able to move from one level of modelling to another using transformation functions that:

 guarantee the maintenance of the same semantics

 are reversible, meaning that for each of the functions there exists, in the mathematical sense, a reciprocal function

The principle of symmetrically reversible transformation functions is illustrated in figure 1. Transformations must, starting from a conceptual model, be able to create logical model for a relational DBMS, for a network DBMS or for an XML DBMS and, conversely, starting from a logical model allow the creation of a conceptual model.

In addition to transformation functions, modelling tools should allow models to "evolve" to take into account changes in the real world that they represent. The development of the models is carried out either by

(8)

transformation functions or by changes to the models (addition, modification or deletion of a model element) or even by a combination of transformations and model changes.

To the extent that databases are considered as an ecosystem that includes data access, it is essential that data access be represented in modelling tools in the same way as the data models. In particular, access to data models must be related to the data models, to allow the tools using the models to have a complete view of the information system (IS).

Finally, modelling tools should provide opportunities to establish connections between models of different levels (conceptual-logical-physical) and of course comparison functions to allow the identification of differences in the models.

Beyond „pure‟ modelling, and to ensure the utility of the MDDE approach it is useful to associate with the modelling tool:

 "generators", according to the objectives pursued, to produce the necessary source code. Thus, for example, it is not very difficult to write an SQL generator to produce a query verifying whether or not the data contained in a database meet the rules defined in the model. It goes without saying that many other possibilities are conceivable.

 "analyzers" which allow the reconstruction, from existing code, of all or part of the description of the ecosystem including the rules of management.

(to learn more about the MDDE see http://www.rever.eu/white-papers/IntroductionMDDEmethods- EN.pdf ).

(9)

Page | 9

2 Data governance

2.1 Data in the Organization

It is increasingly common to hear that "data are a strategic issue for organizations."

This slogan expresses 2 truths. The first is that organizations cannot risk losing their data (without data the organization stops). Secondly, data are a valuable asset as they cannot be replaced if lost. It is equally true that:

 data exists in the organization independently of computer programs

 data have a different lifetime to programmes that manipulate them

 the cost of the data is estimated to be from 4 to 6 times greater than programs

 data errors generate hidden costs that may amount to hundreds of thousands of Euros each year depending on the organization

In this sense, the data are a real heritage which is transmitted from generation to generation within the organization. It is therefore that the ESA (European accounting system) recommendations advocate that all databases of a lifetime of more than one year are recorded in the "assets" side of the balance sheets of the Member States of the European Union. This trend is confirmed by recommendations of the OECD who would also require that databases are recorded as business assets.

This approach invites several questions: what criteria should be used to assess the value of a database?

How can we measure appreciation or depreciation? How to insure against the risk to data? Most importantly, how can we do this in an objective manner? All of this falls under the heading of Data Governance.

2.2 Governance and data engineering

As any form of governance, data governance requires three levels of intervention; strategic, tactical and operational.

The strategic level defines the vision of the organization for the management of its data, indicates the rules to be applied, and assesses its effectiveness and relevance. This governance framework allows users to specify their requirements:

 definition of information systems

 assessment of existing systems

 evolution of the systems put in place

 reuse of available data

(10)

The tactical level defines the data solutions (methods and tools) applicable to the organization‟s data ecosystems. These solutions must enable technical stakeholders to respond to the requirements imposed at the strategic level and more particularly:

 develop information systems

 understand and measure the existing systems

 modify and modernize the systems put in place

 export and import data

The operational level translates the application to meet the specific data ecosystems adopted at the tactical level.

In this context and for the implementation of the projects, the MDDE approach is the link between strategic requirements, tactical methods, and operational tools:

 models define and formalize the business requirements

 transformations provide the translation of the functional requirements specified by the business

 generators produce the tools needed to carry out the transformations

The governance of data described here does not take into account the budgetary, human and organizational aspects are required for a complete governance policy. A discussion of the fundamental aspects for the success of a complete data governance policy within organizations is out of the scope of this document.

The rest of this document describes different methods used for each of the business requirements. These methodological descriptions are illustrated by examples from real projects carried out using the "DB-MAIN"

modelling platform complemented by technical solutions integrating parsers and generators. Technical solutions are described in more detail in http://www.rever.eu/white-papers/technicalsolutionsMDDE- EN.pdf

(11)

Page | 11

The examples used in this document are from different projects for different clients having, of course, very diverse technology environments. This approach, at first glance perhaps somewhat chaotic, clearly shows the generic nature of the MDDE approach.

In addition, to assist understanding, the vocabulary used is that of the entities-association paradigm and the representations used in the figures are mainly expressed in this paradigm.

Annex 1 provides the correlation in vocabulary between the different modelling paradigms.

You are reminded that the modelling platform DB-MAIN was developed by the University of Namur and is available to download free of charge from http://www.db-main.eu. A bibliography summarizing the latest research is provided in Annex 3. A complete bibliography is available at:

http://info.fundp.ac.be/~dbm/mediawiki/index.php/LIBD:Publications

(12)

2.3 Business Requirements 2.3.1 Define

New user requirements are met by the development of short programs, by the development of a new application or by the overwriting of an existing application to meet the needs of the business. In these circumstances, the objective of the MDDE approach is to make available to developers, regardless of their function - systems analyst, programmer, database administrator - tools that allow them to accelerate the creation of new data ecosystems.

2.3.1.1 Develop

The purpose of tools and methods is to allow developers to start from definitions to co-generate an ecosystem of data that is creation of the structures of the database and the associated data access methods.

2.3.1.1.1 Method

The process of "cogeneration" is shown in figure 6:

(1) it is appropriate first to define the schema conceptual data. This scheme is a business definition that IS can implement independent of any technology

(2) from the conceptual schema, an automatic process of transformation will produce: a technical schema and business views taken from the conceptual schema (3) from the technical schema, a product code generator creates DB scripts (4) from the business views code generators produce:

(a) the source code for a middleware layer: Business Access Modules (BAM). This layer contains all the methods for data access management (b) the technical documentation for the middleware layer to so that the developers can use the BAM

(c) the application source code for the version of the DB. This application relies on the generated middleware layer.

The methodology of cogeneration is supported by appropriate tools presents several major interests:

 It is exclusively based on a conceptual schema offering descriptions of functionalities:

o "types of entity" (e.g.: the type of entity "persons" is divided into two types of entity "clients"

and "employees", each of these types of entity with its own characteristics);

o the "attributes" (e.g.: "name", "address" is composed of "Street", "nr",..)

o "association types" that bind the types of entity between them (e.g.: "employee" may work for one or several "agencies" and vice versa in an "agency" there may be several "employees").These definitions are translated by the transformer of schema in technical terms;

 isolates the process of treatment data management processes providing an architecture "agile" to advance the different "layers" technical with a measure of independence;

(13)

Page | 13

 methodology does not impose any technical architecture. Only the BAM generator is dependent on the choice of architecture to generate source code. The latter, must conform to architecture selected and defined strategic directions (centralized, decentralized, MDM, application);

 is an application approach, business views are not confined to a single database. The BAM could access several databases, possibly implemented in different DBMSs;

 the BAM are generated from business views . Cogeneration tools generate a first business view embodying the definition of the conceptual schema. Starting from this prima facie business it is possible to define other business views each giving a different view of the DB: the BAM offers access to the DB by following the logic of the business views from which they are derived;

 the generation of source code can be in different programming language (JAVA, C, COBOL,...);

 It is automated and therefore immediate;

 It is also applicable to existing databases. In this case:

o The need to rebuild the technical schema by reverse-engineering (a specific tool is also provided).

o from this technical scheme, automata reconstruct a conceptual schema, generate business views, the BAM, their documentation, and the application allowing the editing of the DB program.

(14)

The use of the methodology is shown here using the BIRT database that accompanies the Eclipse development platform. The conceptual schema (BIRT/conceptual figure 7 left) contains five types of entity that is: (red) agencies, (dark brown) employees, customers (blue), (light brown) payments, (green) people who are either employees or customers.

The following outputs are generated from the conceptual schema:

a "technical" relational schema (figure BIRT/SQL) (7 on the right) which is used for the implementation of the DBMS,

a business view

From this first business view, generated automatically, a second business view (BIRT-finance /

"profession" figure 8) set was: the type of entity "persons" has been deleted; its attributes were aggregated to the types of entity "clients" and "employees". Furthermore, entity type "payments" has been integrated into the type of entity "clients". Finally some attributes of "clients" have been deleted and/or renamed. The MAMs generated from the business view "Finance" present data as if "clients" was a single table composed of the different attributes described in the business view (figure 9).

Documentation of the BAM for developers is also generated as well as an editing program for the database based on the BAM. The editor allows developers to browse the database following the association types defined by the business view, to view data, to create data, and to modify data as well as familiarizing themselves with the entity types, association types and methods that implement them, in addition to the generated documentation.

finance view/Business-1

(15)

Page | 15 2.3.2 Assess

To be able to evaluate an application and to measure the quality of the data and estimate the risk of adapting an application to new needs are essential functions for a good governance of data.

The requirement for good governance is often in conflict with the reality of day-to-day use of applications. Indeed, the programs making up an application accumulate over time, forming a cluster whose complexity is reinforced by technological developments, and evolutionary and corrective maintenance. This increase in complexity over time is accompanied in parallel with a loss of "knowledge" of the application due to incomplete documentation, outdated data and experts in the system called to other duties or tasks or who leave the company.

This complexity of programs makes it difficult to see the data which are at the heart of the applications.

In this context, the primary objective of the MDDE is to understand the application and use that knowledge to gain a sufficiently detailed knowledge to assess the quality of the data, the quality of the databases, the level of risk, etc.

2.3.2.1 Understand

The reverse-engineering methodology explained below is designed to reconstruct a definition of the ecosystem of the IS regardless of the diversity of the storage structures and data management processes that it comprises. This definition is a reconstruction of the different levels of model. It goes without saying that if a full or partial model is already available, it is not necessary to perform all the steps in the process described below. The granularity (precision) of the models depends on the performed steps and selecting the appropriate level of precision is a function of the development team.

2.3.2.1.1 Method

To achieve the defined objective, the method proposed is to analyze the elements into two categories:

 the technical elements: scripts for creating the DB, the source codes for all of the application processes (DB-procedures, triggers, programs, JCL, scripts,...) and finally the data itself

 non-technical elements such as the existing documentation, the implicit knowledge of system developers and end-users.

The analysis of the technical elements takes place in several successive stages (five are shown below) that improve and validate the accuracy and quality of the model as they proceed.

 the first step is to rebuild the physical model of the DB by simple analysis of creation scripts and/or direct query of the DB (for most relational DBMSs)

 the second step is to complete the previous model by elements declared explicitly in programs and not reported in the DB. Thus, for example, it is common to find in the DB columns of hundreds of characters or descriptions that are set in the programs. This step is required when it comes to applications running with flat files

 the third stage is to produce the logical model of the application. The latter is built mainly by enriching the results of the previous step by the discovery of management rules found in the programmes

 the fourth step is to validate the results obtained by the analysis of the data. Non-compliance of the data values to the data rules defined in the model requires us to think about the origin of the gap. Is it due to an incorrect rule, an erroneous rule or an incomplete rule? In addition, this step allows the enrichment of the model using analyzed values: unused columns, default values and so on

 the last step is to abstract the technical results to produce a "conceptual model" independent of the technology. This rough conceptual model can be completed by the knowledge taken from documentation and available expertise. This contribution then allows the analyst to obtain a conceptual schema whose semantics are enriched and tends to best express the perception of the IS from the point of view of users

The above process is more than 90% automated. The remaining manual tasks are the validation of the results of each of the steps and the enrichment of the rough conceptual model using documentation and human expertise. It should also be noted that the methodology proposed is completely generic and is suitable for all types of DBMS, languages and operating systems.

(16)

2.3.2.1.2 Examples

It is obviously not possible to describe exhaustively all the results of the reverse engineering process just described. It is however, important to highlight two types of result:

 the accuracy of the models

 dependencies

2.3.2.1.2.1 The accuracy of the models

This is the first objective. The screenshots below illustrate each of the steps in the process of reverse- engineering showing the evolution of the accuracy of the model. The sample is a subset of a database of IDS2 (environment BULL GCOS8) before migrating to an IBM, Z/OS and DB2 environment. To give a sense of scale the full database includes 255 record types and the surrounding application about 1.5 million lines of COBOL code.

The analysis of the creation of the base code reveals entity types (type of record IDS2) and types of association reported. The notion "owner- member”, characteristic of network databases, indicates the direction of the type of association (e.g.

for an occurrence of IDENT1, it is possible to have several INSTITUTE, and conversely an occurrence of Institute may depend on that one and only IDENT1).

The analysis of source codes of programs allows us to complete the physical schema notably by including the addition of the breakdown of the data structures as they are used by programs.

A more detailed analysis of the programs allows us to complete the schema by complementary types of association (in blue in the example) used by programs. These types of association between entity types are data rules which are managed exclusively by the programs. It should be noted in the example that two "clusters" of entity types that appeared to be independent of the other are in fact associated with the programs.

(17)

Page | 17

To ensure the accuracy of the model reconstructed by analysis of the source code, an automatic process verifies the data using the rules defined by the data model. In this way, the rules check to ensure compliance with the formats specified for types of association or the constraints on dependencies between attributes.

To produce the conceptual model, the following actions were carried out:

 analysis of the data showed an equivalence of keys between IDENT1 and I2DENT, so they have been merged

 a record which was the implementation of a N - N relationship has become a type of association after removing redundant attributes

 Institute, the added constraint highlighted the fact that the IN_MAT attribute was redundant with the type of association IDIN, and therefore this attribute has been removed

 two of the added constraints have been transformed into types of association

 the decomposition of the dates in years, months, days have been removed, and the decomposition of bank accounts and the decomposition of notes in rows

It should be noted that to obtain a readable schema we should also rename the entity types, attributes, and types of association in order to have a significant vocabulary. This work has been done the project, but for reasons of confidentiality the result is not presented here.

(18)

2.3.2.1.2.2 Identification of Dependencies

In addition to the reconstruction of models, reverse-engineering also allows for the identification of dependencies between components of the data application. Three types of dependencies are identified: data- data dependencies, data-program dependencies and program-program dependencies. Whatever the type of dependency, they are modelled as a diagram with nodes representing the components and arcs representing the links between components. Knowledge of these dependencies drastically reduces the technical risks to the projects by identifying all of the elements affected by a change in any component.

Data-Data Dependencies

The model indicates the types of association (dependencies, in red) between data which must be respected (at risk is the consistency of the IS). Thus, for example (figure 15), a postal code change should be reflected in the types of entity address-branch, address- headquarters-social, people-non-entities. This example was taken from an application modernization project using the COBOL language and DBMS: IDS2.

Data-Program Dependencies

All accesses to the DB are identified and are represented as a diagram in which:

 nodes are the types of entity and modules programming

 arcs indicate the type of use of the types of entity (read, write, update) carried out by the modules .The example (figure 16), shows the list of programs that use the entity type "address-branch": any modification of the latter may have an impact on the modules that are linked. Once again, this example was taken from an application modernization project using the COBOL language and DBMS: IDS2.

Program-Program Dependencies

In the same manner as above, all calls between modules are identified and are represented in a diagram in which:

 nodes are modules

 arcs are the links between the different modules This diagram is in fact a description of the architecture of the data application. It allows us to follow the propagation path of any change in a module. This example was taken from a redocumentation project using the C language and the DB2 DBMS.

Application mapping

An application map is obtained by combining two types of previous diagram and the objective is to obtain a mapping from the application or one of its parts.

This example was taken from a redocumentation project using NATSTAR and the ORACLE DBMS.

(19)

Page | 19 Documentation:

Documentation is obtained by the simple export of the contents of the DB-MAIN repository.

The items stored in the DB-MAIN repository are exported in XML format and can be output in the required format by an automated process. Available formats include HTML, WIKI, DOCBOOK, HLP and so on.

2.3.2.2 Measure

A detailed knowledge of an application allows us to assess the quality of the elements of the application.

According to the needs the Assessments can range from the simple extent of one or more components in the implementation to a complete solution for their management. For example, detailed knowledge of the application data model allows the assessment of its functionality compared to user requirements.

It goes without saying that methods should be adapted to match the type of components you wish to measure or even improve. Some examples of possible measures are provided below. It should be noted however, that where the MDDE is limited to access of data, it cannot qualify for an assessment of the quality of the programs: this falls within the remit of other domains of computer science.

2.3.2.2.1 Data quality

During reverse engineering, the processes used in step four, automatically generate tools to assess the compliance of data with the rules described in the data model. This validation is conducted on each value contained in the DB against all the rules that each value must respect. All nonconformities are identified and reported, indicating for each of them, which rule is not respected, which values fail the validation and which programs are affected by these data issues.

From this starting point it is possible to build a platform measuring data quality. This is illustrated in figure 20.

The principle is to build a repository that contains the set of management rules. In addition to the rules from the model, the repository can integrate other rules (eg compliance to regulations, internal rules) provided by stakeholders. From this repository, automated processes generate queries to compare the data to the rules.

This process can identify:

 erroneous data likely to be the subject of corrections

 incomplete or inaccurate management rules that must be more clearly specified

(20)

Knowledge of data-program dependencies reinforces the approach of error prevention in particular by improving the controls carried out in the programs before data are added to the database or processed in any way.

In the same vein, it is possible to add additional modules that permit:

 The archiving of results of checks which, in turn, allows a view of improvements in the data quality over time and an assessment of the ratio of cost of quality improvement efforts against results

 The assessment of the impact of erroneous data in business terms based on , for example, the absence of a postal code in an address that prevents delivery of an invoice which has a financial impact of x % of the amount of the invoice

Clearly, measures carried out by the systems described here do not claim to resolve all questions about data quality. Indeed, the data contained in computer databases must be as close as possible to the reality of the field they describe: in particular they must be accurate, reliable and up-to-date.

Whatever they are, technical control systems do not govern data;, this remains a corporate responsibility:

at the most we can expect technical systems to check the "consistency" of the data. In other words, it is not possible for a technical system to verify that Madam X has 3 children, but it is possible to highlight when one database reports that she has 3 children while another indicates that she has one.

2.3.2.2.2 Quality of databases

Beyond the quality of the data, the MDDE approach allows an improvement in the quality of the database. The criteria to take into account can be varied, for example, number of columns in each table, number of identifiers, redundant attributes, management rules defined in the DBMS, use of data by different programs and so on. It goes without saying

that these assessments complement the information provided by the DBMS in terms of use, performance and so on.

Evaluation of the quality of the database is useful for example:

 to evaluate the complexity of the development of applications: degree of dependency of entity types, duplication of information, component dependencies and so on

 to complete assessments of the program quality and to provide a measure of the quality of an application

Further work in this area is currently underway: this includes the automatic detection of complex constructions likely to be sources of error.

2.3.2.2.3 Program-Database Risk

Knowledge of the model and programs-data dependencies allows us to classify programs by DB risk. The principle adopted is to calculate a weight for each of the programs which is the sum of the weight assigned to each program/data access weighted by the type of action (read, write) and the role of each of the entity types in the model.

Programs are then added to the x-axis of a graph in order of increasing weight. The y-axis may vary depending on the objectives (utilization level of programs, frequency of maintenance and so on.

This tool is often used in tests. The programs with the greatest weight and used more frequently are those that represent the greatest risk: they need to be tested as a priority and probably more thoroughly than programs deemed to be low risk.

(21)

Page | 21 2.3.3 Develop

As detailed as they are, information systems are not and never will be more than a model of the real world. They are designed to be as close as possible to the real world as is required by the business.

However, the real world never ceases to change over time. The development of an information system is, therefore, an ongoing requirement in data governance.

From the point of view of business analysts, the developments of an IS can be classified into three categories:

 Either it must respond to functional changes related to changes in organizational or regulatory requirements, which require changes in the existing application database

 Or it‟s a question of technical changes, in particular a change of DBMS while preserving the existing application. This is a modernization of the application

 Or, finally, the existing application is completely replaced. In this case, it‟s necessary to export the data from an existing application and to import it into another database. This type of development is treated in the next chapter.

2.3.3.1 Update and/or modernize

If from a functional and technical point of view the two types of change taken into account in this chapter do not have the same objectives, it should be clear, from the methodological point of view, that the methods used in both cases are similar.

2.3.3.1.1 Method

The processes run are similar and are summarized in figure 23. The differences in the methodology are highlighted in italics. The method works in five main phases:

(1) UNDERSTAND. This is one of strong characteristics of the proposed method: there must be sufficient knowledge of the source ecosystem. We can use the image of a GPS which cannot calculate a path without knowing where the starting point is. This knowledge is essential to the success of the project.

(2) SET. This phase consists of defining in the model, developments required to achieve the desired final result in terms of change of structures and rules. Actually this is equivalent to defining the conceptual model of the target database.

In the case of modernization, the conceptual model of the target database may be same or different from that of the source database.

(3) CONTROL. At this stage, the source and target database conditions are known, it is appropriate to identify the obstacles which could prevent the movement of data from the source to the target. These barriers, which in fact are the technical risks to the project, come from three separate origins that must be subject to validation by a priori controls:

 control of "compatibility": this is a question of identifying the inconsistencies between the source and target databases, ie any change

(22)

that cannot be achieved by a simple transfer of values. The types of inconsistencies depend, of course, on the type of project, the essential point being to be able to, for a specific project, to detect them. As an example, and accepting that the list below is not exhaustive:

o in the case of the development of databases, the incompatibilities come essentially from the incompatibility of data with the new rules defined for the target database: for example if a column has become a foreign key it is necessary to verify that data values are consistent with this rule

o in the case of DBMS changes, incompatibilities often occur as a result of functionality supported in the source DBMS that is not supported by the target DBMS. For example, in most of the legacy DBMSs (IMS, IDS2, IDMS and so on) the order of presentation of rows of a table is defined at the time of writing, in a relational DBMS this order is set at the time of reading. This order may be important for application programs so it is appropriate in this case to define the appropriate sort keys that will allow the proper functioning of the programs

 Validating data quality: whatever rules changes apply, it is appropriate to determine that the source data are in accordance with the rules of the target IS. This control is done by checking the conformity of the data to the target model.

 Checking the propagation of changes: knowledge of the three levels of dependencies (data-data, data- programs, programs-programs) to identify for each of the changes of an element of the ecosystem the impact on other elements

(4) EXECUTE. The activities deployed in this phase are:

 the automatic generation of the necessary tools for the creation of the target ecosystem. For example, o structures of the target database

o in the case of modernisation, access to the target database o data migration programs including

 the export scripts that incorporate changes in value and/or format

 the importing scripts for the target database

 running of export programs that produce flat files in the format of the target database tables and that can be directly imported to the database using standard DBMS utilities and scripts

 in the case of modernization, adaptation of the application programs so that they access the target database in the correct way

It should be noted that adaptations of application programs so that they take into account the functional changes are outside the scope of the automatic generation system and they therefore remain a task to be completed by other stakeholders.

(5) VALIDATE. This phase is designed to ensure that the migration of data from the source to the target IS does not disrupt the consistency of the ecosystem. The approach used is to deduce source and target models and of their correspondences, a common model from which it is possible to generate programs that validate the migration from two angles:

 exhaustiveness: programs generated for each of the source and target environments create a count of the number of physical occurrences in the databases as well as a functional control (checksum) of each of the attributes. A report is published highlighting possible differences between the counts

 consistency: generated programs extract data from each of the databases. The extracts are compared and a report is published highlighting possible differences of values and occurrences.

It should be noted that this method works regardless of DBMS types and the structures of source and target databases and permits the validation of a data migration regardless of any application.

2.3.3.1.2 Examples

Several phases of the process have already been illustrated in the previous pages.

The following example discusses the validation phase for a data migration and in particular the maintenance of data consistency. To that end, the example uses migration of data from a source database to a target database in which the data structures were different. Figure 24 shows a portion of the resulting common model following comparison of the

(23)

Page | 23

source and target databases. From this model, a code generator produces:

 the source code of a data extraction program for the source database

 the source code of a data extraction program for the target database

Execution of data extraction programs in the source and target environments provide two files in XML format that must contain the same data. A XML file comparison is used to verify this assertion. As a first step, the comparator highlights differences in the compared files.

Secondly, the comparator highlights the differences in values between identical instances of the compared files.

(24)

2.3.3.2 Co-development

Co-development is just an update of the above methodology: the only difference is the addition to the execution phase of the BAM generator for the new definition of the database.

2.3.3.2.1 Method

This is illustrated in detail in figure 27: this starts with an existing conceptual model and includes six successive stages:

 Step 1 Define the new conceptual schema from the existing conceptual schema. Once the new conceptual schema is defined, a schema processor produces a new technical schema and a new business view

 Step 2 Ensure that the data contained in the database are consistent with the constraints and rules defined for the new version of the database. A report shows the non-compliant data. This leads either to the correction of the data or a change in the constraints applied to data

 Step 3 Depending on requirements this requires either a change script for the existing database or a script for creating a new database. The choice is often linked to the importance of the changes required .

 Step 4 Adapt or migrate the data contained in the existing database to the new database. This step also generates validation processes that ensure completeness of the migration and the maintenance of data consistency

 Step 5 Generate Business Access Modules (BAM) for the new version of the database. This step also provides a list of the existing BAM that are impacted by developments

 Step 6 Provide developers with the list of programs and/or application modules affected by the changes and that it is appropriate to modify

(25)

Page | 25 2.3.4 Reuse

The primary purpose of data is to create information. Data should therefore be able to be used in other contexts than those in which they are managed: it should be possible to export them i.e. extract them for the purposes of exchange, archive, or more simply to produce samples, possibly anonymized, for testing or training purposes.

If the data must be able to be exported, they must also be capable of being imported and without losing their coherence or their meaning. We must be able to transform them, aggregate them, disaggregate them and integrate them into other IS structures where the rules are known in advance.

2.3.4.1 Export

The need to export all or part of the application data is very common: Exchange of data with partners (customers, suppliers and so on), archiving of inactive data, providing business stakeholders with subsets of the data for the purposes of checks, testing, training, etc.

The objective is to allow a consistent extraction of a subset of the database that meets business logic and demands.

2.3.4.1.1 Methods

To achieve the objective, the following approach was used:

 Step 1: define the functional subsets desired in the conceptual model. These subsets define the list of entity types and attributes to retrieve

 Step 2: define criteria that reduce the volume of data extracted in functional subsets described in the previous step

 Step 3: define for each of the functional subsets, target models to produce: these target models allow the definition of formats for the results of the extraction, for example flat files, database, XML file and so on.

 Step 4: generate the code for selection programs: the objective of these programs is to discover the key of the roots entity types of functional subsets

 Step 5: generate the code for programs whose purpose is to extract the data from the production databases for all occurrences of selected key

(26)

2.3.4.1.2 Example

Functional reduction starts from a business view corresponding to functional needs. The definition of the business view allows us to deduce the subset of tables necessary and sufficient for the creation of data sets.

Extraction of data from one type of business entity may require several queries on technical tables: these queries are handled by the business access module corresponding to the business view.

Reducing the volume of data is achieved using the attributes defined in the business view. In the example, the objective is to retain those people with an outstanding debt at least cancelled as a result of bankruptcy. A first filter shows that the number of persons with at least an "outstanding debt" is 21,824 on a total of 156,418 debts. A second filter on the

"cancelled debts" reduces the number to 4,510. Finally a filter on the value of the attribute reason with an entity type with a value of Waiver reduces the final number to 159.

For these 159 people the data are extracted in XML format, following the logic defined in the business view. The DTD for the XML document is determined by the XML model of the reduced database.

2.3.4.2 Data Testing

For the creation of test data, the extraction tools are complemented by:

 the use of program-data dependencies to automatically determine the functional subset necessary and sufficient for the programs to test

 Some tools that allow:

o the generation of data from predefined rules. This allows:

 generation of rows in tables

 generation of values in the columns of each row

o the anonymization of data for purposes of confidentiality. This process must maintain data consistency while identifiers are amended

o the comparison of data values before and after testing, providing effective assistance to the count;

 tools to assess the technical test coverage. The goal here is to determine the extent of coverage by the test data set: If the data set only tested 30% of a program, it is likely that a new set of data would have to be provided

2.3.4.3 Cleansing

In the same way than for tests sets, the extraction tools can be enhanced by other tools that delete the extracted date from the database.

(27)

Page | 27 2.3.4.4 Import

If the data must be exported, this must be done without losing their coherence or their meaning. We must be able to transform them, aggregate them, disaggregate them and integrate them into other IS structures where the rules are known in advance.

2.3.4.4.1 Method

The methodology used to import data is close to that used for the development of a database such as that shown in figure 31. We find here the same five keywords:

(1) UNDERSTAND. This is to ensure that all the constituent elements of source and target ecosystems are identified and known with a degree of accuracy, sufficient to achieve the objectives of the project.

(2) DEFINE. In the import process the target model is already defined. Therefore the definitions expected in this phase are the source-target mappings. It goes without saying that these correspondences are firstly semantic and then of a technical nature.

(3) CHECK. Once the correspondences are defined, it is appropriate:

o to identify the inconsistencies that come from:

 differences in the management rules (including those managed by the programs),

 differences in

definitions of data structures; (see the examples later in the document)

o to validate the transformation rules which will allow the resolution of detected incompatibilities. The diversity of situations and transformation rules does not allow the automation of this phase. The rules for specific transformations must be the subject of appropriate developments. Beforehand, a prototype of the processing rules is made to ensure their validity prior to actual development

 Each of the transformation rules is tested on all the targeted data

o at the end of the prototyping of each transformation rule, it is necessary to check the quality of the data processed according to the target model to ensure their compliance with the rules of the target system. This check allows the validation of the transformation rules

(4) RUN. It is necessary to develop, based on the completed prototypes, specific modules that ensure the complex transformations required. Then, a source code generator produces unloading programs by incorporating the developed modules and standard transformations such as change of type or format changes. Loading scripts for the target database are also generated. Finally, execution of programs and the generated scripts perform the migration.

(5) VALIDATE. This phase ensures that the migration of data from the source to the target IS has maintained the coherence of the existing source data ecosystem: in other words it must ensure that Mr X‟s orders have not been assigned to Miss Y.

(28)

The approach used is to deduce from source and target models and correspondences that have been established, a "common" (often XML format) model from which it is possible to generate programs that allow the validation of the migration:

 validation of completeness: the programs generated for each of the source and target environments perform a count of the number of physical occurrences in the databases and functional check of each of the attributes. A report in HTML format publishes the results of these counts highlighting potential differences;

 validation of coherence: the generated programs extract from each of the databases the data in XML format. The resulting files are compared and a report of the possible differences of values and occurrence is published.

2.3.4.4.2 Examples

This section demonstrates one of the most interesting innovations of the MDDE: comparison of models.

It allows the identification of inconsistencies between the source and target ISs, incompatibilities that must be resolved by the transformation rules. An example of a migration of data is shown in figure 32. In this example, it will be assumed that stakeholders have identified that there is a correspondence between the types of entity contracts, clauses and insured risk of the source and target databases. As such this migration is not possible where there is an incompatibility between types of entity management rule: in the target database, the insured risk is dependent on (additional) clauses in the contract while this is not the case in the source database. Therefore, if, in the source database, there is a contract with several clauses and several claims, the question is which clause should be linked to each claim? Similarly, what if there is a contract in the source database relating to some insured risk but without an appropriate clause relating to the insured risk?

As shown in figure 33, comparison of the models allows the identification of inconsistencies in the types of association.

Comparison of the models can also identify the inconsistencies between the types of entity attributes:

format, length, mandatory attribute…

(29)

Page | 29

The resolution of incompatibility problems lies within the jurisdiction of the project stakeholders.

However, it should be noted that generally it comes to decisions that are within the competence of business officers and not technicians as is clearly shown in the two examples above.

(30)

3 The MDDE perspectives

As we see it, the MDDE approach provides a set of solutions covering a very large part of the needs of tactical and operational data governance. In addition, the industrialization of these solutions allows one to be freed from purely technological considerations.

The use of the MDDE approach leads, in effect, to a shortening of the distance between IT and business. What more could we hope for than to have a formalization (as a model) that clearly defines the business requirements and then to use this formalization to generate the technical implementation?

As attractive as it is, the MDDE approach remains one that requires a good knowledge of databases and models. Even if it has left the research laboratory, knowledgeable, well-trained implementation specialists are required.

Consequently, the objective of development using the MDDE approach is clear: make the approach usable on a day-to-day basis in organizations. This objective will be achieved by:

 the continual improvement and industrialization of defined solutions

 the improvement of user interfaces

 the association with modelling methodologies covering other areas of the organization

3.1 Improvement of Human-Machine Interface

An MDDE approach demands that we make use of all the potential and the richness of models: it is not conceivable to consider reducing the knowledge and experience required for the modelling approach on the pretext that it is too complex. Nevertheless, using a musical metaphor, it is possible to listen to music without knowing the language in which it is written. The objective is to make models and their formal language accessible to all stakeholders (users, business analysts, developers, DBAs etc). One approach would be to allow translations of the formal language of modelling into a common language written as text. Ideally, these translators would be able to work in both directions: from the model to text and vice versa, the text into the model.

(31)

Page | 31

The approach model towards text is illustrated by figure 36: this translation is obtained by means of supplementary linguistic analysis tools.

If the demonstration above is enticing in itself, its optimum usage is probably situated in its integration with interactive interfaces allowing a dialog between creators and users. Advanced prototypes have already been made and are now being industrialized. The direction text towards models has been equally explored.

Prototypes that also integrate linguistic analysis tools effectively allow the production of a model while leaving a natural language text giving basic data specifications. If the technical development of such tools is not problematic in itself, it seems that today the market does not demand such a process. The development of such tools is therefore more a question of commercial timeliness than of technical difficulty.

3.2 Coupling with other modelling systems

The modelling described so far is one that allows us to represent the data processing technical system: it is in its nature limited to this universe. In the universe of organizations, computing is only a support tool for business activities that require more or less advanced models. The goals of these models are varied, with multiple approaches and are the object of intense debate within specialized communities. The question to be raised here is pragmatic: is it possible to couple the models used by MDDE to other types of model? In trying to answer this question, the approach was applied to an application at the Belgian Ministry of the Finances.

The application manages the debts of businesses and works in a DB2 and Java® environment. The modelling of the computer application was carried out using DB-MAIN and reverse engineering tools. The upstream modelling of the computer application was realized using the Requirements Engineering tool OBJECTIVER from the RESPECT IT Corporation. To summarize, and without wanting to go into details (to know more about OBJECTIVER see http://www.objectiver.com/index.php?id=4&L=1), OBJECTIVER has an approach centred on the business objectives and allows their specification using models:

 the business needs (requirements models) allows the generation of the requirements document

 the business activities (models operations) implemented in reply to the requirements Figure 37 illustrates the process followed in this project:

 on one hand the use of OBJECTIVER to produce both the requirements and operations models

 on the other hand the use of DB_MAIN to produce both the processing and functional models

A manual comparison of the operational and functional models then allowed the verification of the similarity of the models (meaning that the application met the specifications)

(32)

Pragmatically, the extraction of information coming from the two tools has permitted the construction of a table showing the correspondences between the requirements statement and proposed processes. Figure 38 is an extract from such a table.

The results open up new perspectives for:

 ongoing mastery of applications and, in particular, (re)documentation of existing applications

 change management

 acceptance testing of applications

The last point is currently the aim of a development project and is detailed in the following section.

3.2.1 Application acceptance

The goal of the project is to provide a methodological framework and the technical tools allowing organizations to carry out the acceptance of computer applications under three sub-headings:

 ensure that the application meets the business requirements

 reduce the time and costs required to carry out the tests that guarantee the correct functioning of the application

 identify the technical risks in order to allow: