• No results found

Ontology assisted Data Warehouse System

N/A
N/A
Protected

Academic year: 2021

Share "Ontology assisted Data Warehouse System"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Ontology assisted Data Warehouse System

Vedran Vrbanić

Koios savjetovanje d.o.o, Zagreb, vedran.vrbanic@koios.hr

Data model is a conceptual representation of data structures that are required to build the corresponding database. It defines records about entities and it is the basis for database design. Ontology is comparable to database schema but the ontology represents a much richer information model than a normal database schema, providing metadata schemas and offering a controlled vocabulary of concepts. It can be viewed as an extension of the logical database design. In this paper, an ontology-based approach is proposed with the goal of enhancing the data warehouse (DWH) system with semantic information. This attempt aims to make developers and users aware of the semantics stored in the data warehouse by incorporating functional layer ontology. The task of the model mentioned above is to improve the database system by providing query processing not only at the syntactical, but also at the semantic level.

Keywords: data warehousing, semantics, ontology, RDF, banking

I. INTRODUCTION

Database management systems face challenging problems in dealing with a huge amount of data and the variety of their formats. Current database systems do not only need additional support for manipulating of data but also for understanding their meaning. Semantic knowledge in its various forms has become a necessary tool for enhancing the usefulness and flexibility of data management, especially in integrating data from multiple sources and in optimizing queries [2].

Relational databases are based on the entity relationship (ER) model. It consists of various modelling elements (entities, relationships and domains) that can be used to describe and represent issues of interest to an enterprise. Entities represent things, persons, events and other information-related issues within a subject area or across areas and relationships represent the associations between them. Domains represent logical data types. Trinkunas and Vasilecas [3] propose an approach for transformation of the ER conceptual model to ontology, using graph formalism and a formally defined transformation. A similar, but more customized approach is used here, since this research is not aimed to automate the generation of ontologies from ER models but to examine whether the incorporation of ontology into a DWH system could improve its usefulness.

II. MOTIVATION AND PROBLEM STATEMENT

The success of a DWH system is dependent on the problem of generating transformations for data sources towards DWH structure. These transformations involve a reconciliation of semantics for user requirements and data source schemas. An ambiguous definition of user

requirements occurs when the users are unable to define their requirements precisely and clearly.

Moreover, various meanings of data (i.e. attributes, tables) make it difficult for integrating the user requirements to data sources. Thus, reconciliation of the appropriate semantics of user terms and data sources is important in generating the transformations accordingly.

Researchers Approaches

Kimball (1996) Process-driven

Inmon (2002), Winter and Strauch (2004)

Supply-driven / Data-driven Winter and Strauch (2004) Demand-driven /

Requirement-driven Niedrite et al. (2007),

Giorgini et al. (2008)

Goal-driven Mazon et al. (2007) Model-driven Romero and Abello (2007),

Skoutas and Simitsis (2007)

Ontology-driven Fig. 1. The DWH requirements analysis approaches [4] DWH is intended to be a vivid and accurate source of enterprise level decision making information. But in a number of cases the situation is not so bright. A large number of data sources, inconsistent values in source systems and requirements for complex data transformations result in top-heavy systems with huge maintenance and development costs. Information contained inside those systems is hidden among the multitude of data and its value is often understood only by a few top users (term top user implies most skilled employees in the enterprise, who have the greatest knowledge in specific business and IT areas).

Data model maintenance in large systems requires strong and well-structured software support. The developments in the field of ontologies are promising, though many professionals doubt their value and these ontology-based systems require higher technical performances [14].

Within the scope of this research, I shall try to reduce the greatest barriers that lead to uneconomical and poor DWH systems, by incorporating an ontology approach on the functional level of an existing large DWH system. The system in question is a financial data warehouse in one of the largest banks in Croatia. We must note that the system is developed and maintained by a group of experienced IT professionals quite well, but it is not ontology driven so what I am trying to do is to determine whether the implementation of ontologies into the system’s functional layer can improve its usefulness.

An ER model is intended to store data and the third normal form allows storing of almost any kind of related data. However, an ER model is often characterised by a complex schema of tables, integrity constraints, foreign

(2)

keys etc. Transformation of the third normal form to a denormalized star or snowflake schema results in additional columns, flags or incorporation of normalized entities into the data warehouse. Every data source has its custom characteristics and that entails the growth of complexity of the DWH enterprise schema which should contain uniformed data. E.g., the introduction of a new non-standard financial product can affect at least one transaction system which is a data source for the DWH. It will be handled by adding new tables, keys and relationships to its third normal form schemata. If the new product is to be forwarded to the DWH, adding of new columns is almost inevitable. That means that every existing mapping that updates the product dimension will have to consider new attributes and it will have to be altered. As the data are combined and transformed, there should be some declarative support to follow that process, but the business rules and logic are mostly expressed as PL/SQL code, which is hard to maintain. As changes in business processes and requirements occur, maintaining of existing modules of the DWH system and identifying the dependencies between existing and new modules in the system become more and more tiresome and repetitive. The net result is that the valuable information and knowledge are blurred and not easy to find, so that decision making and business adaptation to new circumstances are slower and less effective than they should be.

III. INTENDED APPROACH

Problems stated above mainly arise from the fact that the semantic representation of knowledge is not satisfactory addressed either by ER model, or star and snowflake schemata. What we seek for is a fine-grained, extendable and self-explanatory module that would help us with business knowledge management. In this paper, an ontological approach will be experimented via case study which compares the standard star schema ETL modelling approach to the ontology driven approach. First, an introduction to methodologies used in “standard” financial DWH system design and development will be explained. Later, an explanation of alternative ontological approach will be provided.

The goal of a financial data warehouse is to output a number of financial reports for decision making, planning and consequently for increase in profits. Bank’s portfolio consists of a large number of different financial products – cash, deposits, current accounts, credit cards, loans, mortgages, guarantees, securities, equities, derivatives etc. All these products are sold to customers. Every customer can have a number of these products by signing contract with the Bank for every product that s/he holds. A number of transactions are made for every contract during its lifetime. These transactions are stored into analytical tables that hold information about the customer, contract, currency, account and organization unit. Contract balances are calculated from transactions.

A simplified star schema of the DWH enterprise model based on the text above is presented in the following figure.

Fig. 2. Simplified star schema of the enterprise model based on the fact table that contains daily balances

There are a large number of financial reports that are generated in different data marts, using DWH enterprise model as the basis. A number of financial statements regarding business and financial risk are presented in the list below.

BUSINESS RISK

 Asset quality – portfolio analysis, impaired / problem loans, reserve adequacy, securities and derivatives portfolios

 Sensitivity to market risk – valuation techniques for investments and derivatives – fair value through income statement, available for sale, held to maturity

FINANCIAL RISK

 Liquidity and funding – funding sources, on and off balance sheet treatment, deposits, commercial papers, repos, inter-bank lines, bonds and stocks

 Capital adequacy – types of capital, earnings, asset valuation, risk weighted assets - Basel II approach

Although every report is specific, in most cases, the methodology to build a data mart is similar. A simplified approach is presented in the following list:

1. Define contracts or counterparties (all retail counterparties, all counterparties with exposure higher than 2 million €, all contracts from treasury…)

2. Define accounts (principals, interests, fees, …) whose balances are of interest.

3. Crossword balances, contracts and accounts to get analytical data from the enterprise model.

4. Aggregate (or don’t) analytical data, calculate exposures to the clients/groups, load additional attributes or dimensions of interest (e.g. use time dimension to calculate payment schedule for future incomes)

5. Release the report

One may think, while reading the steps specified above that the build up of a data warehouse were a simple process. Theoretically, it is. But in practice, huge number of legacy data sources, large data volumes and large number of users inevitably entail a number of difficulties

(3)

arising from low data quality and lack of logical constraints in data sources. We believe that these difficulties would be easier to solve if there were a possibility to make DWH developers and users more aware of the semantics related to the underlying data and PLSQL code.

Let us consider an example of hierarchy consisting of following financial products:

 Limit – presents an internal Bank's decision which determines the amount of money that the customer can borrow. The limit does not present a commitment for the Bank but the Bank will probably not lend more money to the customer than it is defined by the limit.

 Agreement is a financial instrument that defines the amount of money that a customer can borrow by taking one or more loans. Agreement is signed by the Bank and the customer and it is obligatory to the Bank. A limit can contain one or more agreements.

 Loans present final nodes in the hierarchy. A loan can be issued from a limit, agreement or directly. Database table containing the hierarchy would contain contract codes (contract type is limit, agreement or loan) and relation types:

CONTRACT_1 RELATION_TYPE CONTRACT_2

Limit 1 Limit Loan 1

Limit 1 Limit Agreement 1

Agreement 1 Agreement Loan 2 Agreement 1 Agreement Loan 3

Fig. 3. Hierarchy stored in the database table

The logic for the calculation of exposures of listed products is:

 If there are no products issued from a limit, the original limit amount presents an exposure.

 If there are products issued from a limit, exposure is calculated by summing exposures of the issued products.

 If there are no products issued from an agreement, original agreement amount presents the exposure.

 If there are products issued from an agreement, exposure is calculated by summing exposures of issued products.

Any PL/SQL procedure that calculates exposures at different hierarchy level must contain the logic to fetch, group, sum and sort data from the table.

Now, let us assume that there is a requirement to add a new product to the hierarchy. It will be a product called frame. The frame is similar to the agreement, but it is not obligatory to the Bank – it is issued from the limit or it can be the root product in the hierarchy and loans can be issued directly from the frame. A new type of relation between contract_1 and contract_2, e.g. called Frame,is added to the hierarchy table and every PL/SQL procedure that uses hierarchy table must be altered in order to fetch this relation type and to incorporate it correctly into the existing algorithm for the calculation of exposures. If a business user wants to know how did s/he get a certain number in a report, s/he will have to read the documentation (which is

often outdated) or, more likely, to contact a PL/SQL developer to read the code for him or her.

The purpose of incorporating an ontology functional layer on the top of PL/SQL code is to reduce the semantics contained in code (and visible only to developers) by increasing the amount of knowledge contained in functional ontology layer which is readable to business users and developers. So, what can be a more elegant way to store this hierarchy and the semantic meaning of different relation types?

IV. FUNCTIONAL LAYER ONTOLOGY

The standard approach in data warehouse design leads to mapping of functional specifications, i.e. business knowledge, into the PL/SQL code and that is not satisfactory. We aim to displace the functional layer into ontology which is self-explanatory by its nature, in order to make business users and developers more aware of the data semantics.

Ontology is a formal, explicit specification of a shared conceptualization. It comprises of classes of entities, relationships between entities and the axioms, which apply to entities of that domain. It is made up of the following [6]:

 Classes and instances: Describe sets, collections, concepts, types of objects or kinds of entities.

 Properties: They establish relationships between concepts of ontology.

 Rules: They model logical sentences that are always true. Rules provide high expressiveness and they make for more complex reasoning with the ontology

Conceptually, ontology is appropriate for incorporation into the functional layer since classes, properties, rules and hierarchies make a solid foundation for description of the domain of interest.

However, we have not yet mentioned the problems regarding impact of the ontology incorporation on performance of ETL procedures. As financial data warehouses contain huge amounts of data, every ETL batch must be fine tuned. DWH developers must optimize the code in a number of ways regarding the schedule of tables to be joined, table partitioning, join methods (hash join, merge join, nested loops join), use of indices [7] etc. It is not trivial to embed some kind of a third party tool into ETL process to fetch the ontology data since it would be very hard (maybe not possible) to keep ETL performances at a satisfactory level. That is the reason why to use the Oracle (DWH is deployed at Oracle database 11g) semantic technology stack to store and fetch ontology data.

(4)

Fig. 4. Semantic capabilities in Oracle Database [8]

As shown in Figure 4, Oracle database is able to store semantic data and ontologies (RDF/OWL models), as well as traditional relational data.

Resource Description Framework (RDF) is a set of W3C specifications used to store information in the form of statements that consist of three elements – subject, predicate and object [12]. An important thing that RDF defines is the predicate called rdf:type which is used to define types of entities.

RDF does not include support for definition of a vocabulary – RDFS (RDF Schema) is the extension of RDF that makes it possible. Basic RDFS constructs are:

 rdfs:Class – enables definition of classes

 rdfs:subClassOf – incorporates class hierarchy

 rdfs:domain, rdfs:range – used for definition of attributes' domain

Open web language (OWL) is also stored in triples but it provides much richer vocabulary for the description of classes, their relationships and properties. OWL can indicate following:

 domain restrictions that can be applied to specific classes

 classes can be defined as mutually exclusive

 cardinality restrictions as classes' properties

 transitive, unique or inverse predicates

The database table presented in Figure 3 stores a hierarchy. Figure 5 presents the same hierarchy, but this time, ontology is used to describe it. Relation hasParent defines parent-child relationship between entities (e.g. Loan3 is approved from Agreement1). Ontology graph also presents hierarchy of classes, as well as classes’ properties. A more detailed review of ontology graph gives significantly clearer insight into the business domain, which would be hard to achieve by only inspecting the hierarchy table in Figure 3, without reviewing the documentation. Based on the graph, it can be concluded that limit, agreement and loan are similar financial products, which share some common attributes (originalAmount, originalCurrency) since they were all derived from the same class (FinancialProduct), but also have some product specific attributes (signDate, usedAmount).

(5)

V. CASE STUDY

A case study would be conducted by selecting a data mart within DWH system based on the following criteria:

 A large number of dimensional tables from various sources are used within ETL batches that fill the data mart.

 Algorithms used for loading data into the data mart are not trivial, but encompass a large number of business rules within PL/SQL code.

 Dimensional and fact tables that are used contain a large number of rows so, it is necessary to carefully optimize ETL batches.

The ontology that describes a business domain would be created first. This would be a terminology box or T-Box ontology, meaning that it comprises a taxonomy, as well as a set of concepts and their definitions [9, 10, 11]. The ontology would describe data in dimensional tables, as well as business rules used as a basis for the data mart loading algorithm, with intention to minimize the presence of business rules and business knowledge within the PL/SQL code.

Secondly, the new version of the PL/SQL algorithm would be developed. The algorithm would merge the ontology listed above with the data in dimensional tables (which in this case store instances that T-Box ontology describes) and with the data in fact tables that contain measures (e.g. transactions and balances). Comparison of performances of the stated algorithm that uses the ontology and the basic (pure PL/SQL) algorithm would follow.

The next step would be to store the data from dimensional tables directly into the ontology. That means that the ontology would encompass not only a set of concepts, but also a set of instances of these concepts and a set of assertions between them. The term assertion box or A-Box is used to describe a part of knowledge base which consists of instances, i.e. individuals [9, 10, 11]. Therefore, the ontology comprises of T-Box (classes) and A-Box (instances).

At this moment, performances between loading the data into dimensional tables and loading the data into the ontology should be compared.

PL/SQL algorithm must be altered once again – this time, it would select the dimensional data directly from the ontology instead of selecting the data from dimensional tables. Now, performances of this algorithm (which queries ontology) would be compared with performances of the basic (pure PL/SQL) algorithm, as well as with performances of the algorithm that combines T-Box ontology and data stored in dimensional tables.

VI. CONCLUSION

As previously stated, ontologies are self-explanatory by their nature. They provide metadata schemas and offer a controlled vocabulary of concepts [1], and as such, present a solid foundation for the description of business domains and business rules. The question remains, whether ontologies are worth introducing into a DWH system, i.e. whether ontologies’ benefits (reusing expert knowledge

from different domains; enriching specific metadata by completing definitions and annotating their semantics; empowering data integration and analysis; automating reasoning on metadata; helping to understand the meaning of the notions that are actually modelled [13]), justify additional work surrounding their design, as loading, maintaining and combining them with PL/SQL code. Also, it is questionable whether the merging of ontologies and PL/SQL code would slow down ETL batches which load the data into DWH system.

By conducting the case study, presented in the previous section, I will try to answer those questions.

REFERENCES

[1] Dr. Waralak V. Siricharoen, "Ontology Modeling and Object Modeling in Software Engineering", in

International Journal of Software Engineering and its Applications, Vol. 3. No. 1., January, 2011

[2] Chokri Ben Necib, Johann Christoph Freytag, "Ontology Based Query Processing in Database Management Systems", in Proceedings of CoopIS/DOA/ODBASE, pp.839~857, 2003.

[3] Justas Trinkunas, Olegas Vasilecas, "Building Ontologies from Relational Databases Using Reverse Engineering Methods", in Proceedings of CompSysTech, 2007

[4] Azman Taa, Mohd Syazwan Abdullah, Norita Md. Norwawi, "RAMEPs: a goal-ontology approach to analyse the requirements for data warehouse systems", in WSEAS Transactions on Information Science and Applications, Vol. 7 Issue 2, pp. 295-309, February 2010

[5] Dung Nguyen Xuan, Ladjel Bellatreche, Guy Pierra, "Ontology Evolution and Source Autonomy in Ontology-based Data Warehouses", 2006 [Online]. Available: http://www.lisi.ensma.fr/ftp/pub/documents/papers/2006/20 06-RNTI-Xuan.pdf [Accessed: May. 4, 2012]

[6] A. Martin, M. Manjula, Dr. Prasanna Venkatesan, "A Business Intelligence Model to Predict Bankruptcy using Financial Domain Ontology with Association Rule Mining Algorithm", in International Journal of Computer Science Issues, Vol. 8, Issue 3, No. 2, May 2011

[7] K. Andrić, V. Vrbanić, D. Kalpić, "BI in practice - developing credit risk reporting module", 32th International Convention Proceedings, Conference: MIPRO 2009, ISBN 978-953-233-046-5

[8] Oracle Database Semantic Technologies Developer's Guide 11g Release 1 (11.1), Part Number B28397-05 [Online]. Available: http://docs.oracle.com/cd/B28359_01/ appdev.111/b28397/sdo_rdf_concepts.htm#CIHECGII [Accessed: May. 10, 2012]

[9] Vedran Podobnik, “Višeagentski sustav za pružanje telekomunikacijskih usluga zasnovan na profilima korisnika”, Faculty of Electrical Engineering and Computing, University of Zagreb, PhD thesis, Zagreb, 2010 [10] J. Wu, I. Ilyas, G. Weddell, "A study of ontology-based query expansion", Technical report CS-2011-04, University of Waterloo, 2011

[11] Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart, "Ontologies - Querying Data through Ontologies", 2011 [Online].

(6)

Available: http://webdam.inria.fr/Jorge/files/slquery-onto.pdf [Accessed: Jun. 21, 2012]

[12] Hugo Y. K. Lam, Luis Marenco, Tim Clark, Yong Gao, June Kinoshita, Gordon Shepherd, Perry Miller, Elizabeth Wu, Gwen Wong, Nian Liu, Chiquito Crasto, Thomas Morse, Susie Stephens, Kei-hoi Cheung, "Semantic Web Meets e-Neuroscience: An RDF Use Case", In Proceedings of the ASWC International Workshop on Semantic e-Science, pp. 295-309, 2006

[13] Jesús Pardillo, Jose-Norberto Mazón, "Using ontologies for the design of data warehouses", in International Journal of Database Management Systems (IJDMS), Vol.3, No.2, May 2011

[14] Fons Wijnhoven, Edwin van den Belt, Eddy Verbruggen, Paul van der Vet, "Internal Data Market Services: An Ontology-Based Architecture and Its Evaluation" in Informing Science Journal, Volume 6, 2003

Figure

Fig. 2. Simplified star schema of the enterprise model based on  the fact table that contains daily balances
Fig. 5. Ontology-based representation of business rules of the hierarchy presented in Fig

References

Related documents

I utilized the video that accompanies Chapter One as well as targeted email communication to recruit a group of seven SAPros who self-identified as experiencing social justice

After designing spatial-aware appropriate data partitioning and query optimizers for distributed scalable storage and batch processing, we have realized that those

Key Words: Analytic training, auditory learning, aural rehabilitation, cognitive changes, computer assisted therapy, interactive communication strategies, listener confidence,

participation list that delineates the week during the term that you will be on call. Multiple students may be on call during any particular week. If you cannot attend or will not

Und, wenn man den App-Begriff weiter fasst und sagt, dass muss jetzt nicht unbedingt eine native App sein, sondern es kann auch eine Web-App sein, dann kann das durch aus sein,

Failing to disclose your medical history, and that of your dependants, accurately on this application for entitles Medshield to terminate your membership and that of any

The harmonization of the banking and other financial services legislation as component of the EU’ Single Market, the creation of the European Economic and

Often, Spain sought to close these loopholes after each new treaty with Britain, However, constant shifting in Spanish governance and diplomacy often left an unclear foreign