Design of a Multi Dimensional Database for the Archimed DataWarehouse

Full text

(1)

Design of a Multi Dimensional Database

for the Archimed DataWarehouse

Claudine Bréant, Gérald Thurler, François Borst, Antoine Geissbuhler Service of Medical Informatics

University Hospital of Geneva, Geneva, Switzerland

Abstract

The Archimed data warehouse project started in 1993 at the Geneva University Hospital. It has progressively integrated seven data marts (or domains of activity) archiving medical data such as Admission/Discharge/Transfer (ADT) data, laboratory results, radiology exams, diagnoses, and procedure codes. The objective of the Archimed data warehouse is to facilitate the access to an integrated and coherent view of patient medical in order to support analytical activities such as medical statistics, clinical studies, retrieval of similar cases and data mining processes. This paper discusses three principal design aspects relative to the conception of the database of the data warehouse: 1) the granularity of the database, which refers to the level of detail or summarization of data, 2) the database model and architecture, describing how data will be presented to end users and how new data is integrated, 3) the life cycle of the database, in order to ensure long term scalability of the environment. Both, the organization of patient medical data using a standardized elementary fact representation and the use of the multi dimensional model have proved to be powerful design tools to integrate data coming from the multiple heterogeneous database systems part of the transactional Hospital Information System (HIS). Concurrently, the building of the data warehouse in an incremental way has helped to control the evolution of the data content. These three design aspects bring clarity and performance regarding data access. They also provide long term scalability to the system and resilience to further changes that may occur in source systems feeding the data warehouse.

Keywords:

Data warehouse; Medical Informatics; Database

1. Introduction

A data warehouse can be very simply defined as a copy of enterprise transaction data specifically structured, integrated, and organized for query, data analysis, and decision support applications [1,2]. In that regard, the architecture, the life cycle, and the end users of a data warehouse are profoundly different from those of transactional systems [1,2]. Data models and server technology that speeds up transactional processing may not be appropriate for query and reporting processing. It is therefore recommended to develop a data warehouse separately from the transactional environment. Its main requirement is to provide an intuitive, easy and performant access to enterprise wide data so that queries and reports can be quickly produced on a regular basis.

(2)

Radiology Laboratory

ADT

Procedures Diagnoses

Hospital Information System (HIS) components (on line transactional

systems) ARCHIMED data warehouse

(analytical and decision processes) ADT Laboratory Radiology Diagnoses Procedures patient episode of care (analysis) medical service (activity) data marts

Figure 1 – The Hospital Information System components and the Archimed data warehouse

A hospital data warehouse organized around patient medical data involves integrating a wide variety of health data, including patient records, medical images, and genetic information, for the purpose of researching and improving the diagnosis and treatment of

diseases. The objective is to use information technology to help achieve personalized health care by leveraging all the information and knowledge that exist about treated patients and, for instance, by studying how new patients compare with other patients with similar characteristics [3,4].

The development of the Archimed medical data warehouse started at the Geneva University Hospital in 1993. It has progressively integrated seven data marts (or domains

of activity) archiving Admission/Discharge/Transfer (ADT) data, laboratory results,

radiology exams, physiotherapy exams, clinical data relative to childbirth, diagnoses, and procedures codes. In the architecture of Archimed a data mart is simply a subset of the database for a specific domain of activity within the patient treatment workflow. As shown in figure 1, HIS operational systems are organized around applications managing for example ADT data or laboratory results. Archimed proposes a different organization of the data, which is dedicated to analytical activities such as medical statistics, clinical studies, retrieval of similar cases and data mining processes. As shown in figure 1, the Archimed data warehouse is organized around subject areas such as patient medical data, episodes of care, and medical specialties. End users can also access to the database through web-based applications [5].

This paper discusses three principal design aspects relative to the conception of the database of the Archimed data warehouse, towards the archiving and organization of patient medical data. The three following topics are described and discussed in the following sections:

1) the granularity of the database, which refers to the level of detail or summarization of data in the data warehouse,

2) the database model and architecture, describing how data will be presented to end users and how new data is integrated

3) the life cycle of the database, in order to ensure long term scalability of the environment.

The many other aspects of the Archimed system such as data source acquisition, end user applications, and meta data, can be found in [6,7].

2. The Archimed database records patient atomic data (or elementary facts)

The granularity of the data warehouse database refers to the level of detail or summarization of data in the data warehouse. The storage of atomic data (low level of

(3)

granularity) versus storage of aggregated data (high level of granularity) is a major design issue since that it will affects the entire architecture of the data warehouse environment and more specifically the volume of the database and the type of query that can be answered. It is generally recognized that expressing the data at a low level of granularity will help to achieve a robust data warehouse, as it becomes resilient to changes and can easily accommodate new user needs. Moreover, when the fact tables are granular, they serve as the natural destination for operational data that may be extracted frequently from the operational systems.

The transactional hospital information system (HIS) creates and updates records describing most events occurring during patient in-patient or out-patient care. Events of interest for the Archimed data warehouse include encounters, observations, treatments, diagnosis and procedures. Relevant and validated facts such as encounter dates, laboratory results, radiology exams, diagnoses and procedure codes are transmitted to Archimed. The database has been designed to record patient medical data with that same level of detail under a format called standardized elementary fact. This offers many advantages over the storage of aggregated data. Indeed, whereas aggregated data already embeds the user queries, atomic data can be reshaped and presented in any format needed. Future unknown requirements or unusual queries can be accommodated, achieving flexibility of the system and reusability of the data.

Each elementary fact in the Archimed database is represented by its value (generally a numeric or a code), along with several additional properties describing the context of this value such as :

• patient/episode identification - patient identification number, - episode identification number, • medical structures

- service (medical responsibility), - medical unit (patient location),

• basic patient description to facilitate statistical queries - patient age,

- patient sex,

• link to dictionaries and nomenclatures which are specific to each domain of activity

- property name (laboratory result name, radiology exam, diagnosis or procedure description)

Corresponding dictionaries are necessary for a correct interpretation of the fact at the time the value was produced. In that regard, historic versions of these dictionaries are managed if necessary.

• link to the time axis

- date of the event that produced the value - date of archiving of the fact in the database.

Furthermore, elementary facts in Archimed are standardized the following way. Facts coming from different domains of activity are expressed using a common template, including a value and the set of corresponding properties. Basic properties are mandatory, additional ones are optional and their number may vary according to the domain of activity as shown in figure 2. The Archimed standardized elementary facts can therefore easily accommodate a new domain of activity.

The storage of aggregated data can be useful to speed up queries and optimize the database usage. They can be calculated from a group of selected facts from which can be applied aggregated functions (sum, average, count, etc.). The Archimed database doesn’t currently

(4)

store aggregated data; this area is however under consideration for future developments. Patient lengths of stay, indicators describing counts of diagnoses, or surgical procedures, and re admissions rates are some of the aggregates of interest.

The next section describes the database model used to structure these elementary facts.

3. Patient medical data is organized using a multi dimensional approach

The Archimed database relies on a multi dimensional modeling approach. The dimensional model is a logical database design method particularly well suited to data warehouse databases [2]. In particular, it has the great advantage to allow data coming from various heterogeneous and independent sources, such as Archimed elementary patient facts, to be presented in an intuitive and standardized fashion. Contrary to the usual Entity/Relational data model used by transactional systems, it limits the number of tables to fact and

dimension tables.

For instance, the multi dimensional model in Archimed for the Laboratory data mart includes:

• one fact table, recording the elementary facts produced by the operational system, namely the patient laboratory result values.

• a set of usually smaller dimension tables, each linked to the fact table through a primary key, and describing the context of interpretation of elementary facts. The dimension tables in the Laboratory data mart include tables describing patients, medical services, medical units, laboratory exams, laboratory ranges values, and units.

Figure 4 shows the general star schema data model used for the representation of an Archimed data mart. The fact tables all contain zero or more facts that represent values (measurements) taken at each combination of the dimension key components. The fact table is also often represented by a data cube where each axis corresponds to one dimension, as shown in figure 3 [1,2].

As the Archimed data warehouse is composed of several data marts, the whole data model results in a set of inter connected star schemas including a collection of fact tables

patient da te/ti me m e di ca l se rv ic e diagnostic code

Figure 3 – Three dimensional data cube describing diagnostic codes along three dimensions : patient, medical service, and time

Instanciated Procedure fact Instanciated Laboratory fact

patient: 345353

episode: 1000011

medService: INTERNAL MED

medUnit: 7-AL

patientAge: 46

patientSex: F

property name: glucose

value: 6.8 dateOfFact: 2004-02-23 dateOfArchive: 2004-02-24 unit: mml/L material: plasma range: 4.2-6.0 patient: 3345453 episode: 1190002 medService: OBGYN medUnit: 2-AL patientAge: 34 patientSex: F

property name: C-section

value: elected dateOfFact: 2004-05-21 dateOfArchive: 2004-05-22 patient: 957688 episode: 1000474 medService: pediatric medUnit: 1-AL patientAge: 7 patientSex: M

property name: out-patient

value: regular entry

dateOfFact: 2004-01-05

dateOfArchive: 2004-01-06

chief complaint: accident

Instanciated ADT fact

Figure 2 - Three instanciated Archimed facts according to the standardized elementary fact template

additional optional properties

(5)

describing patient medical data (encounters, laboratory results, diagnoses and procedures,

radiology exams and so on) and a set of dimension tables where some of them are shared by many facts tables. Indeed, the patient table or the medical services table record data meaning exactly the same thing in each data mart.

Dimension tables shared by several fact tables are called conformed dimensions [2]. Conformed dimensions constitute a very important and powerful aspect of the multi dimensional model. Indeed, the definition of conformed dimensions establishes the links between the data marts, providing an integrated view of the elementary facts. It builds the foundation for a simple and performant querying of the patient medical data as if it were initially part of the same database.

For example, the queries ‘retrieve all medical data of patient of age 30-40 during its last

hospitalization in the internal medicine department’ or ‘retrieve all patients of the pediatric department having low hemoglobin and diagnosis of diabetes’ require many steps of

laborious treatment when issued at the various hospital transactional database systems. It is however very simple to solve with the Archimed database model.

Establishing, enforcing, and maintaining, the dimension tables of the data warehouse Archimed are therefore important tasks respectively during the initial planning of the database architecture, during the integration process of a new data mart, and for the maintenance of the database overall coherence.

4. Implementation and life cycle of the Archimed database

The Archimed database was first developed using the Ingres database system before being recently migrated to the Oracle relational database system. In brief, the physical implementation of the database has required the definition of a physical data model, proper indexation of the tables, and definition of rules to consistently name tables and attributes.

Unlike classical transactional systems, a data warehouse is in perpetual evolution and cannot remain static. Indeed, its evolution must follow the changes occurring in the organization which it serves. User needs appear or change. New data sources become accessible and must be integrated in the warehouse.

The process of adding a new data mart to the architecture complies with the following steps. First, a detailed analysis of the source data provided by the transactional systems is carried out. Elementary facts are identified, setting the granularity of the fact table. Then, conformed dimensions are carefully highlighted; they define the links with the other data marts. Other dimensions are also identified which complement the definition of the

fact table fact table fact table fact table conformed dimensions (shared by at least 2 fact tables) dimensions dimensions

Figure 4 - Four Archimed data marts inter connected through conformed dimension tables patient

episode

med. services med. unit

(6)

elementary facts. A mapping between source data items and the data warehouse table attributes is established. Then, the physical table and index definitions can be derived and implemented.

The Archimed data warehouse has been built incrementally, in order to break the implementation task to manageable proportions. Each data mart has been successively implemented and connected to the overall architecture.

Moreover, the Archimed database was designed to anticipate changes and the evolution of existing data sources. As shown above, the data model accommodates the need for new descriptors and new dimensions without to have to change the database schema.

5. Conclusion

The planning, realization, maintenance and evolution of a large data warehouse database system can be overwhelming. Source data is complex, volumes are large, and portions of data will be dirty, erroneous, or hard to understand.

In the case of a data warehouse for medical data, we have discussed the three main design aspects of the Archimed database.

The organization of elementary facts (atomic) using the dimensional model has proved to be a powerful tool in order to integrate data coming from multiple heterogeneous database systems developed independently throughout the institution. It brings clarity and performance regarding data access. It also provides scalability to the system and resilience to further changes that may occur in source systems feeding the data warehouse.

Moreover, the building of the data warehouse in an incremental way has helped to control the evolution of the data content

These design options constitute the main rules and act as a road map to follow when a new medical set of data (or data mart) is integrated to the system. Sticking to these decisions has enabled to ensure the coherence of the integrated data, long term scalability, and to greatly simplify the access to data produced by tens of independent hospital wide operational databases.

These rules have been established progressively and result from more that 15 years of prior experience in providing information to hospital administrators and medical staff including departmental statistics, patient similar case retrievals, and clinical studies.

6. References

[1] Inmon WH, Building the Data Warehouse, Wiley; 3rd edition, 2002.

[2] Kimball R, Ross M. The Data Warehouse Lifecycle Toolkit, Wiley; 2nd edition, 2002.

[3] Kerkri R, Quantin C, Yetongnon K, Dusserre L. Les entrepôts de données:application au suivi

épidémiologique. Informatique et Santé, Springer-Verlag, France, Paris 1998(10):21-29.

[4] Ledbetter CS, Morgan MW, Toward best practice: leveraging the electronic patient record as a clinical data warehouse J Healthc Inf Manag. 2001 Summer;15(2):119-31.

[5] Lehner B, Thurler G, Bréant C, Tahintzi P, Borst F. Retrieval of Similar Cases using the ARCHIMED Navigator. MIE, 2003.

[6] Thurler G, Bréant C, Lehner B, Bunge M, Samii K, Hochstrasser D, Nendaz M, Gaspoz JM, Tahintzi P, Borst F. Toward a Systemic Approach to Disease. Complexus, 2003;1:117-122.

[7] Thurler G, Borst F, Bréant C, Campi D, Jenc j, Lehner B, Maricot P, and Scherrer JR. ARCHIMED: A network of Integrated Information Systems. Method Inform Med 2000; 39: 36-43.

Address for correspondence

Claudine Bréant HUG – Service d’Informatique Médicale 24, rue Micheli-du-Crêts 1211 Genève 14, Switzerland, claudine.breant@sim.hcuge.ch

Figure

Updating...

References

Updating...

Related subjects :