Abstract. Introduction. Normalization INTERFACING NORMALIZED RELATIONAL DATABASE STRUCTURES WITH

(1)

INTERFACING NORMALIZED RELATIONAL DATABASE STRUCTURES WITH SAS@SOFTWARE

James R. Johnson, Glaxo Inc. Roger D. Cornejo, Glaxo Inc.

Abstract

The use of relational database (RDBMS) technology and different levels of normalization (1st, 2nd, 3rd, 4th normal data structures) is proliferating throughout the data processing industry. RDBMS systems are valued for their ability to maintain the integrity of data, reduce unnecessary data redundancy, and provide maximum flexibility in retrieval. At the same time SAS@ software is established as the general fourth generation

language (4GL) tool for data analysis and

reporting. Clearly, the use of relational databases and SAS® software should be included in the ideal tool kit for systems development. Use of the different and distinct forms of normalized data structures, within an RDBMS, impacts the creation and use of SAS data sets. Well designed ROBMS applications typically result in normalized tables that remove repeating data groups, duplicated data, and establishes appropriate key to data associations within a table. Many of the powerful features in the SAS system utilize data sets that are not normalized, thus presenting several interfacing problems. This dichotomy in each products usage need not force a compromise in ~ach products strengths. In this paper we will discuss four different normalization forms used in relational databases, as well as presenting methods of interfacing with three normalized data structures using SAS® software.

Introduction

Throughout the 1980's the implementation of relational theory in the design and development of database management systems has been evident with the introduction of such products as DEC's RdB®, IBM's DB2®, and Oracle Corporation's ORACLE RDBMS®. Each of these database engines utilize the basic relational principle that tables of information can be defined which obey constraints, called relations, as outlined by Date (1981). Relational database technology offers application developers and end-users with a tool that can improve the usefulness and quality of data by organizing the information into varying degrees of normalized tables which will better meet the business needs of the application being developed.

One of the problems that has persisted throughout the development of relational technology is the lack of procedural data analysis and reporting tools for application developers and end-users to interface with a relati.onal database. SQL is the industry standard data manipulation language

(DML) for Selecting, Inserting, Updating, and Deleting rows and columns of data from a relational database. SQL i s also a data definition language (DDL) for creating and deleting database objects (table1s, views, indexes, synonyms, etc.). Additionally, SQL is a data control language (DCL) where user access to' the database and its tables and views are granted

(Grants). SQL is not a procedural language since i t does not offer programming constructs such as logical sub setting with IF-THEN-ELSE or CASE statements, or looping with DO-END, DO-WHILE, or DO-UNTIL programming constructs. SAS® Software, in contrast, does have a procedural language component in the DATA step that supports logical and looping programming constructs. The library of SAS® Software procedures is internationally accepted as one of the most comprehensive tool sets of pre-packaged analysis routines available for application developers and end-users. SAS/ACCESS® software specifically offers an interface to relational databases, thus providing the application developer and end-user with a tool for taking advantage of the strengths of relational database technology and the procedural language components of the SAS System to solve various business needs.

It is the intent of this paper to discuss the varying degrees of a normalized database, through a real world example, and how the use of FROC SQL, the DATA step and SAS System procedures call effectively be used to access and analyze information stored in these data structures.

Normalization

Generally speaking, normalization is a formal

process of organizing data into table~

(relations) of logically related information that satisfy conditions defined for the various normal forms (Kemm 1989). The normalized database environment is a means by which an organization can:

Provide a base of all data elements relevant to the business requirements.

Provide a processing environment that makes these data elements easily available to all appropriate users, both current and future. Ensure data integrity in a l l of the data elements.

Provide a stable, reproducable, highly flexible, and standardized data architecture to meet the clearly defined business needs. Optimize the performance of the database.

(2)

In poorly designed r e l a t i o n a l database applications (non-normalized) the developer and end-user will often find that some or all of the following anomalies may take place.

un-related data elements may be placed into tables together.

data elements may appear as repeated groups of information (e.g. un-indexed arrays of data elements)

data element values may be stored as null or blank for selected types of rows in the table.

referential integrity problems may exist that can result in a corrupt database.

I t is the goal of the normalization process to reduce and in most cases eliminate these features of poorly designed applications. Normalization techniques were created to provide database designers with a methodology for detecting and preventing these problems. Normalizing an application will remove un-controlled redundancy of data elements, reduce the amount of structural changes to table designs, and favor the transaction process for inserting, updating, and deleting rows in a table.

Many developers have arrived at INF, 2NF, 3NF database designs just by applying a common senSE understanding of the data elements which are to be stored in their applications. Nevertheless, there are distinct levels of normalization and rules for how each level can be applied to meet the business needs of the database being designed. Table 1. lists the various states of normalization to be discussed in this paper, as well as a simple rule for its existence.

Table 1. states of Normalization in a Relational databas<:I. *

NOODalization Form 1NF 2NF 3NF 'NF

Rule for Existence

No repeating groups of data elements. All non-key columns dependent upon the entire key.

All non-key columns dependent on the enti.re key and have no other dependencies. No Don-key column is dependent on another non-Don-key column.

No row i.n a table has more than one multi-valued d.ependency.

Other normali~ation states have been defined. but are not important to tbe discussion in this paper.

Non-Normalized Data Structure

A sample, single table, database that does not apply the rules of nonnalization defined in Table 1 is presented in Table 2. The sample data set for this table, as a line listing, is presented in Appendix 1. This example will be used to show

how the rules of normalization can be applied as well as demonstrate access to the data via PROC SQL and/or the DATA step. Access to this type of table structure is as simple as creating a descriptor file and view member of a catalog using SAS/ACCESS@ Software. The SAS/ACCESS to Oracle procedure was utilized for this paper. The rules and presentation could, however, be generalized for use with access to the Rdb and DB2 relational data base environments.

Table 2. Non-NoUlalized Data Table •.

Non-Normalized Data Table for Drug Survey Example. All information is contained in one data table.

*PATIENT *VISIT VISIT_DATE SEX AGE AGK_ONITS MARTIAL_STATUS TEST_l TEST_2 TEST 3 TEST_4 TEST_S BP_SYS BP_DYS lIEART RATE

* ..

primary key Null? Type NOT NULL NUMSER NOT NULL NUMBER

NOT NULL DATE CRAR(6) NUMBER CHAR(6) CHAR{~) CHAR(7) NUMBER NUMBER NUMBER NUMBER NUMBER CBAR(1)

NOT NULL NUMBER CBAR(7)

NUMBER

NUMBER NUMBER

NUMBER

This non-normalized table structure looks similar to the "file" structures historically utilized to store and retrieve information on a particular application. In the example, a l l repeating groups, infODmation regarding individual visits, and information about the investigator are physically stored together in a single row of the table. Also note that redundant information on data elements collected once per patient are proliferating throughout the table. Access to t h i s type of non-normalized table can be accomplished with the following simple PROC SQL, DATA step, or procedural examples. Note that in each case the SAS statements are reading directly from an ORACLE database table to complete the task requested. The FROC SQL example will automatically read and display the information requested in a single step, with no intermediate data set being fODmed.

Note that the naming conventions in the programming statements are different than the naming of columns in the respective tables. SAS/ACCESS@ Interface to Oracle software was used to create descriptor files to the databases~

Naming conventions in the access procedures must follow traditional 8 character naming conventions used historically in SAS software.

(3)

or

libname oracledb to [ ) to; proc Itql,

Itelect '" from oracledb. alldata; where patient '" 10;

quit:

proc print dat ... -oracledb alldata; Where patient .. 10;

The DATA step can optionally be utilized to read directly from an ORACLE database table or view to create a traditional SAS data set.

data test;

set oracledh. alldata; i f patient .. 10;

Clearly, the non-normalized data structure is not desirable for the problems that are associated with update and maintenance or the table (including the addition of a new test), excess and redundant storage of information (see appendix 1), an,d the poor performance that would be seen when sorting and querying the example. This sample data -set, along with almost every type of data storage and retrieval application can be analyzed and stored within a relational database management system using normalization techniques that follow in the discussion.

First Normal Form (1 NF) Data Structure

First no~l form (INF) provides for no repeating groups of information. This means that no array

~tructure~ (e.g. Al, A2, A3, . . . ) .::Ihould be utilized in the table design. It also means that all repeating groups of information are moved into their own separate table (along with the primary key columns) and values for Al, A2, A3 are stored as rows rather than columns. The primary reason for this form is that in most cases i t is difficult to predict the number of repeating elements that may be utilized, and adding another row to a table is trivial, while adding another column is not an elementary task. Additionally, if we need to perform operations on the values for AI, A2, A3, ___ in the non-normalized table the operation needs to be duplicated for each repeat group rather than coded once (e.g. Al*10, A2*lO, A3*lO, ••. , versus A*lO)

On occasion INF can be violated when the number of repeating elements i s fixed either by definition or business rule (e.g. fixed monthly columns such as JAN, FEB, MAR, . • . ) . De-normalization may also. at times be utilized for performance reasons to create valid INF violations. However, very clear, defined business or performance rules must be predefined to control the v i o l a t i o n of INF database architecture.

The normalization process will generate tables that generally favor update processing over the query process. Sometimes this is not a desirable situation for the application. For example, in tables utilized for query only, that are built and maintained by a loading u t i l i t y , lNF representation may be sufficient to support the application. It is often desirable in query-based applications to de-normalize the table for performance reasons (e.g. increase the access speed to the information). This can result in an increase in the redundancy of information in the table, albeit planned redundancy.

Table 3 presents a representation of the sample database where the repeating groups of test scores are moved into another Lable, thus creating a two table database environment. Notice that the TEST_SCORES table assigns the test number to an individual column while the cOLresponding test result is placed in the results column. The keys patient, visit, and test number are utilized to uniquely identify each record in the TEST_SCORES table.

Table 3. First Normal Form (lNF)

INF Data Tables for Drug Survey Example. Repeated groups of test scores are partitioned out into a test scores table. All other information resides in the patient table.

Table Name: PATIENTS

*PA'1IEN'1 *VISI'1 VISIT_DATE .. x RACE MAR'1IAL _ S'1ATUS STUDY DRUG BP_SYS BP_DYS HEAR'1_AA'1E * '" primary key *PATIEN'1 *VIS:tT *TES'1_NUM TEST_VALUE

* '"

primary key Null?

NO'1 NULL NUMBP.R NOT NULL HUMBER NOT NULL DATE

CHAR (6) NUMBER CHAR(6) CHAR(l) CRAR(1) cHAR(l) NO'1 NULL NUMBER

CHAR (1) NUMBER NUMBER NUMBER NUMBER Null? Type

NO'1 NULL NUMBER NO'1 NULL NUMBER

NUMBEl< NUMBER

Access to the collective set of information in the INF patient and t e s t scores tables is completed via a simple join of the tables using the SQL join construct in the WHERE clause. The SORT procedure and the DATA step may also be utilized to collectively evaluate the join of the two tables in a similar fashion. However, using traditional SAS programming constructs will create intermediate data sets in either a

(4)

permanent or work library, dependending upon how they are specified. Clearly, the use of the SQL procedure to join and display the information in a single step provides a more robust single step processing environment than the multi-step procedures and DATA step environment.

For purposes of discussion the ORDER BY clause is a similar statement to the SORT procedure and the WHERE clause completes the same function as a match MERGE/BY statement syntax in the DATA step. The WHERE clause can be extended to generate similar results as a sub-setting IF statement in the DATA step.

Access to the 1NF database tables via PROC SQL is as follows:

libname oracledb .. 11 " ;

proc sql;

select a .patient,b. visit, a. invnum, a. sex, a.race, b.testnum, b.result

from oracladb. fnpat a, oracledb fnscore b where a.patient '"' b.patient

and a. visit = b. visit order by patient, v i s i t ; quit;

Access via traditional SAS programming constructs would be as follows, assuming that the data are not stored in sorted order.

proc sort data=oracledb fnpat out"'"pat;

by patient visit; run;

proc sort data"'oracledh fnscore out=score;

by patient visit; run;

data scores; merge pat score; by patient visit;

keep patient visit invnum sex race testnum result;

run;

proc print data""scores; run;

Second Normal Form (2NF) Data Structure

Second normal form (2NF) provides that all of the columns in a table are dependent upon the entire key of the table. Any column of information in a 2NF designed table should require all elements that comprise the tables primary key to uniquely identify i t . If the column can be uniquely identified by only part of the key then that column, and the relevant keys, should be separated out into another table. More formally, a table is in 2NF i f i t is in INF and there are no columns which are dependent on only part of the primary key.

The information regarding repeated measurements of blood pressure and heart rate are v i s i t dependent, and as such should be partitioned out into an individual measurement table to meet the

rules of 2NF. A measurements table that uniquely identifies the visit# and associated measurements collected at a particular visit, is created to eliminate the problems of redundant data seen in Appendix 1. All data elements that are collected once and are visit independent are placed in the patients table.

Table 4. Second Normal Form (2NF)

2NF data tables for Drug Survey Example. Information not associated with a visit is partitioned out into a patient demog table, while information that is visit dependent is placed in a patient measure table. The test scores table is maintained from 1MP. Table Name,PATIENT_DeMOG Name *PATIENT SRX AGE AGE_UNITS RACE MARTIAL_STATUS STUDY _D:RUG rNV_NUM INV_NAMe INV paONE_EXT * '" primary key

Table Name, PATIENT_MEASURE Name *PATIENT *VISIT VISIT_DATE BP SYS BP_DYS HEART_RATE * '" primary key Null? Type

NOT NULL NUMBE:R

CaAR(6) NUMB . . CtlAl\(6) CHAR(l) CHAA(7) CHAF(l)

NOT NULL NUMBER CtlAl\(7) NUMBE:R

Null? Type NOT NULL NUMaE:R NOT NULL NUMBER NOT NULL DATE

NUMBER

Figure 1. presents a pictorial view of how the sample database is implemented in 2NF form. The key to this database design is in the fact that a one-to-many relationship exists between the

patient~demog tables and the patient~measure and test scores tables. It is the implementation of these one-to-rnany relationships that serve as the foundation to good database design.

Access to the information in each of the tables is obtained using a join, similar to the INF query, except that additional WHERE clause criteria is required to link the patient measure data with patient demography data via the patient key. The SQL statements below present the logical join required to read selected information from each of ~he tables in a 2NF implementation of the database. As with the 1NF join the example below will complete the join and display the results in a single step.

lihnatne oracledb "[]"; proc sql;

select a .patient, a. sex, a. race, a. studydg, a invname, b. testnum,b .rasult,

c.visit, c. visitdt, c .bysys, c.bpdys from oracledb. snpat a, oracledb. snscore b,

(5)

where a.pati.ent - b.patient and a.patient c.patient

and b. patient and b.v:!.sit

c.patient c.visit

order by a.patient, b.visit, b.testnum quit;

To accompllsh the same task, as in the above SQL statements, the following basic DATA step and procedure logic could be utilized to complete the join of these three tables. This traditional implementation of SAS programming steps assumes that the data are not stored in sorted order in the tables.

libname oracledb "[]"; proc sort data=oracledb snpat

out=pat; by patient;

proc sort data_oracledb.snscore out=seore;

by patient visit testoum; run;

proc 8<Jrt data",oracledb snrslt out~rslts;

by patient visit; data JllUltipat;

merge score (in=in1) rslts (in~in2) pat (in"'in3); by patient visit; i f in1 and in2 and in3; run;

proe print data=sntable;

Access to a 2NF database architecture need not be as cumbersome as in the. above example, i t can be as simple as reading from the traditional SAS data set structures utilized historically (e.g. the single SAS data set) by creating a view (stored SQL statement). The ability to create a vlew of the joined information and access the view directly, with either the DATA step or a procedure, is a new feature of SAS® Software

that will allow for greater speed of retrieval and much more robust access to data in a normalized database. Now applications can utilize the procedural language programming constructs that have been time tested over many SAS® Software applications with views directly into a database as wel~ as the traditiona~ OAO data set. The following example creates a view of the infoDffiation and permanently stores this view in a catalog. Note that this example creates a view of three tables of information in a manner that is consistent with looking at the data in lNF. Once this type of view of the database is made available to the end-user then a l l of the components of the SAS System become available to an application.

lihnama oracledb "[]"; proe sql;

create view oracledb. sntable as

select a.patient, a. sex, a. race, a. etudydg, a invname, b. teetnum,b. result,

c.visit, c .v:is:itdt, '" .by"ys, c. bpdy"

from oracledb . .,npat a, oracl ... db cns",ora b, oracledb. snrs.lt c

where a.patient = b.patient and a.patient e.patient and b patient

-

c.patient and b.visit c.visit

,

quit;

Once a view of the database is created and stored in the catalog then the complete power of the SAS System can be ut . .:i;lized to access, analyzer and report on information stored in the database as i f i t were a SAS data set. For example, simple statistical procedures can be executed directly against the database using SAS procedures. The fOllowing example reads directly from the database view defined as SNTABLE, created above, to complete a means and frequency distribution computation.

proc means n mean data=oracledb. sntable; by patient visit;

var result; run;

proc freq data"'oracledb.sntable; where visit'" 1;

tab.le atudydg*sex*race; run;

Advantages of Views into Databases

Creating and utilizing views offers several advantages when working with r e l a t i o n a l databases. These advantages are not limited to 2NF fODffi architectures and should provide benefit across all forms of database design. Advantages include:

savings on disk space as views are virtual usage o'f the machine. Views are nothing more than stored SQL statements (virtual tables); no disk space is used to store the data that is "project/joined" by the view.

data will always be current in that the database is accessed directly each time a query'is made.

data can be shared among more users without the creation of additional of data sets. many different views or combinations of views can be created to grant access to all or selected information in the database.

Third Normal Form (3NF) Oata Stnocture

Third Normal Form (3NF) eliminates all columns of information in a table that are uniquely identified by data items in the table other than key data columns. All non-key columns are dependent upon the entire key (2NF) and have n~

other field dependencies. This latter type of dependency is referred to as a transitive data dependency (Date 1981). Columns that have a transitive data dependency are removed, along with their respective-identifiers r into a separate table. Most database designers aim to

(6)

create a 3NF database environment. The Boyce Codd Normal Form (BCNF) is an extension of 3NF. BCNF

diffe~s f~om 3NF in that i t applies to candidate keys as well (Date 1981).

In the 2NF implementation of the sample database a transitive dependency exists in the patient

demog~aphics table. The information associated with an investigator (inv num, inv_name, inv_phone) is not totally dependent upon the components of the prima~y keys in the 2NF patients table. In fact, the investigato~s

information is independent of the patients information, but a patient cannot be without an

investigator. Therefore, the information

regarding an investigator is partitioned out into i t s own table thus eliminating the transitive dependency. The identifying key in the investigators table (inv_num) is left as a column in the patient data table to allow for the identification of the investigators information associated with that patients information. This is referred to as a foreign key.

The 3NF database design can implemented as shown in table 5. Table 5. shows that an investigators table is created and that the patients demographics table is altered to allow for the investigator number to be a foreign key. The transitive dependencies of investigator name and phone number are eliminated from the 2NF implementation. All other lNF and 2NF table designs are retained.

Tabl .. 4. Third Normal Form (3NF)

3NF data tables for Drug Survey Example. The information from the inve3tigator is partitioned out into a separate tab1e. The test Bcot"es table is maintained from INF, and the measur .. table is retained from 2NF.

Table Name: INVESTIGATORS

"'INV_NUM INV_NAME INV_PHONE_EXT '" = primary key Table Name:PATIENT_DEMOG Null? Type

NOT NOLL NUMBER CHAR (7) NUMBER

Name Null? Type

*PATIENT SEX AGE AGE UNITS RACE MARTIAL_STATUS STUDY DRUG '" '" primary key

NOT NULL NUMBER

CHAR(6) NUMBER CHAR(6) CHAR(l) CHAR (7) CHAR(l) NOT NULL NUMBER

Figure 2. presents a pictorial view of how the sample database is implemented in 3NF form. The keys to this database deSign are in the fact that

one investigator is associated with many patients, and a one-to-many relationship exists between the patient~ t a b l e and t h e patient_measure and test scores tables. The projection from 2NF to 3NF offers greater f l e x i b i l i t y in database design in that information is discretely grouped into logical tables of infor.mation that are properly related.

Acces~ to the 3NF databa~e is the same as with the 2NF database design. Additional WHERE clause c r i t e r i a are required to address the investigators t a b l e . The example below demonstrates the join of four tables to create a view into the sample database in 3NF form.

1ibname oracledb .. (]"; proc sql;

create view <:>racledb.tntable

select a. pat i6nt, a. sex, a. race, a studydg b. teatnum, b.result,

c. visit, c.visitdt, c.bysys, c .bpdys,

d. invname

from oracledb.tnpat a, orael .. db.tnscore b, oracledb.tnr.s1t c, oracledb.tninv d where a.patient = b.patient

and a.patient '" c.patient and b.patient .. c.patient and b.visit c.visit and a.invnum 'Z d.invnum ;

quit;

Access to a 3NF implemtation of the database using traditional SAS programming (DATA and ~ROC

steps) constructs would be similar to the 2NF database implementation. Additional sorting and merging would be required to match merge the investigator with the patient. The traditional SAS programming steps to match merge these tables' of information is not being presented because as a database is fully normalized the benefits of using views becomes obvious with the saving of machine and programming resources.

Fourth Normal Form (4NF)

Fourth normal form (4NF) provides for the decomposition of a relation into two or more projections (e.g. subset tables) if the relation has multi-valued dependencies that are not functional dependencies (Date 1981, Watterson, 1989). Simply stated, i f a table has more than one column (or sets of columns) which are independently dependent upon the key then these column(s) and key should be partitioned out into a separate set of tables. It is not the intent of this paper to discuss the architecture, access, or retrieval from this form of normalized database in the examples presented. This definition is presented to demonstrate that the relational model can be extended beyond the traditional 3NF forms that is normally used by database designers.

(7)

Transposing a Normalized Data Structure

One of the strengths of a normalized relational database outlined thus far is the elimination of repeating groups (e.g. arrayed structures). This aspect of normalization is carried from the very first step in applying normalization techniques to database design. Figure 1 clearly shows that the TEST_SCORES table provides for three keys to fully describe the individual rows in the table (patient, visit, and test_num), thus eliminating

the array constructions. Often programmers want

to present or analyze information structured with repeating groups utilizing the strengths of array processing. An application can s t i l l take advantage of these array processing features by creating a discrete view that transposes the information. In the sample database a view of the TEST_SCORES table is created that transforms the

individual test numbers into columns. For example the following syntax creates a permanent view into the TEST_SCORES table that is the join of

the table on i t s e l f once for each of the test

numbers (1 through 5) .

proe aql;

create View oracledb.tranpo!!e a$

quit;

!!elect tl.patient, tl.vi!lit, tl. result ' test_l' , t2.reault 'teat_2', t3.reault 'teat_3', t4.r .. sult ' t .. st_4', t5.result ' t .. st_5' from anacore tl, /!Inscore t2, t3, snscor" t4, snscore t5

where tl.patlent "" t2.patient

and tl.patlent

.

t3.patient

and tl.patient

-

t4.patient

and tl.patient = t5.pati .. nt

and t1.visit - t2.visit

and t1.visit t3. visit

and tl.vi .. i t

-

t4.vi .. i t

and tl.visit

-

tS.visit

and t l . t .. stnum ..,

,

and t2. testnum - 2 and t3. t .. atnum = 3

and t4. to!lstnum .. 4 and t5.tO!lstnum =

Access to this type of view can utilize the ARRAY statement syntax supported in the DATA step or with the VAR statement in procedures as demonstrated in the following example.

data _ nl1ll_:

aO!lt oraeledb. tranpose;

array teats{5} test_1 - test_5; do j=l to 5;

i f test{j} = . then put "Missing Test Scor ....

patient'" +1 visit- ; end;

run;

proe print data-oracledb. tranpose; var patient visit test_2 test_4; run;

proe mean" data=oracledb .tranpoae; var test_1 teat_3 test_5; run;

Summary

In this paper a discussion has been presented on the use of different degrees of normalization using a relational database and the interface with SAS@ Software's procedural language components. The use of a relational database can improve the quality and accessibility of information requirements in many different applications. By applying normalization techniques, application developers and end-users alike can take advantage of the str~ngths of storing data in the various normal forms (lNF,

2NF~ 3NF, or 4NF). These strengths include, but are not limited to, the elimination of redundant

data~ broader access, and improved utilization of

disk space. with the i n t r o d u c t i o n of SAS/ACCESS@ software the application developer and end-user now have the ability to fully integrate the strengths of SAS programming constructs (DATA step, Macro facility, and s t a t i s t i c a l procedures) with a relational

database. The combination of these two

application development software packages provides a tool set for systems developed in the 1990's.

Acknowledgements

The authors wish to thank Ms. Nancy Wheeler, and Mr. Robin Pasley for their contributions and

review to this paper.

References

Date C.J. 1981. An IntroductLon to Database Systems. Addison-Wesley Publishing. Reading, MA. 574 pages.

Kemm T. 1989. A Practical Guide to Normalization. DBMS December 1989 pp:46-52.

Watterson K. 1989. From chaos to order. Data Based Advisor, Feb. 1989 7(2) :33-37.

Contact Information

James R. Johnson Roger D. Cornejo Glaxo, Inc.

MIS - Scientific Computing Five Moore Drive

Research Triangle Park, North Carolina 27709 Phone: (919) 248-7341

FAX:J: (919) 248-2571

SA.'l, SAS Software, and SAS/ACCESS Software are registered trademark~

ot SAS Institute Inc., Cary, North Carolina USA.

DB2 i~ a registerad trademark of IBM Corporation.

Rdb Is a r~i ... tered trademark of Digital Equipment Corporation.

ORACLE ROBMS 1" a registe"",d trademark of Oracle Corporation, B<:>lmont, California USA.

(8)

Figur,e 1. Physical Implementation of a 2NF form database for the example. PAllENT.J)BMOG Primory Key .paII ... , PrirnooyK.,..

~-....

~ • ... ~~"m

Figure 2. Physical implementation of 3NF database tables for sample application.

INVSSTIOATORS -inv __ .... PATII!NT..MEASIJlt£ TIIST..s<:Oll1!S PrlmKyX.y. ~-. ~.~

~--APPENDIX 1. Sample Data for Normalization Examples.

Sample Drug/Response SUrvey Information Age Martial Test Scores ---Patient Visit Visit Date Selt Age Units Race status

10 10 10 10 10 20 20 20 20 20 30 30 30 30 30 40 40 40 40 40 50 50 50 50 50 1 2 4 5 1 4 5 1 2 01-JAN-89 MALE 05-JAN-89 MALE 01-FEB-89 MALE 15-FES-89 MALE 01-MAR-89 MALE 15 YEARS C 15 YEARS C 15 YEARS C 15 YEARS C 15 YEARS C 15-JAN-89 FEMALR 25 01-FEB-89 FEMALE 25 15-FEB-89 FEMALE 25 15-MAR-89 FEMALE 25 15-APR-89 FEMALE 25 YEARS B YEARS B 'tEARS 8 YEARS B YEARS B 26-JAN-89 MALE 10-FE8-89 MALE 25-FEB-89 MALE 05-MAR-89 MALE 26-MAR-89 MALE 35 YEARS C 35 YEARS C 35 YEARS c 35 YEARS C 35 YEARS C 01-FEB-89 FEMALE 45 YEARS B 10-FE8-89 FEKALE 45 YEARS 8 26-FEB-89 FEMALE 45 YEARS B 25-MAR-89 FEMALE 45 YEARS 8 16-APR-89 FEMALE 45 YEARS B 15-FEB-89 MALE 55 'tlUIRS C

01-MAR-89 MALE 55 YEARS C

07-MAR-89 MALE 55 YEARS C 15-MAR-89 MALE 55 YEARS C 26-HAR-89 MALE 55 'tlUIRS C

"""=

SINGLE SINGLE SINGLE snmLl!! MARRIED

"""""'D

MAAR"'D MARRIED MAAR"'D SINGLE SINGLE SINGLE SINGLE SINGLE 20 2. 26 21 20

..

78 77 79

'7

58 57 59 61 66 MARRIED 33 MARRIED 53 MARRIED 63 MA!\RIED 31 MARRIED 32 SINGLE SINGLE SINGLE SINGLE SINGLE

"

61 27 35 2'

"

40 50 50 42 41 65 63 65 69 70 55 55 45 56 55 60 60 60 61 63 100 45 99 44 97 43 100

..

90 84

..

67 65 67 67 " 5 61 10 65 12 66 n 71 19

"

99 98 97 96 99 30 35 37 39 42 os 99 100 100 99 100 25 19 27 26 25 30 100 100 100 100 100 91 99 100 16 so 83 B5 91 B5 Study Inv. Drug NUIII. A A A A A B B B B B B B B B B A A A A A B B B B B 01 01 01 01 01 05 05 05 05 05 01 01 01 01 01 10 10 10 10 10 05 05 05 05 05 Inv. N _ Smith Smith Smith Smith smith Phone Ext. 1223 1223 1223 1223 1223 DeMarco 8901 DeMarco 8901 DeMarco 8901 DeMarco 8901 DeMarcQ 8901 Smith Smith smith Smith smith 1223 1223 1223 7223 1223 Yourdon 5512 Yourdon 5512 Y ourdon 5512 Yourdon 5512 Yourdon 5512 DeMarco 8901 DeMarco 8901 DeMarco 8901 DeMarco 8901 DeMarco 8901 BP SyS/Dys 120/12 122/76 134/79 126/12 135/79 129/81 127/11 125/11 121/72 128/80 135/68 135/69 ./. ./. 134/77 161/80 165/72 166/82 165/88 178/90 192/99 188/86 115/82 166/19 126/12 Seart . . c_ 75

..

"

99 97 66

"

67

"

75 80 .2 80

"

82 85 84 84 75 74 72 74 65