• No results found

Modeli Relacionar I Te Dhenave ,database (baze te dhenash)

N/A
N/A
Protected

Academic year: 2021

Share "Modeli Relacionar I Te Dhenave ,database (baze te dhenash)"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

Modeli relacionar I te dhenave

Zemra e modelit relacionare te te dhenave eshte relacioni, porse ndikim te madh kane edhe celesta, relatat, lidhjet, varesite funksionale dhe transitive apo edhe anomalite pergjate ndryshimeve.

Relacioni

Relacioni eshte elementi themelore ne nje model relacionare te te dhenave.

Figura – Relacioni ne modelin relacionare te te dhenave

Nje relaciona duhet ti nenshtrohet rregullave te meposhtme: Relacioni (file, tabele) eshte nje tabele dy-dimensionale:

1. Atributi (ose fusha ose e dhena njesi) eshte nje kolone ne nje tabele. 2. Cdo kolone ne nje tabele ka nje emer unik ne ate tabele.

3. cdo kolone eshte homogjene prandaj dhe hyrjet ne cdo kolone jane te te njejtit tip( psh emir, modha, numri punes per cdo punonjes etj).

4. Cdo kolone ka nje domain, bashkesia e vlerave te mundshme qe mund te shfaqen ne ate kolone.

5. Nje record eshte nje rreshti me te dhena ne nje tabele. 6. Renditja e rreshtave dhe e kolonave nuk ka rendesi.

7. Vlerat e nje rreshti kane lidhje me dicka ose me nje pjese te te dickaje. 8. Grupet perserites nuk jane te lejueshme.

9. Rreshtat e dyfishte nuk jane te lejuar.

10. Qelizat duhet te permbajne vetem nje te dhene, porse mund te kete gjatesi te ndryshme. Vlera te vetme kane kuptmin qe:

o Nuk mund te permbajne te dhena te tilla si 'A1,B2,C3'.

o Nuk mund te permbajne vlera te kombinuara si: 'ABC-XYZ' ku 'ABC' ka nje kuptim

dhe 'XYZ' ka nje kuptim tjeter.

Nje relacion mund te perfaqesohet ne shenim te tille: R(A,B,C, ...) ku:

• R = eshte emri I relacionit.

• (A,B,C, ...) = jane atributet brenda ne relacion.

• A = atributi I cili formon celesin primare

Celesat

(2)

2. nje celes I perbere permban me shume sesa nje atribut.

3. nje celes kanditat eshte nje atribut ose nje bashkesi me attribute qe perfaqeson ne menyre unike nje rresht net abele. Nje celes candidate duhet te zoteroj cilesite e meposhtme:

o identifikim unike – per cdo rresht vlera e celesit duhet te identifikoje ne menyre te

vetme ate rresht.

o pateprice - nqs ndonje atribute qe ben pjese ne formimin e celesit do te largohet prej

celesit ath nuk do te prishet aftesia e identifikimit unik te rreshatve.

4. celesi primare eshte nje celes candidate I cili eshte perzgjedhur si identifikues unik kryesore. cdo relacion duhet qe te permbaje nje celes primare. Celsi primare zakonisht eshte celesi I cili eshte perzgjedhur per identifikim te nje rreshti ne rastin e implementimit fizike te bazes se te dhenave. Psh ISBN perzgjidhet si celes primare ne vend te titullit te nje libri.

5. nje superceles eshte nje bashkesi me attribute qe identifikon ne menyre te vetme nje rresht. Nje superceles ndryshon nga nje celes kanditat sepse supercelesi nuk e ka te nevojshme te plotesoje kushtin e teprise.

6. nje celes I jashtem eshte nje atribut ose nje bashkesi me attribute qe shfaqet si nje celes primare e nje relacion dhe qe shfaqe si nje atribut jo primare ne nje relate tjeter. Nga ana tjeter nje celes I jashtem mund te jete vetem nje pjese e nje celesi primare.:

o nje relacion shume-me-shume mund te implementohet vetem me ane te nje prerjeje

midis dy tabelave me lidhje nje-me-shume. Dhe celsi primare I tabeles se trete”qendrore” eshte nje celes I jashtem per dy tabelat prind dhe eshte nje celes kompozite I ndertuar prej dy celesave primare te dy tabelave prind.

o Nje lidhje nje-me-shume kerkon qe tabela femije te kete nga nje celes te jashtem qe

tregojne ne secilin prind dhe qe po ata celesa te jashtem te perbejne celesin primare te te gjithe tabeles.

7. nje celes semantic ose natyral eshte nje celes I cili ka nje kuptim te natyrshme per perdoruesin e te dhenave.. psh ne tabelen net e cilen kemi ruajtur shtetet e botes do te

ndeshemi me vleren “AL” per rastin kur po kerkohet vlera per shtetin “Albania” ath vlera AL ka kuptim per perdoruesin e kesaj tabele.

8. nje celes teknike, zevendesues apo artificial eshte nje celes vlerat e te cilit nuk kane nje kuptim te qarte per perdoruesin e te dhenave. Keta lloj celesash perdoren ne vend te celesave semantike ne rastet :

o kur vlera nje celesi semantike mund te ndrysohet prej perdoruesit ose mund te

dyfishohet. Per shembull ne nje tabele PERSON nuk eshte e rekomandueshme qe te perdret fusha Person_Name si nje celes sepse ka shume raste kur kemi persona te ndryshem me te njejtin mbiemer ose per rastin e vajzave mbiemri ndryshon pasi martohen.

o Kur asnje prej atributeve nuk mund te siguroje unicitetitn. Ne keto raste perdoret

artifica e shtimit te nje kolone apo fushe te re e cila do te ruaje nje numer ku ky I fundit do te gjenerohet prej sistemit; psh nje numer sekuence si ata qe perdoren ne rastet e ORDER_ID apo INVOICE_ID. Vlera '12345' nuk ka asnje kuptim per perdoruesin porse identifikon ne menyre te vetme gjithe mallrat qe jane porositur ne faturen qe identifikohet prej ketij numri.

9. nje celes funksional percakton te dhenat e tjera net e njejtin rresht prandaj dhe quhet percaktues ose determinant.

10. duhet te keni parasysh qe fjala “celes” ne pjesen me te madhe te DBMS implementohet si nje indeksim I cili nuk ju lejon duplikata te te dhenave.

Relacionet-Marredheniet

Nje tabele mund te lidhet me nje tabele tjeter duke vendosur nje marredhenie midis tyre. Relatat mund te vendosen/krijohen ne strukturen e dtabazes per te lehtesuar veprimet e lidhjeve(join) gjate kohes se ekzekutimit te kerkesave nga DB.

(3)

1. nje relate vendoset midis dy tabelave duke krijuar nje marredhenie nje-me-shume ose

prind-femije ku nje vlere nga tabela prind pao “nje” mund te krijoje lidhje me shume vlera nga

tabele femije apo “shume”. Per te arritur kete qellim tabela femije duhet te permbaje fusha qe te cojne prapa tek celesi primare tek tabela prind. Keto fusha tek tabela femije njihen ndryshe si celesa te jashtem dhe tabela prind njihet prej tabeles femije si tabela e jashtme.

2. eshte e mundur qe nje record ne tabelen prind te ekzistoje pa krijuar lidhje me ndonej record nga tabela femije. Porse nuk eshte e mundur qe nje record nga tabela femije te ekzistoje pa patur vendosur me pare nje lidhej mendonejrekord ne tabelen prind.

3. nje record ne tabelen femije I cili nuk ka nje record korrespondues ne tabelen prind njihet ndryshe si record “jetim”.

4. eshte e mundur qe nje tabele te krioje lidhje me vetveten. Per kete eshte e rendesishem qe te nje celes I jasten I vete tabeles te shenjoje tek celesi primare I po asaj tabele. Por duhet bere kujdes qe te dy celesta te mos jene e njejta fushe sepse perndryshe nje vlere do te referoje tek vetja e saj.

5. nje tabele mund te bej pjese ne disa relata; njehere mund te luaje rolin e tabeles prind dhe ne raste te tjera te luaj rolin e tabeles femije.

6. ne disa databaza me ane te rregullave te integritetit ose te kufizimeve te celesave te jashtem behen kontrolle te rasteve kur nje record femije nuk lejohet te krijohet rekorde kur nuk ekziston lidhja me nje celes te jashtem tek tabela prind ose kur rekordi ne tablen prind ka ndryshuar vlere ose eshte fshire.

Lidhjet relacionale

Operatori join perdoret per te kombinuar te dhena nga dy apo me shume relacione apo tabela ne menyre qe te plotesoje kerkesat nga nje query-I e caktuar. Dy relacione mund te vendosin lidhje kur ato ndajne te pakten nje atribut te perbashket. Lidhja implementohet duke konsideruar nje rresht ne nje instance te cdo relacioni. Nje rresht ne nje relacion R1 eshte lidhur me nje rresht ne relacionin R2 kur vlera e atributit te perbashket eshte e njejte net e dyja tabelat. Lidhja e dy relacioneve njihet ndryshe si lidhje binare.

Lidhja e dy relacioneve (tabelave) krijon nje relate te trete. Shenimi 'R1 x R2' tregon se lidhjen e relacionit R1 me R2. Per me shume shpjegim le te ndjekim shembullin me poshte:

Relacioni R1 A B C 1 5 3 2 4 5 8 3 5 9 3 3 1 6 5 5 4 3 2 7 5 Relacioni R2 B D E 4 7 4

(4)

6 2 3

5 7 8

7 2 3

3 2 2

Vini re se instancat e relacionit R1 dhe R2 permbajne te njejtat te dhena ne fushen e perbashket B. ne rastin e normalizimit nuk behet gje tjeter vecse relacioni R(A,B,C,D,E) ndahet ne relacione eprberes me te vegjel sic jane R1 dhe R2. Ne kete kontekst te dhenat e fushes B do te jene te njejta net e dyja relacionet. Instance e relacionit R1 dhe R2 jane projeksione te instances se relacionit R(A,B,C,D,E) ne atributet (A,B,C) dhe (B,D,E) ne menyre respective. Nje projeksion nuk do te elemonoje vlerat e te dhenave-rreshtat e dyfishte do te elemonohen por kjo ska kuptimin qe do te fshihen vlera nga cdo atribut.

Lidhja e relacioneve R1 dhe R2 behet e mundur sepse B eshte nje atribut I perbashket dhe rezultati do te jete ai I meposhtmi: Relacioni R1 x R2 A B C D E 1 5 3 7 8 2 4 5 7 4 8 3 5 2 2 9 3 3 2 2 1 6 5 2 3 5 4 3 7 4 2 7 5 2 3

Rreshti (2 4 5 7 4) eshte formuar nga bashkimi I rreshtit (2 4 5) nga relacioni R1 me rreshtin (4 7 4) ne relacionin R2. Bashkimi I rreshtave eshte bere per pasoje sepse te dy keta rreshta perbajne vleren te njejte 4 per fushen e perbashket B. Rreshti (2 4 5) nuk eshte bashkuar me rreshtin (6 2 3) sepse vlerat e fushes B per kete rast jane 4 dhe 6, pra jane jot e njejta.

Ne rastin qe sapo pershkruam lidhja midis relacioneve R1 dh eR2 ishte vedosur bazuar ne fushen e vetme B qe ishte fusa e perbashket midis ketyre dy tabelave. Porse ka raste kur relacioni vendoset edhe per rastet kur dy relacione kane me shume sesa nje fushe te perbashket. Per shembull marrim dy relacionet e meposhtme R1 dhe R2 qe ne kete rast kane dy fusha te perbashketa B dhe C:

Figurat me poshte tregojne relacionet perpara bashkimit te tyre:

Relacioni R1

A B C

6 1 4

(5)

5 1 2 2 7 1 Relacioni R2 B C D 1 4 9 1 4 2 1 2 1 7 1 2 7 1 3 Pas bashkimit: Relacioni R1 x R2 A B C D 6 1 4 9 6 1 4 2 8 1 4 9 8 1 4 2 5 1 2 1 2 7 1 2 2 7 1 3

Rreshti (6 1 4 9) eshte formuar nga rreshti (6 1 4 ) I relacionit R1 dhe nga rreshti (1 4 9 ) I relacionit R2. lidhje meqenese u gjenden vlerat te njejta per atributet e perbashket B dhe C; qe jane 1 dhe 4. rreshti (6 1 4 ) I relacionit R1 nuk u lidh me rreshtin (1 2 1 ) te relacionit R2 sepse atributet e perbashket permbanin vlerat ( 1 4 ) per R1 dhe (1 2) per R2, pra vlera jo te njejta.

Operatori join na lejon qe te rekrijojme nje relacin I cili u dekompozua gjate procesit te normalizimit. Lidhja e dy rreshtave, te cfaredoshem, mund te coj ne krijimin e nje rreshti I cili nuk ishte pjese e relacionit original. Keta jane rreshta invalid ose jot e vlefshem qe kijone gjate join.

Lossless Joins

Nje bashkesi me relacione kenaq cilesine e lidhjes pa humbe kur gjate bashkimit nuk krijohen rreshta invalide( ose rreshta te rinj). Termi lidhje pa humbje mund te jete dicka qe krijon konfuzion. Nje lidhje qe ka cilesine e lossless do te coje ne krijimin e rreshtave te rinj jot e vlefshem. Ndersa nje lidhje lossless nuk I permban keta rreshta invalid.

(6)

Japim nje shembull sqarues per informacionin jo korrekt qe do te rijohet nga nje lidhje jo e vlefshme. Struktura e meposhtme ju ndihmon:

R(student, course, instructor, hour, room, grade)

Supozojme se krijohet vetem nje seksion per lenden ne fjale ath do te vendosen varesite funksinale si me poshte:

1. (HOUR, ROOM) COURSE 2. (COURSE, STUDENT) GRADE 3. (INSTRUCTOR, HOUR) ROOM 4. (COURSE) INSTRUCTOR 5. (HOUR, STUDENT) ROOM

Tabela me poshte do ten a ndihmoje akoma me shume ne perfytyrimin e situates se krijuar:

STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE

Mino Mat Guxholli 8:00 107 A

Koci AP Hoxha 8:00 209 B

Tyrku DB Xheka 8:00 312 C

Cupi A2 Kablaqi 9:00 212 A

Relacionet e meposhtme te cilet jane ne formen e 3 normale gjenerohen sipas varesive te meposhtme:

• R1(STUDENT, HOUR, COURSE)

• R2(STUDENT, COURSE, GRADE)

• R3(COURSE, INSTRUCTOR)

• R4(INSTRUCTOR, HOUR, ROOM)

Veini re se varesite (HOUR, ROOM) COURSE dhe (HOUR, STUDENT) ROOM nuk jane perfaqesuar ne menyre ekspilicite gjate dekompozimit te meparshem. Qellimi eshte qe te arrihet ne relacione te normalizuara te cilat duke vepruar mbi operatore join do te arrihet te pergjigjet pyetjeve qe ndryshme te kryera prej perdoruesit fundore. Relacionet e mesiperme jane vetem njera mundesi e relacioneve qe jane zgjidhje e ketij skenari.

Bashkesite e relacioneve te mesiperme mund te popullohen sipas figurave te meposhtme: R1

STUDENT HOUR COURSE

Mino 8:00 Mat

Koci 8:00 AP

Tyrku 8:00 DB

Cupi 9:00 A2

R2

(7)

Mino Mat A Koci AP B Tyrku DB C Cupi A2 A R3 COURSE INSTRUCTOR Mat Guxholli AP Hoxha DB Xheka R4

INSTRUCTOR HOUR ROOM

Guxholli 8:00 107

Hoxha 8:00 209

Xheka 9:00 312

Nqs nga ana e secretaries na kerkohet numri I klasave dhe ora se kur jane te zena ath me ane te relacioneve R1 dhe R4 arrihet te marrim relacionin ne fjale duke I bashkuar at ate dy nepermjet atributit te perbashket HOUR dhe rezultati eshte ai qe na paraqitet ne figuren me poshte:

R1 x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Smith Math 1 Goldman 8:00 200

Jones English Jenkins 8:00 100 Jones English Goldman 8:00 200 Brown English Jenkins 8:00 100 Brown English Goldman 8:00 200 Green Algebra Jenkins 9:00 400

This join creates the following invalid information (denoted by the coloured rows):

• Smith, Jones, and Brown take the same class at the same time from two different instructors in two different rooms.

• Jenkins (the Maths teacher) teaches English.

• Goldman (the English teacher) teaches Maths.

• Both instructors teach different courses at the same time.

(8)

R3 x R4

COURSE INSTRUCTOR HOUR ROOM

Math 1 Jenkins 8:00 100 Math 1 Jenkins 9:00 400 English Goldman 8:00 200 Algebra Jenkins 8:00 100 Algebra Jenkins 9:00 400

This join creates the following invalid information:

• Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.

A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting relation with R4 (using both INSTRUCTOR and HOUR). The result would be:

R1 x R3

STUDENT COURSE INSTRUCTOR HOUR

Smith Math 1 Jenkins 8:00 Jones English Goldman 8:00 Brown English Goldman 8:00 Green Algebra Jenkins 9:00

(R1 x R3) x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Jones English Goldman 8:00 200 Brown English Goldman 8:00 200 Green Algebra Jenkins 9:00 400

Extracting the COURSE and ROOM attributes (and eliminating the duplicate row produced for the English course) would yield the desired result:

COURSE ROOM

Math 1 100 English 200 Algebra 400

The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless (gainless?) join property.

(9)

A relational database is in 4th normal form when the lossless join property can be used to answer unanticipated queries. However, the choice of joins must be evaluated carefully. Many different sequences of joins will recreate an instance of a relation. Some sequences are more desirable since they result in the creation of less invalid data during the join operation.

Suppose that a relation is decomposed using functional dependencies and multi-valued dependencies. Then at least one sequence of joins on the resulting relations exists that recreates the original instance with no invalid data created during any of the join operations.

For example, suppose that a list of grades by room number is desired. This question, which was probably not anticipated during database design, can be answered without creating invalid data by either of the following two join sequences:

R1 x R3 (R1 x R3) x R2 ((R1 x R3) x R2) x R4 or R1 x R3 (R1 x R3) x R4 ((R1 x R3) x R4) x R2

The required information is contained with relations R2 and R4, but these relations cannot be joined directly. In this case the solution requires joining all 4 relations.

The database may require a 'lossless join' relation, which is constructed to assure that any ad hoc inquiry' can be answered with relational operators. This relation may contain attributes that are not logically related to each other. This occurs because the relation must serve as a bridge between the other relations in the database. For example, the lossless join relation will contain all attributes that appear only on the left side of a functional dependency. Other attributes may also be required, however, in developing the lossless join relation.

Consider relational schema R(A, B, C, D), A B and C D. Relations Rl(A, B) and R2(C, D) are in 4th normal form. A third relation R3(A, C), however, is required to satisfy the lossless join property. This relation can be used to join attributes B and D. This is accomplished by joining relations R1 and R3 and then joining the result to relation R2. No invalid data is created during these joins. The relation R3(A, C) is the lossless join relation for this database design.

A relation is usually developed by combining attributes about a particular subject or entity. The lossless join relation, however, is developed to represent a relationship among various relations. The lossless join relation may be difficult to populate initially and difficult to maintain - a result of including attributes that are not logically associated with each other.

The attributes within a lossless join relation often contain multi-valued dependencies. Consideration of 4th normal form is important in this situation. The lossless join relation can sometimes be

decomposed into smaller relations by eliminating the multi-valued dependencies. These smaller relations are easier to populate and maintain.

(10)

The terms determinant and dependent can be described as follows:

1. The expression X Y means 'if I know the value of X, then I can obtain the value of Y' (in a table or somewhere).

2. In the expression X Y, X is the determinant and Y is the dependent attribute. 3. The value X determines the value of Y.

4. The value Y depends on the value of X.

Functional Dependencies (FD)

A functional dependency can be described as follows:

1. An attribute is functionally dependent if its value is determined by another attribute. 2. That is, if we know the value of one (or several) data items, then we can find the value of

another (or several).

3. Functional dependencies are expressed as X Y, where X is the determinant and Y is the functionally dependent attribute.

4. If A (B,C) then A B and A C.

5. If (A,B) C, then it is not necessarily true that A C and B C. 6. If A B and B A, then A and B are in a 1-1 relationship. 7. If A B then for A there can only ever be one value for B.

Transitive Dependencies (TD)

A transitive dependency can be described as follows:

1. An attribute is transitively dependent if its value is determined by another attribute which is

not a key.

2. If X Y and X is not a key then this is a transitive dependency. 3. A transitive dependency exists when A B C but NOT A C.

Multi-Valued Dependencies (MVD)

A multi-valued dependency can be described as follows:

1. A table involves a multi-valued dependency if it may contain multiple values for an entity. 2. A multi-valued dependency may arise as a result of enforcing 1st normal form.

3. X Y, ie X multi-determines Y, when for each value of X we can have more than one value of Y.

4. If A B and A C then we have a single attribute A which multi-determines two other independent attributes, B and C.

5. If A (B,C) then we have an attribute A which multi-determines a set of associated attributes, B and C.

Join Dependencies (JD)

A join dependency can be described as follows:

1. If a table can be decomposed into three or more smaller tables, it must be capable of being joined again on common keys to form the original table.

(11)

A major objective of data normalisation is to avoid modification anomalies. These come in two flavours:

1. An insertion anomaly is a failure to place information about a new database entry into all the places in the database where information about that new entry needs to be stored. In a

properly normalized database, information about a new entry needs to be inserted into only one place in the database. In an inadequately normalized database, information about a new entry may need to be inserted into more than one place, and, human fallibility being what it is, some of the needed additional insertions may be missed.

2. A deletion anomaly is a failure to remove information about an existing database entry when it is time to remove that entry. In a properly normalized database, information about an old, to-be-gotten-rid-of entry needs to be deleted from only one place in the database. In an inadequately normalized database, information about that old entry may need to be deleted from more than one place, and, human fallibility being what it is, some of the needed additional deletions may be missed.

An update of a database involves modifications that may be additions, deletions, or both. Thus 'update anomalies' can be either of the kinds of anomalies discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence constitutes corruption of the database. Properly normalised databases are much less susceptible to corruption than are

unnormalised databases.

Types of Relational Join

A JOIN is a method of creating a result set that combines rows from two or more tables (relations). When comparing the contents of two tables the following conditions may occur:

• Every row in one relation has a match in the other relation.

• Relation R1 contains rows that have no match in relation R2.

• Relation R2 contains rows that have no match in relation R1.

INNER joins contain only matches. OUTER joins may contain mismatches as well.

Inner Join

This is sometimes known s a simple join. It returns all rows from both tables where there is a match. If there are rows in R1 which do not have matches in R2, those rows will not be listed. There are two possible ways of specifying this type of join:

SELECT * FROM R1, R2 WHERE R1.r1_field = R2.r2_field; SELECT * FROM R1 INNER JOIN R2 ON R1.field = R2.r2_field

If the fields to be matched have the same names in both tables then the ON condition, as in: ON R1.fieldname = R2.fieldname

ON (R1.field1 = R2.field1 AND R1.field2 = R2.field2)

can be replaced by the shorter USING condition, as in: USING fieldname

(12)

Natural Join

A natural join is based on all columns in the two tables that have the same name. It is semantically equivalent to an INNER JOIN or a LEFT JOIN with a USING clause that names all columns that exist

in both tables.

SELECT * FROM R1 NATURAL JOIN R2

The alternative is a keyed join which includes an ON or USING condition.

Left [Outer] Join

Returns all the rows from R1 even if there are no matches in R2. If there are no matches in R2 then the R2 values will be shown as null.

SELECT * FROM R1 LEFT [OUTER] JOIN R2 ON R1.field = R2.field

Right [Outer] Join

Returns all the rows from R2 even if there are no matches in R1. If there are no matches in R1 then the R1 values will be shown as null.

SELECT * FROM R1 RIGHT [OUTER] JOIN R2 ON R1.field = R2.field

Full [Outer] Join

Returns all the rows from both tables even if there are no matches in one of the tables. If there are no matches in one of the tables then its values will be shown as null.

SELECT * FROM R1 FULL [OUTER] JOIN R2 ON R1.field = R2.field

Self Join

This joins a table to itself. This table appears twice in the FROM clause and is followed by table aliases that qualify column names in the join condition.

SELECT a.field1, b.field2 FROM R1 a, R1 b WHERE a.field = b.field

Cross Join

This type of join is rarely used as it does not have a join condition, so every row of R1 is joined to every row of R2. For example, if both tables contain 100 rows the result will be 10,000 rows. This is sometimes known as a cartesian product and can be specified in either one of the following ways:

SELECT * FROM R1 CROSS JOIN R2 SELECT * FROM R1, R2

Entity-Relationship Diagram (ERD)

An entity-relationship diagram (ERD) is a data modeling technique that creates a graphical

representation of the entities, and the relationships between entities, within an information system. Any ER diagram has an equivalent relational table, and any relational table has an equivalent ER

(13)

diagram. ER diagramming is an invaluable aid to engineers in the design, optimization, and debugging of database programs.

• The entity is a person, object, place or event for which data is collected. It is equivalent to a database table. An entity can be defined by means of its properties, called attributes. For example, the CUSTOMER entity may have attributes for such things as name, address and telephone number.

• The relationship is the interaction between the entities. It can be described using a verb such as:

o A customer places an order. o A sales rep serves a customer. o A order contains a product. o A warehouse stores a product.

In an entity-relationship diagram entities are rendered as rectangles, and relationships are portrayed as lines connecting the rectangles. One way of indicating which is the 'one' or 'parent' and which is the 'many' or 'child' in the relationship is to use an arrowhead, as in figure 4.

Figure 4 - One-to-Many relationship using arrowhead notation

This can produce an ERD as shown in figure 5: Figure 5 - ERD with arrowhead notation

Another method is to replace the arrowhead with a crowsfoot, as shown in figure 6: Figure 6 - One-to-Many relationship using crowsfoot notation

The relating line can be enhanced to indicate cardinality which defines the relationship between the entities in terms of numbers. An entity may be optional (zero or more) or it may be mandatory (one or more).

A single bar indicates one.

A double bar indicates one and only one.

(14)

A crowsfoot or arrowhead indicates many.

As well as using lines and circles the cardinality can be expressed using numbers, as in:

• One-to-One expressed as 1:1

• Zero-to-Many expressed as 0:M

• One-to-Many expressed as 1:M

• Many-to-Many expressed as N:M This can produce an ERD as shown in figure 7:

Figure 7 - ERD with crowsfoot notation and cardinality

In plain language the relationships can be expressed as follows:

• 1 instance of a SALES REP serves 1 to many CUSTOMERS

• 1 instance of a CUSTOMER places 1 to many ORDERS

• 1 instance of an ORDER lists 1 to many PRODUCTS

• 1 instance of a WAREHOUSE stores 0 to many PRODUCTS

In order to determine if a particular design is correct here is a simple test that I use: 1. Take the written rules and construct a diagram.

2. Take the diagram and try to reconstruct the written rules.

If the output from step (2) is not the same as the input to step (1) then something is wrong. If the model allows a situation to exist which is not allowed in the real world then this could lead to serious problems. The model must be an accurate representation of the real world in order to be effective. If any ambiguities are allowed to creep in they could have disastrous consequences.

We have now completed the logical data model, but before we can construct the physical database there are several steps that must take place:

• Assign attributes (properties or values) to all the entities. After all, a table without any columns will be of little use to anyone.

• Refine the model using a process known as 'normalisation'. This ensures that each attribute is in the right place. During this process it may be necessary to create new tables and new relationships.

(15)

Data Normalisation

Relational database theory, and the principles of normalisation, were first constructed by people with a strong mathematical background. They wrote about databases using terminology which was not easily understood outside those mathematical circles. Below is an attempt to provide understandable explanations.

Data normalisation is a set of rules and techniques concerned with:

• Identifying relationships among attributes.

• Combining attributes to form relations.

• Combining relations to form a database.

It follows a set of rules worked out by E F Codd in 1970. A normalised relational database provides several benefits:

• Elimination of redundant data storage.

• Close modeling of real world entities, processes, and their relationships.

• Structuring of data so that the model is flexible.

Because the principles of normalisation were first written using the same terminology as was used to define the relational data model this led some people to think that normalisation is difficult. Nothing could be more untrue. The principles of normalisation are simple, common sense ideas that are easy to apply.

Although there are numerous steps in the normalisation process - 1NF, 2NF, 3NF, BCNF, 4NF, 5NF and DKNF - a lot of database designers often find it unnecessary to go beyond 3rd Normal Form. This does not mean that those higher forms are unimportant, just that the circumstances for which they were designed often do not exist within a particular database. However, all database designers should be aware of all the forms of normalisation so that they may be in a better position to detect when a particular rule of normalisation is broken and then decide if it is necessary to take appropriate action.

The guidelines for developing relations in 3rd Normal Form can be summarised as follows: 1. Define the attributes.

2. Group logically related attributes into relations. 3. Identify candidate keys for each relation. 4. Select a primary key for each relation. 5. Identify and remove repeating groups.

6. Combine relations with identical keys (1st normal form). 7. Identify all functional dependencies.

8. Decompose relations such that each nonkey attribute is dependent on all the attributes in the key.

9. Combine relations with identical primary keys (2nd normal form). 10. Identify all transitive dependencies.

o Check relations for dependencies of one nonkey attribute with another nonkey

attribute.

o Check for dependencies within each primary key (i.e. dependencies of one attribute in

the key on other attributes within the key).

(16)

12. Combine relations with identical primary keys (3rd normal form) if there are no transitive dependencies.

1st Normal Form

A table is in first normal form if all the key attributes have been defined and it contains no repeating groups.

Taking the ORDER entity in figure 7 as an example we could end up with a set of attributes like this: ORDER

order_id customer_id product1 product2 product3

123 456 abc1 def1 ghi1

456 789 abc2

This structure creates the following problems:

• Order 123 has no room for more than 3 products.

• Order 456 has wasted space for product2 and product3.

In order to create a table that is in first normal form we must extract the repeating groups and place them in a separate table, which I shall call ORDER_LINE.

ORDER

order_id customer_id

123 456

456 789

I have removed 'product1', 'product2' and 'product3', so there are no repeating groups. ORDER_LINE order_id product 123 abc1 123 def1 123 ghi1 456 abc2

Each row contains one product for one order, so this allows an order to contain any number of products.

This results in a new version of the ERD, as shown in figure 8: Figure 8 - ERD with ORDER and ORDER_LINE

(17)

The new relationships can be expressed as follows:

• 1 instance of an ORDER has 1 to many ORDER LINES

• 1 instance of a PRODUCT has 0 to many ORDER LINES

2nd Normal Form

A table is in second normal form (2NF) if and only if it is in 1NF and every nonkey attribute is fully functionally dependent on the whole of the primary key (i.e. there are no partial dependencies).

1. Anomalies can occur when attributes are dependent on only part of a multi-attribute (composite) key.

2. A relation is in second normal form when all non-key attributes are dependent on the whole key. That is, no attribute is dependent on only a part of the key.

3. Any relation having a key with a single attribute is in second normal form. Take the following table structure as an example:

order(order_id, cust, cust_address, cust_contact, order_date, order_total)

Here we should realise that cust_address and cust_contact are functionally dependent on cust

but not on order_date, therefore they are not dependent on the whole key. To make this table 2NF

these attributes must be removed and placed somewhere else.

3rd Normal Form

A table is in third normal form (3NF) if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key (i.e. there are no transitive dependencies).

1. Anomalies can occur when a relation contains one or more transitive dependencies. 2. A relation is in 3NF when it is in 2NF and has no transitive dependencies.

3. A relation is in 3NF when 'All non-key attributes are dependent on the key, the whole key and nothing but the key'.

(18)

Take the following table structure as an example:

order(order_id, cust, cust_address, cust_contact, order_date, order_total)

Here we should realise that cust_address and cust_contact are functionally dependent on cust

which is not a key. To make this table 3NF these attributes must be removed and placed somewhere else.

You must also note the use of calculated or derived fields. Take the example where a table contains PRICE, QUANTITY and EXTENDED_PRICE where EXTENDED_PRICE is calculated as

QUANTITY multiplied by PRICE. As one of these values can be calculated from the other two then it need not be held in the database table. Do not assume that it is safe to drop any one of the three fields as a difference in the number of decimal places between the various fields could lead to different results due to rounding errors. For example, take the following fields:

• AMOUNT - a monetary value in home currency, to 2 decimal places.

• EXCH_RATE - exchange rate, to 9 decimal places.

• CURRENCY_AMOUNT - amount expressed in foreign currency, calculated as AMOUNT multiplied by EXCH_RATE.

If you were to drop EXCH_RATE could it be calculated back to its original 9 decimal places? Reaching 3NF is is adequate for most practical needs, but there may be circumstances which would benefit from further normalisation.

Boyce-Codd Normal Form

A table is in Boyce-Codd normal form (BCNF) if and only if it is in 3NF and every determinant is a candidate key.

1. Anomalies can occur in relations in 3NF if there is a composite key in which part of that key has a determinant which is not itself a candidate key.

2. This can be expressed as R(A,B,C), C A where:

o The relation contains attributes A, B and C. o A and B form a candidate key.

o C is the determinant for A (A is functionally dependent on C). o C is not part of any key.

3. Anomalies can also occur where a relation contains several candidate keys where:

o The keys contain more than one attribute (they are composite keys). o An attribute is common to more than one key.

Take the following table structure as an example:

schedule(campus, course, class, time, room/bldg)

Take the following sample data:

campus course class time room/bldg

East English 101 1 8:00-9:00 212 AYE East English 101 2 10:00-11:00 305 RFK

(19)

West English 101 3 8:00-9:00 102 PPR

Note that no two buildings on any of the university campuses have the same name, thus

ROOM/BLDG CAMPUS. As the determinant is not a candidate key this table is NOT in Boyce-Codd normal form.

This table should be decomposed into the following relations:

R1(course, class, room/bldg, time)

R2(room/bldg, campus)

As another example take the following structure:

enrol(student#, s_name, course#, c_name, date_enrolled)

This table has the following candidate keys:

• (student#, course#)

• (student#, c_name)

• (s_name, course#) - this assumes that s_name is a unique identifier

• (s_name, c_name) - this assumes that c_name is a unique identifier The relation is in 3NF but not in BCNF because of the following dependencies:

• student# s_name

• course# c_name

4th Normal Form

A table is in fourth normal form (4NF) if and only if it is in BCNF and contains no more than one multi-valued dependency.

1. Anomalies can occur in relations in BCNF if there is more than one multi-valued dependency. 2. If A B and A C but B and C are unrelated, ie A (B,C) is false, then we have more

than one multi-valued dependency.

3. A relation is in 4NF when it is in BCNF and has no more than one multi-valued dependency.

Take the following table structure as an example:

info(employee#, skills, hobbies)

Take the following sample data:

employee# skills hobbies

1 Programming Golf 1 Programming Bowling

(20)

1 Analysis Bowling

2 Analysis Golf

2 Analysis Gardening

2 Management Golf 2 Management Gardening

This table is difficult to maintain since adding a new hobby requires multiple new rows corresponding to each skill. This problem is created by the pair of multi-valued dependencies EMPLOYEE# SKILLS and EMPLOYEE# HOBBIES. A much better alternative would be to decompose INFO into two relations:

skills(employee#, skill)

hobbies(employee#, hobby)

5th (Projection-Join) Normal Form

A table is in fifth normal form (5NF) or Projection-Join Normal Form (PJNF) if it is in 4NF and it cannot have a lossless decomposition into any number of smaller tables.

Another way of expressing this is:

... and each join dependency is a consequence of the candidate keys. Yet another way of expressing this is:

... and there are no pairwise cyclical dependencies in the primary key comprised of three or more attributes.

• Anomalies can occur in relations in 4NF if the primary key has three or more fields.

• 5NF is based on the concept of join dependence - if a relation cannot be decomposed any further then it is in 5NF.

• Pairwise cyclical dependency means that:

o You always need to know two values (pairwise). o For any one you must know the other two (cyclical).

Take the following table structure as an example:

buying(buyer, vendor, item)

This is used to track buyers, what they buy, and from whom they buy. Take the following sample data:

buyer vendor item

Sally Liz Claiborne Blouses Mary Liz Claiborne Blouses

(21)

Sally Jordach Jeans Mary Jordach Jeans Sally Jordach Sneakers

The question is, what do you do if Claiborne starts to sell Jeans? How many records must you create to record this fact?

The problem is there are pairwise cyclical dependencies in the primary key. That is, in order to determine the item you must know the buyer and vendor, and to determine the vendor you must know the buyer and the item, and finally to know the buyer you must know the vendor and the item. The solution is to break this one table into three tables; Buyer-Vendor, Buyer-Item, and Vendor-Item.

6th (Domain-Key) Normal Form

A table is in sixth normal form (6NF) or Domain-Key normal form (DKNF) if it is in 5NF and if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and the key constraints specified on the relation.

Another way of expressing this is:

... if every constraint on the table is a logical consequence of the definition of keys and domains. 1. An domain constraint (better called an attribute constraint) is simply a constraint to the effect

a given attribute A of R takes its values from some given domain D.

2. A key constraint is simply a constraint to the effect that a given set A, B, ..., C of R constitutes a key for R.

This standard was proposed by Ron Fagin in 1981, but interestingly enough he made no note of multi-valued dependencies, join dependencies, or functional dependencies in his paper and did not demonstrate how to achieve DKNF. However, he did manage to demonstrate that DKNF is often impossible to achieve.

If relation R is in DKNF, then it is sufficient to enforce the domain and key constraints for R, and all constraints on R will be enforced automatically. Enforcing those domain and key constraints is, of course, very simple (most DBMS products do it already). To be specific, enforcing domain

constraints just means checking that attribute values are always values from the applicable domain (i.e., values of the right type); enforcing key constraints just means checking that key values are unique.

Unfortunately lots of relations are not in DKNF in the first place. For example, suppose there's a constraint on R to the effect that R must contain at least ten tuples. Then that constraint is certainly not a consequence of the domain and key constraints that apply to R, and so R is not in DKNF. The sad fact is, not all relations can be reduced to DKNF; nor do we know the answer to the question "Exactly when can a relation be so reduced?"

(22)

Denormalisation is the process of modifying a perfectly normalised database design for performance reasons. Denormalisation is a natural and necessary part of database design, but must follow proper normalisation. Here are a few words from C J Date on denormalisation:

The general idea of normalization...is that the database designer should aim for relations in the "ultimate" normal form (5NF). However, this recommendation should not be construed as law. Sometimes there are good reasons for flouting the principles of normalization.... The only hard requirement is that relations be in at least first normal form. Indeed, this is as good a place as any to make the point that database design can be an extremely complex task.... Normalization theory is a useful aid in the process, but it is not a panacea; anyone designing a database is certainly advised to be familiar with the basic techniques of normalization...but we do not mean to suggest that the design should necessarily be based on normalization principles alone.

C.J. Date

An Introduction to Database Systems Pages 528-529

In the 1970s and 1980s when computer hardware was bulky, expensive and slow it was often considered necessary to denormalise the data in order to achieve acceptable performance, but this performance boost often came with a cost (refer to Modification Anomalies). By comparison, computer hardware in the 21st century is extremely compact, extremely cheap and extremely fast. When this is coupled with the enhanced performance from today's DBMS engines the performance from a normalised database is often acceptable, therefore there is less need for any denormalisation. However, under certain conditions denormalisation can be perfectly acceptable. Take the following table as an example:

Company City State Zip

Acme Widgets New York NY 10169 ABC Corporation Miami FL 33196 XYZ Inc Columbia MD 21046

This table is NOT in 3rd normal form because the city and state are dependent upon the ZIP code. To place this table in 3NF, two separate tables would be created - one containing the company name and ZIP code and the other containing city, state, ZIP code pairings.

This may seem overly complex for daily applications and indeed it may be. Database designers should always keep in mind the tradeoffs between higher level normal forms and the resource issues that complexity creates.

Deliberate denormalisation is commonplace when you're optimizing performance. If you

continuously draw data from a related table, it may make sense to duplicate the data redundantly. Denormalisation always makes your system potentially less efficient and flexible, so denormalise as needed, but not frivolously.

There are techniques for improving performance that involve storing redundant or calculated data. Some of these techniques break the rules of normalisation, others do not. Sometimes real world requirements justify breaking the rules. Intelligently and consciously breaking the rules of

normalisation for performance purposes is an accepted practice, and should only be done when the benefits of the change justify breaking the rule.

(23)

Compound Fields

A compound field is a field whose value is the combination of two or more fields in the same record. The cost of using compound fields is the space they occupy and the code needed to maintain them. (Compound fields typically violate 2NF or 3NF.)

For example, if your database has a table with addresses including city and state, you can create a compound field (call it City_State) that is made up of the concatenation of the city and state fields. Sorts and queries on City_State are much faster than the same sort or query using the two source fields - sometimes even 40 times faster.

The downside of compound fields for the developer is that you have to write code to make sure that the City_State field is updated whenever either the city or the state field value changes. This is not difficult to do, but it is important that there are no 'leaks', or situations where the source data changes and, through some oversight, the compound field value is not updated.

Summary Fields

A summary field is a field in a one table record whose value is based on data in related-many table records. Summary fields eliminate repetitive and time-consuming cross-table calculations and make calculated results directly available for end-user queries, sorts, and reports without new

programming. One-table fields that summarise values in multiple related records are a powerful optimization tool. Imagine tracking invoices without maintaining the invoice total! Summary fields like this do not violate the rules of normalisation. Normalisation is often misconceived as forbidding the storage of calculated values, leading people to avoid appropriate summary fields.

There are two costs to consider when contemplating using a summary field: the coding time required to maintain accurate data and the space required to store the summary field.

Some typical summary fields which you may encounter in an accounting system are:

• For an INVOICE the invoice amount is the total of the amounts on all INVOICE_LINE records for that invoice.

• For an ACCOUNT the account balance will be the sum total of the amounts on all INVOICE and PAYMENT records for that account.

Summary Tables

A summary table is a table whose records summarise large amounts of related data or the results of a series of calculations. The entire table is maintained to optimise reporting, querying, and generating cross-table selections. Summary tables contain derived data from multiple records and do not necessarily violate the rules of normalisation. People often overlook summary tables based on the misconception that derived data is necessarily denormalised.

In order for a summary table to be useful it needs to be accurate. This means you need to update summary records whenever source records change. This task can be taken care of in the program code, or in a database trigger (preferred), or in a batch process. You must also make sure to update summary records if you change source data in your code. Keeping the data valid requires extra work and introduces the possibility of coding errors, so you should factor this cost in when deciding if you are going to use this technique.

(24)

As mentioned in the guidelines for developing relations in 3rd normal form all relations which share the same primary key are supposed to be combined into the same table. However, there are

circumstances where is is perfectly valid to ignore this rule. Take the following example which I encountered in 1984:

• A finance company gives loans to customers, and a record is kept of each customer's repayments.

• If a customer does not meet a scheduled repayment then his account goes into arrears and special action needs to be taken.

• Of the total customer base about 5% are in arrears at any one time.

This means that with 100,000 customers there will be roughly 5,000 in arrears. If the arrears data is held on the same record as the basic customer data (both sets of data have customer_id as the primary key) then it requires searching through all 100,000 records to locate those which are in arrears. This is not very efficient. One method tried was to create an index on account_status which identified whether the account was in arrears or not, but the improvement (due to the speed of the hardware and the limitations of the database engine) was minimal.

A solution in these circumstances is to extract all the attributes which deal with arrears and put them in a separate table. Thus if there are 5,000 customers in arrears you can reference a table which contains only 5,000 records. As the arrears data is subordinate to the customer data the arrears table must be the 'child' in the relationship with the customer 'parent'. It would be possible to give the arrears table a different primary key as well as the foreign key to the customer table, but this would allow the customer arrears relationship to be one-to-many instead of one-to-one. To enforce this constraint the foreign key and the primary key should be exactly the same.

This situation can be expressed using the following structure:

R (K, A, B, C, X, Y, Z) where:

1. Attribute K is the primary key. 2. Attributes (A B C) exist all the time.

3. Attributes (X Y Z) exist some of the time (but always as a group under the same circumstances).

4. Attributes (X Y Z) require special processing.

After denormalising the result is two separate relations, as follows:

• R1 (K, A, B, C)

• R2 (K, X, Y, Z) where K is also the foreign key to R1

Personal Guidelines

Even if you obey all the preceding rules it is still possible to produce a database design that causes problems during development. I have come across many different implementation tips and

techniques over the years, and some that have worked in one database system have been successfully carried forward into a new database system. Some tips, on the other hand, may only be applicable to a particular database system.

(25)

Database Names

1. Database names should be short and meaningful, such as 'products', 'purchasing' and 'sales'.

o Short, but not too short, as in 'prod' or 'purch'.

o Meaningful but not verbose, as in 'the database used to store product details'.

2. Do not waste time using a prefix such as 'db' to identify database names. The SQL syntax analyser has the intelligence to work that out for itself - so should you.

3. If your DBMS allows a mixture of upper and lowercase names, and it is case sensitive, it is better to stick to a standard naming convention such as:

o All uppercase.

o All lowercase (my preference).

o Leading uppercase, remainder lowercase.

Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can lead to disasters.

4. If a database name contains more than one word, such as in 'sales orders' and 'purchase orders', decide how to deal with it:

o Separate the words with a single space, as in 'sales orders' (note that some DBMSs do

not allow embedded spaces, while most languages will require such names to be enclosed in quotes).

o Separate the words with an underscore, as in 'sales_orders' (my preference). o Separate the words with a hyphen, as in 'sales-orders'.

o Use camel caps, as in 'SalesOrders'.

Again, be consistent.

5. Rather than putting all the tables into a single database it may be better to create separate databases for each logically related set of tables. This may help with security, archiving, replication, etc.

Table Names

1. Table names should be short and meaningful, such as 'part', 'customer' and 'invoice'.

o Short, but not too short. o Meaningful, but not verbose.

2. Do not waste time using a prefix such as 'tbl' to identify table names. The SQL syntax analyser has the intelligence to work that out for itself - so should you.

3. Table names should be in the singular (e.g. 'customer' not 'customers'). The fact that a table may contain multiple entries is irrelevant - any multiplicity can be derived from the existence of one-to-many relationships.

4. If your DBMS allows a mixture of upper and lowercase names, and it is case sensitive, It is better to stick to a standard naming convention such as:

o All uppercase.

o All lowercase. (my preference)

o Leading uppercase, remainder lowercase.

Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can lead to disasters.

5. If a table name contains more than one word, such as in 'sales order' and 'purchase order', decide how to deal with it:

(26)

o Separate the words with a single space, as in 'sales order' (note that some DBMSs do

not allow embedded spaces, while most languages will require such names to be enclosed in quotes).

o Separate the words with an underscore, as in 'sales_order' (my preference). o Separate the words with a hyphen, as in 'sales-order'.

o Use camel caps, as in 'SalesOrder'.

Again, be consistent.

6. Be careful if the same table name is used in more than one database - it may lead to confusion.

Field Names

1. Field names should be short and meaningful, such as 'part_name' and 'customer_name'.

o Short, but not too short, such as in 'ptnam'.

o Meaningful, but not verbose, such as 'the name of the part'.

2. Do not waste time using a prefix such as 'col' or 'fld' to identify column/field names. The SQL syntax analyser has the intelligence to work that out for itself - so should you.

3. If your DBMS allows a mixture of upper and lowercase names, and it is case sensitive, it is better to stick to a standard naming convention such as:

o All uppercase.

o All lowercase. (my preference)

o Leading uppercase, remainder lowercase.

Inconsistencies may lead to confusion, confusion may lead to mistakes, mistakes can lead to disasters.

4. If a field name contains more than one word, such as in 'part name' and customer name', decide how to deal with it:

o Separate the words with a single space, as in 'part name' (note that some DBMSs do

not allow embedded spaces, while most languages will require such names to be enclosed in quotes).

o Separate the words with an underscore, as in 'part_name' (my preference). o Separate the words with a hyphen, as in 'part-name'.

o Use camel caps, as in 'PartName'.

Again, be consistent.

5. Common words in field names may be abbreviated, but be consistent.

o Do not allow a mixture of abbreviations, such as 'no', 'num' and 'nbr' for 'number'. o Publish a list of standard abbreviations and enforce it.

6. Although field names must be unique within a table, it is possible to use the same name on multiple tables even if they are unrelated, or they do not share the same set of possible values. It is recommended that this practice should be avoided as common names could lead to confusion after a join operation. In this situation the only way reference both fields is to give one of them an alias name, so it would be better to give one of them a different name to begin with. For example, tables named 'customer' and 'invoice' each require a field to hold a status value, so these should be given separate names such as 'acc_status' and 'inv_status' instead of the generic 'status'.

(27)

1. It is recommended that the primary key of an entity should be constructed from the table name with a suffix of '_ID'. This makes it easy to identify the primary key in a long list of field names.

2. Avoid using generic names for all primary keys. It may seem a clever idea to use the name 'ID' for every primary key field, but this causes problems:

o It causes the same name to appear on multiple tables with totally different contexts.

The string ID='ABC123' is extremely vague as it gives no idea of the entity being

referenced. Is it an invoice id, customer id, or what?

o It also causes a problem with foreign keys.

3. There is no rule that says a primary key must consist of a single attribute - both simple and composite keys are allowed - so don't waste time creating artificial keys.

4. Avoid the unnecessary use of technical keys. If a table already contains a satisfactory unique identifier, whether composite or simple, there is no need to create another one. Although the use of a technical key can be justified in certain circumstances, it takes intelligence to know when those circumstances are right. The indiscriminate use of technical keys shows a distinct lack of intelligence. For further views on this subject please refer to Technical Keys - Their Uses and Abuses.

Foreign Keys

1. It is recommended that where a foreign key is required the same name as that of the associated key on the foreign table be used. It is a requirement of a relational join that two relations can only be joined when they share at least one common attribute, and this should be taken to mean the attribute name(s) as well as the value(s). Thus where the 'customer' and 'invoice' tables are joined in a parent-child relationship the following will result:

o The primary key of 'customer' will be 'customer_id'. o The primary key of 'invoice' will be 'invoice_id'.

o The foreign key which joins 'invoice' to 'customer' will be 'customer_id'.

2. For MySQL users this means that the shortened version of the join condition may be used:

o Short: A LEFT JOIN B USING (a,b,c)

o Long: A LEFT JOIN B ON (A.a=B.a AND A.b=B.b AND A.c=B.c)

3. The only exception to this naming recommendation should be where a table contains more than one foreign key to the same parent table, in which case the names must be changed to avoid duplicates. In this situation I would simply add a meaningful suffix to each name to identify the usage, such as:

o To signify movement I would use 'location_id_from' and 'location_id_to'. o To signify positions in a hierarchy I would use 'node_id_snr' and 'node_id_jnr'. o To signify replacement I would use 'part_id_old' and 'part_id_new'.

Generating Unique ids

Where a technical primary key is used a mechanism is required that will generate new and unique values. Such keys are usually numeric, so there are several methods available:

1. Some database engines will maintain a set of sequence numbers for you which can be referenced using code such as :

2. SELECT <seq_name>.NEXTVAL FROM DUAL

Using such a sequence is a two-step procedure:

o Access the sequence to obtain a value.

(28)

I have used this method, but a disadvantage that I have found is that the DBMS has no knowledge of what primary key is linked to which sequence, so it is possible to insert a record with a key not obtained from the sequence and thus cause the two to become

unsynchronised. The next time the sequence is used it could therefore generate a value which already exists as a key and therefore cause an INSERT error.

3. Some database engines will allow you to specify a numeric field as 'auto-increment', and on an INSERT they will automatically generate the next available number (provided that no value is provided for that field in the first place). This is better than the previous method because:

o The sequence is tied directly to a particular database table and is not a separate object,

thus it is impossible to become unsynchronised.

o It is not necessary to access the sequence then use the returned value on an INSERT

statement - just leave the field empty and the DBMS will fill in the value automatically.

4. While the previous methods have their merits, they both have a common failing in that they are not-standard extensions to the SQL standard, therefore they are not available in all SQL-compliant database engines. This becomes an important factor if it is ever decided to switch to another database engine. A truly portable method which uses a standard technique and can therefore be used in any SQL-compliant database is to use an SQL statement similar to the following to obtain a unique key for a table:

5. SELECT max(table_id) FROM <tablename> 6. table_id = table_id+1

Some people seem to think that this method is inefficient as it requires a full table search, but they are missing the fact that table_id is a primary key, therefore the values are held within

an index. The SELECT max(...) statement will automatically be optimised to go straight to

the last value in the index, therefore the result is obtained with almost no overhead. This would not be the case if I used SELECT count(...) as this would have to physically count

the number of entries. Another reason for not using SELECT count(...) is that if records

were to be deleted then record count would be out of step with the highest current value.

Comments

Some people disagree with my ideas, but usually because they have limited experience and only know what they have been taught. What I have stated here is the result of decades of experience using various database systems with various languages. This is what I have learned, and goes beyond what I have been taught. There are valid reasons for some of the preferences I have stated in this document, and it may prove beneficial to state these in more detail.

The choice between upper and lower case

When I first started programming in the 1970s all coding was input via punched cards, not a VDU (that's a Visual Display Unit to the uninitiated), and there was no such thing as lowercase as the computer used a 6-bit character instead of an 8-bit byte and did not have enough room to deal with both lower and uppercase characters. CONSEQUENTLY EVERYTHING HAD TO BE IN UPPER CASE. When I progressed to a system where both cases were possible neither the operating system nor the programming language cared which was used - they were both case-insensitive. By common consent all the programmers preferred to use lowercase for everything. The use of uppercase was considered TO BE THE EQUIVALENT OF SHOUTING and was discouraged, except where something important needed to stand out.

(29)

Until the last few years all the operating systems, database systems, programming languages and text editors have been case-insensitive. The UNIX operating system and its derivatives are case-sensitive (for God's sake WHY??). The PHP programming language is case-sensitive in certain areas.

I do not like systems which are case-sensitive for the following reasons:

• I have been working for 30 years with systems which have been case-insensitive and I see no justification in making the switch.

• Case does not make a difference in any spoken language, so why should it make a difference in any computer language?

• When I am merrily hammering away at the keyboard I do not like all those pauses where I have to reach for the shift key. It tends to interrupt my train of thought, and I do not like to be interrupted with trivialities.

• To my knowledge there is no database system which is case-sensitive, so when I am writing code to access a database I do not like to be told which case to use.

• With the growing trend of being able to speak to a computer instead of using a keyboard, how frustrating will it become if you have to specify that particular words and letters are in upper or lower case?

That is why my preference is for all database, table and field names to be in lowercase as it works the same for both case-sensitive and case-insensitive systems, so I don't get suddenly caught out when the software decides to get picky.

The use of unique field names

Some people think that my habit of including the table name inside a field name (as

in CUSTOMER.CUSTOMER_ID) introduces a level of redundancy and is

therefore wrong. I consider this view to be too narrow as it does not cater for

all the different circumstances I have encountered over the years.

Field names should identify the data that they hold.

If I see several tables which all contain field names such as ID and DESCRIPTION it makes me want to reach for the rubber gloves, disinfectant and scrubbing brush. A field named ID simply says that it contains an identity, but the identity of what? A field named DESCRIPTION simply says that it contains a description, but the description of what?

One of the first database systems which I used did not allow field definitions to be included within the table definitions inside the schema. Instead all the fields were defined in one area, and the table definitions simply listed the fields which they contained. This meant that a field was defined with one set of attributes (type and size) and those attributes could not be changed at the table level. Thus ID could not be C10 in one table and C30 in another. The only time we had fields with the same name existing on more than one table was where there was a logical relationship between records which had the same values in those fields.

One of the short-lived new-fangled languages that I used with this database system was built on the assumption that fields with the same name that existed on more than one table implied a relationship between those tables. If you tried to perform a join between two tables this software would look for field names which existed on both tables and automatically perform a natural join using those fields. Thus if you had a CUSTOMER table with a primary key of ID and an INVOICE table with a primary key of ID and a foreign key of CUSTOMER_ID this software would perform the join using

(30)

CUSTOMER.ID to INVOICE.ID and not INVOICE.CUSTOMER_ID. The only way we could get this software to work was to change the ID fields to CUSTOMER_ID and INVOICE_ID.

An additional problem we found was that even non-key fields with the same name would be included. One database designer decided to have a field called STATUS defined on both the

INVOICE and INVOICE_LINE tables. This meant that when trying to show which INVOICE_LINE entries existed for a particular INVOICE only those with exactly the same value in the STATUS field would actually appear. This problem could only be solved by renaming the two fields to

INV_STATUS and INV_LINE_STATUS. This change proved beneficial later on when the range of possible status values between the two objects became different, and as the list was identified by its field name it would have made the operation of linking a list of values to a particular field more difficult if there were two possible lists with the same name.

In some applications we built automatic routines which linked help text and field labels to field names, but these could not work correctly if the same name was shared by different objects.

Primary keys and foreign keys should, wherever possible, be exactly the same.

The related fields do not have to be the same as it is still possible to perform a join, as shown in the following example:

SELECT field1, field2, field3 FROM first_table

LEFT [OUTER] JOIN second_table

ON (first_table.keyfield = second_table.foreign_keyfield)

However, if the fields have the same name then it is possible to replace the ON expression with a

shorter USING expression, as in the following example: SELECT field1, field2, field3

FROM first_table

LEFT [OUTER] JOIN second_table USING (field1)

This feature is available in popular databases such as MySQL, PostgreSQL and Oracle, so it just goes to show that using identical field names is a recognised practice that has its benefits.

Not only does the use of identical names have an advantage when performing joins in an SQL query, it also has advantages when simulating joins in your software. By this I mean where the reading of the two tables is performed in separate operations. It is possible to perform this using standard code with the following logic:

• Operation (1) perform the following after each database row has been read:

o Identify the field(s) which constitute the primary key for the first table. o Extract the values for those fields from the current row.

o Construct a string in the format field1='value1' [field2='value2']. o Pass this string to the next operation.

• Operation (2) performs the following:

o Use the string passed down from the previous operation as the WHERE clause in a SELECT statement.

o Execute the query on the second table.

(31)

It is possible to perform these functions using standard code that never has to be customised for any particular database table. I should know as I have done it in two completely different languages. The only time that manual intervention (i.e. extra code) is required is where the field names are not exactly the same, which forces operation (2) to convert primary_key_field='value' to

foreign_key_field='value' before it can execute the query. Experienced programmers should

instantly recognise that the need for extra code incurs its own overhead:

• The time taken to actually write this extra code.

• The time taken to test that the right code has been put in the right place.

• The time taken to amend this code should there be any database changes in the future.

The only occasion where fields with the same name are not possible is when a table contains multiple versions of that field. This is where I would add a suffix to give some extra meaning. For example:

• In a table which records movements or ranges I would have <table>_ID_FROM and <table>_ID_TO.

• In a table which records a senior-to-junior hierarchy I would have <table>_ID_SNR and <table>_ID_JNR.

My view of field names can be summed up as follows: Fields with the same context should have the same name.

• Fields with different context should have different names.

• Key fields, whether primary or foreign, should be in the format <table>_id.

References

Related documents

The results related with the effect of the presence of cleaning products in goat milk on the microbial inhibitor test response (Chapter 4) showed that the presence in goat

Listen to the call of your body and do not say no to this

Summary of attributable mineral resources exclusive of mineral reserves Moz Pt 2012 2013 2014 2015 Impala 38.7 40.7 28.4 27.9 RBR JV 3.2 3.5 1.5 1.5 Marula 6.2 6.3 6.3 6.7 Afplats

Gender issues have received considerable attention in educational research but research regarding the experiences of female principals with school discipline in

Our framework can accommodate these two outcomes: when trade costs are high, firms spread evenly between the two locations to supply local demand at low cost, leading to

The manufacturer shall performance test 100 percent of production fire pumps, recording flow, total head, speed and power consumed at a minimum of eight points spanning from shut

The primary care nurse practitioner is in a prime position to initiate advance care planning (ACP) interventions with patients and their loved ones to help ensure they receive

The Department of Art, Culture and Youth (DACY) along with the Building Construction Department (BCD) of the Government of the State of Bihar is planning to