SQL Server Table Design - Best Practices

(1)

CwJ Consulting Ltd

SQL Server Table Design - Best Practices

Author: Andy Hogg Date: 20^th February 2015 Version: 1.11

(2)

Contents

1. Introduction ... 3

What is a table? ... 3

Why is good table design important? ... 3

What happens when a table isn’t designed well? ... 3

2. Clusters and Heaps ... 4

Definition of Cluster and Heap ... 4

More on clustered indices ... 4

Using Identity ... 6

Primary Keys, and Clustered Indices ... 6

3. Dates and Times ... 7

How not to store a Date \ Time ... 7

Validation ... 7

Culture ... 8

Date and Time functionality ... 9

Choosing the best Date \ Time data type to use ... 10

Using the Time data type ... 11

Advice on recording durations... 12

4. Strings ... 12

Why a number should be a string. ... 12

How to decide ... 12

5. Choosing the correct Integer ... 13

The different Integer types ... 13

Examples of poor choice Integers ... 13

6. Approximate Data Types ... 14

7. Deprecated Data Types ... 14

What does “deprecated” mean? ... 14

What are the deprecated data types? ... 14

8. Table and Column Naming ... 15

Spaces ... 15

Prefixing \ Suffixing Tables and Views with “tbl” and “vw” ... 15

Reserved Keywords ... 15

Using P.e.r.i.o.d.s ... 16

9. Constraints ... 16

Enforcing business logic ... 16

Enforcing data cleanliness ... 17

Avoiding NULLs ... 17

Improving performance ... 18

10. One Page Summary for Better Tables ... 19

(3)

1. Introduction

What is a table?

A table is a collection of related data held in a structured format within a database. It consists of columns and rows.

EmployeeNumber FirstName LastName

127 Fred Smith

254 Joachim Löw

874 Steve Jones

423 Ralf Little

Relational Database engines (such as SQL Server) store, modify and retrieve data in this format.

Why is good table design important?

Table design has a huge effect on every aspect of system performance.

For example the amount of storage used, the amount of memory used, the amount of processing power required, the likelihood of deadlocks, the integrity of the data etc.

What happens when a table isn’t designed well?

Badly designed tables often perform well.

At first…

However as time goes on, and the amount of data and level of concurrency (number of users) increases, problems begin to manifest themselves.

Badly designed tables simply do not scale well. So systems which appear to work fine when first commissioned, will perform badly at an enterprise level with hundreds of users and billions of rows of data.

(4)

2. Clusters and Heaps

Definition of Cluster and Heap

Tables can store rows in either an ordered format, or an unordered format (known as a heap).

 Ordered storage stores the rows in the order of the values in a particular column or composite of columns.

 Heaps store rows in the order that they are inserted (although rows may be moved around later as data is modified).

Ordering the table is implemented by applying a clustered index to the table.

There are very few occasions when it is best to leave a table as a heap instead of applying a clustered index. These edge cases are few and far between.

A good rule to follow is to always create a clustered index on a table.

More on clustered indices

Since a clustered index defines the order in which rows are stored, there can only ever be one clustered index on a table.

This is because it’s only possible to order something in one way at a time.

For example, it’s not possible to take a deck of cards and order it by value low to high and also to order it by value high to low, at the same time.

You should give careful consideration when selecting which column or columns to use for the clustered index. The ideal cluster key has the following four properties:-

1. Static

Base the cluster key on a column or composite of columns which contain non- volatile values (values that will not change).

Remember that the rows are ordered by the value of the cluster key, so if the value of the cluster key changes, the row is then out of order and must be moved.

Avoid using volatile values in a cluster key.

2. Unique

It is possible to create a clustered index on a column or composite of columns which contain non-unique values, but it is highly undesirable to do this.

In the case on a non-unique cluster key, SQL Server needs to add a hidden 4 byte

“uniquifier” column. This has a substantial impact on performance.

Avoid non-unique clustered indices.

(5)

3. Narrow

The ideal data type for a column being used as a cluster key is an INT. This is because at 4 bytes it is narrow, and narrowness is a hugely desirable property in cluster keys.

It’s possible to find clustered indices defined on very wide data types e.g.

CHAR(250) or even a composite of several very wide data types.

This is a bad design decision and will affect performance.

Avoid using wide keys.

4. Increasing

The ideal cluster key will use continually increasing values.

It’s very easy to store something in an order when it’s already given in that order.

For example, consider writing down the letters A,B,C,D,E,F,G,H in alphabetical order.

Now consider writing down the letters J,W,Y,D,G,E,T,F in alphabetical order.

Which would be the easiest and quickest task to perform?

Use a continually incrementing value for the cluster key.

(6)

Using Identity

SQL Server provides the facility to automatically generate an ever increasing integer value within a table’s column. This is often good to use as a cluster key.

When using the IDENTITY property, two parameters are specified.

The first (the seed) specifies the number at which the automatic numbering should commence.

The second (the increment) specifies how many gaps between each automatic number should be left.

For example:-

 IDENTITY(1,1) will produce a series like this 1,2,3,4,5,6,7…..

 IDENTITY(100,1) will produce a series like this 100,101,102,103,104,105…..

 IDENTITY(1,2) will produce a series like this 1,3,5,7,9…..

A very common design mistake that developers make when using IDENTITY, is to define this as an INT starting from 1 and incrementing by 1 – so specifying IDENTITY(1,1).

The range of values for an INT is -2,147,483,648 to 2,147,483,647.

By starting the identity at 1, the range of usable values (and therefore the maximum number of rows in the table) is halved.

Instead, start the IDENTITY at the lowest possible value i.e. negative.

For an INT that would mean a seed of -2,147,483,648 to maximise the number of possible values.

By specifying IDENTITY(-2147483648,1) the series would then run:- -2147483648, -2147483647, -2147483646, -2147483645,-2147483644……

Primary Keys, and Clustered Indices

A primary key is a single field or combination of fields that uniquely defines a row. None of the fields that are part of the primary key can be nullable.

A table can have only one primary key.

By default, if a primary key is defined on a table in SQL Server, and no clustered index already exists on that table, then a clustered index will be automatically created with the same column definition as the primary key.

Bear this in mind if you declare a primary key on a table.

As defaults go, this isn’t awful behaviour – however like most defaults, one size never fits all cases.

Be aware that the primary key and the cluster key don’t have to use the same column or composite columns.

Separating the definition of the primary key from the cluster key can sometimes be the correct design decision.

(7)

3. Dates and Times

How not to store a Date \ Time

A mistake that many developers often make when designing tables is to store date \ time data in a string e.g. a character data type such as CHAR, NCHAR, VARCHAR, NVARCHAR. We will discuss why this is a design anti-pattern in the following sections.

Validation

By not using one of the date \ time data types, no validation of the data takes place. For example it’s possible to store a date such as the 30^th February, or a time such as 14:61:62 Compare this:-

To this:-

(8)

Culture

No regional culture information is stored when a date is stored as a string.

For example given a date of 03-04-2014, an English person would read this as the 3^rd of April, whereas an American would read this as the 4^th March.

By storing this in a data type designed for the purpose, the day, month and year is tracked and we know what the date actually represents. We can also easily display it in formats for other nationalities:-

(9)

Date and Time functionality

The T-SQL language has many built-in functions to make life much easier when querying and manipulating dates and times. However these functions cannot be used if the date or time is stored in a string format.

An example of querying using built-in T-SQL functions:-

(10)

Choosing the best Date \ Time data type to use

SQL Server offers 6 different data types for storing date and time data.

https://msdn.microsoft.com/en-gb/library/ms186724.aspx#DateandTimeDataTypes Many people, know only of the DATETIME data type, and so use this automatically.

Advice:-

 If storing dates, and the time component is not important or even recorded, then use the DATE data type instead. DATE will use 3 bytes, whereas DATETIME will use 8 bytes.

Here's an example of a table using a DATETIME data type when a DATE would have clearly been a better choice:-

Storing ’00:00:00.000’ for every row is wasteful. Here it costs (6 x 5) = 30 bytes per row.

 If storing times and the date component is not important or even recorded, then use the TIME data type instead and specify the precision that you need.

TIME will use between 3-5 bytes (depending on precision), whereas DATETIME will use 8 bytes.

(11)

 If storing both the date and time is a requirement, then consider using the SMALLDATETIME data type.

SMALLDATETIME can store dates between January 1, 1900, - June 6, 2079.

It stores time to a granularity of 1 second.

Unless dates outside of this range are required, or fractional second precision is needed, then SMALLDATETIME is a better choice than DATETIME.

SMALLDATETIME uses 4 bytes whereas DATETIME uses 8 bytes.

Using the Time data type

The TIME data type should only ever be used to record a point in time.

It should never be used to record a duration of passing of time.

For example a very bad use of the TIME data type would be to record “hours worked”.

In the example below, it’s difficult to try to calculate how many hours were worked in the week.

(12)

Advice on recording durations

When faced with a requirement such as the one above, there are 2 different design patterns which are better solutions:-

 Design the table with a column for “StartTime” and a column for “EndTime”. Use the TIME or SMALLDATETIME data types for these.

(Using TIME would assume that people never work past midnight, so SMALLDATETIME might be a better choice).

Then use the date and time functions within T-SQL to calculate the duration between them.

 Design the table with a column named “HoursWorked” and a column named

“MinutesWorked”. Use the TINYINT data type for both.

4. Strings

Why a number should be a string.

A common mistake often made with table design is to store a number in an integer format when it might be better be served as a string. A telephone number is a good example of this.

By storing a telephone number as an integer, it’s not possible to append leading zeroes to the number (needed for international dialling) or to add helpful telephone number punctuation like + or - or () or <space>. For example +0044 (0) 7123 987 654

Expressed as an integer that would be 4407123987654.

How to decide

When deciding whether to store a number in some kind of numeric format, or as a string – ask the question:-

“Will there be a need to perform any mathematical calculations on this value?”

If the answer to this question is “no”, then store the value as a string.

For example, it’s unlikely that you would need to find the sum of all your friends’ phone numbers. It’s also unlikely that you would need to calculate the average phone number of all your contacts.

(13)

5. Choosing the correct Integer

The different Integer types

It’s very common indeed to see the INT data type used exclusively in tables whenever a whole number needs to be stored.

Many people are unaware that SQL Server supports 4 different INT data types, which have the following properties:-

Data Type Range Storage

BIGINT -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 8 Bytes

INT -2,147,483,648 to 2,147,483,647 4 Bytes

SMALLINT -32,768 to 32,767 2 Bytes

TINYINT 0 to 255 1 Byte

Avoid immediately using an INT data type automatically without considering its smaller cousins.

Examples of poor choice Integers

Here are a couple of real world examples where more thought might have led to a better choice of data type:-

 A column which is an INT data type (so taking up 4 bytes of storage) yet only holding two possible values (33 or 36).

This could have been implemented as a TINYINT data type (1 byte) saving 3 bytes per row.

 A column which is an INT data type (so taking up 4 bytes of storage) yet only holding four possible values (-1,0,1,2).

This could have been implemented as a SMALLINT data type (2 bytes) saving 2 bytes per row.

(14)

6. Approximate Data Types

T-SQL provides two data types FLOAT and REAL which can be used to store extremely small or extremely large numbers.

However these two data types store an approximation of the value, as opposed to the exact value.

The approximate nature of these 2 data types therefore means that their use is best avoided in any financial system.

Instead, use the DECIMAL or NUMERIC data types. They are functionally the same, however the important point is that they store an exact value and not an approximation of one.

Here’s an example of a table which is using approximate data types to store financial data:-

Unless these are incredibly small, or immensely huge numbers, the designer of this table might have been better advised to use the DECIMAL \ NUMERIC data types for these.

7. Deprecated Data Types

What does “deprecated” mean?

As newer versions of the SQL Server product are released over time, Microsoft removes outdated features.

In order to give SQL Server users time to provision for these features being removed, Microsoft first mark these features as “deprecated”.

A deprecated feature is one that is still available to use in the product, but which is scheduled to be removed from a later version of the product in the future.

For this reason, it’s important not to use deprecated data types in any new development work, since these data types have a finite life span.

What are the deprecated data types?

Currently, the following data types are deprecated:-

 TEXT

 NTEXT

 IMAGE

(15)

8. Table and Column Naming

Spaces

Avoid using spaces in table or column names.

Table or column names with spaces in them must always be surrounded by delimited identifiers. For example compare these:-

Prefixing \ Suffixing Tables and Views with “tbl” and “vw”

Although many people make a case for typeful names, in the case of tables and views, experience suggests that this is a bad idea.

Consider a single table named tbl_Person which contains details of both permanent staff and contractors that is queried for reporting purposes.

At some stage in the future, it’s decided to change this by splitting tbl_Person into two separate tables, one for contractors (tbl_Contractor) and one for permanent staff (tbl_PermanentStaff).

To make this change without modifying the existing reporting queries, a view is created to replace the original Person table. The view simply unions the contents of the two new tables.

The end result is a view named tbl_Person.

The prefix is now misleading because it’s no longer a table, it’s a view.

Reserved Keywords

The T-SQL language has a certain vocabulary of reserved keywords. These reserved keywords shouldn’t be used as the name of a table or column.

It is possible to use these reserved keywords as a name by using delimited identifiers, however this practice is strongly discouraged.

A full list of reserved keywords may be found here.

https://msdn.microsoft.com/en-us/library/ms189822.aspx

(16)

Using P.e.r.i.o.d.s

SQL Server uses the period to distinguish the levels of an object hierarchy.

For example - Server.Database.Schema.Object Or often just – Schema.Object

Avoid using periods in any object or column name in SQL Server.

So for example, don’t name a table Gas.Operational.Cost.Estimate

It is much safer to express this as either:-

GasOperationalCostEstimate or Gas_Operational_Cost_Estimate

9. Constraints

Constraints are a way of restricting which values are allowed to be stored in a column.

They perform many useful functions.

Enforcing business logic

Constraints can prevent data being entered which contravenes business logic.

For example if an error severity level should only ever be -1,0,1, or 2; then a CHECK constraint could be set to enforce this:-

Or if an order in an order table should always have an assignment to a customer in a customer table, then a foreign key constraint could be used to enforce this.

(17)

Enforcing data cleanliness

Constraints can be used to help with keeping data clean. For example in a Gender column we could set a CHECK constraint to restrict values to “Male”, “Female” or “Unknown”.

This would prevent values being entered such as “M”, “F”, “Man”, “Woman”, “Hombre” etc.

Avoiding NULLs

In a database, a NULL means “the value is not known or not applicable”.

NULL does not mean zero, nor does it mean “” (an empty string).

In some cases, this may be reasonable. For example, a column that stores mobile phone numbers might allow NULLs, because not everyone has a mobile phone.

However sometimes it’s important to ensure that any record being added or modified has a compulsory piece of information specified.

When a column value shouldn’t be allowed to be omitted, use NOT NULL in the column definition.

(18)

Improving performance

Constraints can help the SQL Server query optimiser to do a better job.

Consider a bag of marbles of various colours.

We know that there are 20 marbles in the bag, and that 12 of them are red.

How many of the marbles are black?

In order to answer this question we must empty out the bag and individually count the black marbles (because there might be marbles of other colours in the bag too).

However, if a constraint is added “The bag can only contain red or black marbles”, then it’s immediately obvious how many black marbles are in the bag, without the need to empty all the marbles out of the bag and count them.

The SQL Server query optimiser can take such logical short-cuts if constraints are defined on columns.

(19)

10. One Page Summary for Better Tables

 Design tables properly from the outset to avoid future problems.

 Define a clustered index on tables.

 Always use a cluster key that is static, unique, narrow, and incrementing.

 If you have a composite cluster key which uses a combination of more than three columns, you should probably rethink your table design.

 Consider creating an ID column for the cluster key, and use the IDENTITY property As a mechanism to populate it.

 If using the IDENTITY property, seed it with the lowest value possible.

 Remember that a primary key and a cluster key don’t have to use the same columns.

 Don’t store dates and times as strings.

 Consider various different data types for suitability, rather than just automatically using INT or DATETIME.

 Don’t use the TIME data type to record a duration of time.

 Don’t automatically store a number as an INT (e.g. a phone number is better stored as text).

 Don’t use FLOAT and REAL data types to represent financial data.

 Don’t create any new tables which use deprecated data types.

 Don’t use spaces or periods in table or column names.

 Don’t use reserved keywords in table or column names.

 Don’t prefix \ suffix tables or views with “tbl” or “vw”.

 Use constraints whenever applicable.

 If a column shouldn’t contain NULL values then define it as NOT NULL.