Duplication Problem. Duplicating Columns

(1)

The amount of data required to perform our work today can be staggering. A fundamental question is, “How should data be stored?” Many people will use

spreadsheets for data storage. Creating a spreadsheet is simple and fast, which is way many people use them. However, they present a serious data problem that can only be solved by a database.

The fundamental difference between spreadsheets and databases is the former is meant for data analysis and the latter for data storage. To illustrate the problems of storing data in a spreadsheet, we’ll take a common scenario of keeping track of event attendance.

Duplication Problem

One of the primary benefits of a database is that it reduces duplication. Duplicate data is generally bad and should be avoided. When data is duplicated, there is an increased chance of introducing errors into the database.

Spreadsheets are not set up to reduce duplication. As such, data is often duplicated in a spreadsheet in three ways:

1. Duplicating columns. 2. Duplicating rows.

3. _{Duplicating information in a cell.}

Duplicating Columns

First Last Golf ’12 Golf 2013 Golf14

Kim Jones x x x

Bob Miaygi x x

Pat Smith x

Here we have a duplication of the events. The last three columns show the events though even in this simple example there are data anomalies. The event names are not consistent. They should either be Golf ’12, Golf ’13, Golf ’14 or Golf 2012, Golf 2013, Golf 2014 or Golf12, Golf13, Golf14. Is this a major problem? Not really. But it does

(2)

Duplicating Rows

First Last Event

Kim Jones Golf ’12

Kimberly Jones Golf ’13

Bob Miaygi Golf ’12

Robert Miaygi Golf ’14

Pat Smith Golf ’14

In this example, duplicating rows means that the person is listed more than once. Since the person is listed multiple times, it may be hard to determine if it is the same person. For example, are Kim Jones and Kimberly Jones the same person? There is no way to tell from the list.

Duplicating Information in a Cell

First Last Event

Golf ’13 Golf ’14

Bob Miaygi Golf ’12

Golf ’14

Pat Smith Golf ’14

Duplicating data in a cell is never a good idea. Of the three duplication scenarios, this is the hardest to query. How many events exists? How many people when to each

event? Duplicating cell data almost inevitably means manually counting the data.

Adding Complexity

Even if one of the above methods was used to keep track of event attendance, it is limited in its complexity. For example, perhaps the people who come have different roles. Most are guests/attendees. Some may be volunteers and a couple of others may be organizers. Perhaps there are various discounts available. How would that get noted?

(3)

Dirty Data

Dirty data refers to the type and consistency of data stored. In the first example of duplicating columns an attendee is marked with an “x”. But the cell could just as easily contain the amount paid, or a “Yes/No”.

First Last Golf ’12 Golf 2013 Golf14

Kim Jones $10 Not Paid Yes

Bob Miaygi Paid Paid

Pat Smith $10

It may seem unlikely that our small spreadsheet would suffer from dirty data. But it doesn’t take much for data to become dirty, especially as the list grows.

Query Conundrum

One of the primary reasons for storing data is to query it. That is, ask the data

questions. As mentioned above in the duplicating column section, it would be laborious to count how many events a person attended. Counting people is a fairly low-level query. That is, it is not very complex. An organization may really want know the following:

How many attendees were first time attendees? How much money did the event make?

How much money came from returning attendees? How much money came from first time attendees?

(4)

A Better Alternative

In a relational database management system (RDBMS), data can be stored and related to other data. This makes viewing and querying the data relatively easy. Moreover, the same data can often be viewed from multiple perspectives. In our example, a people layout could display all the events a person attended while an events layout can list all the attendees. Finding first time attendees, total number of attendees, total income, etc. is quick and easy to obtain.

(5)

When to Use a Spreadsheet

Although spreadsheets are not optimal for data storage, they are very useful tools for data analysis. In our example, let’s say we want to keep track of the number of people who attended an event, how much the event cost, how much revenue the event

created, and how much revenue per person the event created. As previously

mentioned, these numbers are easy to obtain in a relational database. However, a user may want to use the matrix structure of a spreadsheet to analyze the data.

Golf ’12 Golf ’13 Golf ’14

Attendees Cost Gross Net

Net per person

35 42 45

$80 $80 $82

$350 $420 $450

$270 $340 $368

$7.71 $8.10 $8.18

Questions

1. How many spreadsheets are in your computer, file server, or shared drive? 2. Of those spreadsheets, how many are for data analysis and how many for data

storage?

3. How long does it take to find (query) the data for important information? 4. Can you quickly produce a report by simply clicking a button?

New Leaf Data, LLC can help you and your organization store data efficiently. Contact New Leaf Data, LLC today to find out how.