What Is Refactoring? - Agile Analytics - A Value-Driven Appr. to BI and Data Warehousing

Without rewriting the book on database refactoring, perhaps it would be best to explain the concept of refactoring with an example. Suppose our FlixBuster data warehouse has been in production for some time and the data model includes a conformed product dimension that is keyed on product ID and is used by several fact tables. However, in the current develop- ment cycle we are integrating a new syndicated data source that contains rich information about product purchases but is at the product subcategory grain rather than the more detailed individual product grain. Because we will be building a new fact table that uses the grain of product subcategory, we decide to split the product dimension into two separate dimensions, Dim_ Product and Dim_ProductSubcategory. The new dimension can be used as a star schema for detailed product facts or linked directly to the more coarse- grained subcategory facts.

Table 6.1 Database Refactoring Categories Database Refactoring

Category Description Example(s)

Structural A change to the definition of one or

more tables or views

Moving a column from one table to another, or splitting a multipurpose column into separate columns

Data quality A change that improves the quality

of information

Making a column non-nullable or applying a common format to a column

Referential integrity A change that ensures that a

referenced row exists within another table and/or that a row that is no longer needed is removed appropriately

Adding a trigger to enable a cascad- ing delete between two entities, code that was formerly implemented outside the database

Architectural A change that improves the overall

manner in which external programs interact with a database

Replacing an existing Java database operation in a code library with a stored procedure in the database to make it available to non-Java applications

Method A change to a stored procedure, ETL

object, stored function, or trigger that improves its quality

Renaming a stored procedure to make it easier to understand

Non-refactoring transformation

A change to a database schema that changes its semantics

Adding a new column to an existing table

ptg6843605

Because the data warehouse has been in production for some time, and

Dim_Product contains historical data, we’ll need to be disciplined in how we split this dimension table to ensure that all existing BI applications continue working properly. That means that we’ll need to introduce the new dimension table but establish a transition period during which the original table continues to include subcategory information. This transition period will allow us to carefully refactor the BI applications to use the new subcategory dimension wherever appropriate. We’ll use the Split Table refactoring for this task (Ambler and Sadalage 2006).

During the transition period we will have some intentional data duplica- tion. To avoid possible inconsistencies between these dimensions we will outfit each table with a trigger to synchronize data across tables in the event of an insertion, deletion, or update. We must create these triggers so that cycles between the two do not occur.

While we don’t want the transition period to linger too long, we also don’t want to deprecate the old schema too early and risk breaking any BI applications. After careful consideration we’ve determined that February 1, 2013, is an appropriate date by which to complete the schema transition. To imple- ment this we’ll schedule the necessary ALTER TABLE statements to run on that date. This step in the refactoring automates the necessary housekeeping that we might otherwise forget to do. Figure 6.2 depicts the schema before, during, and after our transition period.

Dim_Product Dim_Product 1..* 1 1..* 1 Dim_ProductSubcategory Dim_ProductSubcategory Dim_Product -productID <<PK>> -productCategory Original Schema Transition Period Resulting Schema -productSubcategory -productName -productFeature -productBasePrice -productID <<PK>> -productID <<PK>>

TIME -subcategoryID <<PK>>-subcategoryName

-subcategoryCategory -subcategoryID <<PK>> -subcategoryName -subcategoryCategory SynchronizeWithProduct -productCategory -productSubcategory -productName -productName -productFeature -productFeature -productBasePrice -productBasePrice SynchronizeWithSubcategory

{ event = update | delete | insert, drop date = February 1, 2013 }

ptg6843605

REFACTORING 161

The code for this refactoring looks like this:

CREATE TABLE Dim_ProductSubcategory (

subcategoryID VARCHAR(15)NOT NULL, subcategoryName VARCHAR(20)NOT NULL, subcategoryCategory VARCHAR(15)NOT NULL, CONSTRAINT PKSubcategoryID

PRIMARY KEY (subcategoryID) );

-Trigger to keep all split tables in sync

CREATE OR REPLACE TRIGGER SynchronizeWithSubcategory BEFORE INSERT OR UPDATE

ON Dim_Product

REFERENCING OLD AS OLD NEW AS NEW FOR EACH ROW

DECLARE BEGIN IF updating THEN FindOrCreateSubcategory; END IF; IF inserting THEN CreateSubcategory; END IF; END; /

CREATE OR REPLACE TRIGGER SynchronizeWithProduct BEFORE UPDATE OF productSubcategory

ON Dim_ProductSubcategory

REFERENCING OLD AS OLD NEW AS NEW FOR EACH ROW

DECLARE BEGIN IF updating THEN FindAndUpdateAllProductsForSubcategory END IF; END; / -On February 1, 2013

ALTER TABLE Dim_Product DROP COLUMN productCategory; ALTER TABLE Dim_Product DROP COLUMN productSubcategory; DROP TRIGGER SynchronizeWithSubcategory

DROP TRIGGER SynchronizeWithProduct

But we aren’t going to write all that code at one time. Before we even start this refactoring, we’ll review the existing test suite to be sure the original schema is well covered by automated tests. We’ll take the time to add any new tests we wish we had, and we’ll ensure that all the tests are passing.

ptg6843605

Then, before we begin writing the code, we’ll write new tests to test the code we’re about to write.

For example, we’ll start by writing some structural tests to verify that the

Dim_ProductSubcategory table exists and contains the expected columns and constraints. These tests will fail until we write the CREATE TABLE query correctly. As we proceed in tiny steps, writing tests along the way, we’ll also continue to rerun the old tests to be sure we haven’t broken anything. By the time we’re finished, we will have a new table, new triggers, and a suite of new tests to validate our work.

Structural database refactorings typically follow a pattern that includes a transition period between the old schema and the new one. These buffers coupled with automated tests are essential ingredients of refactoring safety. Additionally, all of the database code and related artifacts should be man- aged in a version control system, making it easy to roll back to an earlier version if things go horribly wrong. We’ll discuss that in detail in Chapter 8, “Version Control for Data Warehousing.”

Many other refactoring situations and corresponding database refactorings are presented in Refactoring Databases (Ambler and Sadalage 2006), and future books may introduce even more. However, this example provides a glimpse of what database refactoring looks like and should convey the idea that refactoring is more than simply restructuring or improving database elements.

In document Agile Analytics - A Value-Driven Appr. to BI and Data Warehousing - K. Collier (Pearson, 2012) BBS (Page 196-199)