Eliminating Required Waste and
Non-Value Added Processing
in Data Warehousing
a Six Sigma LEAN Perspective
John G. McManus
Bank of America
John G. McManus – Bank of America
• Senior Vice President – Global Wealth and Investment Management Technology
• Responsible for: data warehouse/datamart, sales reporting, workflow, document management and incentive compensation applications
• Six Sigma Green Belt certified
• Formerly director of data warehouse and business intelligence systems at AT&T Broadband
• Served as officer in the US Army and Army Reserve for 22 years
• Currently supporting transition integration of the Merrill Lynch and Bank of America brokerage businesses and technology
Overview of the Presentation
• A quick look at Bank of America
• Six Sigma and LEAN fundamentals
• Six Sigma & LEAN perspective on data warehousing lifecycle processes
• Summary
Bank of America – Annual Report – 2008 (does not include
Merrill Lynch)
• Annual revenue of $72B
• Annual net income $4B
• 6,100 Retail Banking Outlets
• 18,700 ATMs
• 59 million consumer and small business banking relationships
• 29 million active on-line banking users
• Bank of America serves more than million consumer and small business relationships
• Global Wealth and Investment Management - more than 2 million individual and institutional clients worldwide
Six Sigma at Bank of America
• Focuses on the reduction/elimination of waste in our processes
• Focuses on the reduction/elimination of defects
• Reduces variation using statistical tools
• Provides ongoing process control and continuous improvement
• Start with the customer perspective
“Begin with the End in Mind” Stephen Covey
Sigma Level - Metric
Measures Defects per Million Opportunities
Sigma Level Defects per Million
Opportunities 2 308,567.0 3 66,807.0 4 6,210.0 5 233.0 6 3.4
LEAN
• Six Sigma focuses on getting variation out of our processes
• LEAN focuses on waste elimination, leveraging common tools and techniques from Six Sigma
• LEAN Tools include:
– End to End System Map
– Physical Process Map
– Time Value Map
– Spaghetti Chart
– Product Process Flow Analysis
Seven Types of Waste
• Defects
– Incomplete or corrupt source system feeds
– Low chip yield causes Xbox shortage for Christmas 2005
• Overproduction
– Excessive storage space, redundant data
• Transportation
– Moving data multiple times though the information pipeline
• Waiting
– Processes that don’t kick off until all source files are in hand
• Inventory
– Wasted storage space, excessive history
• Motion
– Putting the bun on the burger only to remove it to put on the pickles
• Processing (too much)
Defining Value
• Value Added Activity – An activity that CHANGES the size, shape, fit, form or function of material or information (for the 1st time) to meet customer requirements.
– Transformation of dimensional data to load to a fact table
• Non-Value Added Activity – All other activities that take time or resources or does not satisfy customer requirements.
– Partitioning of data
– Indexing
– Creation of aggregate tables
The Activity of the Product
The only four things a product can be doing
• Storage (White Space)
• Transportation
• Inspection
• Processing
– Value added
– Non-value added
The Activity of the Equipment
Value Add and Non-Value Add activity
• Run – Automated processing – Manual processing • Load • Unload • Setup • Replenish • Complete • Idle • Maintain
Typical Data Warehouse Development Tasks - Categorized
Procure and install multi-terabyte database/storage system
Develop logical data model - VAP
• Develop physical data models – Required Waste
• Calculate storage requirements for all objects – Required Waste
• Design database partitions / strategy – Required Waste
• Design indexes – Required Waste
• Design aggregates and materialized views – Required Waste
Create DDL - VAP (for data tables – all else is req’d waste)
• Create physical data tables including partitions – Required Waste
• Create indexes – Required Waste
• Create aggregates and materialized views – Required Waste
Create users, roles, entitlements - VAP
Actaul Results - DDL Migration
• DDL from traditional RDBMS migrated to data warehouse appliance within 1 day.
• Process removed 95+% of content from existing DDL files including tablespace, partition, and index declarations.
BEFORE – 1821 lines of SQL
CREATE TABLE CMVFACT.FCT_MKV_CUST_BAL (
PER_CCYYMM NUMBER(6) NOT NULL, SK_CIF NUMBER(15) NOT NULL, SK_PR_L5 NUMBER(8) NOT NULL, BAL_CD NUMBER(5) NOT NULL, BUS_SEG_L3_ID NUMBER(4) NOT NULL, SEG2_CD NUMBER(2) NOT NULL, ISO_CRNCY_CD CHAR(3) NOT NULL, CO_CST_CTR_ID CHAR(10) NOT NULL, CAL_MON_DAY_NR NUMBER(2) NOT NULL, CAL_MON_BUS_DAY_NR NUMBER(2) NOT NULL, BAL_AM NUMBER(15,2) NOT NULL, BAL_ORIG_CRNCY_AM NUMBER(18,2) NOT NULL, SRCE_SYS_CD NUMBER(4) NULL ) TABLESPACE CMVFACT NOLOGGING PCTFREE 0 PCTUSED 90 INITRANS 1 MAXTRANS 255 STORAGE(BUFFER_POOL DEFAULT) NOPARALLEL NOCACHE PARTITION BY RANGE(PER_CCYYMM)
(PARTITION P200205 VALUES LESS THAN (200206) STORAGE(FREELISTS 1
FREELIST GROUPS 1),
After – 17 lines of SQL
CREATE TABLE FCT_MKV_CUST_BAL (
PER_CCYYMM integer NOT NULL, SK_CIF bigint NOT NULL, SK_PR_L5 integer NOT NULL, BAL_CD integer NOT NULL, BUS_SEG_L3_ID integer NOT NULL, SEG2_CD integer NOT NULL,
ISO_CRNCY_CD CHAR(3) NOT NULL, CO_CST_CTR_ID CHAR(10) NOT NULL, CAL_MON_DAY_NR integer NOT NULL, CAL_MON_BUS_DAY_NR integer NOT NULL, BAL_AM numeric(15,2) NOT NULL,
BAL_ORIG_CRNCY_AM numeric(18,2) NOT NULL, SRCE_SYS_CD integer NULL
)
Typical Data Warehouse Load Tasks
• Drop indexes
• Drop constraints
• Drop aggregates
• Sort data before loading
• Transform and load data
– Look at why a job that ran in 2 minutes yesterday now runs in 2
hours
– Add storage to accommodate data growth
• Regenerate indexes
– Add storage to accommodate data growth
• Regenerate constraints
• Recreate aggregates
• Regenerate materialized views
Time Value Map Inputs – Data Warehouse Load Tasks
(time in minutes)
Process Step Description Store Trans Inspect NV Process VA Process 001 Migrate Data from Source Systems 120
002 Reconcile Data Back to Source 20
003 Preprocess, sort and prep data 30
004 Drop Indexes 5
005 Drop Constraints 1
006 Drop Aggregates 2
007 Drop Materialized Views 2
008 Load Data 30
009 Create Constraints 180
010 Create Indexes 90
011 Create Materialized Views 60
012 Create Aggregates 120
Time Value Map
• Categorize each process step according to the “Activity of the Product”
• Identify value added vs. non-value added processing
• “White space” is bad and needs to be eliminated
Actual Results – 4TB Production Timeline - Monthly
• Production Timeline with traditional RDBMS / hardware
• Production Timeline using data warehouse appliance
Load Stage Dimension/FTP Fact Build Aggregation Index/Swap Certification 21st Calendar Days 25th
CMV Timeline (assume batch begin 17th calendar day 5:00 PM PST
17th 18th 19th 20th 22nd 23rd 24th 26th Load Stage Dimension/FTP Fact Build 26th 22nd 23rd 24th 21st 25th
CMV Timeline (assume batch begin 17th calendar day 5:00 PM PST
17th 18th 19th 20th
8 DAYS removed from monthly batch process or 96 days of increased data
Information Sprawl – Tracking the Activity of the Product
(in LEAN Terms – E2E Process or Spaghetti Chart)
Various Six Sigma Methods for Determining Root Cause
• Failure Modes and Effects Analysis (FMEA)
– Focuses priority on the most critical areas
• Cause and Effect Diagram (Fishbone) - feeds FMEA
• Shows influences on a process – potential failures
• Load failed – why?
– Ran out of tablespace
– Partition not defined for new month
– Incomplete data file
– Data anomalies
– Bad ETL logic
• The 5 Why’s
The 5 Why’s – Get to the Root Cause for Failures
• “Hey Boss, we missed our SLAs for the users again today”
• Why? – The job that always runs in 2 minutes ran for 2 hours
• Why? – We think the database chose a different execution path
• Why? – The Statistics might have been out of date
• Why do we need statistics? – To enable the database to make
intelligent choices about which indexes to use and how to join tables
• Why do we need indexes? – Because traditional RDBMs have significant design flaws
Why do we do it?
• Creation of indexes
– Queries and loads would perform abysmally without them
• Creation of aggregates
– Users want data summarized in many instances
– Queries and reports would perform abysmally without them
• Creation of data marts
– Denormalize the data for reporting efficiencies
– Need to offload impact of reporting from data warehouse
– Specialized transformations
• DDL to support required waste
Today’s Challenges
• Larger data volumes and increased data retention
• Increased reliance on data by the business
• Smaller and smaller maintenance/load windows
• Increased SLAs, support for international availability windows
• Highly sophisticated users with a thirst for data
• Increased risk, compliance and regulatory oversight of data
What if we Could Eliminate the Need for Indexes?
• Eliminate storage needs for indexes, usually 2x – 3x data needs
• Eliminate time consuming index regeneration – usually longer than data load times
• Eliminates on-going tuning/maintenance of indexes – less administrative DBA support needed
• Deliver data faster – get it in the hands of decision makers quicker
• Enable user self-empowerment – unlock the data
What if we Could Eliminate the Need for Aggregates?
• Eliminate storage needs for aggregation tables and associated indexes
• Eliminate time consuming development and testing of aggregation routines – could be man months
• Eliminates time consuming daily/weekly aggregate regeneration
• Data is more current – no need to limit users to weekly or monthly views only [Artificial Constraint]
• Data is delivered faster allowing allow earlier access by the users
• Increase reliability – less places to fail – many aggregation routines have flawed logic – the users just don’t know it
What If We Could Eliminate The Need For Datamarts?
• Eliminate total cost of ownership for datamart hardware and software licenses
• Eliminate development, maintenance and production costs of marts
• Single source of the truth
Summary
• On average, data warehousing is >90% required waste or non-value added processing
• The cost of this waste translates to:
– Unacceptably long time to market
– Unnecessary hardware and software license costs
– Terabytes of wasted storage
– Elongated development data load cycles
– Longer periods of data unavailability
– Stale data
– Poorly performing loads and queries
– Excessive administrative costs
Henry Ford said:
“Time waste differs from material in that there can be no salvage. The easiest of all wastes, and the hardest to correct, is the waste
of time, because the waste of time does not litter the floor like wasted material.”
We need to eliminate the waste in data warehousing
– Reclaim the time we currently waste