This chapter begins our second case study with the emphasis on working through the entire data integration life cycle tasks and deliverables. Subsequent chapters cover the phases of the data integration life cycle and provide case studies for each phase. This case study is based on inte-grating three order management systems for the Wheeler Automotive Company into an enterprise data warehouse and product line profitability data mart.
For the analysis case study, we focus on developing project scope, source systems analysis, and data mapping deliverables.
Case Study Overview
The Wheeler Automotive Company is a fictional midsized auto parts supplier to the automotive industry and has been fairly successful since the company’s inception back in the mid-1960s.
Due to the recent recession, there has been increased focus on cost and profitability at a level of detail that is not currently available in its current plant-level reporting, as shown in Figure 6.1.
ptg
For Wheeler to perform the types of analysis needed to answer these profitability questions, it needs an environment where the disparate order information is consolidated, conformed by subject areas, aggregated by time, and displayed at a transaction level that provides management information about what product lines are selling and showing a profit.
Envisioned Wheeler Data Warehouse Environment
To meet the profitability reporting requirements as well as other future analytic and reporting needed, the Wheeler Information Technology Department has planned to define, design, and build an enterprise data warehouse and product line profitability data mart, as shown in Figure 6.2.
To date, the data warehousing team has completed a logical and physical data model for the data warehouse and product line data mart, as shown in Figure 6.3.
Current Wheeler Reporting Environment
Domestic Order Management System
Asian Order Management System
European Order Management System
Quarterly Profit Reports
Quarterly Profit Reports Quarterly Profit
Reports
Figure 6.1 Case study 2: Wheeler source systems
ptg
Case Study Overview 119
Customer Order Report Month Ending 02/27/2010
CustomerAuto Manufacturer 1Location 1
Order 1001Date 2/2/2010
Line # Item Number Description Cost Price Quantity Total Gross Profit 30% Overhead Net Profit 1 1101 Steel Wheels, Type 1 $100 $125 1,000 $125,000 $25,000 $37,500 -$12,500 2 1201 Wheel Bearing, Type 1 $10 $30 5,000 $150,000 $100,000 $45,000$55,000 3 1301 Rubber Joints, Type 1 $7 $12 10,000 $120,000 $50,000 $36,000$14,000
Product Line Profitability Reporting Envisioned Wheeler Automotive Data
Warehouse Environment
Enterprise Data Warehouse Domestic Order
Management System
European Order Management System
Asian Order Management System
Planned Data Integration
Hub
Product Line Profitability
Data Mart
Figure 6.2 Envisioned Wheeler data warehouse environment
Products
PK: Product Identifier
Data Warehouse Layer
Customers
PK: Customer Number
Addresses
PK: Customer Identifier, Address Number
Customers
PK: Customer Number
Data Mart Layer
Order Lines
PK: Order Number.
Order Line Number
Order
PK: Order Number
Order
PK: Order Number
Time
PK: Time Id
Order Lines
PK: Order Number.
Order Line Number
Products
PK: Product Identifier
Figure 6.3 Wheeler data warehouse and data mart data models
ptg The Wheeler data warehousing team has also produced a sample report layout portrayed in
Figure 6.4 for the product line profitability reporting that includes the known aggregations and calculations.
Customer Order Report Month Ending 02/27/2010
CustomerAuto Manufacturer 1 Location 1
Order 1001Date 2/2/2010
Line # Item Number Description Cost Price Quantity Total Gross Profit 30% Overhead Net Profit
1 1101 Steel Wheels, Type 1 $100 $125 1,000 $125,000 $25,000 $37,500 -$12,500
2 1201 Wheel Bearing, Type 1 $10 $30 5,000 $150,000 $100,000 $45,000 $55,000
3 1301 Rubber Joints, Type 1 $7 $12 10,000 $120,000 $50,000 $36,000 $14,000
Figure 6.4 Wheeler sample report layout
Aggregations in a Data Warehouse Environment
To meet all the requirements of this case study, we need to deal with aggregations, and where they occur for this effort. Although this text is primarily focused on data integration, it is important to take a moment to discuss a general data warehousing best practice. The “when” and “where” of data aggregation and calculation can be performed in all the layers of a data warehouse. In what layer the aggregation or calculation is performed should be evaluated based on potential perfor-mance and static nature of the aggregation or calculation, for example, Pre-Query or On-Query.
Figure 6.5 illustrates the possible data warehouse layers where an aggregation or calcula-tion transform could occur.
Data Warehouse
Architectural Problem:
Where to Calculate?
$ 33,000 W1 15,000 W2 + 20,000 W3
$68,000 Monthly
Option 1: In the Data Integration Processes
Data Integration Analytics
Analytics Server
ledoMataDlacigoL
Irem
otsuC :tcoejrP sisylanAnoitcaretn
traM
ataD
:peyTleod M
A th I BMCiVttl D01/10t A A ngr arA emgnr arneemgnr ar enarr mit Pntm eayy PTytnemyaPyramirt PepyTntm eyaPyramirPtneem IpedcrD s
C hannell IneanhC ddeoCelnnahCemaNelnnahC
epyTlennahC e Iyp l Tne anhC Tyelnnhadpe
C
I gg FniknaBtenretngalFsr meTailecpSgalFgniknae BnohpeleT
i
C ust om er I ssylann Aoitcaretn d I r oiePtnemr esu
eaM d ( F K ) a Ierci Aph r aog G er aogeG IenoZemiT dda TerAciphe Iypdt oiangiseDytinseDaerAcihpagreo n I
G dItnemgeSt eaRtnemyolpmenU d
B ankr upt cy R at e S egm ent Idt Inf noial tr aogt eR aS egmIendIerutaa NerAciph G eci CnopheleT e IodddeoCaerAcihpagroeG d doiret PnemeuraseM
d Ir oiePtnm er e suea
M noialtput eD ad
P oalupoPe Iqiuemin Tt oi U d In S our ce S yst emne IpyTdr oiePntm eersuea emursaM ed Ir oiePrandelat C nearP arY eradnelaCrer taur Qd daelnaCnor Mt h D ay O f nearP danelC araeYalcsiFrer taul QcasiFhtnol Mcad IoirePalcsit Fdhnor M
F sia Palgy FadiloHnyapmoCgaly FaDssniesuBy IaDssenisuBtas al L gFhntM onnoipitcrseD e IypTdr oiePtem nersuaeM d ( F K )
epyTdoiret PnemeruaseM Ieyp d ToirePetm nerusea M dpeyTdoirePtnemeruaseM
ntm e
genarrAtucdorPr yamirP t InemengarrActduorPyramirP drcsDntm ege narrAtucdorPr yamirP
S easonS eason Id S eason
Data Warehouse Layer
algci L DataModelo cPr tejo :CustomerInteractnoiA siylsna ModelType:DataWarehouse AthIBM C V i 10Dt01/tlt01/06
ArrrangeentPrimaryPaymentType To Customer ArrrangementPrimaryPaymentType Id(FK) haCType Id(FK)lne anC Code
C Type Idelnnha
C Type
CustomerCustomerId angementPrimaryPaymentType Id ArrrCustomerPerformance Status Id PrimaryProductArrrangementId Efffecti veCustomerDate End CustomerDatenProv si goi FalAi rrr lTav ale gFAiii r e bnl luC Pass galFni Internet gnkaB gFal laiecpS Terms galFenohelpeT gniknaalB gF
n
CustomerInteract sioi tH ory oi Id(FK) MeasurementPer d CustomerId(FK)elnnha
C Id(FK)
Interact noi Type Id(FK)
CustomerPerformance Status CustomerPerformance Status Id CustomerPerformance Status Dscr CustomerTo CustomerPerformance Status CustomerId(FK) CustomerPerformance Status Id(FK) End Date StartDate
ciGeogr pha Areaci Geogr pha AreaId T eZone
miIdci
Geogr pha AreaType Id Geogr cipha Area tisenD y tangiseD yUnem molp entRateSegmentIdnoi Id BankruptcyRateSegmentId Inf tl naoi RateSegmentId Geogr ciphaAreaNatureIdcinohelpeT Code Id Geogr cihpa AreaCode
ci
Goegr pha AreaTo Customer hGeogr cia AreaId(FK) pCustomerId(FK) End Date StartDate
nInteract oi Type nInteract oi Type Id Interact noi Type Descr tpi noi
oi
MeasurementPer dd MeasurementPer oi Idaltpu P nooi DatealtpuoP noi miT euenqi U IdInSource Systemd MeasurementPer oi Type IdeDate Efffecti vEnd Dated MeasurementPer oi Name Parent arndaelrdaelnaC Per doi Id CYearrandael CQuarterrdaeln
C Monthaa
Week Of rndaelC Year Week Of rdnelaCQuarteraa Week OfrndaelC Month Day Of rdanelaC Year Day Of C randaelterQuar Day Of rndaelC aMonth Parent lcaF Per sioi IddalsicFYearaliscFQuarterlacsiF Month Week aOf lsicFYear Week OfalcasiF Quarter Week Of lsicF Month Day aOf lsicF Year Day aOf lsicFQuartera Day OflsicF Month Day OfWeek Season Id(FK) NumberOfDayss NumberOf seniusB Days NumberOfCr tide InterestDays NumberOf tibeD InterestDaysbcilP H yudilagoFala Company yodilH algalFssenius
B Day gFs
LastuB pi n essniDay InMonth gFal Descr toid MeasuremnetPer oi Type Id(FK)
d
MeasurementPer oi Typed MeasuremnetPer oi Type Id
d
MeasurementPer oi Type
PrimaryProductArrrangement
Option 2: In the Database
Option 3: In the Analytics Engine
Option 4: On Query
Figure 6.5 Architectural options on where to perform a transform
ptg
Case Study Overview 121
Option 1: In the data integration layer—Where the aggregation is performed in the transformation layer of a data integration process. This option is preferred for large volumes of static data that needs to be aggregated.
Advantages:
• Faster performance on query, no in-memory calculations. By having the data pre-cal-culated, the report query simply needs to return a value and the processing load is placed on the data integration environment, rather than on the data warehouse data-base or analytics engine. In this scenario, there is no query wait time for calculations to perform.
Disadvantages:
• Inflexibility in recalculation is required. In business intelligence environments where recalculations are required (for example, what-if scenarios), precalculated query results will not meet the business requirements.
Option 2: In the data warehouse database layer—Where the aggregation or calculation is performed as a stored procedure in the data warehouse or data mart based upon a trigger from a query (e.g., the ON QUERY SQL function). This option provides a little more flexibility than in the data integration layer and pushes the processing requirements on the database server rather than on the analytics engine.
Advantages:
• Faster performance on query, no in-memory calculations. In this scenario, the only wait time is for the database engine to perform the aggregation or calculation.
Disadvantages:
• Inflexible for recalculations when recalculating the query is required; the stored pro-cedure will need to re-execute, causing query wait time.
• Poor metadata management and loss of metadata on the transformations. Store pro-cedures are notoriously poorly documented and their metadata is typically not man-aged in a metadata tool unlike data integration packages.
Option 3: In the analytics layer—Most business intelligence software packages, such as MicroStrategy, Cognos®, and Business Objects, have the ability to perform query calculations and aggregations within their core engine. In this scenario, the BI engine performs the query to the data warehouse/mart database for the raw information, and then performs the calculation/aggregation in the BI server engine, thereby serving the results to the query requester.
ptg Advantages:
• Faster performance on query, no in-memory calculations.
• Simplifies the data integration processes into more straight loads and allows the data warehouse to be simply common, and conformed raw data “pure” from a business rule transformation perspective. It moves the reporting aggregation and calculation transformations to the analytic layer.
Disadvantages:
• Inflexible when recalculations are required. Although similar to the issues of inflexi-bility in the data integration and data warehouse database layers, by having the aggre-gations/calculations in the BI engine, the query results are closer (on the network) to where the results need to be delivered, providing some level of faster performance.
• Requires recalculation, which can affect overall BI server performance. When the BI server engine is processing large resultsets for aggregations and calculations, other queries and requests will be placed in a wait state.
Option 4: During the database query—Where the aggregation or calculation is per-formed in memory of the analytics server or even the requestor’s PC or Internet device. In this scenario, the speed of the aggregation or calculation is dependent on the SQL request to the data-base for the raw data, the network’s speed and throughput of serving the raw results to the requestor’s machine, and the time it takes on that machine to aggregate or calculate the resultset.
Advantages:
• Creates dynamic aggregations and calculations on the fly. This is the most flexible approach. This approach is most often observed in budgeting and forecasting ana-lytic applications.
Disadvantages:
• Dynamic calculations are not scalable. This approach impacts the requestor’s machine and can be constrained by a much smaller PC or Internet devices CPU mem-ory than in server environments.
The best practice is to aggregate or calculate as far back as possible into the data warehouse layers and store the result in the data warehouse or data mart, thereby pushing the workload on the data integration server and managing the metadata in the data integration processes. However, there are exceptions to each rule. For each potential aggregation or calculation, an architectural review is needed for each of the business rules in the user requirements and logical data integra-tion models. In addiintegra-tion, other documentaintegra-tion is required to determine the types of transforms, and where the transformation would best occur.
For the Wheeler Automotive case study, the aggregations in the report will be performed as transformations in the data integration processes and stored in the product line profitability data mart.
ptg