Data Integration Analysis Case Study - Data Integration Blueprint and Modeling

This chapter begins our second case study with the emphasis on working through the entire data integration life cycle tasks and deliverables. Subsequent chapters cover the phases of the data integration life cycle and provide case studies for each phase. This case study is based on inte-grating three order management systems for the Wheeler Automotive Company into an enterprise data warehouse and product line proﬁtability data mart.

For the analysis case study, we focus on developing project scope, source systems analysis, and data mapping deliverables.

Case Study Overview

The Wheeler Automotive Company is a ﬁctional midsized auto parts supplier to the automotive industry and has been fairly successful since the company’s inception back in the mid-1960s.

Due to the recent recession, there has been increased focus on cost and proﬁtability at a level of detail that is not currently available in its current plant-level reporting, as shown in Figure 6.1.

ptg

For Wheeler to perform the types of analysis needed to answer these proﬁtability questions, it needs an environment where the disparate order information is consolidated, conformed by subject areas, aggregated by time, and displayed at a transaction level that provides management information about what product lines are selling and showing a proﬁt.

Envisioned Wheeler Data Warehouse Environment

To meet the profitability reporting requirements as well as other future analytic and reporting needed, the Wheeler Information Technology Department has planned to define, design, and build an enterprise data warehouse and product line profitability data mart, as shown in Figure 6.2.

To date, the data warehousing team has completed a logical and physical data model for the data warehouse and product line data mart, as shown in Figure 6.3.

Current Wheeler Reporting Environment

Domestic Order Management System

Asian Order Management System

European Order Management System

Quarterly Profit Reports

Quarterly Profit Reports Quarterly Profit

Reports

Figure 6.1 Case study 2: Wheeler source systems

ptg

Case Study Overview 119

Customer Order Report Month Ending 02/27/2010

CustomerAuto Manufacturer 1Location 1

Order 1001Date 2/2/2010

Line # Item Number Description Cost Price Quantity Total Gross Profit 30% Overhead Net Profit 1 1101 Steel Wheels, Type 1 $100 $125 1,000 $125,000 $25,000 $37,500 -$12,500 2 1201 Wheel Bearing, Type 1 $10 $30 5,000 $150,000 $100,000 $45,000$55,000 3 1301 Rubber Joints, Type 1 $7 $12 10,000 $120,000 $50,000 $36,000$14,000

Product Line Profitability Reporting Envisioned Wheeler Automotive Data

Warehouse Environment

Enterprise Data Warehouse Domestic Order

Management System

European Order Management System

Asian Order Management System

Planned Data Integration

Hub

Product Line Profitability

Data Mart

Figure 6.2 Envisioned Wheeler data warehouse environment

Products

PK: Product Identifier

Data Warehouse Layer

Customers

PK: Customer Number

Addresses

PK: Customer Identifier, Address Number

Customers

PK: Customer Number

Data Mart Layer

Order Lines

PK: Order Number.

Order Line Number

Order

PK: Order Number

Order

PK: Order Number

Time

PK: Time Id

Order Lines

PK: Order Number.

Order Line Number

Products

PK: Product Identifier

Figure 6.3 Wheeler data warehouse and data mart data models

ptg The Wheeler data warehousing team has also produced a sample report layout portrayed in

Figure 6.4 for the product line proﬁtability reporting that includes the known aggregations and calculations.

Customer Order Report Month Ending 02/27/2010

CustomerAuto Manufacturer 1 Location 1

Order 1001Date 2/2/2010

Line # Item Number Description Cost Price Quantity Total Gross Profit 30% Overhead Net Profit

1 1101 Steel Wheels, Type 1 $100 $125 1,000 $125,000 $25,000 $37,500 -$12,500

2 1201 Wheel Bearing, Type 1 $10 $30 5,000 $150,000 $100,000 $45,000 $55,000

3 1301 Rubber Joints, Type 1 $7 $12 10,000 $120,000 $50,000 $36,000 $14,000

Figure 6.4 Wheeler sample report layout

Aggregations in a Data Warehouse Environment

To meet all the requirements of this case study, we need to deal with aggregations, and where they occur for this effort. Although this text is primarily focused on data integration, it is important to take a moment to discuss a general data warehousing best practice. The “when” and “where” of data aggregation and calculation can be performed in all the layers of a data warehouse. In what layer the aggregation or calculation is performed should be evaluated based on potential perfor-mance and static nature of the aggregation or calculation, for example, Pre-Query or On-Query.

Figure 6.5 illustrates the possible data warehouse layers where an aggregation or calcula-tion transform could occur.

Data Warehouse

Architectural Problem:

Where to Calculate?

$ 33,000 W1 15,000 W2 + 20,000 W3

$68,000 Monthly

Option 1: In the Data Integration Processes

Data Integration Analytics

Analytics Server

ledoMataDlacigoL

Irem

otsuC :tcoejrP sisylanAnoitcaretn

traM

ataD

:peyTleod M

A th I BMCiVttl D01/10t A A ngr arA emgnr arneemgnr ar enarr mit Pntm eayy PTytnemyaPyramirt PepyTntm eyaPyramirPtneem IpedcrD s

C hannell IneanhC ddeoCelnnahCemaNelnnahC

epyTlennahC e Iyp l Tne anhC Tyelnnhadpe

I gg FniknaBtenretngalFsr meTailecpSgalFgniknae BnohpeleT

C ust om er I ssylann Aoitcaretn d I r oiePtnemr esu

eaM d ( F K ) a Ierci Aph r aog G er aogeG IenoZemiT dda TerAciphe Iypdt oiangiseDytinseDaerAcihpagreo n I

G dItnemgeSt eaRtnemyolpmenU d

B ankr upt cy R at e S egm ent Idt Inf noial tr aogt eR aS egmIendIerutaa NerAciph G eci CnopheleT e IodddeoCaerAcihpagroeG d doiret PnemeuraseM

d Ir oiePtnm er e suea

M noialtput eD ad

P oalupoPe Iqiuemin Tt oi U d In S our ce S yst emne IpyTdr oiePntm eersuea emursaM ed Ir oiePrandelat C nearP arY eradnelaCrer taur Qd daelnaCnor Mt h D ay O f nearP danelC araeYalcsiFrer taul QcasiFhtnol Mcad IoirePalcsit Fdhnor M

F sia Palgy FadiloHnyapmoCgaly FaDssniesuBy IaDssenisuBtas al L gFhntM onnoipitcrseD e IypTdr oiePtem nersuaeM d ( F K )

epyTdoiret PnemeruaseM Ieyp d ToirePetm nerusea M dpeyTdoirePtnemeruaseM

ntm e

genarrAtucdorPr yamirP t InemengarrActduorPyramirP drcsDntm ege narrAtucdorPr yamirP

S easonS eason Id S eason

Data Warehouse Layer

algci L DataModelo cPr tejo :CustomerInteractnoiA siylsna ModelType:DataWarehouse AthIBM C V i 10Dt01/tlt01/06

ArrrangeentPrimaryPaymentType To Customer ArrrangementPrimaryPaymentType Id(FK) haCType Id(FK)lne anC Code

C Type Idelnnha

C Type

CustomerCustomerId angementPrimaryPaymentType Id ArrrCustomerPerformance Status Id PrimaryProductArrrangementId Efffecti veCustomerDate End CustomerDatenProv si goi FalAi rrr lTav ale gFAiii r e bnl luC Pass galFni Internet gnkaB gFal laiecpS Terms galFenohelpeT gniknaalB gF

CustomerInteract sioi tH ory oi Id(FK) MeasurementPer d CustomerId(FK)elnnha

C Id(FK)

Interact noi Type Id(FK)

CustomerPerformance Status CustomerPerformance Status Id CustomerPerformance Status Dscr CustomerTo CustomerPerformance Status CustomerId(FK) CustomerPerformance Status Id(FK) End Date StartDate

ciGeogr pha Areaci Geogr pha AreaId T eZone

miIdci

Geogr pha AreaType Id Geogr cipha Area tisenD y tangiseD yUnem molp entRateSegmentIdnoi Id BankruptcyRateSegmentId Inf tl naoi RateSegmentId Geogr ciphaAreaNatureIdcinohelpeT Code Id Geogr cihpa AreaCode

Goegr pha AreaTo Customer hGeogr cia AreaId(FK) pCustomerId(FK) End Date StartDate

nInteract oi Type nInteract oi Type Id Interact noi Type Descr tpi noi

MeasurementPer dd MeasurementPer oi Idaltpu P nooi DatealtpuoP noi miT euenqi U IdInSource Systemd MeasurementPer oi Type IdeDate Efffecti vEnd Dated MeasurementPer oi Name Parent arndaelrdaelnaC Per doi Id CYearrandael CQuarterrdaeln

C Monthaa

Week Of rndaelC Year Week Of rdnelaCQuarteraa Week OfrndaelC Month Day Of rdanelaC Year Day Of C randaelterQuar Day Of rndaelC aMonth Parent lcaF Per sioi IddalsicFYearaliscFQuarterlacsiF Month Week aOf lsicFYear Week OfalcasiF Quarter Week Of lsicF Month Day aOf lsicF Year Day aOf lsicFQuartera Day OflsicF Month Day OfWeek Season Id(FK) NumberOfDayss NumberOf seniusB Days NumberOfCr tide InterestDays NumberOf tibeD InterestDaysbcilP H yudilagoFala Company yodilH algalFssenius

B Day gFs

LastuB pi n essniDay InMonth gFal Descr toid MeasuremnetPer oi Type Id(FK)

MeasurementPer oi Typed MeasuremnetPer oi Type Id

MeasurementPer oi Type

PrimaryProductArrrangement

Option 2: In the Database

Option 3: In the Analytics Engine

Option 4: On Query

Figure 6.5 Architectural options on where to perform a transform

ptg

Case Study Overview 121

Option 1: In the data integration layer—Where the aggregation is performed in the transformation layer of a data integration process. This option is preferred for large volumes of static data that needs to be aggregated.

Advantages:

• Faster performance on query, no in-memory calculations. By having the data pre-cal-culated, the report query simply needs to return a value and the processing load is placed on the data integration environment, rather than on the data warehouse data-base or analytics engine. In this scenario, there is no query wait time for calculations to perform.

Disadvantages:

• Inﬂexibility in recalculation is required. In business intelligence environments where recalculations are required (for example, what-if scenarios), precalculated query results will not meet the business requirements.

Option 2: In the data warehouse database layer—Where the aggregation or calculation is performed as a stored procedure in the data warehouse or data mart based upon a trigger from a query (e.g., the ON QUERY SQL function). This option provides a little more ﬂexibility than in the data integration layer and pushes the processing requirements on the database server rather than on the analytics engine.

Advantages:

• Faster performance on query, no in-memory calculations. In this scenario, the only wait time is for the database engine to perform the aggregation or calculation.

Disadvantages:

• Inﬂexible for recalculations when recalculating the query is required; the stored pro-cedure will need to re-execute, causing query wait time.

• Poor metadata management and loss of metadata on the transformations. Store pro-cedures are notoriously poorly documented and their metadata is typically not man-aged in a metadata tool unlike data integration packages.

Option 3: In the analytics layer—Most business intelligence software packages, such as MicroStrategy, Cognos^®, and Business Objects, have the ability to perform query calculations and aggregations within their core engine. In this scenario, the BI engine performs the query to the data warehouse/mart database for the raw information, and then performs the calculation/aggregation in the BI server engine, thereby serving the results to the query requester.

ptg Advantages:

• Faster performance on query, no in-memory calculations.

• Simpliﬁes the data integration processes into more straight loads and allows the data warehouse to be simply common, and conformed raw data “pure” from a business rule transformation perspective. It moves the reporting aggregation and calculation transformations to the analytic layer.

Disadvantages:

• Inﬂexible when recalculations are required. Although similar to the issues of inﬂexi-bility in the data integration and data warehouse database layers, by having the aggre-gations/calculations in the BI engine, the query results are closer (on the network) to where the results need to be delivered, providing some level of faster performance.

• Requires recalculation, which can affect overall BI server performance. When the BI server engine is processing large resultsets for aggregations and calculations, other queries and requests will be placed in a wait state.

Option 4: During the database query—Where the aggregation or calculation is per-formed in memory of the analytics server or even the requestor’s PC or Internet device. In this scenario, the speed of the aggregation or calculation is dependent on the SQL request to the data-base for the raw data, the network’s speed and throughput of serving the raw results to the requestor’s machine, and the time it takes on that machine to aggregate or calculate the resultset.

Advantages:

• Creates dynamic aggregations and calculations on the ﬂy. This is the most ﬂexible approach. This approach is most often observed in budgeting and forecasting ana-lytic applications.

Disadvantages:

• Dynamic calculations are not scalable. This approach impacts the requestor’s machine and can be constrained by a much smaller PC or Internet devices CPU mem-ory than in server environments.

The best practice is to aggregate or calculate as far back as possible into the data warehouse layers and store the result in the data warehouse or data mart, thereby pushing the workload on the data integration server and managing the metadata in the data integration processes. However, there are exceptions to each rule. For each potential aggregation or calculation, an architectural review is needed for each of the business rules in the user requirements and logical data integra-tion models. In addiintegra-tion, other documentaintegra-tion is required to determine the types of transforms, and where the transformation would best occur.

For the Wheeler Automotive case study, the aggregations in the report will be performed as transformations in the data integration processes and stored in the product line proﬁtability data mart.

ptg

In document Data Integration Blueprint and Modeling (Page 142-148)