Data Warehouse design

(1)

Data Warehouse design

Design of Enterprise Systems University of Pavia

21/11/2013

(2)

DATA PRESENTATION

Data Warehouse design

(3)

BI Reporting Success Factors

BI platform success factors include:

– Performance – User interface

– Presentation of the data architecture

– Alignment with the data model

– Ability to answer questions

– Mobility

– Flexibility

(4)

Information Needs

 Information is data interpreted within a context.

 Information questions yield to one or more answers that will help the enterprise.

 For example:

– Profitability: What was the recent margin between revenues and expenses?

– Trends: Did the business unit sell more or less product this quarter as compared to last quarter?

– Ratios: What is the ROI of the data warehouse?

 So, BI customers need information supporting:

– Proactive processes – Reactive processes – Predefined processes – Analytic processes

(5)

Proactive Processes

 The enterprise needs to know when a problem is approaching with the maximum possible lead-time.

 Once an approaching problem has been observed, the lead-time allows the enterprise to align its resources to prepare the best possible response to the approaching problem.

 For proactive processes, a BI Reporting application should relieve customers of the need to remember to query enterprise data.

 The risk is that the customer will be too busy or just forget to run the query at the exact moment a problem emerges.

(6)

Reactive Processes

 The enterprise needs to assess its recent past in the context of long-term and seasonal trends.

 The information from these assessments helps the enterprise know whether short-term tactics and long-term strategies are currently working as intended or should they be modified in the near future.

 Business processes such as these are reactive because they allow the enterprise to react to recent events.

 BI Reporting customers need the toolsets necessary to review and analyze recent events in the context of long-term and seasonal trends.

(7)

Predefined Processes

 Some business processes are well defined, repeated, and stable.

 Predefined business processes could include such queries as:

– How many units did we sell?

– How much cash came in, and out, in the past week?

– What is the net present value of investments held by each customer?

 In predefined processes, everything is known, except the answer. The time frame, query, and audience are all known.

 A predefined process has very few, if any, variables that require the help or participation of a member of the enterprise.

(8)

Analytic Processes

 Sometimes, the question that must be answered is, “What question should I ask?”

 This search is the analytic process, searching for a correlation between events, for an association between factors within and around the enterprise.

 Business analysts need a toolset that will enable them to search for the questions that will lead to the answers.

 What exactly are “analytics”?

– Analytics are a subset of business intelligence

– Analytics are the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions.

– The analytics may be input for human decisions or may drive fully automated decisions.

(9)

BI maturity model

(10)

Architecture

 The architecture of BI Reporting tools includes one or more servers between the data warehouse and customers.

1. These servers have a roadmap of the data warehouse.

2. Customers tell a BI Reporting application the required information through its user interface

3. The BI Reporting application submits the SQL to the data warehouse.

4. When the result returns from the data warehouse, the BI Reporting application returns it to the customer.

 The companies that develop and own BI Reporting tools negotiate partnerships with the companies that develop and own RDBMS platforms.

– The partnership means that the owners of the RDBMS platform have shared their proprietary information, including application programming interfaces (APIs) and other interfaces, which allow a BI Reporting tool to connect with the most possible features and efficiency.

– The least efficient connectivity is through ODBC. A BI Reporting tool will use ODBC when no other connectivity is available.

(11)

Presentation Methods

 BI Reporting tools interact with data warehouse customers using several methods.

– Reports – Dashboards

– On-Line Analytical Processing (OLAP) – Data mining and Decision Support

Systems (DSSs)

 Each method has its own

advantages and disadvantages.

 None of these methods addresses all the data warehouse customers’

needs and skills.

 For that reason, most BI Reporting tools combine these methods.

(12)

Reports

 Reports are basically SQL statements with a label.

 The BI Reporting tool has a library of predefined reports.

 Data warehouse customers need to be able to find the exact permutation of Fact and Dimension data in a report.

 The SQL in all the reports can be

optimized for maximum query efficiency.

 The BI Reporting team can test and validate each report, verifying it does indeed return the data that it promises to return.

 Also, the BI Reporting team can own and catalog all the BI reports, thus avoiding redundant reports.

(13)

Dashboards

 Dashboards require the BI Reporting tool translate the list of data elements

required by the customer into a SQL statement.

 Then, the BI Reporting tool submits that SQL to the data warehouse and returns the result set back to the data warehouse customer.

 Dashboards usually use drop-down lists, menus, and user input boxes to indicate the list of data elements and WHERE clauses required by the data warehouse customer.

 To achieve the translation of data elements and WHERE clauses, a BI Reporting tool must have its own roadmap of the data warehouse.

(14)

Dashboards

 That roadmap of the data

warehouse must be maintained and synchronized with the data warehouse

– If the data warehouse changes, the BI Reporting roadmap changes.

 Dashboards provide flexibility and ad hoc reporting that does not exist with predefined reports.

 The price for that flexibility is the roadmap of the data warehouse, which includes:

– the cost of a BI Reporting server – the cost of a middleware server,

likely

– development and maintenance

(15)

On-Line Analytical Processing

 OLAP applications pre-calculate and store the answers (i.e., result sets) to permutations of

Dimensions.

 The pre-calculated result sets are stored in a multidimensional

structure, which is referred to as a Cube.

 The multidimensional cube is able to navigate directly to the cell that holds the result set associated

with the permutation of

Dimensions indicated by the customer.

 As a result, the answer set comes

(16)

On-Line Analytical Processing

 The final, and best, feature of an OLAP application is the user interface.

 An OLAP application uses a GUI interface.

 The customer is able to point- and-click on a cell that is a reference to a permutation of Dimensions.

 The result set returns

immediately because the result set has been pre-calculated

and stored, allowing the

customer to ask questions (via

point-and-click) and receive

(17)

Multidimensional OLAP

 MOLAP stores all the result sets of all the permutations of Dimension in an OLAP cube.

 MOLAP requires significant storage capacity.

 The creation of all the result sets in a MOLAP cube

requires significant CPU cycles, I/Os, and memory capacity.

 MOLAP provides the fastest

(18)

Relational OLAP

 ROLAP stores no result sets.

 Rather, ROLAP identifies the data within an associated data warehouse by which it can calculate at runtime all result sets.

 When a customer indicates an

intersection of Dimensions, the ROLAP cube translates that information into a SQL statement, which is submitted to a data warehouse.

 The result set comes back as a data value that is reflected in the OLAP GUI.

 A ROLAP cube requires the least storage capacity on the OLAP server; however, ROLAP transfers consumption of CPU cycles and I/Os over to the data

warehouse.

 ROLAP provides the slowest performance and the maximum number of

Dimensions for the customer.

(19)

Hybrid OLAP

HOLAP is a combination of MOLAP and ROLAP.

By pre-calculating and storing most, but not all, of the result sets within an OLAP cube, a HOLAP cube achieves a compromise between capacity, performance, and permutations of

Dimensions available to

(20)

OLAP operations

Roll up

– A roll-up involves

summarizing the data along a dimension.

– The summarization rule might be

computing totals along a hierarchy or applying a set of formulas such as "profit = sales -

expenses“

(21)

OLAP operations

Drill down

– Drill Down/Up allows the user to navigate among levels of data ranging from the most

summarized (up) to the most detailed (down) – The analyst moves from

the summary category

"Outdoor-products" to

see the sales figures for

the individual products,

e.g. outdoor table,

(22)

OLAP operations

Slicing

– Slice is the act of picking a rectangular subset of a cube by choosing a

single value for one of its dimensions, creating a new cube with one fewer dimension.

– The sales of all regions and all product

categories of the

company in the year

2004 are “filtered" out of

the data cube.

(23)

OLAP operations

Dicing

– The dice operation produces a sub-cube by allowing the analyst to pick specific values of multiple dimensions – The new cube shows

the sales figures of a limited number of

product categories, the time and region

dimensions cover the

(24)

OLAP operations

Pivoting

– Pivot allows an analyst to rotate the cube in space to see its various faces.

– For example, cities could be arranged vertically

and products horizontally while viewing data for a particular quarter.

– Pivoting could replace products with time

periods to see data

across time for a single

product.

(25)

Data Mining

 Data Mining is a search for patterns and associations within data that are not immediately obvious or may be hidden altogether.

 As a pattern emerges, it may lead to a question that will lead to another pattern that may open up a new line of inquiry and discovery.

 The inquiry and discovery in Data Mining follows one of two paths:

– Exploratory Analysis: This is the search for a hypothesis, a business rule that can predict future events and conditions.

– Confirmatory Analysis: This is the test of a hypothesis. A business rule has been found that requires validation and verification

 An enterprise wants to be able to predict an event or condition, i.e., what function and factors in f (x, y, z) = A?

– In the best case scenario, factors x, y, and z are within the power of the enterprise to manipulate. In that case, the enterprise can cause result A to

(26)

Data Mining Tools

 Generally available Data Mining tools handle all the statistical and time-series functions as well as the confidence measurements.

 These Data Mining tools are powerful software packages that enhance and accelerate the Data Mining process.

 They include the statistical algorithms and functions that are at the center of Data Mining.

 Data Mining tools, because of all their statistical power, require the data be brought to them in specific formats.

 Data Preparation is usually two or three times the work of Data Mining.

(27)

Data Cleansing

 Data Mining tool needs a clean set of data, without any “noise” data that might cause confusion or distraction.

 Some of the Data Cleansing methods are:

– Missing Values: Identify missing values in the data. Fill them in with a

reasonable value. This mitigates the risk that an empty spot in the data that does not normally occur may lead the Data Mining tool to believe that empty spot always occurs.

– Outliers: Identify unreasonable data values. In the data warehouse, these outliers are retained. But, in the data presented to a Data Mining tool, these values are modified to a more reasonable value. This mitigates the risk that an outlier in the data that does not normally occur may lead the Data Mining tool to believe that outlier always occurs.

– Sample Bias: Preferably, feed a Data Mining tool with a universe (a whole and complete set) of data, not just a sample. A sample of data should only be used when the delivery of a universe of data is physically and logistically

(28)

Data Inspection

 A Data Mining tool understands two kinds of variables: Independent Variables and Dependent Variables.

 In the cause–effect concept of the world wherein every effect is preceded by one or more causes, Independent Variables are the cause and a Dependent Variable is the effect.

 In Data Inspection, a BI analyst reviews the meaning, content, and inconsistencies within each Variable.

 The methods applied in Source System Analysis can also be applied to Data Inspection:

– Data profile – Histogram

– Business Rule validation

(29)

Exploratory and Confirmatory Analyses

 The hypothesis is that

Independent Variables have some sort of connection to the

Dependent Variable.

 Exploratory Analysis is a search for an explanation as to how (not necessarily why) some subset of these Independent Variables

relates to, or associates with, the Dependent Variable.

 The relation, or association,

derived from Exploratory Analysis is an algorithm. For example:

– Growth in sales is inversely

proportional to changes in price.

– Increases in manufacturing

throughput are directly proportional

 Confirmatory Analysis begins with the hypothesis.

 In Confirmatory Analysis, the BI analyst tries to predict the

Dependent Variable by using the Independent Variables and the hypothesized algorithm.

 The variance between the

predicted value and the actual value is a measurement of the confidence in the hypothesized algorithm.

(30)

Summary

Reactive

processes Proactive

processes Predefined

processes Analytic processes

Analysis Long term and seasonal

trends

Near future / approaching

potential events

Repeated events

Unknown / Hypothesis exploration /

Hypothesis confirmation

Access mode

Dashboards and OLAP

Alerts (e.g.

emails, notifications, etc.) and DSS

Reports Data mining and DSS

(31)

ASSIGNMENT

Data Warehouse design

(32)

Deliverable

 Cube mapping by Pentaho Schema Workbench

 OLAP implementation by Saiku (C-tools)

 Dashboard implementation by CDE – At least one dashboard for each KPI

Deadline: December 17th