Data Warehouse design
Design of Enterprise Systems University of Pavia
21/11/2013
DATA PRESENTATION
Data Warehouse design
BI Reporting Success Factors
BI platform success factors include:
– Performance – User interface
– Presentation of the data architecture
– Alignment with the data model
– Ability to answer questions
– Mobility
– Flexibility
Information Needs
Information is data interpreted within a context.
Information questions yield to one or more answers that will help the enterprise.
For example:
– Profitability: What was the recent margin between revenues and expenses?
– Trends: Did the business unit sell more or less product this quarter as compared to last quarter?
– Ratios: What is the ROI of the data warehouse?
So, BI customers need information supporting:
– Proactive processes – Reactive processes – Predefined processes – Analytic processes
Proactive Processes
The enterprise needs to know when a problem is approaching with the maximum possible lead-time.
Once an approaching problem has been observed, the lead-time allows the enterprise to align its resources to prepare the best possible response to the approaching problem.
For proactive processes, a BI Reporting application should relieve customers of the need to remember to query enterprise data.
The risk is that the customer will be too busy or just forget to run the query at the exact moment a problem emerges.
Reactive Processes
The enterprise needs to assess its recent past in the context of long-term and seasonal trends.
The information from these assessments helps the enterprise know whether short-term tactics and long-term strategies are currently working as intended or should they be modified in the near future.
Business processes such as these are reactive because they allow the enterprise to react to recent events.
BI Reporting customers need the toolsets necessary to review and analyze recent events in the context of long-term and seasonal trends.
Predefined Processes
Some business processes are well defined, repeated, and stable.
Predefined business processes could include such queries as:
– How many units did we sell?
– How much cash came in, and out, in the past week?
– What is the net present value of investments held by each customer?
In predefined processes, everything is known, except the answer. The time frame, query, and audience are all known.
A predefined process has very few, if any, variables that require the help or participation of a member of the enterprise.
Analytic Processes
Sometimes, the question that must be answered is, “What question should I ask?”
This search is the analytic process, searching for a correlation between events, for an association between factors within and around the enterprise.
Business analysts need a toolset that will enable them to search for the questions that will lead to the answers.
What exactly are “analytics”?
– Analytics are a subset of business intelligence
– Analytics are the extensive use of data, statistical and quantitative analysis, explanatory and predictive models, and fact-based management to drive decisions and actions.
– The analytics may be input for human decisions or may drive fully automated decisions.
BI maturity model
Architecture
The architecture of BI Reporting tools includes one or more servers between the data warehouse and customers.
1. These servers have a roadmap of the data warehouse.
2. Customers tell a BI Reporting application the required information through its user interface
3. The BI Reporting application submits the SQL to the data warehouse.
4. When the result returns from the data warehouse, the BI Reporting application returns it to the customer.
The companies that develop and own BI Reporting tools negotiate partnerships with the companies that develop and own RDBMS platforms.
– The partnership means that the owners of the RDBMS platform have shared their proprietary information, including application programming interfaces (APIs) and other interfaces, which allow a BI Reporting tool to connect with the most possible features and efficiency.
– The least efficient connectivity is through ODBC. A BI Reporting tool will use ODBC when no other connectivity is available.
Presentation Methods
BI Reporting tools interact with data warehouse customers using several methods.
– Reports – Dashboards
– On-Line Analytical Processing (OLAP) – Data mining and Decision Support
Systems (DSSs)
Each method has its own
advantages and disadvantages.
None of these methods addresses all the data warehouse customers’
needs and skills.
For that reason, most BI Reporting tools combine these methods.
Reports
Reports are basically SQL statements with a label.
The BI Reporting tool has a library of predefined reports.
Data warehouse customers need to be able to find the exact permutation of Fact and Dimension data in a report.
The SQL in all the reports can be
optimized for maximum query efficiency.
The BI Reporting team can test and validate each report, verifying it does indeed return the data that it promises to return.
Also, the BI Reporting team can own and catalog all the BI reports, thus avoiding redundant reports.
Dashboards
Dashboards require the BI Reporting tool translate the list of data elements
required by the customer into a SQL statement.
Then, the BI Reporting tool submits that SQL to the data warehouse and returns the result set back to the data warehouse customer.
Dashboards usually use drop-down lists, menus, and user input boxes to indicate the list of data elements and WHERE clauses required by the data warehouse customer.
To achieve the translation of data elements and WHERE clauses, a BI Reporting tool must have its own roadmap of the data warehouse.
Dashboards
That roadmap of the data
warehouse must be maintained and synchronized with the data warehouse
– If the data warehouse changes, the BI Reporting roadmap changes.
Dashboards provide flexibility and ad hoc reporting that does not exist with predefined reports.
The price for that flexibility is the roadmap of the data warehouse, which includes:
– the cost of a BI Reporting server – the cost of a middleware server,
likely
– development and maintenance
On-Line Analytical Processing
OLAP applications pre-calculate and store the answers (i.e., result sets) to permutations of
Dimensions.
The pre-calculated result sets are stored in a multidimensional
structure, which is referred to as a Cube.
The multidimensional cube is able to navigate directly to the cell that holds the result set associated
with the permutation of
Dimensions indicated by the customer.
As a result, the answer set comes
On-Line Analytical Processing
The final, and best, feature of an OLAP application is the user interface.
An OLAP application uses a GUI interface.
The customer is able to point- and-click on a cell that is a reference to a permutation of Dimensions.
The result set returns
immediately because the result set has been pre-calculated
and stored, allowing the
customer to ask questions (via
point-and-click) and receive
Multidimensional OLAP
MOLAP stores all the result sets of all the permutations of Dimension in an OLAP cube.
MOLAP requires significant storage capacity.
The creation of all the result sets in a MOLAP cube
requires significant CPU cycles, I/Os, and memory capacity.
MOLAP provides the fastest
Relational OLAP
ROLAP stores no result sets.
Rather, ROLAP identifies the data within an associated data warehouse by which it can calculate at runtime all result sets.
When a customer indicates an
intersection of Dimensions, the ROLAP cube translates that information into a SQL statement, which is submitted to a data warehouse.
The result set comes back as a data value that is reflected in the OLAP GUI.
A ROLAP cube requires the least storage capacity on the OLAP server; however, ROLAP transfers consumption of CPU cycles and I/Os over to the data
warehouse.
ROLAP provides the slowest performance and the maximum number of
Dimensions for the customer.
Hybrid OLAP
HOLAP is a combination of MOLAP and ROLAP.
By pre-calculating and storing most, but not all, of the result sets within an OLAP cube, a HOLAP cube achieves a compromise between capacity, performance, and permutations of
Dimensions available to
OLAP operations
Roll up
– A roll-up involves
summarizing the data along a dimension.
– The summarization rule might be
computing totals along a hierarchy or applying a set of formulas such as "profit = sales -
expenses“
OLAP operations
Drill down
– Drill Down/Up allows the user to navigate among levels of data ranging from the most
summarized (up) to the most detailed (down) – The analyst moves from
the summary category
"Outdoor-products" to
see the sales figures for
the individual products,
e.g. outdoor table,
OLAP operations
Slicing
– Slice is the act of picking a rectangular subset of a cube by choosing a
single value for one of its dimensions, creating a new cube with one fewer dimension.
– The sales of all regions and all product
categories of the
company in the year
2004 are “filtered" out of
the data cube.
OLAP operations
Dicing
– The dice operation produces a sub-cube by allowing the analyst to pick specific values of multiple dimensions – The new cube shows
the sales figures of a limited number of
product categories, the time and region
dimensions cover the
OLAP operations
Pivoting
– Pivot allows an analyst to rotate the cube in space to see its various faces.
– For example, cities could be arranged vertically
and products horizontally while viewing data for a particular quarter.
– Pivoting could replace products with time
periods to see data
across time for a single
product.
Data Mining
Data Mining is a search for patterns and associations within data that are not immediately obvious or may be hidden altogether.
As a pattern emerges, it may lead to a question that will lead to another pattern that may open up a new line of inquiry and discovery.
The inquiry and discovery in Data Mining follows one of two paths:
– Exploratory Analysis: This is the search for a hypothesis, a business rule that can predict future events and conditions.
– Confirmatory Analysis: This is the test of a hypothesis. A business rule has been found that requires validation and verification
An enterprise wants to be able to predict an event or condition, i.e., what function and factors in f (x, y, z) = A?
– In the best case scenario, factors x, y, and z are within the power of the enterprise to manipulate. In that case, the enterprise can cause result A to
Data Mining Tools
Generally available Data Mining tools handle all the statistical and time-series functions as well as the confidence measurements.
These Data Mining tools are powerful software packages that enhance and accelerate the Data Mining process.
They include the statistical algorithms and functions that are at the center of Data Mining.
Data Mining tools, because of all their statistical power, require the data be brought to them in specific formats.
Data Preparation is usually two or three times the work of Data Mining.
Data Cleansing
Data Mining tool needs a clean set of data, without any “noise” data that might cause confusion or distraction.
Some of the Data Cleansing methods are:
– Missing Values: Identify missing values in the data. Fill them in with a
reasonable value. This mitigates the risk that an empty spot in the data that does not normally occur may lead the Data Mining tool to believe that empty spot always occurs.
– Outliers: Identify unreasonable data values. In the data warehouse, these outliers are retained. But, in the data presented to a Data Mining tool, these values are modified to a more reasonable value. This mitigates the risk that an outlier in the data that does not normally occur may lead the Data Mining tool to believe that outlier always occurs.
– Sample Bias: Preferably, feed a Data Mining tool with a universe (a whole and complete set) of data, not just a sample. A sample of data should only be used when the delivery of a universe of data is physically and logistically
Data Inspection
A Data Mining tool understands two kinds of variables: Independent Variables and Dependent Variables.
In the cause–effect concept of the world wherein every effect is preceded by one or more causes, Independent Variables are the cause and a Dependent Variable is the effect.
In Data Inspection, a BI analyst reviews the meaning, content, and inconsistencies within each Variable.
The methods applied in Source System Analysis can also be applied to Data Inspection:
– Data profile – Histogram
– Business Rule validation
Exploratory and Confirmatory Analyses
The hypothesis is that
Independent Variables have some sort of connection to the
Dependent Variable.
Exploratory Analysis is a search for an explanation as to how (not necessarily why) some subset of these Independent Variables
relates to, or associates with, the Dependent Variable.
The relation, or association,
derived from Exploratory Analysis is an algorithm. For example:
– Growth in sales is inversely
proportional to changes in price.
– Increases in manufacturing
throughput are directly proportional
Confirmatory Analysis begins with the hypothesis.
In Confirmatory Analysis, the BI analyst tries to predict the
Dependent Variable by using the Independent Variables and the hypothesized algorithm.
The variance between the
predicted value and the actual value is a measurement of the confidence in the hypothesized algorithm.
Summary
Reactive
processes Proactive
processes Predefined
processes Analytic processes
Analysis Long term and seasonal
trends
Near future / approaching
potential events
Repeated events
Unknown / Hypothesis exploration /
Hypothesis confirmation
Access mode
Dashboards and OLAP
Alerts (e.g.
emails, notifications, etc.) and DSS
Reports Data mining and DSS
ASSIGNMENT
Data Warehouse design
Deliverable
Cube mapping by Pentaho Schema Workbench
OLAP implementation by Saiku (C-tools)
Dashboard implementation by CDE – At least one dashboard for each KPI
Deadline: December 17th