Data Warehouse

(1)

Data

warehouse

(2)

Index

Data warehouse

page 3

Strengths and weaknesses of implementing a

Data warehouse

page 5

Data Warehouse Architecture

page 7

Data Warehouse Schema

page 9

Data mining

page 10

Data mining used for

page 11

Data Mining Techniques

(3)

Data warehouse

A data warehouse in its simplest perception is no more than a collection of the key pieces of information used to manage and direct the business for the most

profitable outcome. This could be anything from deciding the level of stock on supermarket shelf, through focused targeting of customers for loyalty schemes and direct mail promotions to strategic decisions making on major market segments and company profitability.

In other word, a data warehouse is the data

(meta/fact/dimension/aggregation) and the process managers (load/warehouse/query) that make

information available, enabling people to make informed decisions.

In computing, a data warehouse or enterprise data

warehouse is a database used for reporting and data

analysis. It is a central repository of data, which is created by integrating data from multiple disparate sources. Data warehouses store current as well as historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.

The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales etc.,

(4)

shown in the figure to the right). The data may pass through an operational data store for additional

operations before they are used in the DW for reporting.

The typical ETL-based data warehouse uses staging, integration, and access layers to house its key

functions. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data are then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups often called

dimensions and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.

A data warehouse constructed from an integrated data source systems does not require ETL, staging

databases, or operational data store databases. The integrated data source systems may be considered to be a part of a distributed operational data store layer. Data federation methods or data virtualization methods may be used to access the distributed integrated

source data systems to consolidate and aggregate data directly into the data warehouse database tables.

Unlike the ETL-based data warehouse, the integrated source data systems and the data warehouse are all integrated since there is no transformation of

dimensional or reference data. This integrated data warehouse architecture supports the drill down from the aggregate data of the data warehouse to the

(5)

transactional data of the integrated source data systems.

Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse.

Strengths and weaknesses of

implementing a Data warehouse

Strengths

A Data Warehouse Delivers Enhanced Business Intelligence

By providing data from various sources, managers and executives will no longer need to make business

decisions based on limited data or their gut. In addition, “data warehouses and related BI can be applied

directly to business processes including marketing segmentation, inventory management, financial management, and sales.”

A Data Warehouse Saves Time

Since business users can quickly access critical data from a number of sources—all in one place—they can rapidly make informed decisions on key initiatives.

(6)

They won’t waste precious time retrieving data from multiple sources.

Not only that but the business execs can query the data themselves with little or no support from IT—saving

more time and more money. That means the business users won’t have to wait until IT gets around to

generating the reports, and those hardworking folks in IT can do what they do best—keep the business

running.

A Data Warehouse Enhances Data Quality and Consistency

A data warehouse implementation includes the

conversion of data from numerous source systems into a common format. Since each data from the

various departments is standardized, each department will produce results that are in line with all the other departments. So you can have more confidence in the accuracy of your data. And accurate data is the basis for strong business decisions.

A Data Warehouse Provides Historical Intelligence

A data warehouse stores large amounts of historical data so you can analyze different time periods and trends in order to make future predictions. Such data typically cannot be stored in a transactional database or used to generate reports from a transactional

system.

A Data Warehouse Generates a High ROI

Finally, the piece de resistance—return on investment. Companies that have implemented data warehouses and complementary BI systems have generated more

(7)

revenue and saved more money than companies that haven’t invested in BI systems and data warehouses.

Weaknesses



Data warehouses are not the optimal environment for unstructured data.



Because data must be extracted, transformed and loaded into the warehouse, there is an element of latency in data warehouse data.



Over their life, data warehouses can have high costs.



Data warehouses can get outdated relatively quickly. There is a cost of delivering suboptimal information to the organization.

Data Warehouse Architecture

Different data warehousing systems have different structures. Some may have an ODS (operational data

(8)

store), while some may have multiple data marts. Some may have a small number of data sources, while some may have dozens of data sources. In view of this, it is far more reasonable to present the different layers of a data warehouse architecture rather than discussing the specifics of any one system.

In general, all data warehouse systems have the following layers:

• Data Source Layer

• Data Extraction Layer

• Staging Area

• ETL Layer

• Data Storage Layer

• Data Logic Layer

• Data Presentation Layer

• Metadata Layer

• System Operations Layer

Data Source Layer

This represents the different data sources that feed data into the data warehouse. The data source can be of any format -- plain text file, relational database,

other types of database, Excel file, etc., can all act as a data source.

Many different types of data can be a data source:

• Operations -- such as sales data, HR data, product data, inventory data, marketing data, systems data.

(9)

• Internal market research data.

• Third-party data, such as census data, demographics data, or survey data.

All these data sources together form the Data Source Layer.

Data Extraction Layer

Data gets pulled from the data source into the data warehouse system. There is likely some minimal data cleansing, but there is unlikely any major data

transformation.

Staging Area

This is where data sits prior to being scrubbed and

transformed into a data warehouse / data mart. Having one common area makes it easier for subsequent data processing / integration.

ETL Layer

This is where data gains its "intelligence", as logic is applied to transform the data from a transactional nature to an analytical nature. This layer is also where data cleansing happens. The ETL design phase is often the most time-consuming phase in a data warehousing project, and an ETL tool is often used in this layer.

Data Storage Layer

This is where the transformed and cleansed data sit. Based on scope and functionality, 3 types of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any given system, you may have just one of the three, two of the three, or all three types.

(10)

Data Logic Layer

This is where business rules are stored. Business rules stored here do not affect the underlying data

transformation rules, but do affect what the report looks like.

Data Presentation Layer

This refers to the information that reaches the users. This can be in a form of a tabular / graphical report in a browser, an emailed report that gets automatically

generated and sent every day, or an alert that warns users of exceptions, among others. Usually an OLAP tool and/or a reporting tool is used in this layer.

Metadata Layer

This is where information about the data stored in the data warehouse system is stored. A logical data model would be an example of something that's in the

metadata layer. A metadata tool is often to used to manage metadata.

System Operations Layer

This layer includes information on how the data

warehouse system operates, such as ETL job status, system performance, and user access history.

Data Warehouse Schema

(11)

schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables

Snowflake schema

The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. In a star schema, each dimension is

represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy.

Data mining

Data mining is a process of statistical analysis. Analysts use technical tools to query and sort through terabytes of data looking for patterns. Usually, the analyst will develop a hypothesis, such as customers who buy product X usually buy product Y within six months. Running a query on the relevant data to prove or disprove this theory is data mining. Businesses then

(12)

use this information to make better business decisions based on how they understand their customers and suppliers' behaviors.

Data mining used for

Collection

All companies with a large customer base have a number of customers who do not pay their dues on time. Collecting these payments from the debtors requires a great deal of resources, and a large

proportion of this work is wasted on customers that are difficult or impossible to recover. By applying predictive data mining to historical customer debt data, the

collection procedure can be optimized by identifying the debtors most likely to pay and finding the most effective contact methods or legal actions for each debtor.

Scientific applications

In pharmaceutical companies, chemistry is one of the most resource intensive areas within the research and development (R&D) activities. The whole purpose of the pharmaceutical company’s R&D is to produce new

chemical entities (NCE) that will make it all the way through clinical trials to the market as new drugs. The search for new chemical compounds is in essence a

trial-and-error process. The job of the R&D chemist is to synthesise (produce) new compounds for testing in the

(13)

laboratory.

One such chemist may spend up to three weeks for making just one such compound.

Despite the expensive production of new compounds, pharmaceutical companies test very many compounds in their R&D activities and stores all the results in large databases (there is also a fairly large industry selling chemical compounds of great diversity). By applying predictive data mining to such historical laboratory test data, it is possible to reasonably predict the outcome of the laboratory tests without having to synthesise the compounds. This means that the chemist can find out the most likely properties of the alternative compounds and choose to work on the most promising ones, before spending the next three weeks in the laboratory,

thereby increasing the quality of the resulting compounds.

Recommendation systems

All companies want to sell more to existing customers. This is often the most effective way of increasing the profitability. For companies that sell many different products to large customer bases and that keep

records of sales transactions for their customers, it is possible to apply predictive data mining to identify sales opportunities as products likely to appeal to a particular customer who has not yet bought them. This type of application is also commonly referred to as

cross-selling, and some of the most notable examples of companies using it is Flipkart.com where you will get relevant books recommended.

(14)

Other applications

Other applications of predictive data mining include fraud detection (e.g. within credit card transactions, taxation, telephony, and insurance industry) and risk management (e.g. for determining insurance policy rates or managing credit applications).

Data Mining Techniques

Association

Association is one of the best known data mining

technique. In association, a pattern is discovered based on a relationship of a particular item on other items in the same transaction. For example, the association technique is used in market basket analysis to identify what products that customers frequently purchase together. Based on this data businesses can have corresponding marketing campaign to sell more products to make more profit.

Classification

Classification is a classic data mining technique based on machine learning. Basically classification is used to

(15)

classify each item in a set of data into one of

predefined set of classes or groups. Classification

method makes use of mathematical techniques such as decision trees, linear programming, neural network and statistics. In classification, we make the software that can learn how to classify the data items into groups. For example, we can apply classification in application that “given all past records of employees who left the company, predict which current employees are

probably to leave in the future.” In this case, we divide the employee’s records into two groups that are

“leave” and “stay”. And then we can ask our data mining software to classify the employees into each group.

Clustering

Clustering is a data mining technique that makes

meaningful or useful cluster of objects that have similar characteristic using automatic technique. Different

from classification, clustering technique also defines the classes and put objects in them, while in

classification objects are assigned into predefined classes. To make the concept clearer, we can take library as an example. In a library, books have a wide range of topics available. The challenge is how to keep those books in a way that readers can take several books in a specific topic without hassle. By using clustering technique, we can keep books that have

some kind of similarities in one cluster or one shelf and label it with a meaningful name. If readers want to grab books in a topic, he or she would only go to that shelf instead of looking the whole in the whole library.

Prediction

The prediction as it name implied is one of a data

(16)

independent variables and relationship between dependent and independent variables. For instance, prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw a fitted regression curve that is used for profit prediction.

Sequential Patterns

Sequential patterns analysis in one of data mining technique that seeks to discover similar patterns in data transaction over a business period. The uncover patterns are used for further business analysis to recognize relationships among data.