Data Warehouse

(1)

(2)

• A Data Warehouse

collects and manages data from varied sources

to

provide meaningful business insights.

• It is a collection of data which is separate from the operational

systems and supports the decision making of the company

. In Data

Warehouse data is stored from a historical perspective.

• The

data in the warehouse is extracted from multiple functional units

.

It is checked, cleansed and then integrated with Data warehouse

system.

Data warehouse used a very fast computer system having

large storage capacity

. This tool can answer any complex queries

relating data.

(3)

A data warehouse has following characteristics:

• Subject-Oriented

• _Integrated

• Time-variant

• Non-volatile

(4)

Subject-Oriented

• _{A data warehouse never focuses on the ongoing operations. Instead,}

it

put emphasis on modeling and analysis of data for decision making

.

It also provides a simple and concise view around the specific subject

by excluding data which not helpful to support the decision process.

These subjects can be sales, marketing, distributions, etc.

(5)

Integrated

• In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from the dissimilar database. The data also needs to be stored in the Data warehouse in common and universally acceptable manner.

(6)

In the Below example, there are three different application labeled A, B and C. Information stored in these applications are Gender, Date, and Balance. However, each application's data is stored different way.

• In Application A gender field store logical values like M or F

• In Application B gender field is a numerical value,

• In Application C application, gender field stored in the form of a character value.

• Same is the case with Date and balance.

However, after transformation and cleaning process all this data is stored in common format in the Data Warehouse.

(7)

Time-Variant

• The time horizon for data warehouse is quite extensive compared with

operational systems. The data collected in a data warehouse is recognized with a

particular period and offers information from the historical point of view. It

contains an element of time, explicitly or implicitly.

Characteristics of Data warehouse

Non-volatile

• Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.

(8)

• A data mart is a simple form of a Data Warehouse. It is focused on a

single subject.

• Data Mart draws data from only a few sources. These sources may be

central Data warehouse, internal operational systems, or external

data sources.

(9)

Differences between Data Warehouse and Data

Mart

_{Parameter Data Warehouse} _{Data Mart}

Definition A Data Warehouse is a large repository of data collected from different organizations or departments within a corporation.

A data mart is an only subtype of a Data Warehouse. It is designed to meet the need of a certain user group.

Usage It helps to take a strategic decision. It helps to take tactical decisions for the business.

Objective The main objective of Data Warehouse is to provide an integrated environment and coherent picture of the business at a point in time.

A data mart mostly used in a business division at the department level.

Designing The designing process of Data Warehouse is quite difficult.

The designing process of Data Mart is easy.

May or may not use in a dimensional model. However, it can feed dimensional models.

It is built focused on a dimensional model using a star schema.

Data Handling

Data warehousing includes large area of the corporation which is why it takes a long time to process it.

Data marts are easy to use, design and implement as it can only handle small amounts of data.

Focus Data warehousing is broadly focused all the departments. It is possible that it can even represent the entire company.

Data Mart is subject-oriented, and it is used at a department level.

Data type The data stored inside the Data Warehouse are always detailed when compared with data mart.

(10)

• An operational data store (ODS) is a type of database that collects data from

multiple sources for processing, after which it sends the data to operational

systems and data warehouses.

• It provides a central interface or platform for all operational data used by

enterprise systems and applications.

• An ODS is a database designed to integrate data from multiple sources for

additional operations on the data, for reporting, controls and operational

decision support.

(11)

Activities like delete, update, and insert which are performed in an operational application environment are omitted in Data warehouse environment. Only two types of data operations performed in the Data Warehousing are

• Data loading

• Data access

Operational application vs. Data

Warehouse

Operational Application Data Warehouse

Complex program must be coded to make sure that data upgrade processes maintain high integrity of the final

product.

This kind of issues does not happen because data update is not performed.

Data is placed in a normalized form to ensure minimal

redundancy.

Data is not stored in normalized form. Technology needed to support

issues of transactions, data recovery, rollback, and

resolution as its deadlock is quite complex.

(12)

There are two prominent architecture styles practiced today to build a data warehouse: the Inmon architecture and the Kimball architecture.

The Inmon Approach

• The Inmon approach to building a data warehouse begins with the corporate data model. Step -1

This model identifies the key subject areas, and most importantly, the key entities the business operates with and cares about, like customer, product, vendor, etc. From this model, a detailed logical model is created for each major entity.

• For example, a logical model will be built for Customer with all the details related to that entity.

• There could be ten different entities under Customer. All the details including - business keys, attributes, dependencies, participation, and relationships will be captured in the detailed logical model. The key point here is that the entity structure is built in normalized form.

• Data redundancy is avoided as much as possible.

• This leads to clear identification of business concepts and avoids data update anomalies. Step-2

The next step is building the physical model.

(13)

Step-2 - The next step is building the physical model.

• The physical implementation of the data warehouse is also normalized. This is what Inmon calls as a ‘data warehouse,’

• This normalized model makes loading the data less complex, but using this structure for querying is hard as it involves many tables and joins.

• So, Inmon suggests building data marts specific for departments. • The data marts will be designed specifically for Finance, Sales, etc.,

• Any data that comes into the data warehouse is integrated, and the data warehouse is the only source of data for the different data marts. This ensures that the integrity and consistency of data is kept intact across the organization.

(14)

• The data warehouse truly serves as the single source of truth for the enterprise, as it is the only source for the data marts and all the data in the data warehouse is integrated.

• Data update anomalies are avoided because of very low redundancy. This makes ETL process easier and less prone to failure.

• The business processes can be understood easily, as the logical model represents the detailed business entities.

• Very flexible – As the business requirements change or source data changes, it is easy to update the data warehouse as one thing is in only one place.

• Can handle varied reporting needs across the enterprise.

Inmon Model Advantages and Disadvantages

Here are some of the disadvantages of Inmon method:

• The model and implementation can become complex over time as it involves more tables and joins.

• Need resources who are experts in data modeling and of the business itself. These type of resources can be hard to find and are often expensive.

• The initial set-up and delivery will take more time, and management needs to be aware of this.

• More ETL work is needed as the data marts are built from the data warehouse.

(15)

• The Kimball approach to building the data warehouse starts with identifying the key business processes and the key business questions that the data warehouse needs to answer.

• ETL software is used to bring data from all the different sources and load into a staging area.

• From here, data is loaded into a dimensional model.

“Here the comes the key difference: the model proposed by Kimball for data warehousing—the dimensional model—is not normalized.”

• The fundamental concept of dimensional modeling is the star schema. In the star schema, there is typically a fact table surrounded by many dimensions.

• The fact table has all the measures that are relevant to the subject area, and it also has the foreign keys from the different dimensions that surround the fact.

• The dimensions are de-normalized completely so that the user can drill up and drill down without joining to another table.

(16)

(17)

Here are some of the advantages of the Kimball method:

• Quick to set-up and build, and the first phase of the data warehousing project will be delivered quickly.

• The star schema can be easily understood by the business users and is easy to use for reporting. Most BI tools work well with star schema.

• it occupies less space in the database and it makes the management of the system fairly easier. • The performance of the star schema model is very good.

• A small team of developers and architects is enough to keep the data warehouse performing effectively (Breslin, 2004).

• Works really well for department-wise metrics and KPI tracking, as the data marts are geared towards department-wise or business process-wise reporting.

• Drill-across, where a BI tool goes across multiple star schemas to generate a report can be successfully accomplished using conformed dimensions.

Here are some of the disadvantages of the Kimball method:

• The essence of the ‘one source of truth’ is lost, as data is not fully integrated before serving reporting needs. • Redundant data can cause data update anomalies over time.

• Adding columns to the fact table can cause performance issues. This is because the fact tables are designed to be very deep. If new columns are to be added, the size of the fact table becomes much larger and will not perform well. This makes the dimensional model hard to change as the business requirements change.

• Cannot handle all the enterprise reporting needs because the model is oriented towards business processes rather than the enterprise as a whole.

• Integration of legacy data into the data warehouse can be a complex process.

(18)

• Reporting Requirements – If the reporting requirements are strategic and enterprise-wide and integrated

reporting is needed, then Inmon works best. If the reporting requirements are business process/team oriented, then Kimball works best.

• Project Urgency – If the organization has enough time to wait for the first delivery of the data warehouse (say

4 to 9 months), then Inmon approach can be followed. If there is very little time for the data warehouse to be up and running (say, 2 to 3 months) then the Kimball approach is best.

• Future Staffing Plan – If the company can afford to have a large sized team of specialists to maintain the data

warehouse, then the Inmon method can be pursued. If the future plan for the team is to be thin, then Kimball is more suited.

• Frequency of Changes – If the reporting requirements are expected to change more rapidly and the source

systems are known to be volatile, then the Inmon approach works better, as it is more flexible. If the requirements and source systems are relatively stable, the Kimball method can be used.

• Organization Culture – If the sponsors of the data warehouse and the managers of the firm understand the

value proposition of the data warehouse and are willing to accept long-lasting value from the data warehouse investment, the Inmon approach is better. If the sponsors do not care about the concepts but want a solution to get better at reporting, then the Kimball approach is enough.

(19)

• ETL is defined as a process that extracts the data from different RDBMS source systems, then transforms the data (like applying calculations, concatenations, etc.) and finally loads the data into the Data Warehouse system. ETL full-form is Extract, Transform and Load.

ETL is a 3-step process

(20)

ETL is a 3-step process

Step 1) Extraction

• In this step, data is extracted from the source system into the staging area.

Three Data Extraction methods:

• Full Extraction

• Partial Extraction- without update notification.

• Partial Extraction- with update notification.

Irrespective of the method used, extraction should not affect performance and response time of the source systems. These source systems are live production databases. Any slow down or locking could effect company's bottom line.

• Some validations are done during Extraction:

• Reconcile (restore) records with the source data.

• Make sure that no spam/unwanted data loaded

• Data type check

• Remove all types of duplicate/fragmented data

• Check whether all the keys are in place or not

(21)

ETL is a 3-step process

Step 2) Transformation

• Data extracted from source server is raw and not usable in its original form. Therefore it needs to be cleansed, mapped and transformed.

Example –

• In transformation step, you can perform customized operations on data. For instance, if the user wants sum-of-sales revenue which is not in the database. Or if the first name and the last name in a table is in different columns. It is possible to concatenate them before loading.

1. Different spelling of the same person like Jon, John, etc.

2. There are multiple ways to denote company name like Google, Google Inc. 3. Use of different names like Cleaveland, Cleveland.

4. There may be a case that different account numbers are generated by various applications for the same customer.

(22)

Validations are done during Transformation

stage-•

Filtering – Select only certain columns to load

• Using rules and lookup tables for Data standardization.

• Character Set Conversion and encoding handling.

• Conversion of Units of Measurements like Date Time Conversion, currency conversions, numerical

conversions, etc.

• Data threshold validation check. For example, age cannot be more than two digits.

• Required fields should not be left blank.

• Cleaning ( for example, mapping NULL to 0 or Gender Male to "M" and Female to "F" etc.)

• Split a column into multiples and merging multiple columns into a single column.

(23)