• Generally, data mining (sometimes called data or
knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful
information - information that can be used to increase revenue, cuts costs, or both.
• Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from
many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
Continuous Innovation
• Although data mining is a relatively new term, the technology is not. Companies have used powerful computers to sift through volumes of supermarket
scanner data and analyze market research reports for years. However, continuous innovations in computer processing power, disk storage, and statistical
software are dramatically increasing the accuracy of analysis while driving down the cost.
What can data mining do?
• Data mining is primarily used today by companies with a strong consumer focus - retail, financial,
communication, and marketing organizations. It
enables these companies to determine relationships among "internal" factors such as price, product
positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to
determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to
"drill down" into summary information to view detail transactional data.
• With data mining, a retailer could use point-of-sale records of customer purchases to send targeted
promotions based on an individual's purchase
history. By mining demographic data from comment or warranty cards, the retailer could develop
products and promotions to appeal to specific customer segments.
• For example, Blockbuster Entertainment mines its video rental history database to recommend rentals to individual customers. American Express can
suggest products to its cardholders based on analysis of their monthly expenditures.
• WalMart is pioneering massive data mining to transform its supplier relationships. WalMart
captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data
warehouse. WalMart allows more than 3,500 suppliers, to access data on their products and
perform data analyses. These suppliers use this data to identify customer buying patterns at the store
display level. They use this information to manage local store inventory and identify new
merchandising opportunities. In 1995, WalMart computers processed over 1 million complex data queries.
• The National Basketball Association (NBA) is exploring a data mining application that can be used in conjunction with image recordings of basketball games. The Advanced Scout software analyzes the movements of players to help coaches orchestrate plays and strategies. For example, an analysis of the play-by-play sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6,
1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and
made each one! Advanced Scout not only finds this pattern, but explains that it is interesting because it differs
considerably from the average shooting percentage of 49.30% for the Cavaliers during that game.
• By using the NBA universal clock, a coach can
automatically bring up the video clips showing each of the jump shots attempted by Williams with Price on the floor, without needing to comb through hours of video footage. Those clips show a very successful pick-and-roll play in which Price draws the Knick's defense and then finds Williams for an open jump shot.
• How does data mining work?
• While large-scale information technology has been evolving separate transaction and analytical
systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on
open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
• Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine
customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
• Clusters: Data items are grouped according to logical
relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
• Associations: Data can be mined to identify associations.
The beer-diaper example is an example of associative mining.
• Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being
purchased based on a consumer's purchase of sleeping bags and hiking shoes.
• Data mining consists of five major elements:
• Extract, transform, and load transaction data onto the data warehouse system.
• Store and manage the data in a multidimensional database system.
• Provide data access to business analysts and information technology professionals.
• Analyze the data by application software.
• Present the data in a useful format, such as a graph or table.
7. OLAP
• OLAP is an acronym for On Line Analytical
Processing. It is an approach to quickly provide the answer to analytical queries that are dimensional in nature.
• It is part of the broader category business
intelligence, which also includes Extract transform load (ETL), relational reporting and data mining.
• The typical applications of OLAP are in business reporting for sales, marketing, management
reporting, business performance management (BPM), budgeting and forecasting, financial reporting and similar areas.
• The term OLAP was created as a slight modification of the traditional database term OLTP (On Line
Transaction Processing).
• Databases configured for OLAP employ a
multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time.
• Nigel Pendse has suggested that an alternative and perhaps more descriptive term to describe the
concept of OLAP is Fast Analysis of Shared Multidimensional Information (FASMI). They borrow aspects of navigational databases and
hierarchical databases that are speedier than their relational
Functionality
• OLAP takes a snapshot of a set of source data and
restructures it into an OLAP cube. The queries can then be run against this. It has been claimed that for complex
queries OLAP can produce an answer in around 0.1% of the time for the same query on OLTP relational data.
• The cube is created from a star schema or snowflake schema of tables. At the centre is the fact table which lists the core facts which make up the query. Numerous dimension tables are linked to the fact tables. These tables indicate how the aggregations of relational data can be analyzed. The number of possible aggregations is determined by every possible
manner in which the original data can be hierarchically linked.
• For example a set of customers can be grouped by city, by district or by country; so with 50 cities, 8 districts and two countries there are three
hierarchical levels with 60 members.
• These customers can be considered in relation to products; if there are 250 products with 20
categories, three families and three departments then there are 276 product members.
• With just these two dimensions there are 16,560 (276 * 60) possible aggregations. As the data
considered increases the number of aggregations can quickly total tens of millions or more.
• The calculation of the aggregations AND the base data combined make up an OLAP cube, which can potentially contain all the answers to every query which can be answered from the data (as in Gray, Bosworth, Layman, and Pirahesh, 1997). Due to the potentially large number of aggregations to be
calculated, often only a predetermined number are fully calculated while the remainder are solved on demand.