Data Mining (DM)
Ms Ansif Arooj, University of Education, S & T, Township Campus Lahore
Lecture 2: Data Mining and its Applications
2
Why Data Mining?
âš« The Explosive Growth of Data: from terabytes to petabytes
âš« Data collection and data availability
âš« Automated data collection tools, database systems,
Web, computerized society
âš« Major sources of abundant data
⚫ Business: Web, e-commerce, transactions, stocks, …
âš« Science: Remote sensing, bioinformatics, scientific
simulation, …
âš« Society and everyone: news, digital cameras, YouTube
âš« We are drowning in data, but starving for knowledge!
⚫ “Necessity is the mother of invention”—Data
3
What Is Data Mining?
âš« Data mining (knowledge discovery from data)
âš« Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from huge amount of data
âš« Data mining: a misnomer?
âš« Alternative names
âš« Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business intelligence, etc.
4
Knowledge Discovery (KDD) Process
âš« This is a view from typical database systems and data warehousing
communities
âš« Data mining plays an essential role in the knowledge discovery process
Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
© Prentice Hall 6
© Prentice Hall 7
Data Warehousing
âš«
“
Subject-oriented, integrated, time-variant, nonvolatile” William Inmon⚫ Operational Data: Data used in day to day needs of company.
âš« Informational Data: Supports other functions such as planning and forecasting.
âš« Data mining tools often access data warehouses rather than operational data.
© Prentice Hall 8
Operational vs. Informational
Operational Data Data Warehouse
Application OLTP OLAP
Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
© Prentice Hall 9
OLAP
âš« Online Analytic Processing (OLAP): provides more complex queries than OLTP.
âš« OnLine Transaction Processing (OLTP): traditional database/transaction processing.
âš« Dimensional data; cube view
âš« Visualization of operations:
âš« Slice: examine sub-cube.
âš« Dice: rotate cube to look at another dimension.
âš« Roll Up/Drill Down
© Prentice Hall 10
OLAP Operations
Single Cell Multiple Cells Slice Dice
Roll Up
11
Data Mining in Business Intelligence
Increasing potential to support
business decisions End User
Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses Data Sources
12
KDD Process: A Typical View from ML and Statistics
Input Data Data
Mining
Data Pre-Processing
Post-Process ing
âš« This is a view from typical machine learning and statistics communities
Data integration Normalization Feature selection Dimension reduction
Pattern discovery
Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
13
Multi-Dimensional View of Data Mining
âš« Data to be mined
âš« Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks
âš« Knowledge to be mined (or: Data mining functions)
âš« Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.
âš« Descriptive (mining tasks characterize properties of the data in a target data set.) vs. predictive data mining (mining tasks perform induction on the current data in order to make predictions).
âš« Multiple/integrated functions and mining at multiple levels
âš« Techniques utilized
âš« Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.
FUNCTION OF DATA MINING
Application of Data Mining
âš«
Spatial Data Analysis
âš«
Information Retrieval
⚫Pattern Recognition
⚫Image Analysis
⚫Signal Processing
⚫Computer Graphics
⚫Web Technology
⚫Business
⚫Bioinformatics
17
Data Mining Function:
(1) Generalization
âš« Information integration and data warehouse construction
âš« Data cleaning, transformation, integration, and
multidimensional data model
âš« Data cube technology
âš« Scalable methods for computing (i.e., materializing)
multidimensional aggregates
âš« OLAP (online analytical processing)
âš« Multidimensional concept description:
âš« Characterization and discrimination
âš« Generalize, summarize,
18
Data Mining Function: (2)
Association and
Correlation Analysis
âš« Frequent patterns (or frequent item sets)
What items are frequently purchased together in your shopping mall?
âš« Association, correlation vs. causality
âš« A typical association rule
Butter, Bread Milk [20%, 100%] [support, confidence) RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2% ,c=55%]
Are strongly associated items also strongly correlated?
âš« How to mine such patterns and rules efficiently in large datasets?
âš« How to use such patterns for classification, clustering, and other applications?
Support(x->y)=P(XUY) Confidence(x->y)=XUY/X
Small Data Mining Example
âš« X Y
âš« Butter, Bread Milk
âš« [20%, 100%] (support, confidence)
âš«
Support
âš« the proportion of transactions in the data set which
contain the itemset.
âš« 1/5
âš«
Confidence
CLASS ACTIVITY
Rule 1) Coke, burger Diapers
Rule 2) Coke, burger, Potatoes bread
Rule 3) Coke, burger, potatoes onion, bread Rule 4) burger, potatoes, onion coke
Transaction ID
Coke Burger Potatoes Onion Diapers Bread
1 1 1 1 1 2 1 1 1 3 1 1 1 1 4 1 1 1 1 1 5 1 6 1 1 1 7 1 1 1 1 8 1 1 1 1 1 9 1 1 10 1 1 1 1 1 1
24
Data Mining Function:
(3) Classification
âš« Classification and label predictionâš« Construct models (functions) based on some training examples
âš« Also named as supervised classification
âš« Describe and distinguish classes or concepts for future prediction âš« E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
âš« Typical methods
âš« Decision trees, naĂŻve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …
âš« Typical applications:
⚫ Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …
27
Data Mining Function:
(4) Cluster Analysis
âš« Unsupervised learning (i.e., Class label is unknown)
âš« Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
âš« Principle: Maximizing intra-class similarity & minimizing interclass similarity
29
Data Mining Function:
(5) Outlier Analysis
âš« Outlier analysis
âš« Outlier: A data object that does not comply with the
general behavior of the data
⚫ Noise or exception? ― One person’s garbage could be
another person’s treasure
âš« Methods: by product of clustering or regression analysis,
…
Data Mining Function:
(6) Prediction
⚫The major idea is to use a large number of past
values to consider probable future values.
âš«
Forecasting and predicting the unavailable data
values or a class label for some data.
31
Evaluation of Knowledge
âš« Are all mined knowledge interesting?
⚫ One can mine tremendous amount of “patterns”
âš« Some may fit only certain dimension space (time,
location, …)
⚫ Some may not be representative, may be transient, …
⚫ Evaluation of mined knowledge → directly mine only
interesting knowledge? ⚫ Descriptive vs. predictive ⚫ Coverage ⚫ Typicality vs. novelty ⚫ Accuracy ⚫ Timeliness ⚫ …
32
Data Mining: Confluence of Multiple Disciplines
Data Mining Machine Learning Statistics Applicati ons Algorithm Pattern Recogniti on High-Perform ance Computing Visualizat ion Database Technolo gy
33
Summary
âš« Data mining: Discovering interesting patterns and knowledge from massive amount of data
âš« A natural evolution of science and information technology, in great demand, with wide applications
âš« A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge presentation
âš« Mining can be performed in a variety of data
âš« Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.
âš« Data mining technologies and applications
Class Activity
âš«
Discuss whether or not each of the following activities
is a data mining task.
âš«
A) Dividing the customers of a company according to
their gender.
âš«
No. This is a simple database query.
âš«
B) Dividing the customers of a company according to
their profitability.
âš«
No. This is an accounting calculation, followed by the
application of a threshold. However, predicting the
profitability of a new customer would be data mining.
Data Mining yes/no?
âš« (c) Computing the total sales of a company.
âš« No. Again, this is simple accounting.
âš« (d) Sorting a student database based on student identification numbers.
âš« No. Again, this is a simple database query.
âš« (e) Predicting the outcomes of tossing a (fair) pair of dice.
âš« No. Since the die is fair, this is a probability calculation. If
the die were not fair, and we needed to estimate the
probabilities of each outcome from the data, then this is more like the problems considered by data mining.
However, in this specific case, solutions to this problem were developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data mining.
Data Mining yes/no?
âš«
(f)Predicting the future stock price of a company using
historical records.
âš«
Yes. We would attempt to create a model that can
predict the continuous value of the stock price. This is
an example of the area of data mining known as
predictive modelling. We could use regression for this
modelling, although researchers in many fields have
developed a wide variety of techniques for predicting
time series.
Data Mining yes/no?
âš«
(g) Monitoring the heart rate of a patient for
abnormalities.
âš«
Yes. We would build a model of the normal behavior
of heart rate and raise an alarm when an unusual heart
behavior occurred. This would involve the area of data
mining known as anomaly detection. This could also
be considered as a classification problem if we had
Data Mining yes/no?
âš«
(h) Monitoring seismic waves for earthquake activities.
âš«
Yes. In this case, we would build a model of different
types of seismic wave behavior associated with
earthquake activities and raise an alarm when one of
these different types of seismic activity was observed.
This is an example of the area of data mining known as
classification.
âš«
Extracting the frequencies of a sound wave.
Data Mining and Data Privacy
âš« For each of the following data sets, explain whether or not data privacy is an important issue.
⚫ (a) Census data collected from 1900–1950.
âš« No
âš« (b) IP addresses and visit times of Web users who visit your Website.
âš« Yes
âš« (c) Images from Earth-orbiting satellites.
âš« No
âš« (d) Names and addresses of people from the telephone book.
âš« No
âš« (e) Names and email addresses collected from the Web.
40
Recommended Reference Books
âš« E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
âš« S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
âš« R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
âš« T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
âš« U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press, 1996
âš« U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
âš« J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011
âš« T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd
ed., Springer, 2009
âš« B. Liu, Web Data Mining, Springer 2006
âš« T. M. Mitchell, Machine Learning, McGraw Hill, 1997
âš« Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012
âš« P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
âš« S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
âš« I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,