CSD479
Data Mining
•
Instructor:
Dr. Zeeshan Gillani
Room # 37, Faculty Block
Lectures:
•
Office Hrs:
Mon 10:00 – 11:30; Thur 11:30 – 13:00 hrs
(or by appointment)
•
Prerequisite:
Knowledge of Statistics and Database/Data
Warehousing is helpful
This course aims to provide the students with the key
concepts of applications, techniques, and methodologies of
Data Mining with the primary focus on the classification and
clustering algorithms.
Mining Methodology, Overview of Data Warehousing,
Overview of OLAP, Applications of Data Mining, Data
cleaning and preparation, Concept Description, Association
Rule Mining, Classification, Classification by Back
Propagation, Prediction, Decision Trees, Bayesian
Classification, Classification Accuracy, Regression for
Classification and Prediction, Distributions, Cluster
Analysis.
Text Book:
1.
Han, J. and Kamber, M. (2011) Data Mining Concepts and
Techniques, 3
rdEdition, Morgan Kaufmann.
Reference Books:
1.
Provost, F. and Fawcett, T. (2013) Data Science for Business: What
you need to know about data mining and data-analytic thinking,
1
stedition, O'Reilly Media.
2.
Witten I. H., Frank, E. and Hall, M. A. (2011) Data Mining:
Practical Machine Learning Tools and Techniques, 3
rdEdition,
Morgan Kaufmann.
Instruments:
There will be 4 assignments, 4 quizzes,
Weights:
Assignments
10%
Quizzes
15%
S-I
10%
S-II
15%
Final Exam
50%
Lect.# Topics/Contents
1 Introduction to data Mining? Data Mining on different kind of Databases. Data mining functionalities.
2 Data objects and Attribute Types. Some basic Statistical Descriptions of Data; Mean, Median, Mode, S.D., Variance etc. Data Similarity and Dissimilarity 3 Non-Euclidean Distances for Nominal, Ordinal and Mixed Types attributes. 4 Data Preprocessing techniques; Data cleaning; Data integration
5 Data Integration problems, removing data redundancy using Chi-square and correlation analysis.
6 Data Reduction; Dimensionality Reduction, Numerosity Reduction Data Compression, PCA
7 Examples of PCA; Data Normalization.
8 Mining Frequent Patterns, Market basket analysis, frequent itemsets, frequent pattern mining. mining association rules from frequent itemsets
9 Finding Frequent itemsets, using candidate generation, generating association rules from frequent itemsets. Brute force algorithm and The Apriori Algorithm. 10 Finding interestingness, strong rules are not necessarily interesting, from
association analysis to correlation analysis. 11 Sessional - I
Lect.# Topics/Contents
12 Introduction to Classification, Classification by Decision Tree, Decision tree induction, attribute selection measures
13
Entropy and Gini measures for tree induction, tree pruning. 14
Tree pruning, pre and post pruning, scalability 15
Model Evaluation methods. Introduction to Weka 16
Conditional Probability and Bayes Theorem 17
Introduction to Naive Bayes Classifier with examples
18 Rule-based Classification: Using IF-THEN Rules for Classification, Rule Extraction from a Decision Tree.
19 Rule induction using a sequential covering algorithm. Methods of Rule evaluation.
20 Introduction to Artificial Neural Network. A Multilayer feed-forward neural network, backpropagation.
21
Example of ANN. Revision 22
Sessional-II
23
Discussion on S-II. Introduction to clustering, K-Mean clustering
24
Examples of k-means, k-modes, selecting best k.
25
Clustering: K-Medoids with examples
26
Clustering: Introduction to CLARA and CLARANS
27
Introduction to Hierarchical Clustering. Agglomerative Clustering
using Single Link.
28
Agglomerative Clustering using Complete Link, Average Link and
MST. Divisive Algorithms.
29
Introduction to BIRCH. Clustering Features. CF Tree
30
Major tasks of clustering evaluation, Extrinsic and intrinsic
evaluation methods. Revision
31
Terminal Exam
Introduction to Data Mining
(Chapter #1 of text book)
Motivation: “Necessity is the
Mother of Invention”
Data Explosion Problem
1. Automated data collection tools (e.g. web, sensor networks) and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
2. Currently enterprises are facing data explosion problem.
3. YouTube users upload 48 hours of video, Facebook users share 684,478 pieces of content, Instagram users share 3,600 new photos, and Tumblr sees 27,778 new posts published.
A full 90% of world's data generated over last two
years (Date:May 22, 2013, Source:SINTEF)
Electronic Information an Important Asset for Business
Decisions
1. With the growth of electronic information, enterprises began to realizing that the accumulated information can be an important asset in their business decisions.
2. There is a potential business intelligence hidden in the large volume of data.
3. This intelligence can be the secret weapon on which the success of a business may depend.
11
1.
It is not a
Simple Matter
to discover
Business
Intelligence
from
Mountain of Accumulated Data
.
2.
What is required are
Techniques
that allow the enterprise to
Extract the Most Valuable Information
.
3.
The
Field of Data Mining
provides such
Techniques
.
4.
These techniques can
Find Novel Patterns (unknown)
that
may
Assist an Enterprise
in
Understanding
the business
better and in forecasting.
What Is Data Mining?
Data mining (knowledge discovery in databases):
Extraction of interesting
(non-trivial, implicit, previously
unknown and potentially useful)
information or patterns
from data in large databases
Alternative names :
Data mining: a misnomer?
Knowledge discovery(mining) in databases (KDD),
Data Mining (Example)
Random Guessing vs. Potential Knowledge
Suppose we have to Forecast the Probability of Rain in Islamabad city
for any particular day.
Without any Prior Knowledge the probability of rain would be 50%
(pure random guess).
If we had a lot of weather data, then we can extract potential
rules using Data Mining which can then forecast the chance of rain better than random guessing.
Example: The Rule
if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of rain. Temperature Humidity Windyhot high false RainNo
hot high true Yes
hot high false Yes
mild high false No
cool normal false No
Examples: What is (not) Data
Mining?
What is not Data
Mining?
–
Look up phone
number in phone
directory
–
Query a Web search
engine for information
about “Amazon”
What is Data Mining?
–
Certain names are more prevalent
in certain US locations (O’Brien,
O’Rurke, O’Reilly… in Boston area)
–
Group together similar documents
returned by search engine according
to their context (e.g. Amazon
Data Mining: A KDD Process
Data mining:
the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Data Mining
The Data Mining Process
17
•
Step 0: Determine Business Objective/Learning the
application domain
- e.g. Forecasting the probability of rain
- Must have relevant prior knowledge and goals of application.
•
Step 1: Creating a Target Data set/Prepare Data
- Data Selection
- Data Cleaning; Noisy and Missing values handling (may take 60% of the effort!).
- Data Transformation (Normalization/Discretization). - Attribute/Feature Selection.
•
Step 2: Choosing the Function of Data Mining
- Classification, Clustering, Regression, Association Rules
•
Step 3: Choosing The Mining Algorithm
- Selection of correct algorithm depending upon the quality of data. - Selection of correct algorithm depending upon the density of data.
Step 4: Data Mining
- Search for patterns of interest:- A typical data mining algorithm can mine millions of patterns.
•
Step 5: Visualization/Knowledge Representation
- Visualization/Representation of interesting patterns, etc . and then
Data Mining and Business Intelligence
Increasing potential to support
business decisions End User
Business Analyst
Data Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Data Mining: On What Kind of
Data?
1.
Relational databases
2.
Data warehouses
3.
Transactional databases
4.
Advanced DB and information repositories
Time-series data and temporal data
Text databases
Multimedia databases
Data Stream (Sensor Networks Data)
Data Mining: Confluence of Multiple
Disciplines
Data Mining
Database
Technology
Statistics
Other
Disciplines
Information
Science
Machine
Data Mining vs SQL, EIS, and OLAP
21
•
SQL
.
SQL is a query language, difficult for business people
to use
•
EIS = Executive Information Systems.
EIS systems
provide graphical interfaces that give executives a
pre-programmed (and therefore limited) selection of reports,
automatically generating the necessary SQL for each.
•
OLAP
allows views along multiple dimensions, and
drill-drown, therefore giving access to a vast array of analyses.
However, it requires manual navigation through scores of
reports, requiring the user to notice interesting patterns
themselves.
An Example of OLAP Analysis and its
Limits
Walking Sticks Sales by City
50 10 400 Karachi Lahore Islamabad
Walking Sticks Sales in Islamabad by Age
10 30
360
Less than 20 20 to 60 Older than 60
Age Distribution by City
0 20 40 60 80
Karachi Lahore Islamabad
Younger than 20 20 to 60 Older than 60
22
• What is driving sales of walking sticks ?
• Step 1: View some OLAP graphs: e.g. walking stick sales by city.
• Step 2: Noticing that Islamabad has high sales you decide to investigate further.
• (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down).
• It seems that old people are responsible for most walking stick sales.
You confirm this by viewing a chart of age distributions by city.
• But imagine if you had to do this manual investigation for all of the 10,000 products in your range !
Here, OLAP gives way to Data Mining.
Data Mining vs Expert Systems
23
•
Expert Systems = Rule-Driven Deduction
Top-down: From known rules (expertise) and data to
decisions. (To be dealt with in Part 2 of this course)
•
Data Mining = Data-Driven Induction
Bottom-up: From data about past decisions to
discovered rules (general rules induced from the data).
Expert
System
Data
Mining
Rules
Data
Rules
Data
(including past decisions)
Difference b/w Machine Learning
and Data Mining
Machine Learning techniques are designed to deal with a limited
amount of artificial intelligence data. Where the Data Mining
Techniques deal with large amount of databases data.
Data Mining (Knowledge Discovery in Databases)
Extraction of interesting
(non-trivial, implicit, previously unknown
and potentially useful)
information or patterns from data in large
databases.
What is not Data Mining?
(Deductive) query processing.
Data Mining Functionalities (1)
Data Preprocessing
Handling Missing and Noisy Data (Data Cleaning).
Techniques we will cover
.• Missing values Imputation using Mean, Median and Mod. • Missing values Imputation using K-Nearest Neighbor.
• Missing values Imputation using Association Rules Mining. • Missing values Imputation using Fault-Tolerant Patterns. • Data Binning for Noisy Data.
TID Refund Country Taxable Income Cheat
1 Yes USA 125K No
2 UK 100K No
3 No Australia 70K No
4 120K No
Data Mining Functionalities (1)
Data Preprocessing
Data Transformation (Discretization and Normalization).
With the help of data transformation rules become more General and
Compact.
General and Compact rules increase the Accuracy of Classification.
Age Child Child Young Young Old Old Child Young
Child = (0 to 20) Young = (21 to 47) Old = (48 to 120) Age 15 18 40 33 55 48 12 23
1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No.
2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then Buy_Computer = No.
3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then Buy_Computer = No.
1. If attribute 1 = value1 & attribute 2 = value2 and Age = Child then
Data Mining Functionalities (1)
Data Preprocessing
Attribute Selection/Feature Selection
• Selection of those attributes which are more relevant to data mining task.
• Advantage1: Decrease the processing time of mining task. • Advantage2: Generalize the rules.
Example
• If our mining goal is to find that countries which has more Cheat on which Taxable Income.
• Then obviously the date attribute will not be an important factor
in our mining task. Date Refund Country Taxable Income Cheat
11/02/200 2
Yes USA 125K No
13/02/200 2
Yes UK 100K No
16/02/200 2
No Australia 120K Yes
21/03/200 2
No Australia 120K Yes
26/02/200 2
Data Mining Functionalities (1)
Data Preprocessing
We will cover two Attribute/Feature Selection
Techniques
•
Principle Component Analysis
•
Wrapper Based
Data Mining Functionalities (2)
Association Rule Mining
In
Association Rule Mining Framework
we have to
find all the
rules
in a transactional/relational dataset which
contain a support
(frequency)
Greater
than some
minimum support (min_sup)
threshold
(provided by the user).
For example with min_sup = 50%.
Itemset Support
{Butter} 4
{Bread} 3
{Egg} 2
{Bread,Butter} 3 {Bread, Butter, Egg} 2
Transaction ID Items Bought
2000 Bread,Butter,Egg
1000 Bread,Butter, Egg
4000 Bread,Butter, Tea
Data Mining Functionalities (2)
Association Rule Mining
Topic we will cover
Frequent Itemset Mining Algorithms (Apriori, FP-Growth,
Bit-vector ).
Fault-Tolerant/Approximate Frequent Itemset Mining.
N-Most Interesting Frequent Itemset Mining.
Closed and Maximal Frequent Itemset Mining.
Incremental Frequent Itemset Mining
Sequential Patterns.
Projects
• Mining Fault-Tolerant Using Pattern-Growth.
Data Mining Functionalities (2)
Classification and Prediction
Finding models (functions) that describe and distinguish classes or
concepts for future prediction
Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
Must have known the previous business decisions (Supervised
Learning).
City Temperature Humidity Windy Rain
Lahore hot low false No
Islamabad hot high true Yes
Islamabad hot high false Yes
Multan mild low false No
Karachi cool normal false No
Rawalpindi hot high true Yes
City Temperature Humidity Windy Rain
Muree hot high false ?
Sibi mild low true ?
Rule
• If Temperature = Hot & Humidity = High then
Rain = Yes.
Cluster Analysis
Group data to form new classes based on un-labels class data.
Business decisions are unknown (Also called unsupervised Learning). Example: Classify rainy/un-rainy cities based on Temperature,
Humidify and Windy Attributes.
City Temperature Humidity Windy Rain
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
3 clusters
Outlier Analysis
Outlier: A data object that does not comply with the general behavior
of the data.
It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
City Temperature Humidity Windy Rain
Lahore hot low false ?
Islamabad hot high true ?
Islamabad hot high false ?
Multan mild low false ?
Karachi cool normal false ?
Rawalpindi hot high true ?
2 outliers
Are All the “Discovered” Patterns
Interesting?
A data mining system/query may generate thousands of
patterns, not all of them are interesting.
Suggested approach:
Query-based, Constraint
mining
Interestingness Measures:
A pattern is
interesting
if
it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to
Can We Find All and Only Interesting
Patterns?
Find all the interesting patterns: Completeness
Can a data mining system find
all
the interesting patterns?
Remember most of the problems in Data Mining are NP-Complete.
There is no global best solution for any single problem.
Search for only interesting patterns: Optimization
Can a data mining system find only the interesting patterns?
Approaches
• First generate all the patterns and then filter out the uninteresting ones.
Reading Assignment
Book Chapter
Chapter 1 of “Jiawei Han and Micheline Kamber” book
Data Mining --- Where?
Some Nice Resources
ACM Special Interest Group on Knowledge Discovery and Data
Mining (SIGKDD)
http://www.acm.org/sigs/sigkdd/.
Knowledge Discovery Nuggets
www.kdnuggests.com.
IEEE Transactions on Knowledge and Data Engineering –
http://
www.computer.org/tkde/.
IEEE Transactions on Pattern Analysis and Machine Intelligence –
http://www.computer.org/tpami/.
Data Mining and Knowledge Discovery - Publisher: Springer
Science+Business Media B.V., Formerly Kluwer Academic
Text and Reference Material
The course will be mainly based on research
literature, following text may however be
consulted:
1.
Jiawei Han and Micheline Kamber. “Data Mining: Concepts and
Techniques”, 3
rdEd.
2.
Provost, F. and Fawcett, T. (2013) Data Science for Business:
What you need to know about data mining and data-analytic
thinking, 1
stedition, O'Reilly Media.
3.
Witten I. H., Frank, E. and Hall, M. A. (2011) Data Mining:
Practical Machine Learning Tools and Techniques, 3
rdEdition,
Morgan Kaufmann.
4.
David Hand, Heikki Mannila and Padhraic Smyth. “Principles of
Data Mining”. Pub. Prentice Hall of India, 2004.