Lecture 1-Introduction to Data Mining - M

(1)

CSD479

Data Mining

(2)

•

Instructor:

Dr. Zeeshan Gillani

[email protected]

Room # 37, Faculty Block

Lectures:

•

Office Hrs:

Mon 10:00 – 11:30; Thur 11:30 – 13:00 hrs

(or by appointment)

•

Prerequisite:

Knowledge of Statistics and Database/Data

Warehousing is helpful

(3)

This course aims to provide the students with the key

concepts of applications, techniques, and methodologies of

Data Mining with the primary focus on the classification and

clustering algorithms.

(4)

Mining Methodology, Overview of Data Warehousing,

Overview of OLAP, Applications of Data Mining, Data

cleaning and preparation, Concept Description, Association

Rule Mining, Classification, Classification by Back

Propagation, Prediction, Decision Trees, Bayesian

Classification, Classification Accuracy, Regression for

Classification and Prediction, Distributions, Cluster

Analysis.

(5)

Text Book:

1.

Han, J. and Kamber, M. (2011) Data Mining Concepts and

Techniques, 3

rd

Edition, Morgan Kaufmann.

Reference Books:

1.

Provost, F. and Fawcett, T. (2013) Data Science for Business: What

you need to know about data mining and data-analytic thinking,

1

st

edition, O'Reilly Media.

2.

Witten I. H., Frank, E. and Hall, M. A. (2011) Data Mining:

Practical Machine Learning Tools and Techniques, 3

rd

Edition,

Morgan Kaufmann.

Instruments:

There will be 4 assignments, 4 quizzes,

Weights:

Assignments

10%

Quizzes

15%

S-I

10%

S-II

15%

Final Exam

50%

(6)

Lect.# Topics/Contents

1 Introduction to data Mining? Data Mining on different kind of Databases. Data mining functionalities.

2 _{Data objects and Attribute Types. Some basic Statistical Descriptions of Data;} Mean, Median, Mode, S.D., Variance etc. Data Similarity and Dissimilarity 3 _{Non-Euclidean Distances for Nominal, Ordinal and Mixed Types attributes.} 4 _{Data Preprocessing techniques; Data cleaning; Data integration}

5 _{Data Integration problems, removing data redundancy using Chi-square and} correlation analysis.

6 _{Data Reduction; Dimensionality Reduction, Numerosity Reduction Data} Compression, PCA

7 _{Examples of PCA; Data Normalization.}

8 _{Mining Frequent Patterns, Market basket analysis, frequent itemsets, frequent} pattern mining. mining association rules from frequent itemsets

9 _{Finding Frequent itemsets, using candidate generation, generating association} rules from frequent itemsets. Brute force algorithm and The Apriori Algorithm. 10 Finding interestingness, strong rules are not necessarily interesting, from

association analysis to correlation analysis. 11 _{Sessional - I}

(7)

Lect.# Topics/Contents

12 _{Introduction to Classification, Classification by Decision Tree, Decision tree} induction, attribute selection measures

13

Entropy and Gini measures for tree induction, tree pruning. 14

Tree pruning, pre and post pruning, scalability 15

Model Evaluation methods. Introduction to Weka 16

Conditional Probability and Bayes Theorem 17

Introduction to Naive Bayes Classifier with examples

18 _{Rule-based Classification: Using IF-THEN Rules for Classification, Rule} Extraction from a Decision Tree.

19 _{Rule induction using a sequential covering algorithm. Methods of Rule} evaluation.

20 _{Introduction to Artificial Neural Network. A Multilayer feed-forward neural} network, backpropagation.

21

Example of ANN. Revision 22

Sessional-II

(8)

23

Discussion on S-II. Introduction to clustering, K-Mean clustering

24

Examples of k-means, k-modes, selecting best k.

25

Clustering: K-Medoids with examples

26

Clustering: Introduction to CLARA and CLARANS

27

Introduction to Hierarchical Clustering. Agglomerative Clustering

using Single Link.

28

Agglomerative Clustering using Complete Link, Average Link and

MST. Divisive Algorithms.

29

Introduction to BIRCH. Clustering Features. CF Tree

30

Major tasks of clustering evaluation, Extrinsic and intrinsic

evaluation methods. Revision

31

Terminal Exam

(9)

Introduction to Data Mining

(Chapter #1 of text book)

(10)

Motivation: “Necessity is the

Mother of Invention”



Data Explosion Problem

1. Automated data collection tools (e.g. web, sensor networks) and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.

2. Currently enterprises are facing data explosion problem.

3. YouTube users upload 48 hours of video, Facebook users share 684,478 pieces of content, Instagram users share 3,600 new photos, and Tumblr sees 27,778 new posts published.



A full 90% of world's data generated over last two

years (Date:May 22, 2013, Source:SINTEF)

(11)



Electronic Information an Important Asset for Business

Decisions

1. With the growth of electronic information, enterprises began to realizing that the accumulated information can be an important asset in their business decisions.

2. There is a potential business intelligence hidden in the large volume of data.

3. This intelligence can be the secret weapon on which the success of a business may depend.

11

(12)

1.

It is not a

Simple Matter

to discover

Business

Intelligence

from

Mountain of Accumulated Data

.

2.

What is required are

Techniques

that allow the enterprise to

Extract the Most Valuable Information

.

3.

The

Field of Data Mining

provides such

Techniques

.

4.

These techniques can

Find Novel Patterns (unknown)

that

may

Assist an Enterprise

in

Understanding

the business

better and in forecasting.

(13)

What Is Data Mining?



Data mining (knowledge discovery in databases):



Extraction of interesting

₍

non-trivial, implicit, previously

unknown and potentially useful)

information or patterns

from data in large databases



Alternative names :



Data mining: a misnomer?



Knowledge discovery(mining) in databases (KDD),

(14)

Data Mining (Example)



Random Guessing vs. Potential Knowledge

 Suppose we have to Forecast the Probability of Rain in Islamabad city

for any particular day.

 Without any Prior Knowledge the probability of rain would be 50%

(pure random guess).

 If we had a lot of weather data, then we can extract potential

rules using Data Mining which can then forecast the chance of rain better than random guessing.



Example: The Rule

if [Temperature = ‘hot’ and Humidity = ‘high’] then there is 66.6% chance of rain. Temperature Humidity Windy_hot _high _false Rain_No

hot high true Yes

hot high false Yes

mild high false No

cool normal false No

(15)

Examples: What is (not) Data

Mining?



What is not Data

Mining?

–

Look up phone

number in phone

returned by search engine according

to their context (e.g. Amazon

(16)

Data Mining: A KDD Process



Data mining:

the core of

knowledge discovery

process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Data Mining

(17)

The Data Mining Process

17

•

Step 0: Determine Business Objective/Learning the

application domain

- e.g. Forecasting the probability of rain

- Must have relevant prior knowledge and goals of application.

•

Step 1: Creating a Target Data set/Prepare Data

- Data Selection

- Data Cleaning; Noisy and Missing values handling (may take 60% of the effort!).

- Data Transformation (Normalization/Discretization). - Attribute/Feature Selection.

•

Step 2: Choosing the Function of Data Mining

- Classification, Clustering, Regression, Association Rules

•

Step 3: Choosing The Mining Algorithm

- Selection of correct algorithm depending upon the quality of data. - Selection of correct algorithm depending upon the density of data.

Step 4: Data Mining

- Search for patterns of interest:- A typical data mining algorithm can mine millions of patterns.

•

Step 5: Visualization/Knowledge Representation

- Visualization/Representation of interesting patterns, etc . and then

(18)

Data Mining and Business Intelligence

Increasing potential to support

business decisions End User

Business Analyst

Data Analyst

DBA

Making

Decisions

Data Presentation

Visualization Techniques

Data Mining

Information Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data Sources

(19)

Data Mining: On What Kind of

Data?

1.

Relational databases

2.

Data warehouses

3.

Transactional databases

4.

Advanced DB and information repositories



Time-series data and temporal data



Text databases



Multimedia databases



Data Stream (Sensor Networks Data)

(20)

Data Mining: Confluence of Multiple

Disciplines

Data Mining

Database

Technology

Statistics

Other

Disciplines

Information

Science

Machine

(21)

Data Mining vs SQL, EIS, and OLAP

21

•

SQL

.

SQL is a query language, difficult for business people

to use

•

EIS = Executive Information Systems.

EIS systems

provide graphical interfaces that give executives a

pre-programmed (and therefore limited) selection of reports,

automatically generating the necessary SQL for each.

•

OLAP

allows views along multiple dimensions, and

drill-drown, therefore giving access to a vast array of analyses.

However, it requires manual navigation through scores of

reports, requiring the user to notice interesting patterns

themselves.

(22)

An Example of OLAP Analysis and its

Limits

Walking Sticks Sales by City

50 10 400 Karachi Lahore Islamabad

Walking Sticks Sales in Islamabad by Age

10 30

360

Less than 20 20 to 60 Older than 60

Age Distribution by City

0 20 40 60 80

Karachi Lahore Islamabad

Younger than 20 20 to 60 Older than 60

22

• What is driving sales of walking sticks ?

• Step 1: View some OLAP graphs: e.g. walking stick sales by city.

• Step 2: Noticing that Islamabad has high sales you decide to investigate further.

• (Before OLAP, you would have to have written a very complex SQL query instead of just simply clicking to drill-down).

• It seems that old people are responsible for most walking stick sales.

You confirm this by viewing a chart of age distributions by city.

• But imagine if you had to do this manual investigation for all of the 10,000 products in your range !

Here, OLAP gives way to Data Mining.

(23)

Data Mining vs Expert Systems

23

•

Expert Systems = Rule-Driven Deduction

Top-down: From known rules (expertise) and data to

decisions. (To be dealt with in Part 2 of this course)

•

Data Mining = Data-Driven Induction

Bottom-up: From data about past decisions to

discovered rules (general rules induced from the data).

Expert

System

Data

Mining

Rules

Data

Rules

Data

(including past decisions)

(24)

Difference b/w Machine Learning

and Data Mining



Machine Learning techniques are designed to deal with a limited

amount of artificial intelligence data. Where the Data Mining

Techniques deal with large amount of databases data.



Data Mining (Knowledge Discovery in Databases)



Extraction of interesting

₍

non-trivial, implicit, previously unknown

and potentially useful)

information or patterns from data in large

databases.



What is not Data Mining?



(Deductive) query processing.

(25)

Data Mining Functionalities (1)



Data Preprocessing



Handling Missing and Noisy Data (Data Cleaning).



Techniques we will cover

_.

• Missing values Imputation using Mean, Median and Mod. • Missing values Imputation using K-Nearest Neighbor.

• Missing values Imputation using Association Rules Mining. • Missing values Imputation using Fault-Tolerant Patterns. • Data Binning for Noisy Data.

TID Refund Country Taxable Income Cheat

1 Yes USA 125K No

2 UK 100K No

3 No Australia 70K No

4 120K No

(26)

Data Mining Functionalities (1)



Data Preprocessing

 Data Transformation (Discretization and Normalization).

 With the help of data transformation rules become more General and

Compact.

 General and Compact rules increase the Accuracy of Classification.

Age Child Child Young Young Old Old Child Young

Child = (0 to 20) Young = (21 to 47) Old = (48 to 120) Age 15 18 40 33 55 48 12 23

1. If attribute 1 = value1 & attribute 2 = value2 and Age = 08 then Buy_Computer = No.

2. If attribute 1 = value1 & attribute 2 = value2 and Age = 09 then Buy_Computer = No.

3. If attribute 1 = value1 & attribute 2 = value2 and Age = 10 then Buy_Computer = No.

1. If attribute 1 = value1 & attribute 2 = value2 and Age = Child then

(27)

Data Mining Functionalities (1)



Data Preprocessing

 Attribute Selection/Feature Selection

• Selection of those attributes which are more relevant to data mining task.

• Advantage1: Decrease the processing time of mining task. • Advantage2: Generalize the rules.

 Example

• If our mining goal is to find that countries which has more Cheat on which Taxable Income.

• Then obviously the date attribute will not be an important factor

in our mining task. _Date _{Refund Country Taxable Income Cheat}

11/02/200 2

Yes USA 125K No

13/02/200 2

Yes UK 100K No

16/02/200 2

No Australia 120K Yes

21/03/200 2

No Australia 120K Yes

26/02/200 2

(28)

Data Mining Functionalities (1)



Data Preprocessing



We will cover two Attribute/Feature Selection

Techniques

•

Principle Component Analysis

•

Wrapper Based

(29)



Association Rule Mining



In

Association Rule Mining Framework

we have to

find all the

rules

in a transactional/relational dataset which

contain a support

(frequency)

Greater

than some

minimum support (min_sup)

threshold

(provided by the user).



For example with min_sup = 50%.

Itemset Support

{Butter} 4

{Bread} 3

{Egg} 2

{Bread,Butter} 3 {Bread, Butter, Egg} 2

Transaction ID Items Bought

2000 Bread,Butter,Egg

1000 Bread,Butter, Egg

4000 Bread,Butter, Tea

(30)

Data Mining Functionalities (2)



Association Rule Mining



Topic we will cover



Frequent Itemset Mining Algorithms (Apriori, FP-Growth,

Bit-vector ).



Fault-Tolerant/Approximate Frequent Itemset Mining.



N-Most Interesting Frequent Itemset Mining.



Closed and Maximal Frequent Itemset Mining.



Incremental Frequent Itemset Mining



Sequential Patterns.



Projects

• Mining Fault-Tolerant Using Pattern-Growth.

(31)



Classification and Prediction

 Finding models (functions) that describe and distinguish classes or

concepts for future prediction

 Example: Classify rainy/un-rainy cities based on Temperature,

Humidify and Windy Attributes.

 Must have known the previous business decisions (Supervised

Learning).

City Temperature Humidity Windy Rain

Lahore hot low false No

Islamabad hot high true Yes

Islamabad hot high false Yes

Multan mild low false No

Karachi cool normal false No

Rawalpindi hot high true Yes

City Temperature Humidity Windy Rain

Muree hot high false ?

Sibi mild low true ?

Rule

• If Temperature = Hot & Humidity = High then

Rain = Yes.

(32)



Cluster Analysis

 Group data to form new classes based on un-labels class data.

 Business decisions are unknown (Also called unsupervised Learning).  Example: Classify rainy/un-rainy cities based on Temperature,

Humidify and Windy Attributes.

City Temperature Humidity Windy Rain

Lahore hot low false ?

Islamabad hot high true ?

Islamabad hot high false ?

Multan mild low false ?

Karachi cool normal false ?

Rawalpindi hot high true ?

3 clusters

(33)



Outlier Analysis

 Outlier: A data object that does not comply with the general behavior

of the data.

 It can be considered as noise or exception but is quite useful in fraud

detection, rare events analysis

City Temperature Humidity Windy Rain

Lahore hot low false ?

Islamabad hot high true ?

Islamabad hot high false ?

Multan mild low false ?

Karachi cool normal false ?

Rawalpindi hot high true ?

2 outliers

(34)

Are All the “Discovered” Patterns

Interesting?



A data mining system/query may generate thousands of

patterns, not all of them are interesting.



Suggested approach:

Query-based, Constraint

mining



Interestingness Measures:

A pattern is

interesting

if

it is easily understood by humans, valid on new or test

data with some degree of certainty, potentially useful,

novel, or validates some hypothesis that a user seeks to

(35)

Can We Find All and Only Interesting

Patterns?



Find all the interesting patterns: Completeness



Can a data mining system find

all

the interesting patterns?



Remember most of the problems in Data Mining are NP-Complete.



There is no global best solution for any single problem.



Search for only interesting patterns: Optimization



Can a data mining system find only the interesting patterns?



Approaches

• First generate all the patterns and then filter out the uninteresting ones.

(36)

Reading Assignment



Book Chapter



Chapter 1 of “Jiawei Han and Micheline Kamber” book

(37)

Data Mining --- Where?



Some Nice Resources



ACM Special Interest Group on Knowledge Discovery and Data

Mining (SIGKDD)

http://www.acm.org/sigs/sigkdd/.



Knowledge Discovery Nuggets

www.kdnuggests.com.



IEEE Transactions on Knowledge and Data Engineering –

http://

www.computer.org/tkde/.



IEEE Transactions on Pattern Analysis and Machine Intelligence –

http://www.computer.org/tpami/.



Data Mining and Knowledge Discovery - Publisher: Springer

Science+Business Media B.V., Formerly Kluwer Academic

(38)

Text and Reference Material



The course will be mainly based on research

literature, following text may however be

consulted:

1.

Jiawei Han and Micheline Kamber. “Data Mining: Concepts and

Techniques”, 3

rd

Ed.

2.

Provost, F. and Fawcett, T. (2013) Data Science for Business:

What you need to know about data mining and data-analytic

thinking, 1

st

edition, O'Reilly Media.

3.

Witten I. H., Frank, E. and Hall, M. A. (2011) Data Mining:

Practical Machine Learning Tools and Techniques, 3

rd

Edition,

Morgan Kaufmann.

4.

David Hand, Heikki Mannila and Padhraic Smyth. “Principles of

Data Mining”. Pub. Prentice Hall of India, 2004.