Data Mining (DM)

(1)

Data Mining (DM)

Ms Ansif Arooj, University of Education, S & T, Township Campus Lahore

Lecture 2: Data Mining and its Applications

(2)

2

Why Data Mining?

⚫ The Explosive Growth of Data: from terabytes to petabytes

⚫ Data collection and data availability

⚫ Automated data collection tools, database systems,

Web, computerized society

⚫ Major sources of abundant data

⚫ Business: Web, e-commerce, transactions, stocks, …

⚫ Science: Remote sensing, bioinformatics, scientific

simulation, …

⚫ Society and everyone: news, digital cameras, YouTube

⚫ We are drowning in data, but starving for knowledge!

⚫ “Necessity is the mother of invention”—Data

(3)

3

What Is Data Mining?

⚫ Data mining (knowledge discovery from data)

⚫ Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

⚫ Data mining: a misnomer?

⚫ Alternative names

⚫ Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data

archeology, data dredging, information harvesting, business intelligence, etc.

(4)

4

Knowledge Discovery (KDD) Process

⚫ This is a view from typical database systems and data warehousing

communities

⚫ Data mining plays an essential role in the knowledge discovery process

Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

(5)

(6)

(7)

Data Warehousing

⚫

“

Subject-oriented, integrated, time-variant, nonvolatile” William Inmon

⚫ Operational Data: Data used in day to day needs of company.

⚫ Informational Data: Supports other functions such as planning and forecasting.

⚫ Data mining tools often access data warehouses rather than operational data.

(8)

Operational vs. Informational

Operational Data Data Warehouse

Application OLTP OLAP

Use _Precise _Queries _{Ad Hoc} Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated

Size Gigabits Terabits

Level Detailed Summarized

Access Often Less Often

Response Few Seconds Minutes

(9)

OLAP

⚫ Online Analytic Processing (OLAP): provides more complex queries than OLTP.

⚫ OnLine Transaction Processing (OLTP): traditional database/transaction processing.

⚫ Dimensional data; cube view

⚫ Visualization of operations:

⚫ Slice: examine sub-cube.

⚫ Dice: rotate cube to look at another dimension.

⚫ Roll Up/Drill Down

(10)

OLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

(11)

11

Data Mining in Business Intelligence

Increasing potential to support

business decisions End User

Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses Data Sources

(12)

12

KDD Process: A Typical View from ML and Statistics

Input Data Data

Mining

Data Pre-Processing

Post-Process ing

⚫ This is a view from typical machine learning and statistics communities

Data integration Normalization Feature selection Dimension reduction

Pattern discovery

Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

(13)

13

Multi-Dimensional View of Data Mining

⚫ Data to be mined

⚫ Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

⚫ Knowledge to be mined (or: Data mining functions)

⚫ Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

⚫ Descriptive (mining tasks characterize properties of the data in a target data set.) vs. predictive data mining (mining tasks perform induction on the current data in order to make predictions).

⚫ Multiple/integrated functions and mining at multiple levels

⚫ Techniques utilized

⚫ Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

(14)

FUNCTION OF DATA MINING

(15)

Application of Data Mining

⚫

Spatial Data Analysis

⚫

Information Retrieval

⚫

Pattern Recognition

⚫

Image Analysis

⚫

Signal Processing

⚫

Computer Graphics

⚫

Web Technology

⚫

Business

⚫

Bioinformatics

(16)

(17)

17

Data Mining Function:

(1) Generalization

⚫ Information integration and data warehouse construction

⚫ Data cleaning, transformation, integration, and

multidimensional data model

⚫ Data cube technology

⚫ Scalable methods for computing (i.e., materializing)

multidimensional aggregates

⚫ OLAP (online analytical processing)

⚫ Multidimensional concept description:

⚫ Characterization and discrimination

⚫ Generalize, summarize,

(18)

18

Data Mining Function: (2)

Association and

Correlation Analysis

⚫ Frequent patterns (or frequent item sets)

What items are frequently purchased together in your shopping mall?

⚫ Association, correlation vs. causality

⚫ A typical association rule

Butter, Bread Milk [20%, 100%] [support, confidence) RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2% ,c=55%]

Are strongly associated items also strongly correlated?

⚫ How to mine such patterns and rules efficiently in large datasets?

⚫ How to use such patterns for classification, clustering, and other applications?

(19)

Support(x->y)=P(XUY) Confidence(x->y)=XUY/X

(20)

(21)

Small Data Mining Example

⚫ X Y

⚫ Butter, Bread Milk

⚫ [20%, 100%] (support, confidence)

⚫

Support

⚫ the proportion of transactions in the data set which

contain the itemset.

⚫ 1/5

⚫

Confidence

(22)

CLASS ACTIVITY

Rule 1) Coke, burger Diapers

Rule 2) Coke, burger, Potatoes bread

Rule 3) Coke, burger, potatoes onion, bread Rule 4) burger, potatoes, onion coke

Transaction ID

Coke Burger Potatoes Onion Diapers Bread

1 1 1 1 1 2 1 1 1 3 1 1 1 1 4 1 1 1 1 1 5 1 6 1 1 1 7 1 1 1 1 8 1 1 1 1 1 9 1 1 10 1 1 1 1 1 1

(23)

(24)

24

Data Mining Function:

(3) Classification

⚫ Classification and label prediction

⚫ Construct models (functions) based on some training examples

⚫ Also named as supervised classification

⚫ Describe and distinguish classes or concepts for future prediction ⚫ E.g., classify countries based on (climate), or classify cars based

on (gas mileage)

⚫ Typical methods

⚫ Decision trees, naïve Bayesian classification, support vector

machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …

⚫ Typical applications:

⚫ Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

(25)

(26)

(27)

27

Data Mining Function:

(4) Cluster Analysis

⚫ Unsupervised learning (i.e., Class label is unknown)

⚫ Group data to form new categories (i.e., clusters), e.g.,

cluster houses to find distribution patterns

⚫ Principle: Maximizing intra-class similarity & minimizing interclass similarity

(28)

(29)

29

Data Mining Function:

(5) Outlier Analysis

⚫ Outlier analysis

⚫ Outlier: A data object that does not comply with the

general behavior of the data

⚫ Noise or exception? ― One person’s garbage could be

another person’s treasure

⚫ Methods: by product of clustering or regression analysis,

…

(30)

Data Mining Function:

(6) Prediction

⚫

The major idea is to use a large number of past

values to consider probable future values.

⚫

Forecasting and predicting the unavailable data

values or a class label for some data.

(31)

31

Evaluation of Knowledge

⚫ Are all mined knowledge interesting?

⚫ One can mine tremendous amount of “patterns”

⚫ Some may fit only certain dimension space (time,

location, …)

⚫ Some may not be representative, may be transient, …

⚫ Evaluation of mined knowledge → directly mine only

interesting knowledge? ⚫ Descriptive vs. predictive ⚫ Coverage ⚫ Typicality vs. novelty ⚫ Accuracy ⚫ Timeliness ⚫ …

(32)

32

Data Mining: Confluence of Multiple Disciplines

Data Mining Machine Learning Statistics Applicati ons Algorithm Pattern Recogniti on High-Perform ance Computing Visualizat ion Database Technolo gy

(33)

33

Summary

⚫ Data mining: Discovering interesting patterns and knowledge from massive amount of data

⚫ A natural evolution of science and information technology, in great demand, with wide applications

⚫ A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge presentation

⚫ Mining can be performed in a variety of data

⚫ Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.

⚫ Data mining technologies and applications

(34)

Class Activity

⚫

Discuss whether or not each of the following activities

is a data mining task.

⚫

A) Dividing the customers of a company according to

their gender.

⚫

No. This is a simple database query.

⚫

B) Dividing the customers of a company according to

their profitability.

⚫

No. This is an accounting calculation, followed by the

application of a threshold. However, predicting the

profitability of a new customer would be data mining.

(35)

Data Mining yes/no?

⚫ (c) Computing the total sales of a company.

⚫ No. Again, this is simple accounting.

⚫ (d) Sorting a student database based on student identification numbers.

⚫ No. Again, this is a simple database query.

⚫ (e) Predicting the outcomes of tossing a (fair) pair of dice.

⚫ No. Since the die is fair, this is a probability calculation. If

the die were not fair, and we needed to estimate the

probabilities of each outcome from the data, then this is more like the problems considered by data mining.

However, in this specific case, solutions to this problem were developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data mining.

(36)

Data Mining yes/no?

⚫

(f)Predicting the future stock price of a company using

historical records.

⚫

Yes. We would attempt to create a model that can

predict the continuous value of the stock price. This is

an example of the area of data mining known as

predictive modelling. We could use regression for this

modelling, although researchers in many fields have

developed a wide variety of techniques for predicting

time series.

(37)

Data Mining yes/no?

⚫

(g) Monitoring the heart rate of a patient for

abnormalities.

⚫

Yes. We would build a model of the normal behavior

of heart rate and raise an alarm when an unusual heart

behavior occurred. This would involve the area of data

mining known as anomaly detection. This could also

be considered as a classification problem if we had

(38)

Data Mining yes/no?

⚫

(h) Monitoring seismic waves for earthquake activities.

⚫

Yes. In this case, we would build a model of different

types of seismic wave behavior associated with

earthquake activities and raise an alarm when one of

these different types of seismic activity was observed.

This is an example of the area of data mining known as

classification.

⚫

Extracting the frequencies of a sound wave.

(39)

Data Mining and Data Privacy

⚫ For each of the following data sets, explain whether or not data privacy is an important issue.

⚫ (a) Census data collected from 1900–1950.

⚫ No

⚫ (b) IP addresses and visit times of Web users who visit your Website.

⚫ Yes

⚫ (c) Images from Earth-orbiting satellites.

⚫ No

⚫ (d) Names and addresses of people from the telephone book.

⚫ No

⚫ (e) Names and email addresses collected from the Web.

(40)

40

Recommended Reference Books

⚫ E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

⚫ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

⚫ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

⚫ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

⚫ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.

AAAI/MIT Press, 1996

⚫ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan

Kaufmann, 2001

⚫ J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd_{ed. , 2011}

⚫ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd

ed., Springer, 2009

⚫ B. Liu, Web Data Mining, Springer 2006

⚫ T. M. Mitchell, Machine Learning, McGraw Hill, 1997

⚫ Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012

⚫ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

⚫ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

⚫ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,