• No results found

Data Mining (DM)

N/A
N/A
Protected

Academic year: 2020

Share "Data Mining (DM)"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Mining (DM)

Ms Ansif Arooj, University of Education, S & T, Township Campus Lahore

Lecture 2: Data Mining and its Applications

(2)

2

Why Data Mining?

âš« The Explosive Growth of Data: from terabytes to petabytes

âš« Data collection and data availability

âš« Automated data collection tools, database systems,

Web, computerized society

âš« Major sources of abundant data

⚫ Business: Web, e-commerce, transactions, stocks, …

âš« Science: Remote sensing, bioinformatics, scientific

simulation, …

âš« Society and everyone: news, digital cameras, YouTube

âš« We are drowning in data, but starving for knowledge!

⚫ “Necessity is the mother of invention”—Data

(3)

3

What Is Data Mining?

âš« Data mining (knowledge discovery from data)

âš« Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge from huge amount of data

âš« Data mining: a misnomer?

âš« Alternative names

âš« Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data

archeology, data dredging, information harvesting, business intelligence, etc.

(4)

4

Knowledge Discovery (KDD) Process

âš« This is a view from typical database systems and data warehousing

communities

âš« Data mining plays an essential role in the knowledge discovery process

Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

(5)
(6)

© Prentice Hall 6

(7)

© Prentice Hall 7

Data Warehousing

âš«

“

Subject-oriented, integrated, time-variant, nonvolatile” William Inmon

âš« Operational Data: Data used in day to day needs of company.

âš« Informational Data: Supports other functions such as planning and forecasting.

âš« Data mining tools often access data warehouses rather than operational data.

(8)

© Prentice Hall 8

Operational vs. Informational

Operational Data Data Warehouse

Application OLTP OLAP

Use Precise Queries Ad Hoc Temporal Snapshot Historical Modification Dynamic Static Orientation Application Business Data Operational Values Integrated

Size Gigabits Terabits

Level Detailed Summarized

Access Often Less Often

Response Few Seconds Minutes

(9)

© Prentice Hall 9

OLAP

âš« Online Analytic Processing (OLAP): provides more complex queries than OLTP.

âš« OnLine Transaction Processing (OLTP): traditional database/transaction processing.

âš« Dimensional data; cube view

âš« Visualization of operations:

âš« Slice: examine sub-cube.

âš« Dice: rotate cube to look at another dimension.

âš« Roll Up/Drill Down

(10)

© Prentice Hall 10

OLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

(11)

11

Data Mining in Business Intelligence

Increasing potential to support

business decisions End User

Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses Data Sources

(12)

12

KDD Process: A Typical View from ML and Statistics

Input Data Data

Mining

Data Pre-Processing

Post-Process ing

âš« This is a view from typical machine learning and statistics communities

Data integration Normalization Feature selection Dimension reduction

Pattern discovery

Association & correlation Classification Clustering Outlier analysis … … … … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

(13)

13

Multi-Dimensional View of Data Mining

âš« Data to be mined

âš« Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks

âš« Knowledge to be mined (or: Data mining functions)

âš« Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

âš« Descriptive (mining tasks characterize properties of the data in a target data set.) vs. predictive data mining (mining tasks perform induction on the current data in order to make predictions).

âš« Multiple/integrated functions and mining at multiple levels

âš« Techniques utilized

âš« Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, etc.

(14)

FUNCTION OF DATA MINING

(15)

Application of Data Mining

âš«

Spatial Data Analysis

âš«

Information Retrieval

âš«

Pattern Recognition

âš«

Image Analysis

âš«

Signal Processing

âš«

Computer Graphics

âš«

Web Technology

âš«

Business

âš«

Bioinformatics

(16)
(17)

17

Data Mining Function:

(1) Generalization

âš« Information integration and data warehouse construction

âš« Data cleaning, transformation, integration, and

multidimensional data model

âš« Data cube technology

âš« Scalable methods for computing (i.e., materializing)

multidimensional aggregates

âš« OLAP (online analytical processing)

âš« Multidimensional concept description:

âš« Characterization and discrimination

âš« Generalize, summarize,

(18)

18

Data Mining Function: (2)

Association and

Correlation Analysis

âš« Frequent patterns (or frequent item sets)

What items are frequently purchased together in your shopping mall?

âš« Association, correlation vs. causality

âš« A typical association rule

Butter, Bread Milk [20%, 100%] [support, confidence) RentType(X, "game") AND Age(X, "13-19") -> Buys(X, "pop") [s=2% ,c=55%]

Are strongly associated items also strongly correlated?

âš« How to mine such patterns and rules efficiently in large datasets?

âš« How to use such patterns for classification, clustering, and other applications?

(19)

Support(x->y)=P(XUY) Confidence(x->y)=XUY/X

(20)
(21)

Small Data Mining Example

âš« X Y

âš« Butter, Bread Milk

âš« [20%, 100%] (support, confidence)

âš«

Support

âš« the proportion of transactions in the data set which

contain the itemset.

âš« 1/5

âš«

Confidence

(22)

CLASS ACTIVITY

Rule 1) Coke, burger Diapers

Rule 2) Coke, burger, Potatoes bread

Rule 3) Coke, burger, potatoes onion, bread Rule 4) burger, potatoes, onion coke

Transaction ID

Coke Burger Potatoes Onion Diapers Bread

1 1 1 1 1 2 1 1 1 3 1 1 1 1 4 1 1 1 1 1 5 1 6 1 1 1 7 1 1 1 1 8 1 1 1 1 1 9 1 1 10 1 1 1 1 1 1

(23)
(24)

24

Data Mining Function:

(3) Classification

âš« Classification and label prediction

âš« Construct models (functions) based on some training examples

âš« Also named as supervised classification

âš« Describe and distinguish classes or concepts for future prediction âš« E.g., classify countries based on (climate), or classify cars based

on (gas mileage)

âš« Typical methods

âš« Decision trees, naĂŻve Bayesian classification, support vector

machines, neural networks, rule-based classification, pattern-based classification, logistic regression, …

âš« Typical applications:

⚫ Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

(25)
(26)
(27)

27

Data Mining Function:

(4) Cluster Analysis

âš« Unsupervised learning (i.e., Class label is unknown)

âš« Group data to form new categories (i.e., clusters), e.g.,

cluster houses to find distribution patterns

âš« Principle: Maximizing intra-class similarity & minimizing interclass similarity

(28)
(29)

29

Data Mining Function:

(5) Outlier Analysis

âš« Outlier analysis

âš« Outlier: A data object that does not comply with the

general behavior of the data

⚫ Noise or exception? ― One person’s garbage could be

another person’s treasure

âš« Methods: by product of clustering or regression analysis,

…

(30)

Data Mining Function:

(6) Prediction

âš«

The major idea is to use a large number of past

values to consider probable future values.

âš«

Forecasting and predicting the unavailable data

values or a class label for some data.

(31)

31

Evaluation of Knowledge

âš« Are all mined knowledge interesting?

⚫ One can mine tremendous amount of “patterns”

âš« Some may fit only certain dimension space (time,

location, …)

⚫ Some may not be representative, may be transient, …

⚫ Evaluation of mined knowledge → directly mine only

interesting knowledge? ⚫ Descriptive vs. predictive ⚫ Coverage ⚫ Typicality vs. novelty ⚫ Accuracy ⚫ Timeliness ⚫ …

(32)

32

Data Mining: Confluence of Multiple Disciplines

Data Mining Machine Learning Statistics Applicati ons Algorithm Pattern Recogniti on High-Perform ance Computing Visualizat ion Database Technolo gy

(33)

33

Summary

âš« Data mining: Discovering interesting patterns and knowledge from massive amount of data

âš« A natural evolution of science and information technology, in great demand, with wide applications

âš« A KDD process includes data cleaning, data integration, data selection,

transformation, data mining, pattern evaluation, and knowledge presentation

âš« Mining can be performed in a variety of data

âš« Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.

âš« Data mining technologies and applications

(34)

Class Activity

âš«

Discuss whether or not each of the following activities

is a data mining task.

âš«

A) Dividing the customers of a company according to

their gender.

âš«

No. This is a simple database query.

âš«

B) Dividing the customers of a company according to

their profitability.

âš«

No. This is an accounting calculation, followed by the

application of a threshold. However, predicting the

profitability of a new customer would be data mining.

(35)

Data Mining yes/no?

âš« (c) Computing the total sales of a company.

âš« No. Again, this is simple accounting.

âš« (d) Sorting a student database based on student identification numbers.

âš« No. Again, this is a simple database query.

âš« (e) Predicting the outcomes of tossing a (fair) pair of dice.

âš« No. Since the die is fair, this is a probability calculation. If

the die were not fair, and we needed to estimate the

probabilities of each outcome from the data, then this is more like the problems considered by data mining.

However, in this specific case, solutions to this problem were developed by mathematicians a long time ago, and thus, we wouldn’t consider it to be data mining.

(36)

Data Mining yes/no?

âš«

(f)Predicting the future stock price of a company using

historical records.

âš«

Yes. We would attempt to create a model that can

predict the continuous value of the stock price. This is

an example of the area of data mining known as

predictive modelling. We could use regression for this

modelling, although researchers in many fields have

developed a wide variety of techniques for predicting

time series.

(37)

Data Mining yes/no?

âš«

(g) Monitoring the heart rate of a patient for

abnormalities.

âš«

Yes. We would build a model of the normal behavior

of heart rate and raise an alarm when an unusual heart

behavior occurred. This would involve the area of data

mining known as anomaly detection. This could also

be considered as a classification problem if we had

(38)

Data Mining yes/no?

âš«

(h) Monitoring seismic waves for earthquake activities.

âš«

Yes. In this case, we would build a model of different

types of seismic wave behavior associated with

earthquake activities and raise an alarm when one of

these different types of seismic activity was observed.

This is an example of the area of data mining known as

classification.

âš«

Extracting the frequencies of a sound wave.

(39)

Data Mining and Data Privacy

âš« For each of the following data sets, explain whether or not data privacy is an important issue.

⚫ (a) Census data collected from 1900–1950.

âš« No

âš« (b) IP addresses and visit times of Web users who visit your Website.

âš« Yes

âš« (c) Images from Earth-orbiting satellites.

âš« No

âš« (d) Names and addresses of people from the telephone book.

âš« No

âš« (e) Names and email addresses collected from the Web.

(40)

40

Recommended Reference Books

âš« E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

âš« S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

âš« R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

âš« T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

âš« U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.

AAAI/MIT Press, 1996

âš« U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan

Kaufmann, 2001

âš« J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011

âš« T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd

ed., Springer, 2009

âš« B. Liu, Web Data Mining, Springer 2006

âš« T. M. Mitchell, Machine Learning, McGraw Hill, 1997

âš« Y. Sun and J. Han, Mining Heterogeneous Information Networks, Morgan & Claypool, 2012

âš« P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

âš« S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

âš« I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations,

References

Related documents

1960s CPU idle time Automate transition between jobs Time Sharing Systems 1970s Good response time Time-slice, round- robin scheduling Multiprocessing Systems 1980s

Also these figures are not consistent with data I have compiled from ABS sources in recent years for the housing tenure of Indigenous households by remoteness geography4. Table 2

As described above, our benchmark model uses the three variables that growth theory suggests should have approximately the same permanent components: Real output per hour (variable

The American Recovery and Reinvestment Act of 2009 (ARRA, P.L. 111-5) includes a temporary provision that allowed non-itemizing homeowners to claim an additional standard deduction

COMPASS consortium partners involved in the development of the self- assessment included academics with expertise in corporate sustain- ability, organizational learning,

The results also correlate with ones in the multi-criteria evaluations: even this picture is not covering all partners in the QS ranking, but it is clear that the position of

Within this motivating context, the aims of this work were to (i) assess the genetic diversity in a large collection of durum wheat accessions (over 250) released since the early

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’