• No results found

Datamining

N/A
N/A
Protected

Academic year: 2020

Share "Datamining"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Chapter 28

(3)

Definitions of Data Mining

 The discovery of new information in terms of

patterns or rules from vast amounts of data.

 The process of finding interesting structure in

data.

 The process of employing one or more computer

(4)

Data Warehousing

 The data warehouse is a historical database

designed for decision support.

 Data mining can be applied to the data in a

warehouse to help with certain types of decisions.

 Proper construction of a data warehouse is

(5)

Knowledge Discovery in Databases

(KDD)

 Data mining is actually one step of a larger

process known as knowledge discovery in

databases (KDD).

 The KDD process model comprises six phases

 Data selection  Data cleansing  Enrichment

 Data transformation or encoding  Data mining

(6)

Goals of Data Mining and Knowledge

Discovery (PICO)

Prediction:

 Determine how certain attributes will behave in the

future.

Identification:

 Identify the existence of an item, event, or activity.

Classification:

(7)

Types of Discovered Knowledge

 Association Rules

 Classification

(8)

Classification

Classification is the process of learning a model

that is able to describe different classes of data.

 Learning is supervised as the classes to be

learned are predetermined.

 Learning is accomplished by using a training set

of pre-classified data.

 The model produced is usually in the form of a

(9)
(10)
(11)
(12)

K – Nearest Neighbor Algorithm

(13)

K – Nearest Neighbor Algorithm

2 2 1

2 2

1

)

(

)

(

x

x

y

y

(14)
(15)
(16)

Association Rules

 Association rules are frequently used to generate rules

from market-basket data.

 A market basket corresponds to the sets of items a

consumer purchases during one visit to a supermarket.  An association rule is of the form X=>Y

X is called antecedent (left-hand-side or LHS) and Y is called consequent (right-hand-side or RHS) of the rule.

(17)

Example of rule

{onions, potatoes} => {beef}

 Found in the sales data of a supermarket would

indicate that if a customer buys onions and

potatoes together, he or she is likely to also buy beef

 Such information can be used as the basis for

(18)

Mathematical formulation of Rule

 Let I={i1, i2, .., in} be a set of n binary attributes called items.

 Let D={t1, t2, .., tm} be a set of transactions called the database.  Each transaction in D has a unique transaction ID and contains a

subset of the items in I.

 A rule is defined as an implication of the form XY where X∩Y= ∅ and X,Y

⊆ I

.

Itemset

A collection of one or more items

 Example: {Milk, Butter, Bread}

(19)

Association rule:

Support

Every association rule has a support and a confidence.

“The support is the percentage of transactions that demonstrate the rule.”

Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 1, 3, 5.

2: 1, 8, 14, 17, 12. 3: 4, 6, 8, 12, 9, 104. 4: 2, 1, 8.

(20)

Association rule:

Support

An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold

add to previous example: if threshold 50%

(21)

Association Rules :Confidence

Confidence

Every association rule has a support and a confidence.

An association rule is of the form: X => Y

X => Y: if someone buys X, he also buys Y

The confidence is the conditional probability that, given X

present in a transition , Y will also be present.

Confidence measure, by definition:

(22)

Association Rules :Confidence

Confidence

We should only consider rules derived from

itemsets with high support, and that also have high confidence.

“A rule with low confidence is not meaningful.”

(23)

Association Rules :Confidence

Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 3, 5, 8. 2: 2, 6, 8.

3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8.

Conf ( {5} => {8} ) ?

supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,

then conf( {5} => {8} ) = 4/5 = 0.8 or 80%

(24)

Association Rules :Confidence

1. Generate all itemsets that have a support that

exceeds the threshold. These sets of items are

called large (or frequent) itemsets. Note that large here means large support.

2. For each large itemset, all the rules that have a

(25)

Generating Association Rules:

Apriori Algorithm

 Apriori principle:

 If an itemset is frequent, then all of its subsets

(26)

Generating Association Rules:

Illustrating

Apriori Principle

Item Count

Bread 4

Coke 2

Milk 4

Beer 3

Diaper 4

Eggs 1

Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Items (1-itemsets) Pairs (2-itemsets)

No need to generate

candidates involving Coke or Eggs

Triplets (3-itemsets)

No need to generate

candidates involving {bread, beer} and {milk, beer}

(27)

Generating Association Rules:

Apriori

Algorithm

 N= Nomber of attributes

 Attribute_list = all attributes

 For i=1 to N

 Frequent_item_set = generate item set of i attributes using

attribute_list

 Frequent_item_set = remove all infrequent itemsets

 Rule = Rule U generate rule using Frequent_item_set

 Attribute_list = Only attributes contained in Frequent_item_set

(28)

Clustering

 Unsupervised learning or clustering builds

models from data without predefined classes.

 The goal is to place records into groups where

the records in a group are highly similar to each other and dissimilar to records in other groups.

 The k-Means algorithm is a simple yet effective

(29)

 Problem

Example

Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine.

Medicine Weight pH-Index

A 1 1

B 2 1

C 4 3

D 5 4

A B

C

(30)

Example

 Step 1: Use initial seed points for partitioning

c1A ,c2B

24 . 4 ) 1 4 ( ) 2 5 ( ) , ( 5 ) 1 4 ( ) 1 5 ( ) , ( 2 2 2 2 2 1           c D d c D d

Assign each object to the cluster

Euclidean distance

D

C

(31)

Example

 Step 2: Compute new centroids of the current

partition

Knowing the members of each cluster, now we compute the new

centroid of each group based on these new memberships.

(32)

Example

 Step 2: Renew membership based on new

centroids

Compute the distance of all

(33)

Example

 Step 3: Repeat the first two steps until its

convergence

Knowing the members of each

cluster, now we compute the new centroid of each group based on these new memberships.

(34)

Example

 Step 3: Repeat the first two steps until its

convergence

Compute the distance of all objects to the new centroids

References

Related documents

Circumglobal in tropical and subtropical waters; in western Pacific north to Japan (Collette and Nauen 1983) and southern Kuril Islands (Parin 2003); Iron Springs, central

Therefore a tradeoff can be illustrated between the implementation cost and the hydrological impact on the watershed based on the storm water management approach of using only LID

e) arrange, in the event the User exercises the right of withdrawal provided for by the regulations in force concerning distance contracts, for the refund of the price

Built into Symantec™ Endpoint Protection 12.1 and Norton 360™, and Norton™ Internet Security consumer products from 2010 onwards, SONAR uses real-time behavioral monitoring to block

The mass of the pulsar and secondary star, the secondary star’s volume equivalent radius and its projected rotational velocity are inferred values from radial velocity

Happy Staff • In-depth staff induction and training • 24 hour SMS reminders for all shifts • Easy mobile post-shift reporting • Quickest payment terms of any agency •

Fourteen multivariate regression methods, ranging from intrinsically linear methods ( e.g. , ANN and RF), were thus tested and compared for their performances on predicting logPAs

Prinos sira s obzirom na masu mlijeka, udio mliječne masti, proteina te suhe tvari u mlijeku za sirenje prikazuju Slike 19, 20, 22 i 24 dok raspodjelu sastojaka mlijeka između