• No results found

ACM SIGKDD Workshop on Intelligence and Security Informatics Held in conjunction with KDD-2010

N/A
N/A
Protected

Academic year: 2021

Share "ACM SIGKDD Workshop on Intelligence and Security Informatics Held in conjunction with KDD-2010"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

Fuzzy Association Rule

Fuzzy Association Rule

Mining for Community

Mining for Community

Crime Pattern Discovery

Crime Pattern Discovery

Anna L. Buczak, Christopher M. Gifford

Anna L. Buczak, Christopher M. Gifford

ACM SIGKDD Workshop on Intelligence and Security Informatics

Held in conjunction with KDD-2010 July 25, 2010

(2)

Outline

Outline

Objective

Fuzzy Association Rule Mining (FARM) methodology

Rule interestingness

Rule pruning

Data set

Results:

All state rules

Post-pruning of all state rules

Regional rules

(3)

Crime Pattern Discovery

Crime Pattern Discovery

Currently:

Manual inspection of crime data by analysts.

Limited due to the amount of data that can be

processed in an acceptable time frame.

Complex relationships between various crime

Complex relationships between various crime

attributes can be overlooked.

Increased interest by state and city law enforcement to

discover patterns in crime data sets.

Goal: automatic discovery of community crime patterns.

Extract rules governing crime patterns.

(4)

Fuzzy Association Rule Mining (FARM)

Fuzzy Association Rule Mining (FARM)

Goal: find novel relationships in the data

For

numerical

and

categorical

attributes

crisp rules are unsatisfactory and fuzzy

rules provide significantly higher quality

results.

Fuzzy logic assigns degree of

membership between 0 and 1 (e.g., 0.4) to

each element of a set.

Fuzzy Membership Functions* for

Variable Customer-age

each element of a set.

Fuzzy association rules are of the form:

(X is

A

)

(Y is

B

)

where X, Y are attributes and A, B are fuzzy sets which characterize X and Y respectively.

Example fuzzy association rule for a

banking application:

(Customer-age is

young

) and

(Account-balance is

small

)

(Loan-balance is

moderate

)

Fuzzy rules are well suited to problems with

*Au, W-H and Chan, KCC, “Mining fuzzy association

rules in a bank-account database”, IEEE Transactions on Fuzzy Systems, 11:238-248, April 2003.

(5)

Rule Interestingness

Rule Interestingness

Rules of interest usually have high support and high

confidence.

In certain domains rules of interest don’t have a high

support (i.e. they describe rare events).

Holds for crime, equipment failure and rare disease

applications.

applications.

When interested in rare events, even rules with low

support need to be generated.

Result: very large number of rules.

5

(6)

Rule Pruning (1)

Rule Pruning (1)

Pruning based on Support, Confidence, Lift

Consequent-constraint rule pruning

*

An item constraint is used that requires rule consequents to satisfy

a given constraint. Requires prior knowledge of which consequents

should be interesting.

Antecedent-constraint rule pruning

Antecedent-constraint rule pruning

Remove rules that are subsets of other rules and have similar

confidence:

R1: (A1 & A2) -> C1,

conf = 0.7

R2: (A1 & A2 & A3) -> C1,

conf = 0.7

R3: (A1 & A2 & A4) -> C1,

conf = 0.88

R2 is a subset of R1 with the same confidence - it should be

removed.

R3 is a subset of R1with a different confidence - it should stay.

*R.J. Bayardo, R. Agrawal, and D. Gunopulos, “Constraint-Based Rule Mining in Large, Dense Databases,”

(7)

Rule Pruning (2)

Rule Pruning (2)

Defined

Relative Fuzzy Support

(RFS) measure:

Allows reduction of the support threshold for

Allows reduction of the support threshold for

consequents that have low frequency and increase of

the support threshold for consequents that have high

frequency.

The reduction or increase of support is significant

because of the square in the denominator.

RFS is well suited for applications in which the user

knows the consequents of interest. This is the case in

the crime application, as the user is most interested in

Violent Crimes

,

Murders

,

Robberies

and

Assaults

being

High.

7

(8)

Crime Association Rule Mining

Crime Association Rule Mining

Communities and Crime Data Set

*

(UCI Machine

Learning Repository):

Total of 128 variables

Census data (1990)

Crime data (1995)

Crime data (1995)

Law enforcement data (1990)

For many communities these attributes are missing

(e.g., police officers per 100k population, police

requests per officer, officers assigned to drug units,

police operating budget)

Data from 2215 communities

(9)

Antecedents and Consequents

Antecedents and Consequents

Mean People Per Household Race: African American (%) Race: Caucasian (%) Race: Hispanic (%)

Race: Asian (%) Age: 12-21 (%) Age: 12-29 (%) Age: 16-24 (%) Age: 65+ (%) Unemployed (%) Employed (%) Divorced (%) Houses with Salary Income (%) Houses with Retirement Income (%) Houses with Social Security Income (%) Houses with Public Assistance Income (%)

Per Capita Median People in Dense People in Urban

Violent Crimes Robberies Per Capita Income Median Household Income People in Dense Housing (%) People in Urban Area (%) People Speaking No English (%) People Speaking English Only (%) Median Gross Rent People in Owner Occupied Households (%) Education: Less than 9th Grade (%) Education: No High School Diploma (%) Education: Bachelor's or Higher (%) Occupied Housing Units Without Phone (%) People in Homeless Shelters Homeless People Counted in Street

Houses with Kids Living with Two Parents (%) Kids Born to Never Married (%) Foreign Born (%) Population Density (Persons Per Square Mile)

People Commute Using Public Transit (%)

People Under

Poverty Level (%) 40 variables

122 membership functions

Assaults

(10)

Example Variables and

Example Variables and

Membership Functions

Membership Functions

Houses with Public Assistance Income (%) Membership Functions

(11)

Post

Post--Pruning of All State Rules (1)

Pruning of All State Rules (1)

Confidence = 60%

Support = 0.135%

Total of 13,657 rules

generated

Post-pruning using RFS = 1.0

Rules left: 657 Number of rules Number of rules 95.2% reduction in the number of rules

Average support of rules with membership functions Low

and No increased the most considerably.

11

Average support Average support

Large portion of rules of no

interest was automatically

(12)

Post

Post--Pruning of All State Rules (2)

Pruning of All State Rules (2)

Rules with consequents Murders

(High) and Robberies (High) have the highest lift, exceeding several times the average lift of the other

consequents.

Average lift of rules with consequent

Violent Crimes (High): remaining after pruning increased more than three times.

Average lift Average lift

three times.

Average lift of rules with membership functions No, Low, and Medium

remaining after pruning is unchanged.

Rules with consequents Murders

(High) and Robberies (High) have the highest relative support. Pruning

does not increase the average

relative support of this class of rules.

Average relative support of rules with membership functions No, Low, and

Medium remaining after pruning increased.

Average relative support Average relative support

(13)

All State Rules Producing Highest Value

All State Rules Producing Highest Value

of a Metric

of a Metric

People Speaking No English (Low) & People in Dense Housing (Low) Robberies (Low), conf=85.0, lift=1.0, rel sup=1.1, sup=75.3

Kids Born to Never Married (Low) & People in Dense Housing (Low) Robberies (Low), conf=88.0, lift=1.1, rel sup=1.1, sup=73.9

People in Urban Area (High) & Kids Born to Never Married (High) Robberies (High), conf=63.0, lift=34.7, rel sup=11.9, sup=0.4

Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High),

Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High), conf=61.0, lift=33.3, rel sup=10.9, sup=0.4

Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1

Race African American (Minority) & People Speaking No English (Low) Robberies (Low), conf=91.0, lift=1.1, rel sup=1.0, sup=65.9

Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1

Houses with Kids Living with Two Parents (Low) & People Commute Using Public Transit (High) Robberies (High), conf=86.0, lift=47.4, rel sup=5.6, sup=0.2

13

Prominent variables: Kids Born to ever Married & Houses with Kids Living with Two Parents are present in 6 out of 8 rules

(14)

Surprising All State Rules Identified by

Surprising All State Rules Identified by

Subject Matter Expert

Subject Matter Expert

Employed (High) & Kids Born to Never Married (High) Violent Crimes (High),

conf=0.67, lift=15.2, rel sup=1.0 sup=0.002

Employed (High) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.0, sup=0.297

Kids Born to Never Married (High) Violent Crimes (High),

conf=0.58, lift=13, rel sup=2.1, sup=0.004

People Under Poverty Level (Low) & Kids Born to Never Married (High) Violent

Crimes (High), conf=0.65, lift=14.6, rel sup=1.1 sup=0.002

People Under Poverty Level (Low) Violent Crimes (Low),

People Under Poverty Level (Low) Violent Crimes (Low),

conf=0.67, lift=1.2, rel sup=1.4, sup=0.416

Kids Born to Never Married (High) Violent Crimes (High),

conf=0.58, lift=13, rel sup=2.1, sup=0.004

Age: 16-24 (Low) & Kids Born to Never Married (High) Murders (High),

conf=0.6, lift=13.5, rel sup=2.1, sup=0.004

Age: 16-24 (Low) Murders (Low), conf=0.6, lift=1.1, rel sup=1.3, sup=0.365

Kids Born to Never Married (High) Murders (High),

conf=0.52, lift=20.5, rel sup=5.9, sup=0.004

Houses with Salary Income (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.64, lift=14.5, rel sup=1.4, sup=0.003

Houses with Income (High) Violent Crimes (Low), conf=0.66, lift=1.2, rel sup=0.9, sup=0.257

Kids Born to Never Married (High) Violent Crimes (High),

(15)

Regional Rules

Regional Rules

Rules were generated separately for

5 US regions

*

: MW, NE, SE, SW, and

W.

3188 rules were consistent through

all regions:

Rule with highest avg Support:

People in Dense Housing (Low)

Rules per Region

Rules per Region

People in Dense Housing (Low) Robberies (Low)

Rule with highest avg Confidence:

Houses with Retirement Income (High) & People in Homeless Shelters (None) Robberies (Low)

Rule with highest avg Lift:

Race African American (Middle) & Houses with Public Assistance Income (Medium) Assaults (Medium)

15

(16)

Conclusions

Conclusions

Fuzzy association rule mining has proven useful for this crime

application.

First experimental study of applying fuzzy association rule

mining to a crime data set.

Both frequent and rare rules are of interest.

New Fuzzy Relative Support metric defined for rule

New Fuzzy Relative Support metric defined for rule

post-pruning:

Achieves a 95.2% reduction in the final number of rules.

Rules discovered represent patterns of interest to law

enforcement officials.

Subject Matter Expert recommendation: “Law enforcement

personnel and analysts should further analyze the identified

set of surprising rules and the corresponding underlying data

in an attempt to better understand crime patterns and develop

more effective approaches to combat crime.”

(17)

Questions ?

Questions ?

Contact info: Dr. Anna L. Buczak

17

Contact info: Dr. Anna L. Buczak

[email protected]

(18)
(19)

Advantages of FARM

Advantages of FARM

Fuzzy association rules work well with numerical and

categorical

data.

Fuzzy rules are easy to understand by a human.

FARM does not make any assumptions about the rules that

are to be extracted, removing a bias that humans might

have.

have.

Use of fuzzy techniques makes fuzzy association rules

mining resilient to noise and missing values.

Fuzzy rules are proven to provide superior performance to

crisp rules in many applications (e.g., fuzzy temperature

controller).

(20)

Rule Support and Confidence

Rule Support and Confidence

D

Y

X

Y

X

Support

#

#

)

(

=

I

Support (coverage) is the number of

instances the rule predicts correctly

expressed as a proportion of all items in the data set.

Support = number of instances that contain both X and Y divided by

number of all transactions in database (D).

X

Y

Rule:

D

X

Y

X∩Y

(D).

Confidence (accuracy) is the number of

instances that the rule predicts correctly, expressed as a proportion of all instances to which it applies.

Confidence = number of transactions that contain both X and Y divided by number of transactions that contain X.

Confidence can be treated as conditional probability of a transaction containing X also containing Y (P(Y|X)).

X

Y

X

Y

X

Confidence

#

#

)

(

=

I

(21)

Rule Lift

Rule Lift

Lift - measures the deviation from independence of X and Y.

Lift - ratio of the number of instances X and Y appear together to the

multiple of number of instances X appears and number of instances Y appears.

Lift - values larger than 1.0 indicate that transactions containing the

)

(

_

)

(

)

(

Y

X

Conf

Expected

Y

X

Conf

Y

X

Lift

=

)

(

)

(

_

Conf

X

Y

Sup

Y

Expected

=

Y

X

D

Y

X

)

#

#

(

)

(

#

X

Y

Rule:

that transactions containing the antecedent (X) tend to contain the consequent (Y) more often than

transactions that do not contain the antecedent (X).

The higher the lift, the more likely

that the existence of X and Y together is not just a random occurrence but because of a relationship between them. 21

D

X

Y

X∩Y

D

Y

X

Y

X

Y

D

X

Y

X

Y

X

Lift

#

#

#

)

(

#

#

#

#

)

(

#

)

(

=

I

=

I

(22)

US Regions

US Regions

Community data were grouped into five regions:

Northeast: CT, DE, ME, MD, MA, NH, NJ, NY, PA, RI, and VT.

This subset covers 632 communities.

Southeast: AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and

WV. This subset covers 420 communities.

Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH,

Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH,

SD, and WI. It covers a total of 513 communities.

Southwest: AZ, NM, OK, and TX. This subset covers 228

communities.

West: CA, CO, ID, MT (no data), NV, OR, UT, WA, and WY.

(23)

Experimental Setup

Experimental Setup

All attributes with a large number of missing values were

removed.

Odds ratios between each remaining attribute and

Violent Crimes

,

Murders

,

Robberies

, and

Assaults

were

computed. Attributes exhibiting small odds ratios were

removed.

removed.

Similar attributes were omitted (e.g., from the attributes

Divorced (%)

,

Male Divorced (%)

, and

Female Divorced

(%)

, only

Divorced (%)

was kept).

(24)

Examples of Membership Functions

Examples of Membership Functions

Race: Hispanic (%) Homeless People in Shelters Per 100K Population

Violent Crimes Per 100K Population

(25)

Examples of Unsurprising All State Rules

Examples of Unsurprising All State Rules

Frequent rules:

Houses with Kids Living with Two Parents (High) & People Speaking No English (Low) Murders (No) conf=0.61, lift=1.3, sup=0.221

Houses with Public Assistance Income (Low) & Houses with Kids Living with Two Parents (High) Murders (No) conf=0.6, lift=1.3, sup=0.219

Rare rules:

People in Homeless Shelters (High) & People Commute Using Public Transit (High)

Robberies (High), conf=0.76, lift=42.1, sup=0.002

Houses with Public Assistance Income (High) & Kids Born to Never Married (High)

Robberies (High), conf=0.7, lift=38.7, sup=0.002

Houses with Public Assistance Income (High) & Kids Born to Never Married (High)

Murders (High), conf=0.74, lift=28.9, sup=0.002

References

Related documents