Fuzzy Association Rule
Fuzzy Association Rule
Mining for Community
Mining for Community
Crime Pattern Discovery
Crime Pattern Discovery
Anna L. Buczak, Christopher M. Gifford
Anna L. Buczak, Christopher M. Gifford
ACM SIGKDD Workshop on Intelligence and Security Informatics
Held in conjunction with KDD-2010 July 25, 2010
Outline
Outline
Objective
Fuzzy Association Rule Mining (FARM) methodology
Rule interestingness
Rule pruning
Data set
Results:
All state rules
Post-pruning of all state rules
Regional rules
Crime Pattern Discovery
Crime Pattern Discovery
Currently:
Manual inspection of crime data by analysts.
Limited due to the amount of data that can be
processed in an acceptable time frame.
Complex relationships between various crime
Complex relationships between various crime
attributes can be overlooked.
Increased interest by state and city law enforcement to
discover patterns in crime data sets.
Goal: automatic discovery of community crime patterns.
Extract rules governing crime patterns.
Fuzzy Association Rule Mining (FARM)
Fuzzy Association Rule Mining (FARM)
Goal: find novel relationships in the data
For
numerical
and
categorical
attributes
crisp rules are unsatisfactory and fuzzy
rules provide significantly higher quality
results.
Fuzzy logic assigns degree of
membership between 0 and 1 (e.g., 0.4) to
each element of a set.
Fuzzy Membership Functions* for
Variable Customer-age
each element of a set.
Fuzzy association rules are of the form:
(X is
A
)
(Y is
B
)
where X, Y are attributes and A, B are fuzzy sets which characterize X and Y respectively.
Example fuzzy association rule for a
banking application:
(Customer-age is
young
) and
(Account-balance is
small
)
(Loan-balance is
moderate
)
Fuzzy rules are well suited to problems with
*Au, W-H and Chan, KCC, “Mining fuzzy association
rules in a bank-account database”, IEEE Transactions on Fuzzy Systems, 11:238-248, April 2003.
Rule Interestingness
Rule Interestingness
Rules of interest usually have high support and high
confidence.
In certain domains rules of interest don’t have a high
support (i.e. they describe rare events).
Holds for crime, equipment failure and rare disease
applications.
applications.
When interested in rare events, even rules with low
support need to be generated.
Result: very large number of rules.
5
Rule Pruning (1)
Rule Pruning (1)
Pruning based on Support, Confidence, Lift
Consequent-constraint rule pruning
*An item constraint is used that requires rule consequents to satisfy
a given constraint. Requires prior knowledge of which consequents
should be interesting.
Antecedent-constraint rule pruning
Antecedent-constraint rule pruning
Remove rules that are subsets of other rules and have similar
confidence:
R1: (A1 & A2) -> C1,
conf = 0.7
R2: (A1 & A2 & A3) -> C1,
conf = 0.7
R3: (A1 & A2 & A4) -> C1,
conf = 0.88
R2 is a subset of R1 with the same confidence - it should be
removed.
R3 is a subset of R1with a different confidence - it should stay.
*R.J. Bayardo, R. Agrawal, and D. Gunopulos, “Constraint-Based Rule Mining in Large, Dense Databases,”
Rule Pruning (2)
Rule Pruning (2)
Defined
Relative Fuzzy Support
(RFS) measure:
Allows reduction of the support threshold for
Allows reduction of the support threshold for
consequents that have low frequency and increase of
the support threshold for consequents that have high
frequency.
The reduction or increase of support is significant
because of the square in the denominator.
RFS is well suited for applications in which the user
knows the consequents of interest. This is the case in
the crime application, as the user is most interested in
Violent Crimes
,
Murders
,
Robberies
and
Assaults
being
High.
7Crime Association Rule Mining
Crime Association Rule Mining
Communities and Crime Data Set
*(UCI Machine
Learning Repository):
Total of 128 variables
Census data (1990)
Crime data (1995)
Crime data (1995)
Law enforcement data (1990)
For many communities these attributes are missing
(e.g., police officers per 100k population, police
requests per officer, officers assigned to drug units,
police operating budget)
Data from 2215 communities
Antecedents and Consequents
Antecedents and Consequents
Mean People Per Household Race: African American (%) Race: Caucasian (%) Race: Hispanic (%)
Race: Asian (%) Age: 12-21 (%) Age: 12-29 (%) Age: 16-24 (%) Age: 65+ (%) Unemployed (%) Employed (%) Divorced (%) Houses with Salary Income (%) Houses with Retirement Income (%) Houses with Social Security Income (%) Houses with Public Assistance Income (%)
Per Capita Median People in Dense People in Urban
Violent Crimes Robberies Per Capita Income Median Household Income People in Dense Housing (%) People in Urban Area (%) People Speaking No English (%) People Speaking English Only (%) Median Gross Rent People in Owner Occupied Households (%) Education: Less than 9th Grade (%) Education: No High School Diploma (%) Education: Bachelor's or Higher (%) Occupied Housing Units Without Phone (%) People in Homeless Shelters Homeless People Counted in Street
Houses with Kids Living with Two Parents (%) Kids Born to Never Married (%) Foreign Born (%) Population Density (Persons Per Square Mile)
People Commute Using Public Transit (%)
People Under
Poverty Level (%) 40 variables
122 membership functions
Assaults
Example Variables and
Example Variables and
Membership Functions
Membership Functions
Houses with Public Assistance Income (%) Membership Functions
Post
Post--Pruning of All State Rules (1)
Pruning of All State Rules (1)
Confidence = 60%
Support = 0.135%
Total of 13,657 rules
generated
Post-pruning using RFS = 1.0
Rules left: 657 Number of rules Number of rules 95.2% reduction in the number of rulesAverage support of rules with membership functions Low
and No increased the most considerably.
11
Average support Average support
Large portion of rules of no
interest was automatically
Post
Post--Pruning of All State Rules (2)
Pruning of All State Rules (2)
Rules with consequents Murders
(High) and Robberies (High) have the highest lift, exceeding several times the average lift of the other
consequents.
Average lift of rules with consequent
Violent Crimes (High): remaining after pruning increased more than three times.
Average lift Average lift
three times.
Average lift of rules with membership functions No, Low, and Medium
remaining after pruning is unchanged.
Rules with consequents Murders
(High) and Robberies (High) have the highest relative support. Pruning
does not increase the average
relative support of this class of rules.
Average relative support of rules with membership functions No, Low, and
Medium remaining after pruning increased.
Average relative support Average relative support
All State Rules Producing Highest Value
All State Rules Producing Highest Value
of a Metric
of a Metric
People Speaking No English (Low) & People in Dense Housing (Low) Robberies (Low), conf=85.0, lift=1.0, rel sup=1.1, sup=75.3
Kids Born to Never Married (Low) & People in Dense Housing (Low) Robberies (Low), conf=88.0, lift=1.1, rel sup=1.1, sup=73.9
People in Urban Area (High) & Kids Born to Never Married (High) Robberies (High), conf=63.0, lift=34.7, rel sup=11.9, sup=0.4
Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High),
Race Caucasian (Minority) & Kids Born to Never Married (High) Robberies (High), conf=61.0, lift=33.3, rel sup=10.9, sup=0.4
Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1
Race African American (Minority) & People Speaking No English (Low) Robberies (Low), conf=91.0, lift=1.1, rel sup=1.0, sup=65.9
Kids Born to Never Married (High) & People Commute Using Public Transit (High) Robberies (High), conf=96.0, lift=52.9, rel sup=4.5, sup=0.1
Houses with Kids Living with Two Parents (Low) & People Commute Using Public Transit (High) Robberies (High), conf=86.0, lift=47.4, rel sup=5.6, sup=0.2
13
Prominent variables: Kids Born to ever Married & Houses with Kids Living with Two Parents are present in 6 out of 8 rules
Surprising All State Rules Identified by
Surprising All State Rules Identified by
Subject Matter Expert
Subject Matter Expert
Employed (High) & Kids Born to Never Married (High) Violent Crimes (High),
conf=0.67, lift=15.2, rel sup=1.0 sup=0.002
Employed (High) Violent Crimes (Low), conf=0.67, lift=1.2, rel sup=1.0, sup=0.297
Kids Born to Never Married (High) Violent Crimes (High),
conf=0.58, lift=13, rel sup=2.1, sup=0.004
People Under Poverty Level (Low) & Kids Born to Never Married (High) Violent
Crimes (High), conf=0.65, lift=14.6, rel sup=1.1 sup=0.002
People Under Poverty Level (Low) Violent Crimes (Low),
People Under Poverty Level (Low) Violent Crimes (Low),
conf=0.67, lift=1.2, rel sup=1.4, sup=0.416
Kids Born to Never Married (High) Violent Crimes (High),
conf=0.58, lift=13, rel sup=2.1, sup=0.004
Age: 16-24 (Low) & Kids Born to Never Married (High) Murders (High),
conf=0.6, lift=13.5, rel sup=2.1, sup=0.004
Age: 16-24 (Low) Murders (Low), conf=0.6, lift=1.1, rel sup=1.3, sup=0.365
Kids Born to Never Married (High) Murders (High),
conf=0.52, lift=20.5, rel sup=5.9, sup=0.004
Houses with Salary Income (High) & Kids Born to Never Married (High) Violent Crimes (High), conf=0.64, lift=14.5, rel sup=1.4, sup=0.003
Houses with Income (High) Violent Crimes (Low), conf=0.66, lift=1.2, rel sup=0.9, sup=0.257
Kids Born to Never Married (High) Violent Crimes (High),
Regional Rules
Regional Rules
Rules were generated separately for
5 US regions
*: MW, NE, SE, SW, and
W.
3188 rules were consistent through
all regions:
Rule with highest avg Support:
People in Dense Housing (Low)
Rules per Region
Rules per Region
People in Dense Housing (Low) Robberies (Low)
Rule with highest avg Confidence:
Houses with Retirement Income (High) & People in Homeless Shelters (None) Robberies (Low)
Rule with highest avg Lift:
Race African American (Middle) & Houses with Public Assistance Income (Medium) Assaults (Medium)
15
Conclusions
Conclusions
Fuzzy association rule mining has proven useful for this crime
application.
First experimental study of applying fuzzy association rule
mining to a crime data set.
Both frequent and rare rules are of interest.
New Fuzzy Relative Support metric defined for rule
New Fuzzy Relative Support metric defined for rule
post-pruning:
Achieves a 95.2% reduction in the final number of rules.
Rules discovered represent patterns of interest to law
enforcement officials.
Subject Matter Expert recommendation: “Law enforcement
personnel and analysts should further analyze the identified
set of surprising rules and the corresponding underlying data
in an attempt to better understand crime patterns and develop
more effective approaches to combat crime.”
Questions ?
Questions ?
Contact info: Dr. Anna L. Buczak
17
Contact info: Dr. Anna L. Buczak
[email protected]
Advantages of FARM
Advantages of FARM
Fuzzy association rules work well with numerical and
categorical
data.
Fuzzy rules are easy to understand by a human.
FARM does not make any assumptions about the rules that
are to be extracted, removing a bias that humans might
have.
have.
Use of fuzzy techniques makes fuzzy association rules
mining resilient to noise and missing values.
Fuzzy rules are proven to provide superior performance to
crisp rules in many applications (e.g., fuzzy temperature
controller).
Rule Support and Confidence
Rule Support and Confidence
D
Y
X
Y
X
Support
#
#
)
(
→
=
I
Support (coverage) is the number of
instances the rule predicts correctly
expressed as a proportion of all items in the data set.
Support = number of instances that contain both X and Y divided by
number of all transactions in database (D).
X
Y
Rule:
D
X
Y
X∩Y
(D).Confidence (accuracy) is the number of
instances that the rule predicts correctly, expressed as a proportion of all instances to which it applies.
Confidence = number of transactions that contain both X and Y divided by number of transactions that contain X.
Confidence can be treated as conditional probability of a transaction containing X also containing Y (P(Y|X)).
X
Y
X
Y
X
Confidence
#
#
)
(
→
=
I
Rule Lift
Rule Lift
Lift - measures the deviation from independence of X and Y.
Lift - ratio of the number of instances X and Y appear together to the
multiple of number of instances X appears and number of instances Y appears.
Lift - values larger than 1.0 indicate that transactions containing the
)
(
_
)
(
)
(
Y
X
Conf
Expected
Y
X
Conf
Y
X
Lift
→
→
=
→
)
(
)
(
_
Conf
X
Y
Sup
Y
Expected
→
=
Y
X
D
Y
X
)
#
#
(
)
(
#
X
Y
Rule:
that transactions containing the antecedent (X) tend to contain the consequent (Y) more often than
transactions that do not contain the antecedent (X).
The higher the lift, the more likely
that the existence of X and Y together is not just a random occurrence but because of a relationship between them. 21
D
X
Y
X∩Y
D
Y
X
Y
X
Y
D
X
Y
X
Y
X
Lift
#
#
#
)
(
#
#
#
#
)
(
#
)
(
→
=
I
=
I
US Regions
US Regions
Community data were grouped into five regions:
Northeast: CT, DE, ME, MD, MA, NH, NJ, NY, PA, RI, and VT.
This subset covers 632 communities.
Southeast: AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, and
WV. This subset covers 420 communities.
Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH,
Midwest: IL, IN, IA, KS, MI, MN, MO, NE (no data), ND, OH,
SD, and WI. It covers a total of 513 communities.
Southwest: AZ, NM, OK, and TX. This subset covers 228
communities.
West: CA, CO, ID, MT (no data), NV, OR, UT, WA, and WY.
Experimental Setup
Experimental Setup
All attributes with a large number of missing values were
removed.
Odds ratios between each remaining attribute and
Violent Crimes
,
Murders
,
Robberies
, and
Assaults
were
computed. Attributes exhibiting small odds ratios were
removed.
removed.
Similar attributes were omitted (e.g., from the attributes
Divorced (%)
,
Male Divorced (%)
, and
Female Divorced
(%)
, only
Divorced (%)
was kept).
Examples of Membership Functions
Examples of Membership Functions
Race: Hispanic (%) Homeless People in Shelters Per 100K Population
Violent Crimes Per 100K Population
Examples of Unsurprising All State Rules
Examples of Unsurprising All State Rules
Frequent rules:
Houses with Kids Living with Two Parents (High) & People Speaking No English (Low) Murders (No) conf=0.61, lift=1.3, sup=0.221
Houses with Public Assistance Income (Low) & Houses with Kids Living with Two Parents (High) Murders (No) conf=0.6, lift=1.3, sup=0.219
Rare rules:
People in Homeless Shelters (High) & People Commute Using Public Transit (High)
Robberies (High), conf=0.76, lift=42.1, sup=0.002
Houses with Public Assistance Income (High) & Kids Born to Never Married (High)
Robberies (High), conf=0.7, lift=38.7, sup=0.002
Houses with Public Assistance Income (High) & Kids Born to Never Married (High)
Murders (High), conf=0.74, lift=28.9, sup=0.002