Alvaro Cárdenas Fujitsu Laboratories
Dagstuhl Perspectives Workshop on
Machine Learning Methods for Computer Security September 2012
Evaluating Classifiers in
Traditional ML Evaluation is not Applicable in Adversarial Settings
In traditional ML practice, algorithms are trained and
evaluated under assumptions that do not hold in a security environment
You should not depend on attack examples
If Classifier is widely adopted, “attack class” will change its behavior
True positive rate must be evaluated with care!
Training data might be poisoned by an attacker
Large class imbalance between normal and attack
Previous Work on Classifier Evaluation
Nelson, Joseph, Tygar, Rubinstein et.al.
AsiaCCS 2006, AISec 2011, LEET 2008, IMC 2009,
Taxonomy introduction, refinement, and applications
Kloft, Laskov
AISec 2009, AISTATS 2010
Analytical attacks for robust evaluation
Biggio, Fumera, Giacinto, Roli, et.al,
MCS 2009, 2011, IEEE SMC 2011
Empirical attacks for robust evaluation
Focus: generating attacks against the classifier
Missing: New metrics and worst undetected
Talk Outline
Case example of Classifier Evaluation for
Electricity Theft Detection
Evaluating Classifiers Analytically
Metric to account for imbalanced datasets
Key Points of Electricity Theft Use-Case
You cannot evaluate algorithms on detection rate, but
rather on their effectiveness against the worst undetected attacks
Ignore true positive rate metric
• No ROC, PR curves, AUC, Accuracy, F-score, etc.
Instead assume we always get attacked; &
measure the cost of the worst undetected attacks
Asymptotic behavior of data poisoning attacks
More Info:
Mashima, Cárdenas, RAID 2012
Cárdenas et.al. AsiaCCS 2011 – worst undetected
Smart Grid Goals
Efficiency
Optimal use of assets: load shaping instead of load following
Green: integrate renewable generation
Reliability
Real-time, fine-grained state of the grid used to anticipate faults and provide better control
Customer Choice
Transparency: Fine-grained energy usage, prices, proportion of green generation, etc.
Smart appliances automated based on consumer preferences
Advanced Metering Infrastructure (AMI)
Replacing old mechanical electricity meters
with new digital meters
Enables frequent, periodic 2-way
communication between utilities and homes
Smart Meter
Gateway Data Collection Metering Server
GW
Motivation for ML in AMI for Security
Push back in prices
Billions of low-cost embedded devices
Can’t have fancy tamper protection
Security is hard to see
But, Situational Awareness is Fun to see
Understand the health of the system
Identify anomalies
AMI Gives a More Data on Electricity
Consumption
Construct models of “normal” consumption
Focus on Electricity Theft
Source: Investment and Financial Flows To Address Climate Change. United Nations Annex Parties: Developed
Nations (Europe NA Japan) Non Annex Parties:
The Rest.
Attacks will happen: Devices are deployed for 20~30 years
Anomaly Detection Architecture Substa'on Houses Meters Collector Private Cloud Fi be r-‐ op 'c n et w or k Router Router
Smart Meters send consumption data frequently (e.g., every 15 minutes) to the utility Consumer 1 Consumer n Electricity Usage Data Analytics, Anomaly Detection Meter Data Repository Storage
Case Study: Detection of Electricity Theft Detection of Electricity Theft Hardware: Balance Meters Tamper Evident Seals Software:
Algorithms Trained without Attack Data
Hypothesis
Testing
Unsupervised
Learning
Y1, . . . , Yn Outlier Detection Algorithm Outliers Unlabeled data Problems Easier to attack More false positives
LOF
We have prior knowledge
of attack invariant
We know attackers want to lower energy consumption
Include this information for the “bad” class
ARMA-GLR, CUSUM, EWMA
H0 :P0
ARMA GLR Detector
We need to detect attack signals that are not only different
from the historical ARMA model, but signals that lower the reported electricity consumption
Given a sequence of observations,
Calculate the likelihood of H0 (normal) and H1 (attack)
Model H0 as an Autoregressive Moving Average Model (ARMA)
Model H0 as an ARMA model , such that:
In ARMA models, a change in the mean can be modeled as:
Y
k+1=
pi=1A
iY
k i+
qj=0B
j(
V
k j+ )
under
H
0: = 0 and under
H
1: =
,
>
0
.
Y
1, . . . , Y
nP
0P
H
0:
P
0Selecting the Attack Probability Model We do not know the magnitude of the attack
We cannot compute the likelihood of H1 until we find a way
to deal with this uncertainty
Idea: Use the Generalized Likelihood Ratio (GLR)
test:
Among the class of “attack” distributions, find the one that best matches the observation
Evaluation
Most Machine Learning Algorithms Assume a
pool of Negative Examples and a Pool of Positive examples to evaluate the tradeoff between false alarms vs. detection rate:
Problem: We Do Not Have Positive Examples
Because meters were just deployed, we do not
Our Proposal:
Find the worst possible undetected attack for
each classifier, and then find the cost (kWh Lost) of these attacks
Adversary Model
ˆ
Y
1, . . . ,
Y
ˆ
nY
1, . . . , Y
nReal Consumption Fake Meter Readings Utility
Goal of attacker: Minimize Energy Bill:
min
ˆ Y1,...,Yˆn n i=1
ˆ
Y
iGoal of Attacker: Not being detected by classifier “C”:
C
( ˆ
Y
1, . . . ,
Y
ˆ
n) = normal
Real vs. Attack Signals
Time Slot of Day
Electr icity Usage 0 20 40 60 80 0 50 100 150 200 250
Time Slot of Day
Electr icity Usage 0 20 40 60 80 0 50 100 150 200 250 Real Attack
New Tradeoff Curve: No Detection Rates
0.00 0.05 0.10 0.15 0.20 0.25
10000
15000
20000
False Positive Rate
ARMA−GLR Average CUSUM EWMA LOF A ver
age Loss per Attack [Wh]
Y-axis: Cost of Undetected Attacks (can be extended to other fields) X-axis: False Positive Rate
Asymptotic Effects of Poisoning Attacks Time (Hours) Consumption (0.1wh) 0 50 100 150 200 250 300 100 200 300 400 500 600 Online Learning Concept Drift Electricity consumption is a non-stationary distribution
We have to “retrain” models
Attacker can use undetected
attacks to poison training data
“Valid” Electricity Consumption
Undetected Attacks
Time
Re-train Classifier to Account for Concept Drift
Detecting Poisoning Attacks
Identify concept drift trends helping an attacker
Lower electricity consumption over time.
Countermeasure: linear regression of trend
Slope of regression was not good discriminant
Determination coefficients worked!
0.0 0.2 0.4 0.6 0.8 1.0 Original Attack Deter mination Coefficient − 6 − 4 − 20 2 Original Attack
Slope of Regression Line
Honest Users Attackers Honest Users Attackers
Sl op e of R eg re ssi on D et ermi na tio n Coef f.
Talk Outline
Case example of Classifier Evaluation for
Electricity Theft Detection
Evaluating Classifiers Analytically
Game Theory and Adversarial Classification
Oakland 2006, AAAI 2006, NIPS 2008, Infocom
2007, ToN 2009
Metric to account for imbalanced datasets
Adversarial Classification is a Game
Traditional ML
Nature makes first move:
• Statistics of classes
Classifier makes second move:
• Optimal classifier for the given statistical properties given
Adversarial ML
Classifier makes first move:
• Optimal classifier for normal class and guessed attacks
Attacker makes second move:
• Modify attack after seeing classifier
A lot of previous work on game theory
How Can We Ensure That Classifier
Performance in Deployment is the Same (or Better) as Metric During Design?
Minimax strategy is the safety level of the game for
player 1:
The smallest worst case Φ (error) an attacker can do to the system
Step 1: Maximize Φ(D,A) over A (attack parameter)
over a set of possible classifiers D
Step 2: Minimize Φ(D,A*) over D
Provable bound of worst performance… D is
Example: Game Theory in ROC curves
Goal: Select Classifier that Minimizes Probability
of Error
Attacker has control of prior
Attacker Has Second Move
If you select Classifier 1 (h1),
Attacker selects p1 with Pr[Error |h1] = 0.3
If you select Classifier 2 (h2),
Attacker selects p2 with Pr[Error |h2] = 0.4
Obtaining Full ROC Curve
Known to Neyman-Pearson, 1933
Intermission; “Real” ROC Curve
Barreno, Cárdenas, Tygar, NIPS 2007/08
Theorem: In general, optimal ROC has rules where n is the number of classifiers
In General, Easier to Reverse Optimization Order
Example: MAC-Layer Misbehavior Expected backoff distribution ASUS Centrino Digicom Linksys Dlink Dlink
Bianchi et al. Infocom 2007
Access Point
Lemma: The probability that the adversary
accesses the channel is:
Let
Then the attack pmfs p1 must belong to the
following set:
Adversary Model
Analytical form for optimal attack
The optimal p1 is
Attackers have not figured out the optimal attack Expected Distribution ASUS Centrino Digicom Linksys Dlink Dlink Optimal-Attack Distribution
SPRT outperforms previous solutions
Expected time for a false positive Expected time to
Talk Outline
Case example of Classifier Evaluation for
Electricity Theft Detection
Evaluating Classifiers Analytically
Metric to account for imbalanced datasets
Polygraph Test
A national security
organization with
10000 employees has one traitor
(intruder) Assume Moe is tested with a 99% accurate Polygraph test:
-
Pr [ A=1 | I=1 ] = 0.99 (PD)
-
Pr [ A=0 | I=0 ] = 0.99 (1-PF)
What is the probability that Moe is a traitor?
Moe tests positive.
What is the probability that Moe is
a traitor?
10000 employees; one traitor
PD = Pr [ A=1 | I=1 ] = 0.99 PF = Pr [ A=1 | I=0 ] = 0.01 A) 0.99 B) 0.01 C) 0.50 D) ?
The base-rate fallacy problem
If Pr[ I=1 | A=1 ] = 0.01 sounds
counterintuitive, you are exhibiting the
base-rate fallacy syndrome
Ignore base-rates in your probability assessments
Small base-rates: crying wolf phenomenon
It is easy to lie with statistics
-
But it is easier to lie without them. Dan Geer
Alternative Metric to Evaluate IDS?
PF can be misinterpreted
Count number of alarms? Good but heuristic
Replace ROC with graph with posterior
probability
Problems:
• How do you maximize the posterior probability?
• Uncertain p (probability of attack)
Precision-Recall curves
Are always computed empirically with base-rate
Evaluating the Performance of Intrusion Detection Systems
Metric Field
ROC Signal Processing
Cost sensitive eval. (Bayes risk) Decision Theory/Operations
Research
Intrusion Detection Capability CID Information Theory
Bayesian Detection Rate
Pr[I|A]=PPV Statistics
Distinguishability
Tradeoff between
PD=Pr[ A=1 | I=1 ] vs.
Pr [ I=1 | A=1 ] for different base-rates
New Metric: IDOC (B-ROC)
Pr[ I=1 | A=1 ]
Talk Outline
Case example of Classifier Evaluation for
Electricity Theft Detection
Evaluating Classifiers Analytically
Metric to account for imbalanced datasets
CSA: Big Data Working Group
CSA Big Data Working Group Site
https://cloudsecurityalliance.org/research/big-data/
CSA, Big Data LinkedIn
http://www.linkedin.com/groups?home=&gid=4458215&trk=anet_ug_hm
Basecamp Project Collaboration Site Request
Form
https://cloudsecurityalliance.org/research/ basecamp/
5 Research Directions
Big data analytics for security intelligence
Privacy preserving/enhancing technologies
Big data-scale crypto
Big data cloud infrastructure and attack
surface reduction
The Road to Better Situational Awareness
Big Data Security/Analytics
Variety of Data, Security Intelligence
SIEM
Alarm Correlation
Intrusion Detection Systems
What is new in Big Data?
Traditional Systems
More rigid, predefined
schemas
Data gets deleted
Complex analyst queries
take long to complete
Others?
Big Data Promise
Structured and unstructured
data treated seamlessly
Keep data for historical
correlation (e.g., 10 years)
Faster query response times
Others?
Hadoop is de facto open standard for big data at rest Stream processing? Participation welcome!!
How do we Achieve these Objectives Academic audience:
How do I convince my
peers that the evaluation of a classifier is:
1. Technically sound
2. Follows scientific method that considers actions of an attacker
Industry audience
How do I convince customers/industry partners that our
technology is valuable
Business case for investing in new technology