Blacklisting and Blocking
Sources of Malicious Traffic
Sources of Malicious Traffic
Athina Markopoulou
Uni
sit f C lif ni I in
University of California, Irvine
Joint work with Fabio Soldo, Anh Le @ UC Irvine Jo nt work w th Fab o Soldo, Anh Le @ UC Irv ne
Outline
Outline
Motivation
Mot vat on
Malicious Internet Traffic: Attack and Defense
Two Defense Mechanisms
Proactive: Predictive Blacklisting
d F l
Reactive: Source-Based Filtering
C
l si
Malicious Traffic on the Internet
Compromising systems
Malicious Traffic on the Internet
p
g y
scanning, worms, website attacks phishing, social engineering attacks ....
Launching attacks
spam click fraud click-fraud Denial-of-Service attacks …B t t
Botnets
The solution requires many components
Monitoring and detection of malicious activity
The solution requires many components
Monitoring and detection of malicious activity
– in the network and/or at hosts– signature-based, behavioral analysis
Mitigation
– at the hosts: remove malicious code
– in the network: block, rate-limit, scrub malicious traffic
Defense at the edge of the network
Defense at the edge of the network
N k 1 Network 2
Network 1 Network 2
Logging IDS Firewall Logging IDS Firewall
router router
L i IDS Fi ll
Network 3 Network 4
Logging IDS Firewall Logging IDS Firewall
Dshield Dataset
6 months of IDS+firewall logs from Dshield.org (May-Oct 2008):
Dshield Dataset
6 months of IDS firewall logs from Dshield.org (May Oct 2008)
~600 contributing networks, 60M+ source IPs, 400M+ logs
Contributing
network Dshield.org
LogsLogs Time Victim IDTime Victim ID Src IP Dst IP Src Port Dst Port Protocol Flags (contributor) Src IP Dst IP Src Port Dst Port Protocol Flags
P h f d d l d b h
Outline
Outline
Background
Background
Malicious Internet Traffic: Attack and Defense
Two Defenses Mechanisms
Proactive: Predictive Blacklisting
d F l
Reactive: Source-Based Filtering
C
l si
Predictive Blacklisting
Problem definition:
Predictive Blacklisting
Problem definition:
– Given past logs of malicious activity collected at various
locations
P di t lik l t d li i t ffi t h i ti
– Predict sources likely to send malicious traffic to each victim
network in the future.
Blacklist:
– list of “worst” (e.g. top-100) attack sources
Prediction vs Detection
Data analysis
Superposition of several behaviors
Data analysis
erts mber of al D Nu DayA multi-level prediction model
Different predictors capture different patterns in
A multi-level prediction model
Different predictors capture different patterns in
the dataset:
– Model temporal dynamics
M d l i l l i b i i / k
– Model spatial correlation between victims/attackers
Combine different predictors
Comb ne d fferent pred ctors
Formulate as a
Recommendation Systems
problem
Recommender systems: example
Netflix: you rate movies and you get suggestions
Formulating Predictive Blacklisting
Recommendation System
Predictive Blacklisting
as a Recommendation System (CF)
3 2 ? ? 13 4 ? Attackers Users 3 2 ? ? 1 ? ? 4 - 13 4 ? ? - - ? 3 ? ? 1 ? - - ?7 12? ? 1 ? 1 ? ims m s 6 3 1 9 ? ? 2 ? ? 11 - 2 3 8 ? -? - 12 1 4 3 ? - - ? 27 9 1 ? ? ? ? ? ? ? ? Vict i Ite m ? ? 2 ? R = Rating Matrix 8 ? 2 ? ? 216 - - ? 11 2 ? -? ? ? ? ? ? ? ? User ? Attack? ? ? UserPredictor I: (attacker, victim) pair
T
l d
i
Temporal dynamics
)
(
,t
r
aTSvPredictor I: (a v) time series
Predictor I: (a, v) time series
)
(
,
t
r
aTSv Data analysis: repeated attacks within short time periods Prediction:
– Use EWMA model to capture this temporal trendp p – Accounts for the short memory of attack sources. – Computationally efficient
– Includes as special case t=1
Past activity at time t’ ≤ t Predicted
Predictor II: similar victims
Data analysis: victims share common attackers.spatial correlation
– [Katti et al, IMC 2005], [Zhang et al, Usenix Security 2008]
C
attackersCommonOur approach:
VictimsPredictor II: similar victims
defining similarity
• Similarity of victims u,v captures:y p
– the number of common attackers – and when they are attacked
C
Our approach:
1 1 0 0 v1 a1 a2 a3 a4 Common attackers 1 1 0 0 1 1 0 0 1 1 1 0 v2 v3 victims 0 0 1 1 v4Predictor II: similar victims
k-nearest neighbors (kNN)
)
(
,t
r
aKNNv Traditional kNN: “trust” your peers
– Identify k most similar victims (“neighbors”) + predict your rating based on theirs
N h ll d i i i
New challenges due to time varying ratings
Sum over the
Our approach:
Sum over the neighborhood of v
Time series forecast given past logs Predicted
activity
given past logs
Similarity between y time-varying vectors
Predictor III: Attackers-Victims
l
Data analysis:
Co-clustering
– group of attackers consistently target the same group of victims. – this behavior often persists over time
We used the Cross-Association (CA) method to automatically identify
Predictor III: Attackers-Victims
P d
Prediction
)
(
,t
r
aEWMAv −CA Intuition:– pairs (a,v) in dense clusters are more likely to occur – use the density of the cluster, as the predictor
, where
A multi-level prediction model
p
Summary
Different predictors capture different patterns:
– Temporal trends
EWMA TS of (attacker victim)
• EWMA TS of (attacker,victim) – Neighborhood models:
• KNN: Similarity of victims
• EWMA CA: Interaction of attackers-victims
Combining different predictors
W i ht d A
Combining different predictors
Weighted AveragePerformance Analysis
B
li Bl kli i T h i
Baseline Blacklisting Techniques
•
Local Worst Offender List (LWOL)
•
Local Worst Offender List (LWOL)
– Most prolific local attackers – Reactive but not proactive
•
Global Worst Offender List (GWOL)
•
Global Worst Offender List (GWOL)
– Most prolific global attackers – Might contain irrelevant attackers
– Non prolific attackers are elusive to GWOL
•
Collaborative Blacklisting (HPB)
– [J. Zhang, P. Porras, J. Ullrich, “Highly Predictive Blacklisting”, USENIX Security 2008] – Also implemented and offered as a service (HPB) by Dshield.org
Performance Analysis
60 d f D hi ld l 5 d t i i 1 d t ti BL l th 1000
total hit count
Performance Analysis
60 days of Dshield logs, 5 days training, 1 day testing, BL length=1000, The combined method
– significantly improves the hit count (up to 70%, 57% on avg) – exhibits less variation over timeexhibits less variation over time
Combined method
HPB HPB GWOL
Predicting Attacks
h i h b
d ?
what is the best we can do?
Training, day t1 Test, day t2
12 - 1 33 5 - - 3 5 - 17 4 -
-vi LocalUB(vi)=3
Local Upper Bound: #IPs in training & test window of a particular
contributor 2 - 1 1 - - -12 - 1 33 5 - -- - 7 - 3 29 6 - 1 - - 5 - -3 5 - 17 4 - -1 2 - 1 5 31 4 - - - - 2 - - 1 - - 2 4 - -x - x x x x x x x - x x x x GlobalUB=5
Global Upper Bound: # IPs in training window of any contributor
Predicting Attacks
Predicting Attacks
room for improvement
Collaboration helps!
Large gap from prior methods
Performance Analysis
Robustness achieved by diverse methodsy
robustness to random errors
E.g. an attacker may send traffic to a single victim (detected by temporal) or to several victims (detected by spatial behavior); or he can limit his attack activity
Predictive Blacklisting as a RS System
b
Summary
Predictive Blacklisting as a RS System
Contributions– Combined predictors that capture different patterns in the data – Significant improvement with simple techniques
• still room for further improvement • still room for further improvement
– New formulation as a recommenders system (collaborative filtering) problem
• paves the way to powerful techniques:
• e.g., capture global structure (latent factors), joint spatio-temporal models
References
– F.Soldo, A.Le, A.Markopoulou, "Predictive Blacklisting as an Implicit Recommendation system“, IEEE INFOCOM 2010 and in arXiV.org
How to use a list of malicious sources?
How to use a list of malicious sources?
•
A policy decision:
– E.g. scrub, give lower priority, block, monitor, do nothing …
•
One option is to
block (filter)
malicious sources
– when: during flooding attacks by million-node botnets – where: at firewalls or at the routers
Outline
Outline
Background
Background
Malicious Internet Traffic: Attack and Defense
Two Defenses Mechanisms
Proactive: Predictive Blacklisting
l
d F l
Reactive: Optimal Source-Based Filtering
C
l si
Filtering at the routers
Filtering at the routers
•
Access Control Lists (ACLs)
(
)
– Match a packet header against rules, e.g. source and
destination IP addresses
– Source-based filterSource based filter:: ACL that denies access to a source ACL that denies access to a source
IP/prefix
l
l
•
Filters implemented in TCAM
– Can keep up with high speeds – Limited resource Limited resource
Filter Selection at a Single Router
d ff b
f
fil
ll
l
d
tradeoff: number of filters vs. collateral damage
c
attackers
Filter an attack source A.B.C.D
. . . .
cc c
c c c
legitimate users
Filter a prefix A.B.C.*
ISP
edge router
C edge router
Optimal Source-Based Filtering
Optimal Source Based Filtering
Design a family of filter selection algorithms that: t k i t
• take as input:
– a blacklist of malicious (bad) sources – a whitelist of legitimate (good) sources – a constraint on the number of filters Fmax – a constraint on the number of filters Fmax – a constraint on the access bandwidth C – the operator’s policy
• optimally select which source IP prefixes to filteroptimally select which source IP prefixes to filter – so as to optimize the operator’s objective
– subject to the constraints
A B C *
0 2^32-1
A.B.C.D A.B.C.
Optimal Source-Based Filtering
p
g
A General Framework
[l,r]: range in the IP spaceg p p/l: prefix p of length l
F max: number of filters (<<N)
: whether we block range [l r] or not : whether we block range [l,r] or not : weight assigned to source IP address, i.
Optimal Source-Based Filtering
E
i
O
’ P li
Expressing Operator’s Policy
• Assignment of weights Wi is the operator’s knob:– indicates volume of traffic sent, or importance assigned by the operator
– Wi>0 (good source i), Wi<0 (bad source i ), Wi=0 (indifferent)
• Objective function
=
=
cost of good sources in range [l,r] cost of bad sources in range [l r] cost of bad sources in range [l,r]
Filter Selection Algorithms
P bl O
i
Problem Overview
• RANGE-based: filter IP or range [l,r]g
[Soldo, El Defrawy, Markopoulou, Van De Merwe, Krishnamurthy: ITA’09] – FILTER-ALL-RANGE
– FILTER-SOME-RANGE
FILTER ALL DYNAMIC RANGE
– FILTER-ALL-DYNAMIC-RANGE
• PREFIX-based: filter IP source or prefix
[Soldo, Markopoulou, Argyraki: INFOCOM’09, arXiv.org] [Soldo, Markopoulou, Argyraki INFOCOM 09, arXiv.org]
– FILTER-ALL: block all malicious sources
– FILTER-SOME: block some malicious sources – FILTER-ALL-DYNAMIC: BL varies over time
FLOODING: b d idth st i t t ss t
– FLOODING: bandwidth constraint at access router
Filter Selection Algorithms
Al
ith O
i
Algorithms Overview
• RANGE-based: filter IP or range [l,r]g
[Soldo, El Defrawy, Markopoulou, Van De Merwe, Krishnamurthy: ITA’09] – FILTER-ALL-RANGE
– FILTER-SOME-RANGE
FILTER ALL DYNAMIC RANGE – FILTER-ALL-DYNAMIC-RANGE
• PREFIX-based: filter IP source or prefix
[Soldo, Markopoulou, Argyraki: INFOCOM’09, arXiv.org] [Soldo, Markopoulou, Argyraki INFOCOM 09, arXiv.org]
– FILTER-ALL: O(N) – FILTER-SOME: O(N)
– FILTER-ALL-DYNAMIC: O(N)
FLOODING: NP h d s d l i l l O(C2N) h isti
– FLOODING: NP-hard, pseudo-polynomial alg. O(C2N) + heuristic – DISTRIBUTED-FLOODING: distributed solution
Longest Common Prefix Tree of a BL
Longest Common Prefix Tree of a BL
• LCP-Tree(BL) : binary tree, leaves are addresses in BL,intermediate nodes are their longest common prefixesg p f
• It can be found from the full binary tree of IP prefixes
• E.g. for BL={10.0.0.2, 10.0.0.3, 10.0.0.7}, the LCP-Tree(BL) is:
10.0.0.2/31
10.0.0.0/29
3 bad, 5 good addresses
10.0.0.2/31
10 0 0 2/32 10 0 0 3/32 10 0 0 7/32
0 good, 2 bad addresses
• Finding a set of filters:
– no need to look for all possible sets of prefixes
10.0.0.2/32 10.0.0.3/32 10.0.0.7/32
Filter-All-Prefix
P bl S
Problem Statement
• Given: a blacklist BL, weight wi (for each good IP i), Fmax filters • choose:choose: prefixes p/l prefixes p/l (x(xp/l/l))
Filter-All-Prefix
D
i P
i
Al
i h
Dynamic Programming Algorithm
: cost of optimal allocation of F filters within a prefix p p pp
p sLL ssRR F-n ≥ 1, filters within left subtree n ≥ 1,
filters within right subtree
Filter-All-Prefix
P l
h
E
l
Fmax = 4 N = 10DP Algorithm: Example
Fmax = 4 0/1 32/5Filter-Some-Prefix
Fmax = 4 N = 10 Fmax = 4
N 10
Filter-All-Prefix-Dynamic
Ti
i
Fmax = 4 N = 10 Need to beTime-varying case
(re)computed: O(Fmaxlog(N))26 7 0 22 7 7 5 31 37 10 15 17 22 32 33 57 58 3 6 6 0 2
FLOODING
P bl S
Problem Statement
• Given: a blacklist BL, a whitelist WL, a
weight of address = traffic volume generated
weight of address = traffic volume generated, a constraint on the link capacity C, and Fmax filters
• choose: source IP prefixes, xp/l
• so as to: minimize the collateral damage g
FLOODING
DP Al
i h
DP Algorithm
•
FLOODING is NP-hard
– reduction from knapsack with cardinality constraint (1.5K)
•
An optimal pseudo-polynomial dynamic programming
An optimal pseudo polynomial dynamic programming
algorithm, solves the problem in: O((CF
max)
2N)
– similar to the previous DP but solve 2-dimensional KP
l
– the LCP-Tree includes both good and bad addresses
Distributed Flooding
fil
l
filters at several routers
attackers
• Deploy filters at several routers – increase total filter budget
E h ( ) h
. . .
c c c c c c
• Each router (u) has its own: – view of good/bad traffic
– capacity in incoming link – filter budget
. . .
filter budget• Filtering at several routers: – not only which prefix to block – but also on which router
• Solution:
– can be solved in a distributed way
outperforms independent decisions Victim
Evaluation using Dshield data
FLOODING
li i i
FLOODING vs. rate limiting
• Attack sources, from a point of view of a single victim in Dshield • Good sources: [Kohler et al. TON’06, Barford et al. PAM’06]
• Before attack: good traffic was C/10 < C • During attack: bad traffic is 10C g
Intuition why optimization helps
y p
p
compared to non-optimized filtering
• Malicious sources are clustered in the IP address spacep
• Malicious sources are not co-located with legitimate sources
Evaluation using Dshield data (2)
l l
FILTER-ALL-PREFIX vs. generic clustering algorithms
• Malicious addresses:
– attacking 2 specific victim networks (most and least clustered) in Dshield datasetg p ( )
• Good addresses generated:
Evaluation using Dshield data (3)
DISTRIBUTED FLOODING h l f
di
i
DISTRIBUTED-FLOODING: the value of coordination
D/N
Optimal Source-Based Filtering
S
Summary
F
k
f ti l
filt l ti
•
Framework
for optimal filter selection
– defined various filtering problems
– designed g efficient algorithms g to solve them
•
Lead to significant improvements
on real datasets
– Compared to non-optimized filter selection , to generic
clustering, or to uncoordinated routers
Outline
Outline
Background
Background
Malicious Internet Traffic: Attack and Defenses
T D f
M h
Two Defenses Mechanisms
Proactive: Blacklisting as a Recommendation System Reactive: Filtering as an Optimization Problem
Reactive: Filtering as an Optimization Problem
Conclusion
Conclusion
Parts of larger system that collects and analyzes data from multiple sensors and takes appropriate action