Why filter content?
• Information overload • Specific personal interests • General signal-to-noise ratio
. . . affected by unwanted content, usually commercial or advertisement-based. . .
2
• Information overload – too much content, or too many content
items to handle, but nothing we specifically don’t want to know about, message digests of mailing lists, kerneltraffic and friends
• Specific personal interest – lots of content, but a lot of it is
about stuff we are not interested in knowing about, yet, that other content may be interesting to other people, e.g. on a developers’ mailing list, we may be interested in bugfixes and problems with a Linux version, but couldn’t care less about VMS.
• General signal-to-noise ratio – SNR: ratio of content in which the
userbase is interested compared to that in which the userbase is disinterested.
Why is “spam” such a problem?
• Not just email – Usenet / Netnews also suffers. • Email: significant increase, 48% in last 12 months
3
Half of all emails monitored by MessageLabs in May 2003 were spam, June seems to have reduced so far but we are not at the end of the month yet.
Email filtration options
• Killfiles / Blacklists - simplistic header-based filter
? Spammers regularly spoof headers – not much help.
• Precise hash matches (e.g. Vipul’s Razor)
? Spammers regularly insert “hashbusters” into their content.
But collaborative filtering not without merit. . .
• Regexp-based content matching and server blacklisting
(e.g. SpamAssassin)
? Very effective, but suffers due to static heuristic rules.
4
• Killfiles still used because users understand them REALLY well,
despite their lack of effectiveness. Still useful for deliberately blocking posts from contributors who rub you up the wrong way, but useless for spam.
• Concept of matching content: right direction but not foolproof.
Spamware agents the ability to insert “hashbusters” which are deliberately designed to throw off trivial hash-collision detection methods. Note the benefits of collab filtering though; when it works, it’s good.
• Discuss SA rules – body and header matching, e.g. Nigerian
spam, mail-client spoofing.
• Effectiveness is excellent but errors remain possible. Quite a lot
of “confirmation” emails (e.g. Easyjet, Ryanair) get misclassi-fied because they match the heuristics. Equally, if spam comes along which doesn’t match the static rules, it’s not detected.
The dynamic solution
• Static rules can make an “educated guess” as to what
the user thinks may be spam. . .
• . . . but the only way to find out precisely is to have the
user tell us.
• The user’s wishes are unlikely to be codifiable as a set of
static rules – we must find a different way.
5
Project Objectives
• Implementation of a content filter for mail and news,
controlled and influenced by the individual user.
• Content filtration by statistical classification and
distribution of content hashes
• Investigation of statistical classification as applied to
System Architecture
Mail Handler News Handler
Spam Handler Handler Collab Handler Content Mgmt Clients Incoming Mail Incoming News
Management Interface
Core
Bayesian Classifier, Collaborative Filter
Filtered Mail Filtered News Collab Messages
Incoming Mail
7
Statistical filtering
• Analyze a set of examples which the user tells us are
either spam or non-spam.
• Calculate the prior probability of each word in the
examples based on how often they appear in spam content.
• e.g. Click appears in 939 out of 2,355 spam examples
and 113 out of 4,787 non-spam content.
pspam = 939 2355 113 4787 + 939 2355 = 0.9441
The Na¨
ıve Bayesian Classifier
• To test a content item, search for the probability of
every word in the new content in the table we created.
• Find the most extreme n probabilities (those closest to 0
or 1)
• Use the word probabilities as likelihood indicators for the
new content being spam.
Pspam= Qn k=1pk Qn k=1pk+ Qn k=1(1−pk) 9
Collaborative filtration
• Users generally in some form of community
• The same spam content may reach more than one
member of the community
• Time delay in mail handling works to our advantage • Can we share knowledge within communities to
Better content matching
• Current hash-detection systems fail too readily • Need function such that:
– If content a and b are substantively similar. . .
– . . . values α and β are arithmetically similar.
• A fuzzy hash – hash where two hashes are quantitatively
comparable.
11
Using fuzzy hashing in collaboration
• Alice receives an email, which is detected as spam. • Alice’s mail filter hashes the content, notes the hash,
and sends it on to any interested collaborators.
• Bob’s mail filter receives a collaborative message
regarding the new spam. It notes the hash.
• Bob then receives an email. The email is hashed, and
compared with those it knows about.
• Bob’s mail filter discovers the new mail is a 98% match
with the spam Alice told us about.
• Bob has set his hash match threshold to 70%, so the
Implementation Challenges
• Homogenization of content from various protocols –
abstract message format
• PGP integration for trustworthy collaboration • News protocol implementation
13
Results
• Like-for-like testing:
? My filter: 75% accuracy with no false positives
? SpamAssassin: 90% accuracy with no false positives
• Hard to test collaborative filtering
• Reasonable performance but not really comparable with
Demonstration
15
Further Work
• Optimization of configuration variables
? Token thresholds, number of tokens used in testing.
• Optimization of fuzzy hash matching algorithm
? Slow due to attempted rolling window matches
• Addition of other protocols
? Web-based bulletin boards?
• User interface extensions
? Provide a “usable” mail/news client
• SpamAssassin for news, meta-filtration
? Infrastructure could apply SpamAssassin to news, refactor to allow multiple content testing methods.
Summary
• A content filter which functions acceptably
• Bayesian filtering and fuzzy hash matching are useful • Sole use of these technologies may not be sufficient • Combining filters likely to be the best solution
17