Multi-Protocol Content Filtering

(1)

Multi-Protocol Content Filtering

Matthew Johnson <[email protected]>

MEng Individual Project

1

(2)

Why filter content?

• Information overload • Specific personal interests • General signal-to-noise ratio

. . . affected by unwanted content, usually commercial or advertisement-based. . .

2

• Information overload – too much content, or too many content

items to handle, but nothing we specifically don’t want to know about, message digests of mailing lists, kerneltraffic and friends

• Specific personal interest – lots of content, but a lot of it is

about stuff we are not interested in knowing about, yet, that other content may be interesting to other people, e.g. on a developers’ mailing list, we may be interested in bugfixes and problems with a Linux version, but couldn’t care less about VMS.

• General signal-to-noise ratio – SNR: ratio of content in which the

userbase is interested compared to that in which the userbase is disinterested.

(3)

Why is “spam” such a problem?

• Not just email – Usenet / Netnews also suffers. • Email: significant increase, 48% in last 12 months

3

Half of all emails monitored by MessageLabs in May 2003 were spam, June seems to have reduced so far but we are not at the end of the month yet.

(4)

Email filtration options

• Killfiles / Blacklists - simplistic header-based filter

? Spammers regularly spoof headers – not much help.

• Precise hash matches (e.g. Vipul’s Razor)

? Spammers regularly insert “hashbusters” into their content.

But collaborative filtering not without merit. . .

• Regexp-based content matching and server blacklisting

(e.g. SpamAssassin)

? Very effective, but suffers due to static heuristic rules.

4

• Killfiles still used because users understand them REALLY well,

despite their lack of effectiveness. Still useful for deliberately blocking posts from contributors who rub you up the wrong way, but useless for spam.

• Concept of matching content: right direction but not foolproof.

Spamware agents the ability to insert “hashbusters” which are deliberately designed to throw off trivial hash-collision detection methods. Note the benefits of collab filtering though; when it works, it’s good.

• Discuss SA rules – body and header matching, e.g. Nigerian

spam, mail-client spoofing.

• Effectiveness is excellent but errors remain possible. Quite a lot

of “confirmation” emails (e.g. Easyjet, Ryanair) get misclassi-fied because they match the heuristics. Equally, if spam comes along which doesn’t match the static rules, it’s not detected.

(5)

The dynamic solution

• Static rules can make an “educated guess” as to what

the user thinks may be spam. . .

• . . . but the only way to find out precisely is to have the

user tell us.

• The user’s wishes are unlikely to be codifiable as a set of

static rules – we must find a different way.

5

Project Objectives

• Implementation of a content filter for mail and news,

controlled and influenced by the individual user.

• Content filtration by statistical classification and

distribution of content hashes

• Investigation of statistical classification as applied to

(6)

System Architecture

Mail Handler News Handler

Spam Handler Handler Collab Handler Content Mgmt Clients Incoming Mail Incoming News

Management Interface

Core

Bayesian Classifier, Collaborative Filter

Filtered Mail Filtered News Collab Messages

Incoming Mail

7

Statistical filtering

• Analyze a set of examples which the user tells us are

either spam or non-spam.

• Calculate the prior probability of each word in the

examples based on how often they appear in spam content.

• e.g. Click appears in 939 out of 2,355 spam examples

and 113 out of 4,787 non-spam content.

pspam = 939 2355 113 4787 + 939 2355 = 0.9441

(7)

The Na¨

ıve Bayesian Classifier

• To test a content item, search for the probability of

every word in the new content in the table we created.

• Find the most extreme n probabilities (those closest to 0

or 1)

• Use the word probabilities as likelihood indicators for the

new content being spam.

Pspam= Qn k=1pk Qn k=1pk+ Qn k=1(1−pk) 9

Collaborative filtration

• Users generally in some form of community

• The same spam content may reach more than one

member of the community

• Time delay in mail handling works to our advantage • Can we share knowledge within communities to

(8)

Better content matching

• Current hash-detection systems fail too readily • Need function such that:

– If content a and b are substantively similar. . .

– . . . values α and β are arithmetically similar.

• A fuzzy hash – hash where two hashes are quantitatively

comparable.

11

Using fuzzy hashing in collaboration

• Alice receives an email, which is detected as spam. • Alice’s mail filter hashes the content, notes the hash,

and sends it on to any interested collaborators.

• Bob’s mail filter receives a collaborative message

regarding the new spam. It notes the hash.

• Bob then receives an email. The email is hashed, and

compared with those it knows about.

• Bob’s mail filter discovers the new mail is a 98% match

with the spam Alice told us about.

• Bob has set his hash match threshold to 70%, so the

(9)

Implementation Challenges

• Homogenization of content from various protocols –

abstract message format

• PGP integration for trustworthy collaboration • News protocol implementation

13

Results

• Like-for-like testing:

? My filter: 75% accuracy with no false positives

? SpamAssassin: 90% accuracy with no false positives

• Hard to test collaborative filtering

• Reasonable performance but not really comparable with

(10)

Demonstration

15

Further Work

• Optimization of configuration variables

? Token thresholds, number of tokens used in testing.

• Optimization of fuzzy hash matching algorithm

? Slow due to attempted rolling window matches

• Addition of other protocols

? Web-based bulletin boards?

• User interface extensions

? Provide a “usable” mail/news client

• SpamAssassin for news, meta-filtration

? Infrastructure could apply SpamAssassin to news, refactor to allow multiple content testing methods.

(11)

Summary

• A content filter which functions acceptably

• Bayesian filtering and fuzzy hash matching are useful • Sole use of these technologies may not be sufficient • Combining filters likely to be the best solution

17