The Open Source Stack: One approach to spam filtering

(1)

The Open Source Stack: One

approach to spam filtering

Chris St. Pierre

Unix Systems Administrator Nebraska Wesleyan University

(2)

Administrivia

(3)

Administrivia

(4)

Terminology

● Spam isn't an abbreviation or acronym.

● UCE (Unsolicited Commercial Email) and

UBE (...Bulk...)

● “Spam” is more than spam: phishing, 419

scams, lottery scams, pump and dump, viruses, etc.

● Things to avoid:

– False positives (FPs): legit email marked spam – False negatives (FNs): Spam marked legit

(5)

Goals

● Make your users happy

● Users with control are happier than users

without control

(6)

The Stack Approach

● There's no magic bullet that will kill all spam

● Zeno's Paradox

● Every tool we use will get rid of a little more

spam

(7)

Other Approaches

● Pay someone a lot of money

● Pure Whitelisting

● C & R

● Pray

(8)

Disclaimer

● This is just one approach to spam filtering.

There are many other approaches that may be just as effective.

● Your anti-spam solution must be tailored to

fit your environment, not mine.

● If something I recommend doesn't work for

(9)

The Stack

1.Honeypots 2.RBLs

3.Greylisting

4.HELO (and other) restrictions 5.Tarpitting

6.ClamAV

7.SpamAssassin 8.End-user tools

(10)

Order is important

● If you can discard or reject messages before

accepting them, this saves you valuable resources

(11)

Basics

● NEVER bounce spam or viruses

– Don't be a jerk and cause backscatter! – Reject with a 5xx error code

– Discarding is also bad, but sometimes we do it anyway

● NEVER forward to off-site addresses before

filtering

(12)

1. Honeypots

● Create a fake address and publicize it; ban

anyone who sends to it

● Remarkably ineffective

(13)

Aside: secondary MXes

(14)

2. RBLs

● Realtime Black List (or DNSBL: DNS Black

List)

● Someone else has done all the work for you.

Yay!

● Run a caching nameserver

● When blocking based on RBL, you must

avoid FPs

● http://www.usenix.org/publications/login/2006-12/pdfs/josephsen.pdf

(15)

Live RBL Revue!

● Only a few are worth considering:

– zen.spamhaus.org is excellent. Includes SBL, XBL, and PBL. Costs some cash for

non-personal use; cbl.abuseat.org is free, and is one of their sources

– SpamCop got a bad reputation early on, but they're doing a great job now (bl.spamcop.net) – The Passive Spam Block List (psbl.surriel.com)

works much better than you might suspect

● Nothing else I've found or heard of is worth

(16)

3. Greylisting

● Overview:

– Greylisting identifies each message with a unique triplet: sender, recipient, originating server.

– The first time it sees a given triplet, it gives a 4xx (tempfail) code

– Legitimate servers will retry, at which point the triplet will be recognized and accepted

– Spammers don't waste resources on retries

(17)

● Greylist on the /24 netblock of the

originating server

● Retry time doesn't matter, because

spammers don't retry. (5 minutes is sort of the standard.)

● Auto-whitelist and auto-blacklist

(18)

3. Greylisting, continued

● Find a greylisting server with a sizable

preconfigured whitelist

● If you have >1 MX, look for a greylisting

server that supports a shared database

– Policyd is wonderful, but is Postfix-only

– SQLGrey is quite nice and works with both Postfix and Exim

– RelayDelay is the closest I've found for Sendmail

(19)

4. HELO (and other) restrictions

● Lots of fun stuff!

● Site-specific whitelists/blacklists

● Reject non-FQDN HELOs and HELOs with

bad syntax

● Reject mail to unknown recipients!

● Reject HELOs that resolve to bogons

– http://www.cymru.com/Documents/bogon-bn-agg.txt

(20)

4. HELO restrictions, continued

● HELO Randomization Protection (HRP)

● Reject mail when the HELO name has no

MX or A record?

● Well-configured HELO restrictions can drop

(21)

5 (or 0). Tarpitting

● Make a connection very slow (or just pause)

● Spammers are impatient

● Claims of 80% block rates

● Two ways to implement:

– Pre-MTA wrapper

– Within the MTA (e.g., milter)

● Most connections are dropped after about a

(22)

5 (or 0). Tarpitting, continued

● Two years ago, this presentation had this

line:

– “Tarpitting is fairly new, so software is rare as of this writing”

● Tarpitting never really caught on, so it's still

fairly rare.

● Implementations:

– GreetPause (sendmail) – OpenBSD SpamD

(23)

Changeup!

● Up to here, we've been talking about

discarding messages

● After this, we'll assume you've already

accepted the message

(24)

Aside: What about filtering

integrators?

● Amavis, MailScanner, etc.

● Generally, not worth it

● Not a lot of supplementary functionality of

consequence – but that's changing

● They remove you one step from your

component configuration, and whether or not they make the integration any easier is up for debate

(25)

Aside: What about filtering

integrators?

● Cost: additional complexity of setup and

maintenance; one more thing to break

● Benefit: Some (often minor) features

(26)

6. ClamAV

● ClamSMTPD is a great integrator

● Not just antivirus; anti-phishing par

excellence

● In addition to the standard rules, use

http://www.sanesecurity.com/clamav

– Exclude the “SpamDomain” rulesets

● Keep it updated and ClamAV will Just Work

(27)

7. SpamAssassin

● This could be a class of its own. We'll

cover:

a)Basics b)Bayes

c)Checksumming systems (Razor2, DCC, Pyzor) d)URIBL

e)SARE rulesets f) Plugins

g)Miscellaneous score adjustments h)Alternatives?

(28)

a. Basics

● SpamAssassin does not filter spam

● SA scores mail with a bunch of tests. Each

test can add or subtract a few points to the score. If the mail has over a certain number of points, it gets marked as spam – not

filtered.

● The default required_hits value is 5,

which tends to work well

(29)

b. Bayes

● You can keep your Bayesian database in

either flat files, or in a “real” DB

● Use a real database if you have >1 MX

● Let your users report FPs and FNs, and

train Bayes on it

● Use bayes_auto_learn to ensure a

(30)

b. Bayes, continued

● Train train train!

● DO NOT train Bayes on public corpora

● DO NOT train Bayes on your outgoing mail

● The SA Bayes engine isn't the greatest

● One solution (?): crm114 plugin

(31)

c. Checksumming systems

● Razor2, DCC, Pyzor

● They're all free now

● Razor2 rawks hard

● DCC gives lots of FPs, because it just

measures bulkiness, not spamminess

● Both Razor2 and Pyzor have very low FP

(32)

d. URIBL

● Checks the URLs in an email against a

blacklist

● This is wonderful

● Crank these scores

● If none of your top ten rules are URIBL_*,

(33)

e. Third-Party Rulesets

● Additional rules that block lots of stock

scams, image spam, etc.

● SpamAssassin Rule Emporium (SARE)

– Howto: _{http://daryl.dostech.ca/sa-update/sare/sare-sa-update-howto.txt} – http://www.rulesemporium.com/rules.htm

– Most rulesets have 2-4 options, increasing in aggressiveness

● KAM

(34)

e. Third-Party Rulesets, cont'd

● Extra rules from SpamAssassin

– http://wiki.apache.org/spamassassin/CustomRulesets

– See especially the “Sought” ruleset – Sets for other languages

(35)

f. Plugins

● There are lots out there, but four major ones

you need to know:

● Botnet: tries to identify mail from botnets

– Lots of FPs, not a lot of real positives

– http://people.ucsc.edu/~jrudd/spamassassin/

● PDFInfo: ImageInfo for PDF attachment

spam

(36)

f. Plugins, continued

● ImageInfo: looks for broken or suspicious

image attachments

– Together with the SARE rules, is very good at stopping image spam

– Doesn't use OCR or other processor-intensive tests

– Consider it a necessity – Included in SA 3.2+

(37)

f. Plugins, continued

● Custom plugins are beyond the scope of

this tutorial

– Try to write rules instead of plugins – Check out

http://wiki.apache.org/spamassassin/DumpTextPlugin

for a good sample plugin and a nice place to start

(38)

g. Miscellaneous score

adjustments

● Tweak and frob scores to suit your

environment

● Track:

– Which rules are hitting frequently and what they're hitting on (ham or spam)

(39)

g. Miscellaneous score

adjustments

● Many rules are disabled (score = 0). Enable

all tests initially to see if any of the disabled

rules hit reliably:

egrep 'score.*\s0$' \

/usr/share/spamassassin/50_scores.cf | \ awk'{print $1, $2, "0.1"}' > all-rules.cf

(40)

h. Alternatives?

● Dspam, Bogofilter, others

● Dspam and Bogofilter violate the stack

model; they only use Bayes

● SA uses Bayes, plus other plugins and

(41)

8. End-user tools

● Clients must, at a minimum, be able to

report FPs and FNs

– Learn (with Bayes) and automatically white| blacklist per-user based on what they report

● Let your clients configure their own filtering

levels

● Forget quarantining

(42)

8. End-user tools

● Let your clients configure their own

whitelists and blacklists

– Ideally, whitelisting a sender should get them past RBLs, tarpitting, greylisting, etc., for the recipient(s) who whitelisted them

● Really really difficult

– Also ideally, generate whitelists from address books

– Whitelisting can be dangerous, since it relies on addresses, not Received: headers

(43)

9. Statistics

● You need statistics for four reasons:

1.Everyone likes pretty pictures

2.Track the effectiveness of your filters 3.Plan for and justify growth

(44)

9. What kind of statistics?

● Both graphs/charts and hard numbers

● General mail statistics are a prerequisite

● What is your ratio of ham to spam?

● How much spam are you delivering to

mailboxes?

● How many viruses are you getting?

● How much is filtered out by

(45)

9. What kind of statistics?

● What are your spam scores? (Min/max/avg)

Are there arny trends?

● How long does it take to scan a message?

What is your average time-to-delivery?

● What SA rules are hitting the most? (On

ham? On spam?) Which are the “best” or “most reliable” rules?