The Open Source Stack: One
approach to spam filtering
Chris St. Pierre
Unix Systems Administrator Nebraska Wesleyan University
Administrivia
Administrivia
Terminology
● Spam isn't an abbreviation or acronym.
● UCE (Unsolicited Commercial Email) and
UBE (...Bulk...)
● “Spam” is more than spam: phishing, 419
scams, lottery scams, pump and dump, viruses, etc.
● Things to avoid:
– False positives (FPs): legit email marked spam – False negatives (FNs): Spam marked legit
Goals
● Make your users happy
● Users with control are happier than users
without control
The Stack Approach
● There's no magic bullet that will kill all spam
● Zeno's Paradox
● Every tool we use will get rid of a little more
spam
Other Approaches
● Pay someone a lot of money
● Pure Whitelisting
● C & R
● Pray
Disclaimer
● This is just one approach to spam filtering.
There are many other approaches that may be just as effective.
● Your anti-spam solution must be tailored to
fit your environment, not mine.
● If something I recommend doesn't work for
The Stack
1.Honeypots 2.RBLs
3.Greylisting
4.HELO (and other) restrictions 5.Tarpitting
6.ClamAV
7.SpamAssassin 8.End-user tools
Order is important
● If you can discard or reject messages before
accepting them, this saves you valuable resources
Basics
● NEVER bounce spam or viruses
– Don't be a jerk and cause backscatter! – Reject with a 5xx error code
– Discarding is also bad, but sometimes we do it anyway
● NEVER forward to off-site addresses before
filtering
1. Honeypots
● Create a fake address and publicize it; ban
anyone who sends to it
● Remarkably ineffective
Aside: secondary MXes
2. RBLs
● Realtime Black List (or DNSBL: DNS Black
List)
● Someone else has done all the work for you.
Yay!
● Run a caching nameserver
● When blocking based on RBL, you must
avoid FPs
● http://www.usenix.org/publications/login/2006-12/pdfs/josephsen.pdf
Live RBL Revue!
● Only a few are worth considering:
– zen.spamhaus.org is excellent. Includes SBL, XBL, and PBL. Costs some cash for
non-personal use; cbl.abuseat.org is free, and is one of their sources
– SpamCop got a bad reputation early on, but they're doing a great job now (bl.spamcop.net) – The Passive Spam Block List (psbl.surriel.com)
works much better than you might suspect
● Nothing else I've found or heard of is worth
3. Greylisting
● Overview:
– Greylisting identifies each message with a unique triplet: sender, recipient, originating server.
– The first time it sees a given triplet, it gives a 4xx (tempfail) code
– Legitimate servers will retry, at which point the triplet will be recognized and accepted
– Spammers don't waste resources on retries
● Greylist on the /24 netblock of the
originating server
● Retry time doesn't matter, because
spammers don't retry. (5 minutes is sort of the standard.)
● Auto-whitelist and auto-blacklist
3. Greylisting, continued
● Find a greylisting server with a sizable
preconfigured whitelist
● If you have >1 MX, look for a greylisting
server that supports a shared database
– Policyd is wonderful, but is Postfix-only
– SQLGrey is quite nice and works with both Postfix and Exim
– RelayDelay is the closest I've found for Sendmail
4. HELO (and other) restrictions
● Lots of fun stuff!
● Site-specific whitelists/blacklists
● Reject non-FQDN HELOs and HELOs with
bad syntax
● Reject mail to unknown recipients!
● Reject HELOs that resolve to bogons
– http://www.cymru.com/Documents/bogon-bn-agg.txt
4. HELO restrictions, continued
● HELO Randomization Protection (HRP)
● Reject mail when the HELO name has no
MX or A record?
● Well-configured HELO restrictions can drop
5 (or 0). Tarpitting
● Make a connection very slow (or just pause)
● Spammers are impatient
● Claims of 80% block rates
● Two ways to implement:
– Pre-MTA wrapper
– Within the MTA (e.g., milter)
● Most connections are dropped after about a
5 (or 0). Tarpitting, continued
● Two years ago, this presentation had this
line:
– “Tarpitting is fairly new, so software is rare as of this writing”
● Tarpitting never really caught on, so it's still
fairly rare.
● Implementations:
– GreetPause (sendmail) – OpenBSD SpamD
Changeup!
● Up to here, we've been talking about
discarding messages
● After this, we'll assume you've already
accepted the message
Aside: What about filtering
integrators?
● Amavis, MailScanner, etc.
● Generally, not worth it
● Not a lot of supplementary functionality of
consequence – but that's changing
● They remove you one step from your
component configuration, and whether or not they make the integration any easier is up for debate
Aside: What about filtering
integrators?
● Cost: additional complexity of setup and
maintenance; one more thing to break
● Benefit: Some (often minor) features
6. ClamAV
● ClamSMTPD is a great integrator
● Not just antivirus; anti-phishing par
excellence
● In addition to the standard rules, use
http://www.sanesecurity.com/clamav
– Exclude the “SpamDomain” rulesets
● Keep it updated and ClamAV will Just Work
7. SpamAssassin
● This could be a class of its own. We'll
cover:
a)Basics b)Bayes
c)Checksumming systems (Razor2, DCC, Pyzor) d)URIBL
e)SARE rulesets f) Plugins
g)Miscellaneous score adjustments h)Alternatives?
a. Basics
● SpamAssassin does not filter spam
● SA scores mail with a bunch of tests. Each
test can add or subtract a few points to the score. If the mail has over a certain number of points, it gets marked as spam – not
filtered.
● The default required_hits value is 5,
which tends to work well
b. Bayes
● You can keep your Bayesian database in
either flat files, or in a “real” DB
● Use a real database if you have >1 MX
● Let your users report FPs and FNs, and
train Bayes on it
● Use bayes_auto_learn to ensure a
b. Bayes, continued
● Train train train!
● DO NOT train Bayes on public corpora
● DO NOT train Bayes on your outgoing mail
● The SA Bayes engine isn't the greatest
● One solution (?): crm114 plugin
c. Checksumming systems
● Razor2, DCC, Pyzor
● They're all free now
● Razor2 rawks hard
● DCC gives lots of FPs, because it just
measures bulkiness, not spamminess
● Both Razor2 and Pyzor have very low FP
d. URIBL
● Checks the URLs in an email against a
blacklist
● This is wonderful
● Crank these scores
● If none of your top ten rules are URIBL_*,
e. Third-Party Rulesets
● Additional rules that block lots of stock
scams, image spam, etc.
● SpamAssassin Rule Emporium (SARE)
– Howto: http://daryl.dostech.ca/sa-update/sare/sare-sa-update-howto.txt – http://www.rulesemporium.com/rules.htm
– Most rulesets have 2-4 options, increasing in aggressiveness
● KAM
e. Third-Party Rulesets, cont'd
● Extra rules from SpamAssassin
– http://wiki.apache.org/spamassassin/CustomRulesets
– See especially the “Sought” ruleset – Sets for other languages
f. Plugins
● There are lots out there, but four major ones
you need to know:
● Botnet: tries to identify mail from botnets
– Lots of FPs, not a lot of real positives
– http://people.ucsc.edu/~jrudd/spamassassin/
● PDFInfo: ImageInfo for PDF attachment
spam
f. Plugins, continued
● ImageInfo: looks for broken or suspicious
image attachments
– Together with the SARE rules, is very good at stopping image spam
– Doesn't use OCR or other processor-intensive tests
– Consider it a necessity – Included in SA 3.2+
f. Plugins, continued
● Custom plugins are beyond the scope of
this tutorial
– Try to write rules instead of plugins – Check out
http://wiki.apache.org/spamassassin/DumpTextPlugin
for a good sample plugin and a nice place to start
g. Miscellaneous score
adjustments
● Tweak and frob scores to suit your
environment
● Track:
– Which rules are hitting frequently and what they're hitting on (ham or spam)
g. Miscellaneous score
adjustments
● Many rules are disabled (score = 0). Enable
all tests initially to see if any of the disabled
rules hit reliably:
egrep 'score.*\s0$' \
/usr/share/spamassassin/50_scores.cf | \ awk'{print $1, $2, "0.1"}' > all-rules.cf
h. Alternatives?
● Dspam, Bogofilter, others
● Dspam and Bogofilter violate the stack
model; they only use Bayes
● SA uses Bayes, plus other plugins and
8. End-user tools
● Clients must, at a minimum, be able to
report FPs and FNs
– Learn (with Bayes) and automatically white| blacklist per-user based on what they report
● Let your clients configure their own filtering
levels
● Forget quarantining
8. End-user tools
● Let your clients configure their own
whitelists and blacklists
– Ideally, whitelisting a sender should get them past RBLs, tarpitting, greylisting, etc., for the recipient(s) who whitelisted them
● Really really difficult
– Also ideally, generate whitelists from address books
– Whitelisting can be dangerous, since it relies on addresses, not Received: headers
9. Statistics
● You need statistics for four reasons:
1.Everyone likes pretty pictures
2.Track the effectiveness of your filters 3.Plan for and justify growth
9. What kind of statistics?
● Both graphs/charts and hard numbers
● General mail statistics are a prerequisite
● What is your ratio of ham to spam?
● How much spam are you delivering to
mailboxes?
● How many viruses are you getting?
● How much is filtered out by
9. What kind of statistics?
● What are your spam scores? (Min/max/avg)
Are there arny trends?
● How long does it take to scan a message?
What is your average time-to-delivery?
● What SA rules are hitting the most? (On
ham? On spam?) Which are the “best” or “most reliable” rules?