Effective Open-Source Spam Filtering
For Enterprise
For Enterprise
Chris Lewis Thomas Choi Thomas Choi October 2008 VB2008, OttawaAgenda
• Introduction Background
z Background
z Something New - Rationale z The Open-Source Project
zBasic Requirements zComponents I t ti zIntegration zTest/Performance Ad d T h i • Advanced Techniques
Introduction/Authors Introduction/Authors
Chris Lewis
Senior Security Analyst/Anti-Spam, Nortel Senior Technical Advisor, MAAWG
Member, Canadian Federal Anti-Spam Task Force
Thomas Choi Nortel
Ph D Student Carleton University Ph.D Student, Carleton University
Background Background
z Spam became a problem in 1994/1995 z Initially in Usenet
z Initially in Usenet
z Clearly would transition to Email
z Commenced Email Anti-Spam program in 1997 z Extremely customized Lyris Mailshield
z Extremely customized Lyris Mailshield
implementation
VB2004 “Corporate Spam Fighting: 5 years of
z VB2004 Corporate Spam Fighting: 5 years of
success and lessons Learned”: by Chris Lewis and John Morris – don't forget those lessons!
Something New - Rationale Something New Rationale
L i M il hi ld h t d i d t d
z Lyris Mailshield has stood us in good stead
z But, getting a little elderly, higher volumes, difficult
to extend with newer techniques
z Review of many other vendor offerings: z Review of many other vendor offerings:
z All missing one or more of critical features z Integrated poorly with existing infrastructure z Not, or poorly extensible/configurable
z Not, or poorly extensible/configurable z Not as effective as current solution
Rationale ... Continued Rationale ... Continued
z Needed open architecture/modular/easy extension z Low capital/license cost (free obviously best!)p ( y )
z Use standard components to minimize
development costs development costs
z Use existing basic low-medium size server class
hardware hardware
z Focus on 3rd party/popular filtering methodologies,
simple ad-hoc filtering capabilities, plus with our own “secret sauce”.
The Open Source Project The Open Source Project
z Basic Requirements – Functional Specification z Component Selection z Component Selection z Integration z Back end z Testing z Testing
Basic Requirements - Filter Basic Requirements Filter
z Support multiple recipient domains z Configurable per-domain handling z Per-domain filter enable
z Configurable archiving/quarantine/disposition g g q p
(pass,filter, trap)
z Output routingp g z Full logging
NEVER b il t bl kh l ( t t )
z NEVER bounce or silent blackhole (except trap) z Plugin architecture – each technique an g q
Basic Filter Requirements ... Continued Basic Filter Requirements ... Continued
z Fault tolerant (eg: failover)
z Support 3rd party facilities, eg: z Support 3 party facilities, eg:
z DNSBL (IP blacklists)
z SURBL/URIBL (URI blacklists) z SURBL/URIBL (URI blacklists)
z “informational” lookups (eg: ASN) z Content Scoring filter
z Anti-virus
z Arbitrary ad-hoc string filters anywhere/on anything z Direct/real-time feedback to filtering
Basic “Not filter” Requirements Basic Not filter Requirements
z Full end-user quarantine view/forward
z End-user (recipient) notification (if desired) z End user (recipient) notification (if desired) z Full logs in database/arbitrary queries
z (Almost) fully automated false positive handling
(forward, filter tune, notification/explanation)
z Operational and Management metrics
Postfacto analysis and automated filter tuning
Components, Filter, Open-Source Components, Filter, Open Source
z Core SMTP listening engine/agent: Qpsmtpd (Hansen, Sergeant
et. al.). 100% Perl implementation (really!)
z Async (event driven) mode
z Async (event driven) mode
z Very high performance – 20M+/day small servers
z Entirely flexible by plugin interfacey y p g z Actively supported & robust
z Has many sample plugins
z SpamAssassin (popular scoring addon filter). (Perl)
z ClamAV (*ix-based) anti-virus signature-based engine( ) g g
z Nearly two dozen ad-hoc filtering plugins, few more than a dozen
lines.
Components, Filter, Glue Components, Filter, Glue
z A spam filter is more than just a filter, needs: z Start/stop/reboot/monitoring
z Start/stop/reboot/monitoring
z Log & quarantine handling and transfer
z Extended filtering heuristic processes (for things
that take too long for real-time)
Components, Backend Components, Backend
z PostgreSQL database
z Apache (admin and user interface) z Apache (admin and user interface)
z Interface to corporate user databases (push to
filters) filters)
z Admin (research, false positive, configuration,
d l t) i t f CGI
deployment) interface CGIs
z User interfaces (configuration and quarantine)( g q ) z Quarantine management
R l ti fi ti ft
VB2008, Ottawa
z Rsync – log, quarantine, configuration, software
Integration SPAM PostgreSQL SPAM Database I t t DMZ N S Apache Internet DMZ QPSMTPD Plugins Non-Spam Mail servers Plugins SpamAssassin ClamAV Mail servers Config Users Rejection Config Rejection Notices F l P iti DNSBL 3rd Party BL False Positive Reports CORWAN
Test/Performance Test/Performance
z Spamtrap operating 9 months
z Performance heavily depends on “early pruning” z Performance heavily depends on early pruning
z “Cheap” tests first
z Prune filtering subsequent to block decision z Prune filtering subsequent to block decision z “Expensive” (body scans, SpamAssassin,
ClamAV) tests last ClamAV) tests last
z Volumes: typical 7m/server (50-100/sec), mostly
spamtrap spamtrap
Advanced Techniques Advanced Techniques z State of Affairs z Hide! z Hide! z Banner delays z Bot fingerprinting
z DNSBLs (local and/or otherwise) z DNSBLs (local and/or otherwise) z DNSBL infrastructure
z Bounces & BATV
z Ones we've omitted and why z Ones we ve omitted and why
State of Affairs State of Affairs
z Underground economy (spam, phish, spyware, CC, mules)
increasing
Some LE believe larger than International Drug trade
z Some LE believe larger than International Drug trade z BOTS responsible for 80%+ of all spam.
z Most getting good at stopping BOTs (<1% deliverability) z => BOTs shifting to reputation theft (relay through legit
MTA ) MTAs)
z State of Anti-Virus: disaster. (new BOT caught by AV 23%
of the time by battery of 35 AV tools only increases to 50% of the time by battery of 35 AV tools, only increases to 50% by 30 days)
z Inadequate AV => can’t find BOT, let alone remediateInadequate AV can t find BOT, let alone remediate
Hide! Hide!
z Make it difficult for BOTs to email you.
z BOTs not full MTAs, high volume/throughput g g p
requirements.
z Primary MX – “refuse connections” (Google for y ( g
“nolisting”)
z Tertiary MX – “always retry”y y y
z Dumb bots try once (primary or tertiary), get refusal
or retry, and give up. Real MTAs do right thing.y, g p g g
z As much as 50% of BOT spam simply vanishes.
L f t i
Banner Delays Banner Delays
z Most BOTs impatient, and won’t retry z 20-40 second banner delays =>
z 20 40 second banner delays z BOTs give up in disgust
z Some legit MTAs equally impatient, may need to
BOT Fingerprinting BOT Fingerprinting
z Most BOTs have fingerprints in the headers and
SMTP protocol that can be caught by pattern t hi
matching.
z Some mutate, some don’t. z Srizbi > 50% of all spam.
F d IP f d t ti b k i t l l
z Feed source IP of detections back into local
DNSBL (DNS Blacklist) DNSBL (DNS Blacklist)
z Hundreds of 3rd party DNSBLs (IP based, domain
based, URIBL filtering etc)
z A handful are both reliable and effective.
There are DNSBLs effective to 70 80%+ of all
z There are DNSBLs effective to 70-80%+ of all
DNSBL Merge DNSBL Merge
z High volume receivers may impose undue loading on 3rd
party DNSBL infrastructure.
z Occasional erratic delays (including DDOS on DNSBL) z => Host them locally
z We use rbldnsd – very high performance DNS server
designed for high-performance serving of DNSBL zones.
z We combine multiple 3rd party zones (plus ones we create
ourselves) into a single zone.
z Each DNSBL source distinguishable by return code,
Filtering/Bounces & BATV Filtering/Bounces & BATV
z Accepting then bouncing email with forged from => bounce
storms (aka backscatter/blowback) => evil
z Simple blackholing also evil
z Aim is inline reject, with remediation information.
z Support costs of receiving end of blowback often exceed
spam
z BATV (Bounce Address Tag Validation) see
http://mipassoc.org/batv/
z When sending email, encode bounce address (MAIL
FROM)
Omitted Techniques & Why Omitted Techniques & Why
z Greylisting – (force retry of “new senders”). z Increasing reports of BOTs doing retry.
z Doesn’t prevent spam-by-reputation-hijacking
z Bayesian – needs training, in many cases defeatedy g, y z Checksumming (Razor/DCC et. al.) –
Detects bulk not spam per se
z Detects bulk, not spam per-se
z Problemmatic when outsourcing user-contact (eg: HR) z Needs whitelisting
z Needs whitelisting