Effective Open-Source Spam Filtering

(1)

Effective Open-Source Spam Filtering

For Enterprise

Chris Lewis Thomas Choi Thomas Choi October 2008 VB2008, Ottawa

(2)

Agenda

• Introduction Background

z Background

z Something New - Rationale z The Open-Source Project

zBasic Requirements zComponents I t ti zIntegration zTest/Performance Ad d T h i • Advanced Techniques

(3)

Introduction/Authors Introduction/Authors

Chris Lewis

Senior Security Analyst/Anti-Spam, Nortel Senior Technical Advisor, MAAWG

Member, Canadian Federal Anti-Spam Task Force

Thomas Choi Nortel

Ph D Student Carleton University Ph.D Student, Carleton University

(4)

Background Background

z Spam became a problem in 1994/1995 z Initially in Usenet

z Initially in Usenet

z Clearly would transition to Email

z Commenced Email Anti-Spam program in 1997 z Extremely customized Lyris Mailshield

z Extremely customized Lyris Mailshield

implementation

VB2004 “Corporate Spam Fighting: 5 years of

z VB2004 Corporate Spam Fighting: 5 years of

success and lessons Learned”: by Chris Lewis and John Morris – don't forget those lessons!

(5)

Something New - Rationale Something New Rationale

L i M il hi ld h t d i d t d

z Lyris Mailshield has stood us in good stead

z But, getting a little elderly, higher volumes, difficult

to extend with newer techniques

z Review of many other vendor offerings: z Review of many other vendor offerings:

z All missing one or more of critical features z Integrated poorly with existing infrastructure z Not, or poorly extensible/configurable

z Not, or poorly extensible/configurable z Not as effective as current solution

(6)

Rationale ... Continued Rationale ... Continued

z Needed open architecture/modular/easy extension z Low capital/license cost (free obviously best!)p ( y )

z Use standard components to minimize

development costs development costs

z Use existing basic low-medium size server class

hardware hardware

z Focus on 3rd party/popular filtering methodologies,

simple ad-hoc filtering capabilities, plus with our own “secret sauce”.

(7)

The Open Source Project The Open Source Project

z Basic Requirements – Functional Specification z Component Selection z Component Selection z Integration z Back end z Testing z Testing

(8)

Basic Requirements - Filter Basic Requirements Filter

z Support multiple recipient domains z Configurable per-domain handling z Per-domain filter enable

z Configurable archiving/quarantine/disposition g g q p

(pass,filter, trap)

z Output routingp g z Full logging

NEVER b il t bl kh l ( t t )

z NEVER bounce or silent blackhole (except trap) z Plugin architecture – each technique an g q

(9)

Basic Filter Requirements ... Continued Basic Filter Requirements ... Continued

z Fault tolerant (eg: failover)

z Support 3rd party facilities, eg: z Support 3 party facilities, eg:

z DNSBL (IP blacklists)

z SURBL/URIBL (URI blacklists) z SURBL/URIBL (URI blacklists)

z “informational” lookups (eg: ASN) z Content Scoring filter

z Anti-virus

z Arbitrary ad-hoc string filters anywhere/on anything z Direct/real-time feedback to filtering

(10)

Basic “Not filter” Requirements Basic Not filter Requirements

z Full end-user quarantine view/forward

z End-user (recipient) notification (if desired) z End user (recipient) notification (if desired) z Full logs in database/arbitrary queries

z (Almost) fully automated false positive handling

(forward, filter tune, notification/explanation)

z Operational and Management metrics

Postfacto analysis and automated filter tuning

(11)

Components, Filter, Open-Source Components, Filter, Open Source

z Core SMTP listening engine/agent: Qpsmtpd (Hansen, Sergeant

et. al.). 100% Perl implementation (really!)

z Async (event driven) mode

z Very high performance – 20M+/day small servers

z Entirely flexible by plugin interfacey y p g z Actively supported & robust

z Has many sample plugins

z SpamAssassin (popular scoring addon filter). (Perl)

z ClamAV (*ix-based) anti-virus signature-based engine( ) g g

z Nearly two dozen ad-hoc filtering plugins, few more than a dozen

lines.

(12)

Components, Filter, Glue Components, Filter, Glue

z A spam filter is more than just a filter, needs: z Start/stop/reboot/monitoring

z Start/stop/reboot/monitoring

z Log & quarantine handling and transfer

z Extended filtering heuristic processes (for things

that take too long for real-time)

(13)

Components, Backend Components, Backend

z PostgreSQL database

z Apache (admin and user interface) z Apache (admin and user interface)

z Interface to corporate user databases (push to

filters) filters)

z Admin (research, false positive, configuration,

d l t) i t f CGI

deployment) interface CGIs

z User interfaces (configuration and quarantine)( g q ) z Quarantine management

R l ti fi ti ft

VB2008, Ottawa

z Rsync – log, quarantine, configuration, software

(14)

Integration SPAM PostgreSQL SPAM Database I t t DMZ N S Apache Internet DMZ QPSMTPD Plugins Non-Spam Mail servers Plugins SpamAssassin ClamAV Mail servers Config Users Rejection Config Rejection Notices F l P iti DNSBL 3rd _{Party BL} False Positive Reports CORWAN

(15)

Test/Performance Test/Performance

z Spamtrap operating 9 months

z Performance heavily depends on “early pruning” z Performance heavily depends on early pruning

z “Cheap” tests first

z Prune filtering subsequent to block decision z Prune filtering subsequent to block decision z “Expensive” (body scans, SpamAssassin,

ClamAV) tests last ClamAV) tests last

z Volumes: typical 7m/server (50-100/sec), mostly

spamtrap spamtrap

(16)

Advanced Techniques Advanced Techniques z State of Affairs z Hide! z Hide! z Banner delays z Bot fingerprinting

z DNSBLs (local and/or otherwise) z DNSBLs (local and/or otherwise) z DNSBL infrastructure

z Bounces & BATV

z Ones we've omitted and why z Ones we ve omitted and why

(17)

State of Affairs State of Affairs

z Underground economy (spam, phish, spyware, CC, mules)

increasing

Some LE believe larger than International Drug trade

z Some LE believe larger than International Drug trade z BOTS responsible for 80%+ of all spam.

z Most getting good at stopping BOTs (<1% deliverability) z => BOTs shifting to reputation theft (relay through legit

MTA ) MTAs)

z State of Anti-Virus: disaster. (new BOT caught by AV 23%

of the time by battery of 35 AV tools only increases to 50% of the time by battery of 35 AV tools, only increases to 50% by 30 days)

z Inadequate AV => can’t find BOT, let alone remediateInadequate AV can t find BOT, let alone remediate

(18)

Hide! Hide!

z Make it difficult for BOTs to email you.

z BOTs not full MTAs, high volume/throughput g g p

requirements.

z Primary MX – “refuse connections” (Google for y ( g

“nolisting”)

z Tertiary MX – “always retry”y y y

z Dumb bots try once (primary or tertiary), get refusal

or retry, and give up. Real MTAs do right thing.y, g p g g

z As much as 50% of BOT spam simply vanishes.

L f t i

(19)

Banner Delays Banner Delays

z Most BOTs impatient, and won’t retry z 20-40 second banner delays =>

z 20 40 second banner delays z BOTs give up in disgust

z Some legit MTAs equally impatient, may need to

(20)

BOT Fingerprinting BOT Fingerprinting

z Most BOTs have fingerprints in the headers and

SMTP protocol that can be caught by pattern t hi

matching.

z Some mutate, some don’t. z Srizbi > 50% of all spam.

F d IP f d t ti b k i t l l

z Feed source IP of detections back into local

(21)

DNSBL (DNS Blacklist) DNSBL (DNS Blacklist)

z Hundreds of 3rd party DNSBLs (IP based, domain

based, URIBL filtering etc)

z A handful are both reliable and effective.

There are DNSBLs effective to 70 80%+ of all

z There are DNSBLs effective to 70-80%+ of all

(22)

DNSBL Merge DNSBL Merge

z High volume receivers may impose undue loading on 3rd

party DNSBL infrastructure.

z Occasional erratic delays (including DDOS on DNSBL) z => Host them locally

z We use rbldnsd – very high performance DNS server

designed for high-performance serving of DNSBL zones.

z We combine multiple 3rd party zones (plus ones we create

ourselves) into a single zone.

z Each DNSBL source distinguishable by return code,

(23)

Filtering/Bounces & BATV Filtering/Bounces & BATV

z Accepting then bouncing email with forged from => bounce

storms (aka backscatter/blowback) => evil

z Simple blackholing also evil

z Aim is inline reject, with remediation information.

z Support costs of receiving end of blowback often exceed

spam

z BATV (Bounce Address Tag Validation) see

http://mipassoc.org/batv/

z When sending email, encode bounce address (MAIL

FROM)

(24)

Omitted Techniques & Why Omitted Techniques & Why

z Greylisting – (force retry of “new senders”). z Increasing reports of BOTs doing retry.

z Doesn’t prevent spam-by-reputation-hijacking

z Bayesian – needs training, in many cases defeatedy g, y z Checksumming (Razor/DCC et. al.) –

Detects bulk not spam per se

z Detects bulk, not spam per-se

z Problemmatic when outsourcing user-contact (eg: HR) z Needs whitelisting

z Needs whitelisting