• No results found

Spam.ppt

N/A
N/A
Protected

Academic year: 2021

Share "Spam.ppt"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

Text Categorization

Moshe Koppel

Lecture 10: Spam Detection

(2)

Obligatory Scare Slide

• There‟s lots of spam

• The proportion of spam is growing – it will

soon exceed 100% of all email sent

• It costs the world gazillions of dollars

• Spam is BAD

• (Actually, lately it looks like spam email

has been mostly defeated.)

(3)

Kinds of spam

• Active spam – ads and scams

– email

– chatbots

– commentbots

• Passive spam – websites

– link farms for SEO

– adsense parking lots

(4)

Special Issues

Spam detection is basically a text cat problem,

but there are some special issues:

• Collecting data – non-spam email is private

• Asymmetry – must never class good mail as spam

(5)

Collecting Data

• Standard collections

– SpamAssassin Corpus

– TREC corpora

• Use your own email

– Might not reflect world

• gmail has user feedback

– LOTS of examples

– Haphazardly labeled

(6)

Problem of False Positives

• False positives more costly than false negatives

• Research must report recall-precision curves;

key point is precision ~ 1

(7)

Adversarial Problem

• Spammers reverse engineer global filters;

use nasty tricks to circumvent them

• This is what makes spam detection an

interesting problem

(8)

Basic Spam

• Let‟s start with some garden variety spam

• This is easily detected by standard text cat

(9)

It cost you nothing (Yes! $0) to give Us a call, We will contact You back Absolutely No exams/Tests/classes/books/Interviews

No Pre-School qualification Needed! ---

Inside USA: 1-718-989-5XXX 0utside USA: +1-718-989-5XXX ---

Degree, Bacheelor, masteerMBA, PhDD available in the field of your choice that's Right, You can even become a doctor & receive all the benefits That omes With it!

Please Leave Below 3 INFO in voicemail: 1) your Name

2) your Country

3) your Phone No. (with Countrycode)

(10)

Most Honorable Sir,

I am Ehud Olmert, formerly the Prime Minister of Israel. I URGENTLY REQUIRE YOUR ASSISTANCE IN A MOST DISCRETE MATTER. As a result of certain events in my country, it has become necessary for me to transfer a considerable sum of cash to a foreign bank account. I turn to you as a MOST HONORABLE AND TRUSTED PERSON for your discrete assistance.

The total amount involved is THIRTY MILLION NEW ISRAELI SHEKELS only [30,000.000.00 NIS] and we wish to transfer this money into safe foreigners account abroad. I am only contacting you as a foreigner because this money cannot be

approved to a local person here, but to a foreigner who has information about the account, which I shall give to you upon your positive response. I am revealing this to you with believe in God that you will never let me down in this business, you are the FIRST AND THE ONLY PERSON that I am contacting for this business, so please reply urgently so that I will inform you the next step to take urgently.

At the conclusion of this business, you will be given 40% of the total amount, 50% will be for us while 10% will be for the expenses both parties may incurred during this transaction. PLEASE, TREAT THIS PROPOSAL AS TOP SECRET.

(11)

Early Work

Sahami et al „98

• Learner: Naïve Bayes

• Feature Set: Words, Phrases, Structural Features

• Feature Selection: top 500 infogain

• Evaluation Data: ~1700 Messages, ~88% Spam

(12)

Early Work

Sahami et al „98

Hand Crafted Features

– 35 Phrases

• „Free Money‟

• „Only $‟

• „be over 21‟

– 20 Domain Specific Features

• Domain type of sender (.edu, .com, etc)

• Sender name resolutions (internal mail)

• Has attachments

• Time received

(13)

Later Studies

• The early work was followed by the usual

stream of extended feature sets and fancier

learning methods (e.g. SVM)

• It is now common to use over 100,000

features

• Learning methods for huge data sets must

be very efficient (online algorithms)

(14)

How to Beat an Adaptive Spam Filter

Graham-Cumming „04

• Use machine learning to discover words that beat

an adaptive filter

– Take a message that is near spam threshold

– Send it to the target filter 10,000 times each time

adding 5 random words

– Train an „evil‟ filter to learn which messages beat the

target filter

– Use „evil‟ filter to modify new spam messages

• Found single word additions to get new spam by

the filter

(15)

Other Tricks

• Fill messages with real text taken from

books, sites, etc.

• Can even generate real-looking texts using

Markovian language models

(16)

The Hitchhiker Chaffer

• Content Chaff

– Random passages from the

Hitchhiker‟s Guide

– Footers from valid mail

“This must be Thursday,” said Arthur to himself, sinking low over his beer, “I never could get the hang of Thursdays.”

Express yourself with MSN Messenger 6.0…

(17)

Hitchhiker Chaffer‟s

Later Work

• There is nothing fancy

about this spam

– “A spam filter will catch

that in its sleep” –

anonymous

(18)

Hitchhiker Chaffer‟s

Later Work

• Hidden Text

• Content Chaff

• URL Spamming

Also included a number of unusual statements made by candidates during, „On display? I eventually had to go

down to the cellar to find them.‟

http://join.msn.com/?Pag e=features/es

(19)

More Tricks

• Encoded Text

• Distorted Text

(20)

Secret Decoder Ring Dude

• Another spam that looks

easy

(21)

Secret Decoder Ring Dude

• Character Encoding

• HTML word breaking

Pharmacy

(22)

Diploma Guy

• Word Obscuring

Dplmoia Pragorm

(23)

More of Diploma Guy

• Diploma Guy is good

at what he does

(24)

One Pretty Good Text Cat Method

• Optimally compress spam training

examples

• Optimally compress non-spam training

examples

• Check which compression method better

compresses suspicious message

(25)

Why This Works

• Works at level of character n-grams

• Should be applied to html source

• Captures weird encodings, word distortions

• Probably using character n-grams with SVM

would also work well

(26)

But Spammers Aren‟t Sitting Around…

• Embed text in images (can vary non-text

parts of image)

(27)
(28)

Text Cat isn‟t the only Trick

• Don‟t display images w/o user okay

• Blacklist IPs that spam comes from

– Can harm legitimate senders (zombies, etc.)

• Charge “postage” for email

– Cash

– Puzzles that waste CPU

(29)

Sender

Recipient

Challeng

e

Response

(30)

CAPTCHAS

• Identify distorted characters

• Supposed to be easy for humans, hard for

computers

• Actually, nowadays computers better at it

than humans

(31)
(32)

Slight Variation

• Fortunately,

for now

, humans are still better

than computers at identifying character

(33)
(34)

Economics of CAPTCHAs

• CAPTCHAs taken from books Google is

trying to OCR. We all work for them for

free.

• Spammers use Mechanical Turk to solve

CAPTCHAs. It‟s worth paying for.

References

Related documents

[r]

Firewall E-Mail Spam Filter Web Filter Anti-Virus Software Anti-Spyware Software Employee Awareness EVIL Virus Spyware Crimeware (evolved from Malware)

Security sector reforms (SSR) typically include the reform of core state institutions – mili- tary, police, judiciary – to promote peace and security along with a ‘normative agenda

You also have the option to report the message to our servers team, and we will update our filters to ensure similar messages do not get flagged as spam in the future.. Delete -

• You have controls in the Spam Quarantine Summary email to Deliver, Whitelist or Delete the quarantined emails.. • By clicking the link in the email to go to the “quarantine

(2) request that the Spam Filter server send the individual email to your Inbox; (Release) (3) request that the Spam Filter server send the individual email to your Inbox and that

Lee, McLoughlin, and Chan (2007) found that student podcast production offers a shared context that supports learner creativity and collaborative negotiation of meaning,

Through the intercession of Mary our Mother and Saint Jerome, we the parish families of Our Lady of Hope!. are called to live and proclaim the Gospel of