HW4_Text-1

(1)

D

A

T

A

M

INING

FOR

B

USINESS

A

NALYTICS

Section 99

Section 99 _:

_:

_{Spring 2015}

Due: Saur!a"

# Apri$

1%# 2015#

&a'

Homework #4 Homework #4

_______________

(put your name above)

(2)

Total grade: _ out of

points

(3)

This assignment will give you hands-on experience in building text classication models, using the application of email spam ltering !ou will use "e#a to convert the textual data (emails) into

feature vectors and build text mining models for automatic spam ltering The email messages you will use were delivered to a particular server between $ %pr &'' and  *ul &'' The target

variable represents whether an email is either spam or ham (non-spam) +ollow the directions and answer any uestions !our report should not be verbose, but should present your results clearly and professionally

.aveat: /sing large sets of words, building the models and evaluating them with cross-validation reuires a lot of memory The assignment as#s you to save your data les before each "e#a run 0n the in-class lab at the beginning of the semester, you should have increased your heapsi1e 0f for some reason you get memory2heap errors anyway, try restarting and avoiding having any other applications open, except where indicated 0f you have problems for more than 3 or 4' minutes, contact the T% or me--don5t spend time being frustrated with that6

4 7n the blac#board, you will nd a le called spam_data_Text.arf , which contains all the emails in our dataset and is in a format ready-to-input by "e#a 0f you were to open this le in "ord8ad or some other capable text editor, you would see that each instance is 9ust a string of the email text (between single uotes), with the target variable at the end Thus the ar le has two variables for each instance: the text string, and the target variable (called ;class;), which ta#es on two values, spam and ham

To be able to build spam ltering predictive models, we need to convert the email texts into feature vectors and engineer the features !ou can do this by following the steps below:

4) <oad spam_data_Text.arf into "e#a

&) .onvert the email text to word features (0n +ilter:

lters--=unsupervised--=attribute--=>tringTo"ord?ector) .hange only two parameter settings as follows:

a 0n the @to#eni1erA parameter box, replace the default delimiters with the set 0 am

providing below !ou need to copy the whole thing below, including the () !ou can cut and paste the string of delimiters from this document (on blac#board)

(,:BCDEF;GHIJKL MNO6PQ=2RS4&U3$V'-W_X) b >et Du(eS)p$i(D to true

.lic# 7Y and then don5t forget to clic# @%pplyA bac# on the main 8reprocess pageF

This step splits the string into words2terms, by using the delimiters to @delimitA where words start and end %fter clic#ing apply, it should run for a little while and then you5ll see a large set of words in the %ttributes list

) Zemove non-word (noise) attributes 0f you got many weird words or mixture of words and

symbols, you must have wrong delimiters .orrectly applying the delimiters provided should give you fewer than &' noisy features at the bottom of the list [anually remove those

features chec# the boxes in %ttributes and clic# Zemove6

U) \inari1e the features (meaning, change them from numeric features to 4 for @the word is

present in this emailA or ' for @the word is absentA) (0n +ilter:

lters--=unsupervised--=attribute--=]umericTo\inary) clic# %pplyF6 0f the last attribute, @1ipA, remains unbinari1ed, 9ust remove it6

(4)

3) Zandomi1e the instances (0n +ilter: lters--=unsupervised--=instance--=Zandomi1e) clic# %pplyF6

) >ave the data you 9ust engineered as spam_data_ Occurrence.arf by clic#ing the @saveA

button at the upper right

]ow you are 9ust about ready to build text classication models ^o to the @.lassifyA tab +irst select ;class; as the target variable from the pull-down l ist in the rectangle under the @[ore optionsA button Then choose as your classier ]ave \ayes (classiersbayes ]aive\ayes ) I NaïveBayes is greyed out, make sure you selected te rigt varia!le as te target varia!le "in te !ox !elo# $%ore options&'.

Ta(*+ Zeport the evaluation results of your model using 4'-fold cross-validation .onsider it both as a classication model and as a model that ran#s cases by the li#elihood of class membership Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(5)

& The spam ltering model you 9ust built is based on word occurrence (presence or absence) in a document ]ow let5s use the freuency of each word in the document instead Zepeat the above step 4 from the original data le 0n step &, in addition to changing the two parameters, set

)upu)r!C)un( to true Then apply step  and 3, s#ipping steps U and  >ave your data as spam_data_ (ount.arf.

]ow you can build a classier using ]ave\ayes[ultinomial, a version of ]aive\ayes ta#ing into account multiple occurrences of a word (classiersbayes ]aive\ayes[ultinomial) to build your prediction model Zeport the 4'-fold cross-validation results and compare with the occurrence-based results in the previous uestion

Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(6)

 %s you #now, error costs for spam ltering are asymmetric [ista#enly classifying a good email as a spam (false positive) is a lot more costly than mista#enly classifying a spam as a good email (false negative) <et5s assume each false positive costs H3 and each false negative costs a nic#el .alculate the total cost and expected cost (per email) based on the confusion matrix you obtained in uestion 4 .opy the confusion matrix and present the formulas you used to get the results 0f this seems foreign to you, go bac# and reread .hapter 6 \e careful with the dimensions of the confusion matrix: which are the @actualsA and which are the @predictionsAE6

(7)

U The models you generated so far use the full set of features2words \ut not all of them are necessarily good features <et5s investigate what features are the most discriminative`recall the discussion on selecting the most informative attributes from early in the semester (and .hapter ) %fter loading the le spam_data_ Occurrence.arf, clic# the Se$e Ari,ue( tab and in the

%ttribute valuator section >elect ;class; as the target variable in the usual pull-down box .hoose In/)GainAri,ueEa$(this selects the attributes with the greatest 0nformation ^ainB see .h ) % small window will be popped up to as# you to use the Zan#er search method .lic# !es %nd clic# the @Zan#erA boxB set nu'T)Se$e to 3' 0n the %ttribute >election [ode section, select u(e /u$$ raining (e Sar the attribute selection program and you will get a list of features !ou can loo# through this list and see whether they seem to ma#e sense as words that might separate spam from non-spam emails ere in the >elect %ttributes tab, you can change the number of attributes and rerun without changing the underlying data

]ow, let5s combine feature selection with modeling [ove bac# to the 8reprocess tab This time we will change the data to only include these @bestA 3' attributes .hoose

+ilter-=>upervised-=attribute-=%ttribute>election .lic# on the parameter box showing @%ttribute>election A .hoose 0nfo^ain%ttributeval as the evaluator .hoose Zan#er as the search .lic# on the

CZan#erC parameter box >et numTo>elect to 3' .lic# 7#7#6 [a#e sure that ;class; is the

target variable (in the box next to ?isuali1e %ll) Then clic# apply The result should be that now only those top-3' words are left as %ttributes, along with the target variable >ave this as

spam_data_ Occurrence)*.arf.

\uild a new ]aive bayes model only using these 3' attributes on5t forget to pic# the right target

variable6 .ompare the results of evaluating this new model with the one you generated in uestion

4 using the full set of features %naly1e the results and report your ndings

(8)

HW4_Text-1

D

D

A

A

T

T

A

A

M

M

INING

INING

FOR

FOR

B

B

USINESS

USINESS

A

A

NALYTICS

NALYTICS

Section 99

Section 99

:

:

Spring 2015

Spring 2015

Due: Saur!a"

Due: Saur!a"

# Apri$

# Apri$

1%# 2015#

1%# 2015#

&a'

&a'

_______________

_______________

(put your name above)

(put your name above)

Total grade: _______ out of ______

points

_:

_:

_{Spring 2015}

_{Spring 2015}

Total grade: _ out of