D
D
A
A
T
T
A
A
M
M
INING
INING
FOR
FOR
B
B
USINESS
USINESS
A
A
NALYTICS
NALYTICS
Section 99
Section 99
:
:
Spring 2015
Spring 2015
Due: Saur!a"
Due: Saur!a"
# Apri$
# Apri$
1%# 2015#
1%# 2015#
&a'
&a'
Homework #4 Homework #4_______________
_______________
(put your name above)
(put your name above)
Total grade: _______ out of ______
points
This assignment will give you hands-on experience in building text classication models, using the application of email spam ltering !ou will use "e#a to convert the textual data (emails) into
feature vectors and build text mining models for automatic spam ltering The email messages you will use were delivered to a particular server between $ %pr &'' and *ul &'' The target
variable represents whether an email is either spam or ham (non-spam) +ollow the directions and answer any uestions !our report should not be verbose, but should present your results clearly and professionally
.aveat: /sing large sets of words, building the models and evaluating them with cross-validation reuires a lot of memory The assignment as#s you to save your data les before each "e#a run 0n the in-class lab at the beginning of the semester, you should have increased your heapsi1e 0f for some reason you get memory2heap errors anyway, try restarting and avoiding having any other applications open, except where indicated 0f you have problems for more than 3 or 4' minutes, contact the T% or me--don5t spend time being frustrated with that6
4 7n the blac#board, you will nd a le called spam_data_Text.arf , which contains all the emails in our dataset and is in a format ready-to-input by "e#a 0f you were to open this le in "ord8ad or some other capable text editor, you would see that each instance is 9ust a string of the email text (between single uotes), with the target variable at the end Thus the ar le has two variables for each instance: the text string, and the target variable (called ;class;), which ta#es on two values, spam and ham
To be able to build spam ltering predictive models, we need to convert the email texts into feature vectors and engineer the features !ou can do this by following the steps below:
4) <oad spam_data_Text.arf into "e#a
&) .onvert the email text to word features (0n +ilter:
lters--=unsupervised--=attribute--=>tringTo"ord?ector) .hange only two parameter settings as follows:
a 0n the @to#eni1erA parameter box, replace the default delimiters with the set 0 am
providing below !ou need to copy the whole thing below, including the () !ou can cut and paste the string of delimiters from this document (on blac#board)
(,:BCDEF;GHIJKL MNO6PQ=2RS4&U3$V'-W_X) b >et Du(eS)p$i(D to true
.lic# 7Y and then don5t forget to clic# @%pplyA bac# on the main 8reprocess pageF
This step splits the string into words2terms, by using the delimiters to @delimitA where words start and end %fter clic#ing apply, it should run for a little while and then you5ll see a large set of words in the %ttributes list
) Zemove non-word (noise) attributes 0f you got many weird words or mixture of words and
symbols, you must have wrong delimiters .orrectly applying the delimiters provided should give you fewer than &' noisy features at the bottom of the list [anually remove those
features chec# the boxes in %ttributes and clic# Zemove6
U) \inari1e the features (meaning, change them from numeric features to 4 for @the word is
present in this emailA or ' for @the word is absentA) (0n +ilter:
lters--=unsupervised--=attribute--=]umericTo\inary) clic# %pplyF6 0f the last attribute, @1ipA, remains unbinari1ed, 9ust remove it6
3) Zandomi1e the instances (0n +ilter: lters--=unsupervised--=instance--=Zandomi1e) clic# %pplyF6
) >ave the data you 9ust engineered as spam_data_ Occurrence.arf by clic#ing the @saveA
button at the upper right
]ow you are 9ust about ready to build text classication models ^o to the @.lassifyA tab +irst select ;class; as the target variable from the pull-down l ist in the rectangle under the @[ore optionsA button Then choose as your classier ]ave \ayes (classiersbayes ]aive\ayes ) I NaïveBayes is greyed out, make sure you selected te rigt varia!le as te target varia!le "in te !ox !elo# $%ore options&'.
Ta(*+ Zeport the evaluation results of your model using 4'-fold cross-validation .onsider it both as a classication model and as a model that ran#s cases by the li#elihood of class membership Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+
& The spam ltering model you 9ust built is based on word occurrence (presence or absence) in a document ]ow let5s use the freuency of each word in the document instead Zepeat the above step 4 from the original data le 0n step &, in addition to changing the two parameters, set
)upu)r!C)un( to true Then apply step and 3, s#ipping steps U and >ave your data as spam_data_ (ount.arf.
]ow you can build a classier using ]ave\ayes[ultinomial, a version of ]aive\ayes ta#ing into account multiple occurrences of a word (classiersbayes ]aive\ayes[ultinomial) to build your prediction model Zeport the 4'-fold cross-validation results and compare with the occurrence-based results in the previous uestion
Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+
%s you #now, error costs for spam ltering are asymmetric [ista#enly classifying a good email as a spam (false positive) is a lot more costly than mista#enly classifying a spam as a good email (false negative) <et5s assume each false positive costs H3 and each false negative costs a nic#el .alculate the total cost and expected cost (per email) based on the confusion matrix you obtained in uestion 4 .opy the confusion matrix and present the formulas you used to get the results 0f this seems foreign to you, go bac# and reread .hapter 6 \e careful with the dimensions of the confusion matrix: which are the @actualsA and which are the @predictionsAE6
Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+
U The models you generated so far use the full set of features2words \ut not all of them are necessarily good features <et5s investigate what features are the most discriminative`recall the discussion on selecting the most informative attributes from early in the semester (and .hapter ) %fter loading the le spam_data_ Occurrence.arf, clic# the Se$e Ari,ue( tab and in the
%ttribute valuator section >elect ;class; as the target variable in the usual pull-down box .hoose In/)GainAri,ueEa$(this selects the attributes with the greatest 0nformation ^ainB see .h ) % small window will be popped up to as# you to use the Zan#er search method .lic# !es %nd clic# the @Zan#erA boxB set nu'T)Se$e to 3' 0n the %ttribute >election [ode section, select u(e /u$$ raining (e Sar the attribute selection program and you will get a list of features !ou can loo# through this list and see whether they seem to ma#e sense as words that might separate spam from non-spam emails ere in the >elect %ttributes tab, you can change the number of attributes and rerun without changing the underlying data
]ow, let5s combine feature selection with modeling [ove bac# to the 8reprocess tab This time we will change the data to only include these @bestA 3' attributes .hoose
+ilter-=>upervised-=attribute-=%ttribute>election .lic# on the parameter box showing @%ttribute>election A .hoose 0nfo^ain%ttributeval as the evaluator .hoose Zan#er as the search .lic# on the
CZan#erC parameter box >et numTo>elect to 3' .lic# 7#7#6 [a#e sure that ;class; is the
target variable (in the box next to ?isuali1e %ll) Then clic# apply The result should be that now only those top-3' words are left as %ttributes, along with the target variable >ave this as
spam_data_ Occurrence)*.arf.
\uild a new ]aive bayes model only using these 3' attributes on5t forget to pic# the right target
variable6 .ompare the results of evaluating this new model with the one you generated in uestion
4 using the full set of features %naly1e the results and report your ndings
Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+