• No results found

HW4_Text-1

N/A
N/A
Protected

Academic year: 2021

Share "HW4_Text-1"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

D

D

A

A

T

T

A

A

 M

 M

INING

INING

FOR

FOR

 B

 B

USINESS

USINESS

 A

 A

NALYTICS

NALYTICS

Section 99

Section 99

:

:

Spring 2015

Spring 2015

Due: Saur!a"

Due: Saur!a"

# Apri$

# Apri$

1%# 2015#

1%# 2015#

&a'

&a'

Homework #4 Homework #4

 _______________ 

 _______________ 

(put your name above)

(put your name above)

(2)

 Total grade: _______ out of ______ 

points

(3)

 This assignment will give you hands-on experience in building text classication models, using the application of email spam ltering !ou will use "e#a to convert the textual data (emails) into

feature vectors and build text mining models for automatic spam ltering The email messages you will use were delivered to a particular server between $ %pr &'' and  *ul &'' The target

variable represents whether an email is either spam or ham (non-spam) +ollow the directions and answer any uestions !our report should not be verbose, but should present your results clearly and professionally

.aveat: /sing large sets of words, building the models and evaluating them with cross-validation reuires a lot of memory The assignment as#s you to save your data les before each "e#a run 0n the in-class lab at the beginning of the semester, you should have increased your heapsi1e 0f for some reason you get memory2heap errors anyway, try restarting and avoiding having any other applications open, except where indicated 0f you have problems for more than 3 or 4' minutes, contact the T% or me--don5t spend time being frustrated with that6

4 7n the blac#board, you will nd a le called spam_data_Text.arf , which contains all the emails in our dataset and is in a format ready-to-input by "e#a 0f you were to open this le in "ord8ad or some other capable text editor, you would see that each instance is 9ust a string of the email text (between single uotes), with the target variable at the end Thus the ar le has two variables for each instance: the text string, and the target variable (called ;class;), which ta#es on two values, spam and ham

 To be able to build spam ltering predictive models, we need to convert the email texts into feature vectors and engineer the features !ou can do this by following the steps below:

4)  <oad spam_data_Text.arf  into "e#a

&) .onvert the email text to word features (0n +ilter:

lters--=unsupervised--=attribute--=>tringTo"ord?ector) .hange only two parameter settings as follows:

a 0n the @to#eni1erA parameter box, replace the default delimiters with the set 0 am

providing below !ou need to copy the whole thing below, including the () !ou can cut and paste the string of delimiters from this document (on blac#board)

(,:BCDEF;GHIJKL MNO6PQ=2RS4&U3$V'-W_X) b >et Du(eS)p$i(D to true

.lic# 7Y and then don5t forget to clic# @%pplyA bac# on the main 8reprocess pageF

 This step splits the string into words2terms, by using the delimiters to @delimitA where words start and end %fter clic#ing apply, it should run for a little while and then you5ll see a large set of words in the %ttributes list

) Zemove non-word (noise) attributes 0f you got many weird words or mixture of words and

symbols, you must have wrong delimiters .orrectly applying the delimiters provided should give you fewer than &' noisy features at the bottom of the list [anually remove those

features chec# the boxes in %ttributes and clic# Zemove6

U) \inari1e the features (meaning, change them from numeric features to 4 for @the word is

present in this emailA or ' for @the word is absentA) (0n +ilter:

lters--=unsupervised--=attribute--=]umericTo\inary) clic# %pplyF6 0f the last attribute, @1ipA, remains unbinari1ed, 9ust remove it6

(4)

3) Zandomi1e the instances (0n +ilter: lters--=unsupervised--=instance--=Zandomi1e) clic# %pplyF6

) >ave the data you 9ust engineered as spam_data_ Occurrence.arf  by clic#ing the @saveA

button at the upper right

]ow you are 9ust about ready to build text classication models ^o to the @.lassifyA tab +irst select ;class; as the target variable from the pull-down l ist in the rectangle under the @[ore optionsA button Then choose as your classier ]ave \ayes (classiersbayes ]aive\ayes ) I NaïveBayes is greyed out, make sure you selected te rigt varia!le as te target varia!le "in te !ox !elo# $%ore options&'.

Ta(*+ Zeport the evaluation results of your model using 4'-fold cross-validation .onsider it both as a classication model and as a model that ran#s cases by the li#elihood of class membership Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(5)

& The spam ltering model you 9ust built is based on word occurrence (presence or absence) in a document ]ow let5s use the freuency of each word in the document instead Zepeat the above step 4 from the original data le 0n step &, in addition to changing the two parameters, set

)upu)r!C)un( to true Then apply step  and 3, s#ipping steps U and  >ave your data as spam_data_ (ount.arf.

]ow you can build a classier using ]ave\ayes[ultinomial, a version of ]aive\ayes ta#ing into account multiple occurrences of a word (classiersbayes ]aive\ayes[ultinomial) to build your prediction model Zeport the 4'-fold cross-validation results and compare with the occurrence-based results in the previous uestion

Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(6)

 %s you #now, error costs for spam ltering are asymmetric [ista#enly classifying a good email as a spam (false positive) is a lot more costly than mista#enly classifying a spam as a good email (false negative) <et5s assume each false positive costs H3 and each false negative costs a nic#el .alculate the total cost and expected cost (per email) based on the confusion matrix you obtained in uestion 4 .opy the confusion matrix and present the formulas you used to get the results 0f this seems foreign to you, go bac# and reread .hapter 6 \e careful with the dimensions of the confusion matrix: which are the @actualsA and which are the @predictionsAE6

Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(7)

U The models you generated so far use the full set of features2words \ut not all of them are necessarily good features <et5s investigate what features are the most discriminative`recall the discussion on selecting the most informative attributes from early in the semester (and .hapter ) %fter loading the le spam_data_ Occurrence.arf, clic# the Se$e Ari,ue( tab and in the

%ttribute valuator section >elect ;class; as the target variable in the usual pull-down box .hoose In/)GainAri,ueEa$(this selects the attributes with the greatest 0nformation ^ainB see .h ) % small window will be popped up to as# you to use the Zan#er search method .lic# !es %nd clic# the @Zan#erA boxB set nu'T)Se$e to 3' 0n the %ttribute >election [ode section, select u(e /u$$ raining (e Sar the attribute selection program and you will get a list of features !ou can loo# through this list and see whether they seem to ma#e sense as words that might separate spam from non-spam emails ere in the >elect %ttributes tab, you can change the number of attributes and rerun without changing the underlying data

]ow, let5s combine feature selection with modeling [ove bac# to the 8reprocess tab This time we will change the data to only include these @bestA 3' attributes .hoose

+ilter-=>upervised-=attribute-=%ttribute>election .lic# on the parameter box showing @%ttribute>election A .hoose 0nfo^ain%ttributeval as the evaluator .hoose Zan#er as the search .lic# on the

CZan#erC parameter box >et numTo>elect to 3' .lic# 7#7#6 [a#e sure that ;class; is the

target variable (in the box next to ?isuali1e %ll) Then clic# apply The result should be that now only those top-3' words are left as %ttributes, along with the target variable >ave this as

spam_data_ Occurrence)*.arf.

\uild a new ]aive bayes model only using these 3' attributes on5t forget to pic# the right target

variable6 .ompare the results of evaluating this new model with the one you generated in uestion

4 using the full set of features %naly1e the results and report your ndings

Rep)r ")ur re(u$( ,e$)- )n .i( page+ Fee$ /ree ) a!! ')re page( i/ )ne page i( n) en)ug. /)r ")ur re(u$(+

(8)

References

Related documents

The causal factors used in study to check their effect on child mortality are mothers' age, place of residence, mothers' education, exposure to media (television), ethnicity,

Age ≥ 65 years, the number of antibiotics received and prior hospital admission were the only risk factors identified in each group and overall (both groups combined) that

Highlight one email or place a checkmark in the checkbox next to multiple emails and click the Classify as Not Spam Button.. Barracuda updates your preferences for classifying

False positives (email incorrectly identified as spam) will typically have a lower score than most other spam messages.. Because of this, messages in quarantine are sorted by

Significantly higher densities of subalpine fir and Engelmann spruce seedlings following MPB outbreak suggest that severe MPB outbreak accelerates the succession of lodgepole pine

If a shift in the bike demand depends on the overall temperature, this relationship also needs to arise when the same analysis is conducted by fitting new principal component and

• High detection rate - block more than 97% of spam threatening email users • Low false positives protects against all types of spam with any content, in. any location, format

Although requirements for regular licenses to teach kinder- garten through grade twelve vary by state, all states require general education teachers to have a bachelor’s degree and