An Experimental Technique on Features Extraction for Product Feedback using Opinion Mining

(1)

International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8 Issue-8S3, June 2019



Abstract: Now a days with increased use of social media, user reviews on any product plays a significant role. The existing opinion mining techniques operate only on single review corpus; it doesn’t consider distribution of opinion features across different corpora. A new technique is proposed to extract opinion features from online reviews across two corpora, one a given review corpus and other is contrasting corpus. This discrepancy is evaluated using domain relevance. The first step is to find candidate opinion features in domain review corpus using syntacticT dependenceT rules.T ThisT evaluateT intrinsicT domainT relevanceT scoresT forT eachT candidateT onT domain-dependentT corporaT andT extrinsicT domainT relevanceT scoreT onT domain-independentT corpora.T AT candidateT featuresT whichT areT lessT genericT andT moreT domainT specificT areT theT finalT opinionT features.T ThisT intervalT thresholdingT isT calledT asT theT intrinsicT andT extrinsicT domainT relevance.

Index Terms: Corpus, Syntactic Rules, Domain Relevance, Candidate Feature, opinion mining, opinion feature

I. INTRODUCTION

NowT aT day’sT opinionT miningT isT aT challengingT taskT withT

increasedT useT ofT socialT media.T TheT saleT ofT anyT productT

andT itsT marketingT isT basedT onT theT customerT reviewsT

aboutT thatT product.T ThisT opinionT miningT isT doneT inT

differentT levels.T ForT example,T documentT levelT opinionT

miningT [14]T detectsT theT overallT subjectivityT expressedT onT productT inT aT reviewT document,T butT itT doesT notT associateT

opinionsT withT specificT componentT ofT theT Product.T “TheT

externalT isT veryT beautiful,T alsoT notT costly,T thoughT theT

batteryT isT poor;T IT stillT resolutelyT recommendT thisT

cell-phone.”T itT expressesT aT generalT positiveT opinionT onT theT cell-phone,T itT alsoT containsT contradictoryT opinionsT

associatedT withT differentT componentsT ofT theT cell-phone.T

TheT opinionT orientationsT forT theT “cell-phone”T itselfT andT

itsT “external”T areT positive,T butT theT opinionT polarityT forT theT componentT ofT “battery”T isT negative.T SuchT opinionsT mayT veryT wellT tipT theT balanceT inT purchaseT decisions.T SavvyT consumersT areT unsatisfiedT withT justT theT overallT opinionT ratingT ofT aT product.T TheyT wantT toT knowT whyT itT

Revised Manuscript Received on December 22, 2018.

Jawahar Gawade1_{, Latha Parthiban}2

Ph.D. Scholar: Computer Science Engineering Bharath University, Chennai, India1

Associate Professor, Department of Computer Science, Pondicherry University, CC, India2

receivesT theT ratingT [15].T InT opinionT mining,T anT opinionT

feature,T indicatesT aT productT orT aT componentT ofT aT productT

onT whichT customerT expressT theirT opinions.T InT currentT

paper,T thisT workT recommendT aT novelT methodT toT theT

detectionT ofT suchT featuresT fromT unstructuredT textualT

reviews.

AT novelT techniqueT isT proposedT toT detectT opinionT featuresT

byT exploitingT theirT distributionT discrepanciesT acrossT

differentT corpora.T ThisT workT proposedT andT evaluatedT theT domainT relevanceT (DR)T [16]T ofT anT opinionT featureT acrossT

twoT corpora.T TheT DRT criterionT evaluatesT howT wellT aT

termT isT statisticallyT associatedT withT aT corpus.T OurT methodT

isT summarizedT asT follows:T First,T severalT syntacticT

dependenceT rulesT areT appliedT toT produceT aT listT ofT

candidateT featuresT fromT theT givenT domainT reviewT corpus,T forT example,T cellT phone.

Next,T forT eachT acceptedT featureT candidate,T itsT domainT

relevanceT scoreT regardingT theT domain-specificT andT

domainT independentT corporaT isT majored,T whichT isT calledT asT theT intrinsic-domainT relevanceT (IDR)T score,T andT theT extrinsicT domainT relevanceT (EDR)T score,T respectively.T InT theT finalT step,T candidateT featuresT withT smallT IDRT scoresT andT highT EDRT scoresT areT pruned.T Thus,T callT thisT intervalT thresholdingT theT intrinsicT andT extrinsicT domainT relevanceT (IEDR)T technique.

The objectives of this work are:

The websites which allow users to their write reviews as a feedback may not place any constraint on the language they use to express their opinion [2]. Some slang words like “OMG”, “BTW” etc. may cause the system to take care while processing it. The problem also comes along with the spelling mistakes occurred in the review.

 The designing of proper syntactic rules [3].

 As the selection of Domain Independent corpus

decides performance, its selection is a critical task.

 The feature naming in the reviews also causes an

overhead in identifying features as users may name the features with different names such as users may use “Headset” or

“Earphones” for

expressing their

An Experimental Technique on Features

Extraction for Product Feedback using Opinion

Mining

(2)

sentiment for the same feature. Hence our objective is to assign proper feature naming [1].

 The selection of proper threshold values [4].

II. RELATED WORK

AT manyT approachesT haveT beenT proposedT toT extractT

featuresT inT opinionT mining.T SupervisedT learningT modelT

mayT workT wellT inT aT givenT domain,T butT theT modelT mustT beT retrainedT ifT itT isT appliedT toT otherT domainsT [2],T [3].T

UnsupervisedT naturalT languageT processingT (NLP)T

techniquesT [4],T [5],T [6]T detectT opinionT featuresT byT

definingT domain-independentT syntacticT rulesT thatT captureT

theT dependenceT rolesT andT localT contextT ofT theT featureT terms.T However,T rulesT doT notT workT wellT onT real-lifeT

reviews,T whichT lackT formalT structure.T TopicT modelingT

approachesT canT mineT coarse-grainedT andT genericT topicsT

orT components,T whichT areT actuallyT semanticT featureT

clustersT orT componentsT ofT theT specificT featuresT

commentedT onT explicitlyT inT reviewsT [7],T [8].T ExistingT extractT opinionT featuresT byT miningT statisticalT patternsT ofT

featureT termsT onlyT inT theT singleT reviewT corpus,T withoutT

consideringT theirT distributionalT characteristicsT inT anotherT differentT corpusT [10],T [11].T ThisT techniqueT isT toT findT theT distributionalT structureT ofT anT opinionT featureT inT aT givenT domain-dependentT reviewT corpus,T forT example,T cellphoneT

reviews,T isT differentT fromT thatT inT aT domain-independentT

corpus.T ForT instance,T theT opinionT featureT batteryT occursT

frequentlyT inT theT domainT ofT cellphoneT reviews,T butT notT

asT frequentlyT inT theT domain-irrelevantT CultureT articleT

collection.

InT W.T JinT andT H.H.T Ho,T AT NovelT LexicalizedT

HMM-BasedT LearningT FrameworkT forT WebT OpinionT

MiningT [17],T MerchantsT sellingT productsT onT theT WebT

regularlyT requestT theirT customersT toT shareT theirT opinionsT andT experiencesT onT productsT theyT haveT purchased.T InT thisT

research,T OpinionT expressionsT andT sentencesT areT alsoT

recognizedT andT opinionT orientationsT forT eachT recognizedT productT entityT areT classifiedT asT positiveT orT negativeT [2].

ThisT workT recommendsT aT novelT machineT learningT

frameworkT usingT lexicalizedT HMMs.T TheT approachT

naturallyT integratesT linguisticT features,T suchT asT

part-of-speechT andT surroundingT contextualT cluesT ofT wordsT intoT automaticT learning.

InT W.T JinT andT H.H.T Ho,T AT NovelT LexicalizedT

HMM-BasedT LearningT FrameworkT forT WebT OpinionT

MiningT [9],T ThisT workT modelT theT problemT asT anT

informationT extractionT task,T whichT willT beT basedT onT

ConditionalT RandomT FieldsT (CRF)T [3].AsT aT baselineT ThisT workT employT theT supervisedT algorithmT byT ZhuangT etT al.T

(2006),T whichT representsT theT state-of-the-artT onT theT

engagedT data.T Furthermore,T ThisT workT investigateT theT

performanceT ofT ourT CRF-basedT techniqueT andT theT

baselineT inT aT single-T andT cross-domainT opinionT targetT

extractionT setting.T CRF-basedT approachT improvesT theT

performanceT byT 0.077,T 0.126,T 0.071T andT 0.178T regardingT F-MeasureT inT theT single-domainT extractionT inT theT fourT

domains.T InT theT cross-domainT settingT ourT approachT

improvesT theT performanceT byT 0.409,T 0.242,T 0.294T andT

0.343T regardingT F-MeasureT overT theT baseline.

ThereT twoT mainT twoT subtasksT ofT opinionT mining:T topicT

extractionT andT sentimentT classificationT inT G.T Qiu,T C.T Wang,T J.T Bu,T K.T Liu,T andT C.T Chen,T IncorporateT theT

SyntacticT KnowledgeT inT OpinionT MiningT inT

User-GeneratedT Content.T ThisT workT proposeT techniquesT

toT theseT twoT topicsT respectivelyT forT ChineseT basedT onT theT considerationT ofT syntacticT knowledge.T ThisT workT takeT theT blogT data,T whichT isT aT typicalT applicationT ofT UGC,T asT theT evaluatingT dataT inT ourT experimentsT andT theT resultsT showT thatT ourT techniquesT toT theT twoT tasksT areT promising.T TheT G.T Qiu,T B.T Liu,T J.T Bu,T andT C.T Chen,T OpinionT WordT

ExpansionT andT TargetT ExtractionT throughT DoubleT

Propagation,T OpinionT targetsT (targets,T forT short)T areT

ProductsT andT theirT componentsT onT whichT opinionsT haveT beenT expressed.T ToT performT theT tasks,T ThisT workT foundT thatT thereT areT someT syntacticT relationsT thatT linkT opinionT

wordsT andT targets.T TheseT relationsT canT beT recognizedT

usingT aT dependencyT parserT andT thenT utilizedT toT expandT

theT initialT opinionT lexiconT andT toT extractT targetsT [6].T ThisT

proposedT methodT isT basedT onT bootstrapping.T ThisT workT

callT itT doubleT propagationT asT itT propagatesT informationT betweenT opinionT wordsT andT targets.T AT keyT advantageT ofT theT proposedT techniqueT isT thatT itT onlyT needsT anT initialT

opinionT lexiconT toT initiateT theT bootstrappingT process.T

Thus,T theT methodT isT semi-supervisedT dueT toT theT useT ofT

opinionT wordT seeds.

InT G.T Qiu,T B.T Liu,T J.T Bu,T andT C.T Chen,T OpinionT WordT

ExpansionT andT TargetT ExtractionT throughT DoubleT

Propagation,T ThisT workT doT notT summarizeT theT reviewsT byT

selectingT aT subsetT orT rewriteT someT ofT theT originalT

sentencesT fromT theT reviewsT toT captureT theT mainT pointsT asT inT theT classicT textT summarizationT [10].T OurT taskT isT performedT inT threeT steps:T (1)T miningT productT featuresT

thatT haveT beenT commentedT onT byT customers;T (2)T

identifyingT opinionT sentencesT inT eachT reviewT andT

decidingT whetherT eachT opinionT sentenceT isT positiveT orT negative;T (3)T summarizingT theT results.

III. PROPOSEDMETHODOLOGYAND DISCUSSION

Important Terminologies

 LoadingT ReviewT Corpuses:T FirstT ofT allT applicationT

haveT toT loadT theT ReviewT corpusesT includingT

DomainT dependentT corporaT andT DomainT

independentT corpora.T TheT statementT usuallyT

providedT withT XMLT

orT TEXTT version.T

(3)

aT XMLT orT aT TEXTT parserT toT readT theT statementT

serially.T TheseT statementsT shouldT beT loadedT intoT arraysT ofT stringT dataT type.

 Part-Of-SpeechT TaggingT (POST):T AT

Part-Of-SpeechT TaggerT isT aT pieceT ofT softwareT thatT

readsT textT inT someT languageT andT assignsT partsT ofT

speechT toT eachT wordT (andT otherT token),T suchT asT noun,T nounT phrase,T verb,T adjective,T etc.,T althoughT

generallyT computationalT applicationsT useT moreT

fine-grainedT POST tagsT likeT ’noun-plural’.T ThisT

workT useT openT sourceT StanfordT NLPT parserT forT

POST.T TheT parserT isT instantiatedT withT EnglishT

Model.

 CandidateT FeatureT Identification:T ForT theT opinionsT

reviews,T thisT workT willT identifyT theT candidateT

featuresT byT extractingT theT frequentT noun,T nounT

phraseT andT adjectivesT toT designT syntacticT rules.

 IntrinsicT DomainT Relevance:T TheT IntrinsicT DomainT

RelevanceT scoreT areT majoredT forT eachT extractedT

candidateT featureT onT domainT specificT corpora.T ItT

representsT howT muchT theT candidateT featureT isT

domainT specificT withT givenT domain.

 ExtrinsicT DomainT Relevance:T TheT ExtrinsicT

DomainT RelevanceT scoreT areT majoredT forT eachT

extractedT candidateT featureT onT domainT independentT

corpora.T ItT representsT howT muchT theT candidateT

featureT isT genericT withT givenT otherT externalT

domains.

 IntrinsicT ExtrinsicT DomainT Relevance:T TheT

IntrinsicT ExtrinsicT DomainT RelevanceT isT aT

techniqueT toT identifyT andT extractT opinionT features.T AT candidateT featuresT withT EDRT scoresT lessT thanT aT

thresholdT andT IDRT scoresT greaterT thanT anotherT

thresholdT areT conformedT asT opinionT features.

 OpinionT Features:T TheT opinionT miningT refersT toT

collectT differentT typesT ofT opinionsT ofT peopleT overT theT sameT issueT fromT web.T MostlyT theT reviewsT areT theT bestT wayT toT expressT user’sT opinionT onT web.

Figure 1: System Architecture

IV. MATHEMATICAL MODEL OF PROPOSED SYSTEM

is a set of review document.

Each document ‘ ’ in the review corpus may be either domain dependent or domain independent reviews.

So, let Dd = set of domain dependent review

And Dind = set of domain independent review.

Each document ‘di’ where diDd or di Dind , contains no of words. Let W is the set of words , represented as

Where means document

and means from document word.

Let be the set of some parts of speech i.e.

where , , ,

, .

Let be the statement of mentioned above

For finding out opinion features we have to separate candidates features from each statement .

So, is the set of candidate features

Any phrase is candidate feature by following

mathematical equation,

(eq. 1)

FromT aboveT selectedT candidateT featuresT weT haveT toT findT finalT opinionT featuresT inT bothT domainT dependentT andT domainT independentT reviews.T

ForT thisT weT haveT toT findT eachT termT T .

WhichT isT theT wordT inT document,T belongsT toT

whichT T orT T withT theT helpT ofT termT

frequencyT andT inverseT documentT frequency.

=termT frequencyT i.e.T howT manyT timeT termT tiT occursT inT givenT document.

T =T InverseT DocumentT FrequencyT i.e.T howT significantlyT aT termT T isT mentionedT

acrossT allT documents.

TF-IDFT meansT theT termT weightT .T

i.e.T [1]T representedT asT T and can be found as

(eq. 2)

from this we can find similarity between terms using

(4)

(eq.3)

According to value of we can group as domain dependent

and domain independent.

Dataset

TheT mobileT reviewT corpusT containsT 110T real-lifeT textualT reviewsT collectedT fromT aT socialT sitesT likeT flipT cart,T Amazon.T TheT hotelT reviewT corpusT containsT 111T reviewsT

crawledT fromT aT socialT sites.T TheT SummaryT ofT theT fourT

domainT reviewT corporaT areT shownT inT TableT 7.1.T ThisT

workT randomlyT selectedT 5T reviewT corpuses.T TwoT personsT

manuallyT markedT opinionT feature(s)T expressedT inT everyT

reviewT sentenceT inT eachT ofT theT mobileT category.T AT

markedT opinionT featureT isT consideredT validT ifT andT onlyT ifT bothT annotatorsT highlightT it.T IfT onlyT oneT ofT theT annotatorsT markT anT opinionT feature,T thenT aT thirdT personT hasT aT finalT decisionT onT whetherT toT keepT orT rejectT it.T AT totalT ofT 18T

opinionT featuresT wereT obtainedT fromT theT mobileT reviewT

files.T UsingT theT sameT method,T thisT workT annotatedT 19T

opinionT featuresT fromT randomlyT selectedT hotelT reviewT

files.T TheT precisionT andT recallT areT measuredT byT equationT 1T andT 2.T TheT valueT forT precisionT isT 0.93T andT 0.90T forT mobileT andT hotelT reviews,T respectively.

ThisT workT hasT alsoT collectedT 4T domain-independentT

(generic)T corporaT fromT aboveT website,T eachT corpusT

containingT 8T documents.T TheT collectedT corporaT coverT

domainT irrelevantT heterogeneousT topicsT containingT

Vehicle,T Laptop,T andT soT on.T SummaryT statisticsT ofT theT 4T domainT independentT corporaT areT shownT inT TableT 2.T AllT

documentsT fromT theT domainT reviewT corporaT asT wellT asT

theT domain-independentT corporaT wereT parsedT usingT theT

languageT technologyT platformT (LTP),T aT ChineseT naturalT

languageT analyser.

V. EXPERIMENTALRESULTS

IdentificationT ofT candidateT FeaturesT ViaT IntrinsicT /T

ExtrinsicT DomainT relevanceT TheT partT ofT speechT taggingT

isT appliedT onT theT collectedT reviewsT andT aT setT ofT syntacticT rulesT areT appliedT toT identifyT candidateT featuresT fromT theT reviewT corpuses.T TheseT syntacticT rulesT identifyT candidateT featuresT properly.T ItT isT representedT inT fig.3.

Figure 3: Candidate Features Via IDR / EDR

Assign Sentiments to Features :A set of sentiments are assigned to opinion features as positive, negative or neutral.

[image:4.595.113.213.52.86.2]

This is represented in fig.4 for each of the domain.

Figure 4: Assign Sentiments

[image:4.595.304.548.53.188.2]

Result comparison

Fig. 5 represents comparison of LDA and IEDR approach and it proves that IEDR is more effective than LDA

[image:4.595.310.548.308.500.2]

(5)

VI. CONCLUSION

ThisT SystemT presentsT novelT inter-corpusT statisticsT

techniqueT toT opinionT featureT extractionT basedT onT theT IEDRT feature-filteringT technique,T whichT utilizesT theT disparitiesT inT distributionalT characteristicsT ofT featuresT

acrossT twoT corpora,T oneT domain-specificT andT oneT

domain-independent.T IEDRT identifiesT candidateT

featuresT thatT areT specificT toT theT givenT reviewT domainT andT yetT notT overlyT genericT (domainT independent).T InT

addition,T sinceT aT goodT qualityT domain-independentT

corpusT isT quiteT importantT forT theT proposedT approach,T thisT workT hasT evaluatedT theT influenceT ofT corpusT sizeT

andT topicT selectionT onT featureT extractionT

performance.T ThisT workT foundT thatT usingT aT

domain-independentT corpusT ofT aT similarT sizeT asT butT

topicallyT differentT fromT theT givenT reviewT domainT

willT yieldT goodT opinionT featureT extractionT results.

REFERENCES

[1] ZhenT Hai,T KuiyuT Chang,T Jung-JaeT Kim,T andT ChristopherT C.T Yang,“IdentifyingT FeaturesT inT OpinionT MiningT viaT IntrinsicT andT ExtrinsicT DomainT Relevance”,T IEEET TRANSACTIONST ONT KNOWLEDGET ANDT DATAT ENGINEERING,T VOL.T 26,T NO.T 3,T MARCHT 2014.

[2] W.T JinT andT H.H.T Ho,“AT NovelT LexicalizedT HMM-BasedT LearningT FrameworkT forT WebT OpinionT Mining”,T Proc.T 26thT Ann.T IntlT Conf.T MachineT Learning,T pp.T 465-472,T 2009. [3] N.T JakobT andT I.T Gurevych,“ExtractingT OpinionT TargetsT inT aT

SingleandT Cross-T DomainT SettingT withT ConditionalT RandomT Fields”,Proc.T Conf.T EmpiricalT MethodsT inT NaturalT LanguageT Processing,T pp.T 1035-1045,T 2010.

[4] G.T Qiu,T C.Wang,T J.T Bu,T K.T Liu,T andT C.T Chen,“IncorporateT theT SyntacticT KnowledgeT inT OpinionT MiningT inT User-GeneratedT Content”,T Proc.T WWWT 2008T WorkshopT NLPT ChallengesT inT theT InformationT ExplosionT Era,T 2008.

[5] G.T Qiu,T B.T Liu,T J.T Bu,T andT C.T Chen,“OpinionWordT ExpansionT andT TargetT ExtractionT throughT DoubleT Propagation”,ComputationalT Linguistics,T vol.T 37,T pp.T 9-27,T 2011.

[6] D.M.T Blei,T A.Y.T Ng,T andT M.I.T Jordan,“LatentT DirichletT Allocation”,J.T MachineT LearningT Research,T vol.T 3,T pp.T 993-1022,T Mar.T 2003.

[7] I.T TitovT andT R.T McDonald,“ModelingT OnlineT ReviewsT withT Multi-T GrainT TopicT Models”,Proc.T 17thT IntlT Conf.T WorldT WideT Web,T pp.T 111-120,T 2008.

[8] M.T HuT andT B.T Liu,“MiningT andT SummarizingT CustomerT Reviews”,Proc.T 10thT ACMT SIGKDDT IntlT Conf.T KnowledgeT DiscoveryT andT DataT Mining,T pp.T 168-177,T 2004.

[9] A.T PopescuT andT O.T Etzioni,“ExtractingT ProductT FeaturesT andT OpinionsT fromT Reviews”,T Proc.T HumanT LanguageT TechnologyT Conf.T andT Conf.T EmpiricalT MethodsT inT NaturalT LanguageT Processing,T pp.T 339-T 346,T 2005.

[10] V.T HatzivassiloglouT andT J.M.T Wiebe,T “EffectsT ofT AdjectiveT OrientationT andT GradabilityT onT SentenceT Subjectivity”,Proc.T 18thT Conf.T ComputationalT Linguistics,T pp.299-305,T 2000. [11] B.T Pang,T L.T Lee,T andT S.T Vaithyanathan,“ThumbsT up?:T

SentimentT ClassificationT UsingT MachineT LearningT Techniques”Proc.T Conf.T EmpiricalT MethodsT inT NaturalT LanguageT Processing,T pp.T 79-86,T 2002.

[12] B.T PangT andT L.T Lee,T “AT SentimentalT Education:T SentimentT AnalysisT UsingT SubjectivityT SummarizationT BasedT onT MinimumT Cuts,T Proc.T 42ndT Ann.T MeetingT onT Assoc.T forT ComputationalT Linguistics,T 2004.

[13] R.T Mcdonald,T K.T Hannan,T T.T Neylon,T M.T Wells,T andT J.T Reynar,“StructuredT ModelsT forT Fine-to-CoarseT SentimentT Analysis”,Proc.T 45thT Ann.T MeetingT ofT theT Assoc.T ofT ComputationalT Linguistics,T pp.T 432-T 439,T 2007.

[14] L.T Qu,T G.T Ifrim,T andT G.Weikum,T “TheT Bag-of-OpinionsT MethodT forT ReviewT RatingT PredictionT fromT SparseT TextT

Patterns”,Proc.T 23rdT IntlT Conf.T ComputationalT Linguistics,pp.T 913-921,T 2010.

[15] D.T Bollegala,T D.Weir,T andT J.T Carroll,T “Cross-DomainT SentimentT ClassificationT UsingT aT SentimentT SensitiveT Thesaurus”,IEEET Trans.T KnowledgeT andT DataT Eng.,T vol.25,T no.T 8,T pp.T 1719-1731,T Aug.T 2013.

[16] P.D.T Turney,T “ThumbsT UpT orT ThumbsT Down?:T SemanticT OrientationT AppliedT toT UnsupervisedT ClassificationT ofT Reviews”,Proc.T 40thT Ann.T MeetingT onT Assoc.T forT ComputationalT Linguistics,T pp.T 417-T 424,T 2002.