Boosting implicit discourse relation recognition with connective-based word embeddings

(1)

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Boosting

implicit

discourse

relation

recognition

with

connective-based

word

emb

e

ddings

Changxing Wu

a,∗

, Jinsong Su

b

, Yidong Chen

c,d

, Xiaodong Shi

c,d a_{Virtual Reality}_{and Interactive}_{Techniques Institute, East}_{China Jiaotong}_University,_Nanchang_{330013, China} b_{Xiamen University, Xiamen}_{361005, China}

c_Fujian_{Key Lab}_{of the Brain-like Intelligent Systems,}_{Xiamen University,}_{Xiamen 361005, China}

d_Department_{of Cognitive Science,}_{School of}_Information_{Science and}_Technology,_{Xiamen University,}_{Xiamen 361005, China}

a

r

t

i

c

l

e

i

n

f

o

Article history:

Received4September2018 Revised12June2019 Accepted27August2019 Availableonline30August2019 CommunicatedbyDr.Y.Chang

Keywords:

Connective-basedwordembeddings Implicitdiscourserelationrecognition Connectiveclassiﬁcation

Neuralnetwork

a

b

s

t

r

a

c

t

Implicitdiscourserelationrecognitionistheperformancebottleneckofdiscoursestructureanalysis.To alleviatethe shortageoftrainingdata,previousmethods usuallyuseexplicitdiscoursedata,whichare naturallylabeledby connectives,as additionaltraining data.However, itis oftendifficultfor themto integratelargeamountsofexplicitdiscoursedatabecauseofthenoiseproblem.Inthispaper,we pro-poseasimple and effective methodtoleverage massive explicitdiscourse data.Specifically,we learn connective-basedwordembeddings(CBWE)byperformingconnectiveclassificationonexplicitdiscourse data.ThelearnedCBWEiscapableofcapturingdiscourserelationshipsbetweenwords,andcanbeused aspre-trainedwordembeddingsfor implicitdiscourserelationrecognition. OnboththeEnglish PDTB andChineseCDTBdatasets,usingCBWEachievessignificantimprovementsoverbaselineswithgeneral wordembeddings,andbetterperformancethanbaselinesintegratingexplicitdiscoursedata.By combin-ingCBWEwithastrongbaseline,weachievethestate-of-the-artperformance.

1. Introduction

Recognizingdiscourserelations(e.g.,Comparison)betweentwo text spans is a crucial subtask of discourse structure analysis. These relations can beneﬁt many downstream natural language processing tasks, including questionanswering, machine transla-tion and so on. A discourse relation instance is usually deﬁned as a discourse connective (e.g., but, and) taking two arguments (e.g.,clause,sentence).Example(a)isanexplicitdiscourseinstance signaled by the connective but, while Example(b) is an implicit discourse instancewith theComparisonrelation, andthe connec-tiveisabsent.Forexplicitdiscourserelationrecognition,usingonly connectives asfeatures achievesmorethan 93% inaccuracy [28]. However,duetotheabsenceofconnectives,implicitdiscourse re-lationrecognitionneeds toinferencediscourse relationsbasedon twoarguments,andisstill challenging.Earlierresearchersusually develop surface features and use supervised learning method to performthetask[6,16,20,27,31].Amongthesefeatures,wordpairs occurringinargumentpairsareconsidered asimportantfeatures, since they can partially catch discourse relationships between two arguments.Forexample,wordpairswithantonymicrelation,

∗ _{Corresponding}_author.

E-mail address: [email protected](C.Wu).

like (crude, advance) in Example (b), may mean a Comparison relation,whilesynonymwordpairslike(good,great)mayindicate aConjunctionrelation.However,previousclassiﬁersbasedonthese featuresdonotworkwellbecauseofthedatasparsityproblem.

(a) [Thecomputerswerecrudebytoday’sstandards,]Arg1

but,[AppleIIwasamajoradvancefromAppleI.]Arg2

(b) [Wehaveseenabigadvanceoftheproject.]Arg1

[Theothersarestillverycrude.]Arg2

Toaddress thisproblem, someresearchersattempttotake ad-vantage of unlabeled data, especially explicit discourse data, to enrich the trainingdata. For example, explicitinstances signaled bytheconnective butcanbe potentiallyused asadditional train-ing data for the Comparison relation in implicit discourse rela-tionrecognition.Theyremoveconnectivesfromexplicitdiscourse instances, and automatically labeled them by mapping connec-tives into corresponding discourse relations (e.g., but – Compar-ison). However, according to Sporleder and Lascarides [33], di-rectlyusing thesedataasadditionaltrainingdatawoulddegrade the performance due to the following two drawbacks: (1) The meaningshiftproblem.Consideringtheexplicitinstance:‘Iam ea-ger to go home for the vacation. Nonetheless, I will book a ﬂight to Beijing.’, one would infer the Contingency relation rather than theComparisonrelationifnonetheless isdropped. (2)The domain https://doi.org/10.1016/j.neucom.2019.08.081

(2)

problem.Therearedifferentworddistributionsanddifferent rela-tiondistributionsbetweenexplicitandimplicitdiscoursedata.For example,in the PDTB dataset, the four top-level discourse rela-tions include: explicit instances (18.9% Temporal, 28.8% Compari-son,18.7%Contingencyand33.6%Expansion)andimplicitinstances (5.7%Temporal,16.9%Comparison,24.9%Contingencyand52.5% Ex-pansion). In other words,Implicit andexplicit discourse datacan beconsideredasdatafromdifferentdomains.Accordingly,for im-plicit discourse relation recognition, explicit discourse instances canbepotentially usedasadditionallabeleddata,butwithsome noise.

Recent researchers seektoleverage explicitdiscourse datavia domainadaptation [5], data selection [32] or multi-task learning [12,13,19,38]. While showing better results, they all directly use explicit data to train classiﬁers. As a result, a small amount of explicit data is just used because of the noise problem. Intu-itively,incorporatingmassiveexplicitdiscoursedatawouldfurther improve the performance. Recently, some researchers use word embeddingsinsteadofwordsasinputfeatures,anddesignvarious neuralnetworkstocapturediscourserelationshipsbetween argu-ments[8,9,11,14,18,30,42].While achievingpromisingresults,they areallbasedongeneralwordembeddingswhichignorediscourse information(e.g.,good,great,andbadareoftenmappedintoclose vectors). In general, using task-speciﬁc word embeddings would furtherboosttheperformance.

Based on the above analysis, we propose to learn connective-based word embeddings (CBWE) from massive explicit data for implicitdiscourserelationrecognition.Explicitdatacanbe consid-eredtobeautomaticallylabeledbyconnectives.Whiletheycannot be directly used as training data for implicit discourse relation recognitionandcontain some noise, theyare effective enoughto provide weakly supervised signals to train the connective-based word embeddings. Our method is inspired by the observation that synonym (antonym) word pairs tend to appear around the discourseconnectiveand(but).Otherconnectivescanalsoprovide some discourse clues. We expect to mine these discourse clues from explicit data, and encode them into distributed represen-tations of words. These representations can used as features for implicitdiscourse relationrecognitionandother discourse-related tasks, and boost their performance potentially. Compared with previous work, our method provides two benefits: (1) Discourse relevantword pair information is encodedinto connective-based word embeddings, such information is helpful for implicit dis-courserelationrecognition.As shownin theexplicitExample(a), wordpair(crude,advance) isrelatedwiththeconnectivebut.Our method can encode the semantic relation signaled by but into theembeddingsofwords crudeandadvance,whichare obviously helpful for distinguish the Comparison relation of the implicit Example(b). (2) Our method is a two-stage method which first learnsCBWEandthenusesitasfeatures.Inthisway,ourmethod leveragesmassiveexplicitdataindirectlyandthuscanreducethe influence of noise. On the other hand, both data selection and multi-task methods use explicit data to train their recognition modelsdirectly,whichmakesthemmoresusceptibletonoiseand hardertoincorporatemassiveexplicitdata.

Specifically, we use two simple andeffective neural networks to learn CBWE by performing connective classification on mas-siveexplicitdata: an averagemodel capturingdiscourse relation-shipsbetweenwords implicitly,andan interaction model captur-ing these relationships explicitly. We apply CBWE as pre-trained wordembeddingsfor aneural implicit discourse relations recog-nition model.On both the English PDTB [29] andChinese CDTB [15]data sets,usingCBWEyields significantlybetter performance thanusing generalword embeddings,andrecent methods incor-poratingexplicitdiscoursedata.Theinteractionmodelshows bet-terresultsthantheaveragemodel.Moreimportantly,thelearned

Fig.1. Theaveragemodelforlearning CBWE .

CBWEcanbeeasilytransferredtostrongimplicitdiscourserelation modelslike[3,9],toboosttheirperformancefurther.

The major contributions of this paper include: (1) We intro-duce asimpleandeffectivemethodtoleverage explicitdiscourse data. (2) The learned CBWE1 _can _be _easily _combined _with_other techniques, to potentially boost the performance of implicit dis-course relation recognition further. (3) We achieve the state-of-the-artperformance,tothebestofourknowledge.Thecontentsof thispaperareorganizedasfollows.Wedetailmodelsforlearning CBWEinSection 2andtheusedimplicitdiscourserelation recog-nitionmodelinSection3.Weconductexperimentstovalidatethe effectivenessofCBWEin Section4.Finally, wereview therelated workinSection5anddrawconclusionsinSection6.

2. Connective-basedwordembeddings

WeinduceCBWEbasedonexplicitdiscourse databy perform-ingconnectiveclassification.Theconnectiveclassificationtask pre-dicts whichconnective is suitable forcombiningtwo given argu-ments.Inthissection, wefirstintroducetwosimpleandeffective neural network models for learning CBWE, and then the way of collectingexplicitdiscoursedatafortraining.

2.1. Theaveragemodel

Weadaptthemodelin[38]tolearnCBWEbyperforming con-nectiveclassification,andcall itthe averagemodel.As illustrated inFig.1,itusesanaveragelayertorepresenttwoarguments,and then a multi-layer perception (MLP) for classification. The aver-agemodelissimpleenoughtoenableustotrainonmassivedata efficiently.

Formally,let(Arg1,Arg2,conn)denotesanexplicitdiscourse

in-stance, where Arg1 andArg2 are arguments, connis the

connec-tive. Letxi∈Rd andyj∈Rd denote theembeddingsof ithwordof Arg1andjthwordofArg2,mandndenotethelengthsofArg1 and

Arg2,respectively.Allword’sembeddingscanbedenotedasa

ma-trixL∈Rv×d_,_where_v_is_the_size_of_vocabulary_and_d_the_dimension ofwordembeddings.Theaveragemodelﬁrst representsan argu-mentastheaverageofwords,xforArg1 andyforArg2.Andthen

itconcatenatesxandy ash0,asinglerepresentationofbothArg1

andArg2: x= 1 m m i=1 xi, y= 1 n n j=1 yj, h0=[x,y] (1)

Finally,h0∈R2disfedintoaMLPforclassiﬁcation.Specially,l

non-linearhiddenlayersarestackedtogetamoreabstractive represen-tationhl,andthenasoftmaxlayerto getprobabilities ofdifferent 1_Our_learned_CBWE_is_publicly_available_at_here,_and_we_will_make_the_source codeavailableafterreview.

(3)

Fig.2. TheinteractionmodelforlearningCBWE .

classeso:

hi= f

(

Wihi−1+bi

)

,i∈[1,l] o=so ftmax

(

Wchl+bc

)

(2) where fisthe nonlinearactivation function forall hiddenlayers, Wi,bi,Wc,bc areparameters.Wecombinethecross-entropyerror andregularizationerrorastheobjectivefunction:

J

(

θ

)

=−

q k=1

gk×log

(

ok

)

+

λ

₂

θ

2, (3)

whereg isthe ground-truthlabelvector foran traininginstance, q the numberof classes,

λ

the regularization coeﬃcient and

θ

=

(

L,Wi,bi,Wc,bc

)

the setofparameters. Notethat bi,bc andLare not includedin

θ

.Duringtraining, Lisﬁrst randomlyinitialized, andthentunedtominimizetheobjectivefunction.Theﬁnally ob-tainedLisourCBWE.

2.2. Theinteractionmodel

Recently, neural network models whichincorporate wordpair information directlyachieve superior performance onnature lan-guage inference [24] and implicit discourse relation recognition [14]. Therefore, we propose to use a simplified version of these models tolearn CBWE,andcall ittheinteraction model.As illus-tratedinFig.2,theinteractionmodelfirstusesaninteractionlayer tocapturethecross-argumentwordpairinformation,thenan ag-gregate layerto representarguments,and finally aMLP layer for classification. The interaction model captures discourse relation-shipsbetweenwordsexplicitly,whiletheaveragemodeldoesthis implicitly.

Formally, we ﬁrst calculate the word interaction score matrix E∈Rm×n_,_computing_e

ijforeachpossibleithword(xi)inArg1 and

jthword(yj)inArg2,andnormalizethemasfollows: ei j=xiyTj,

α

i j= exp

(

ei j

)

n k=1exp

(

eik

)

,

β

ji= exp

(

ei j

)

m k=1exp

(

ek j

)

. (4) A high interaction score e_ij means the corresponding word pair is well correlatedwitha particulardiscourse relation.

α

i1,...,

α

in indicate the normalized interaction scores betweenthe ithword inArg1 andeachwordinArg2.Similarly,

β

j1,...,

β

jmindicatethe normalized interaction scores betweenthe jth word inArg2 and

each word in Arg1. As described in Eq.(5), based on these

nor-malizedscores,weaugmenttherepresentationofithwordinArg1

as[xi,xi],wherexi canbe consideredasits relatedpartsinArg2.

Similarly,thejthwordinArg2 isrepresentedas[yj,yj].Finally,an

argumentisrepresentedastheaverageofaugmented representa-tions ofwords in it, xfor Arg1 and y forArg2. And the

concate-nation[x,y] isfed intoamulti-layerperception forclassiﬁcation. x_i=n j=1

α

i jyj, x=m1 m i=1 [xi,xi], y_j=m i=1

β

jixi, y= 1n n j=1 [yj,yj]. (5)

Essentially, the task of connective classification is similar to implicitdiscourse relation recognition,just with differentoutput labels.Therefore, anyeffectiveneural network model forimplicit relation recognition can be easily adapted for connective classi-fication. The reasons why we choose simple models instead of complicatedmodelsforlearningCBWEare two-folds:(1)training simple models on massive explicit datais time-efficient, and (2) simple models usually contain fewer other parameters, which makesasmuchaspossibleinformationbeencodedintotheword embeddings.Forexample,both the averagemodeland the inter-actionmodelonlyhavetwosetsofparameters:wordembeddings and parameters of the MLP. In our experiments, if we compute ei j=F

(

xi

)

F

(

yj

)

T asthat in[24],whereFisafeed-forward neural network,orei j=xiAyjT+B[xi,yj]+ci j asthatin[14],whereA, B, cij are parameters tomodelandencode wordpairsemantics, the resultedCBWEisnotasgoodasthatlearnedbytheaveragemodel orthe interaction model.The reason behindis that some useful informationisencodedintotheparametersofForA,B,cij.

Wordorderinformationisnotusedinboththeaveragemodel andtheinteractionmodel,whichmakesourmodelsarevery sim-pleandcanbetrainedonlargeamountsofexplicitdiscoursedata efficiently.Intuitively,consideringthewordorderinformationcan boost the performance of connective classification. For example, wecanenhancetherepresentationofawordbyconcatenatingits word embedding and position embedding as [10], or use recur-rentneural networks(e.g.,LSTM)insteadof theaverage operator inEq.(1) orEq.(5).We conduct experiments(results not listed) toverifytheabovetwo methodsandfindthat:(1)Bothmethods donot result inbetter CBWE.Especially when LSTM isused, the resultingCBWE is not asgood asthat learned by the interaction method. The reason is that some useful information is encoded intotheparametersofLSTM.(2)Transferringthelearnedposition embeddingsisnothelpful.The reasoningbehindisthat word or-derinformation canbe learnedfromanysentences,not just lim-ited to the explicit discourse data. (3) Using LSTM is very time-consumingbecauseitcannotbeparallelized.Ourprimarypurpose in thispaper is to learn better CBWE. The learned CBWEcan be easilytransferredtonotonlytheIDRRmodel(Section 3),butalso futuremodelsforthistaskordiscourse-relatedtasks,topotentially boosttheir performance. Based on theabove analysis, we donot considertheorderinformationinourmodelsforCBWE.

Compared with the attention mechanisms in sequence-to-sequence models [2] and multi-head self-attention models [35], ourinteractionmodelisbidirectionalandusuallyusedinscenarios withtwo sentences.Comparedwiththebi-attentionmodels used in [14,24], our interaction model is a simpliﬁed version. Speciﬁ-cally,weadaptcommonlyusedbi-attentionmodelstocatch word-pairinformationexplicitlyandencodetheseinformationintoword embeddings.We simplifythem by retainingonlytwo sets of pa-rametersthat arenecessary, wordembeddingsandparameters of theMLP layer. Therefore, asmuch aspossible information is en-coded into the word embeddings as expected. Experimental re-sultsinTable6alsoshowthatourinteractionmodelismore suit-ableforlearningCBWEthanthecommonlyusedbi-attention mod-els.Accordingly,tosomeextent,ourinteractionmodelisproposed foranewapplication,learningconnective-basedwordembeddings (CBWE)frommassiveexplicitdiscoursedata.

(4)

2.3.Collectingexplicitdiscoursedata

Collectingexplicitdiscoursedataincludestwosteps:(1) distin-guishwhetheraconnective occurringreﬂectsadiscourserelation. Forexample,theconnectiveandcaneitherfunctionasadiscourse connectiveto join two Conjunctionarguments, orbe justusedto linktwonounsinaphrase.(2)identifythepositionsoftwo argu-ments.AccordingtoPrasadetal.[29],Arg2 isdeﬁnedasthe

argu-mentfollowinga connective,however,Arg1 canbelocatedwithin

thesamesentenceastheconnective,insome previousor follow-ingsentence.Linetal.[17]showthat theaccuracyof distinguish-ingEnglish connectivesismorethan97%,whileidentifying argu-ments isbelowthan 80%. Therefore,we use the existingtoolkit2 toﬁnd Englishdiscourse connectives,and justcollectexplicit in-stancesusingpatternslike[Arg1 connArg2],wheretwoarguments

areinthesamesentence,todecreasenoise.

Therestrictionofthesamesentenceseemstohavetwo poten-tially detrimental effects on the collected corpus. First, it would skewthedistributionofdiscourserelationstowardsthosethatare typically expressed within the same sentence. Second, it would actually exclude explicit discourse instances consisting of two separate sentences, which are more similar to implicit discourse instances.Toexplore thisproblem,we comparetwocollected En-glishexplicitdiscourse corpora:(1) onecollectedby ourmethod, and(2)theothercollectedfromtheresultsofthepdtb-styleparser [17],where all identified explicitinstances are included. We find that:(1)thereisnoobviousdifferenceinthedistributionof con-nectivesbetweenthetwocorpora,and(2)connectivebasedword embeddingstrained on two corpora achieve similar performance on implicit discourse relation recognition. Therefore, our way of collectingexplicitdataisfeasiblewhenusingaverylarge corpus. Itcanalsobeeasily generalizedtootherlanguages,onejustneed totrainaclassifiertofinddiscourseconnectivesfollowing[17].

3. Modelforimplicitdiscourserelationrecognition

ToevaluatetheeffectivenessofourlearnedCBWE,weuseitas thepre-trainedwordembeddingsfora popularimplicitdiscourse relationrecognitionmodel(IDRRmodel,hereafter),toseeifit im-provesthe performance.In fact,ourCBWEcan be easilyused for anyneuralimplicitdiscourserelationrecognitionmodel.Inthe fol-lowing,webrieflyintroducetheusedIDRRmodel3_[24]_for_this pa-pertobeselfcontained.Letusrecallthattheinteractionmodelfor CBWE(Section 2.2) isessentially a simplified version ofthe IDRR model. They both use an interaction layer to capture the cross-argumentwordpair information,then an aggregate layer to rep-resent arguments, and finally a MLP layer for classification. The onlydifferencesbetweenthemaretheIDRRmodelusesEq.(6) in-steadofEq.(4)forwordinteractionscores,andEq.(7)insteadof Eq.(5)forargumentrepresentations.Specifically,FinEq.(6)andG inEq.(7)aremulti-layerfeed-forwardneuralnetworksandadded forlearningmoreabstractfeaturerepresentations.

ei j=F

(

xi

)

F

(

yj

)

T,

α

i j= exp

(

ei j

)

n k=1exp

(

eik

)

,

β

ji= exp

(

ei j

)

m k=1exp

(

ek j

)

(6) x_i=n j=1

α

i j yj, x=m1 m i=1 G

(

[xi,xi]

)

y_j=m i=1

β

jixi, y=1n n j=1 G

(

[yj,yj]

)

(7) 2_{https://github.com/linziheng/pdtb-parser}_.

3_Though_the_model_described_in_[24]_is_used_for_nature_language_inference,_we ﬁnd it isalso effective for implicitdiscourse relation recognition,and achieves slightlybetterresultsthanthemodelin[14]inourexperiments.

Table1

StatisticsofdatasetsonthePDTB.

Relation Training Validation Test

Temp 582 48 55

Comp 1855 189 145

Cont 3235 281 273

Expa 6673 638 538

Table2

Top10mostfrequentconnectivesinthecollectedexplicitdiscoursedata. Connective Frequency Connective Frequency

and 1,040,207 when 224,116 but 770,705 after 209,224 also 665,039 if 202,497 while 238,364 however 155,811 as 227,702 because 150,589 4. Experiments

In thissection, we evaluate theeffectiveness of ourproposed methodonboththeEnglishPDTBandChineseCDTBdatasets.We focusontestingwhetherourmethodismorehelpfulthanprevious methods which use explicit discourse data as additionaltraining data,andwhetherourlearnedCBWEcanbecombinedwithother techniquestoboosttheperformancefurther.

4.1. Dataandsettings

Implicitdiscourse relationrecognitionis usually consideredas amulti-wayclassiﬁcationtask.Following[19],weperforma4-way classiﬁcationonthefourtop-levelrelationsinthePDTB,including Temporal(Temp),Comparison(Comp),Contingency(Cont)and Expan-sion(Expa).WeadoptthestandardsettingsandsplitthePDTB cor-pusinto the trainingset (Sections 2–20),validation set (Sections 0–1) and test set (Sections 21–22). Table 1 lists the statistics of thesedatasets.

Wecollect explicitdatafromthe XinandLtwpartsofthe En-glishGigawordCorpus(3rdedition),andgetabout4.92Mexplicit instances.Thereare100discourseconnectivesinthePDTB,we ig-norefourparallelconnectives(e.g.,if...then)forsimplicity.Dueto thespacelimitation,weonlylistthetop10mostfrequentEnglish connectivesinthecollectedcorpusinTable2.Werandomly sam-ple20,000instancesasthevalidationset,20,000instancesasthe testsetandtheothersasthetrainingsetforCBWE.After discard-ing wordsoccurringlessthan5times,the sizeofthe vocabulary is185,048.Forconnectiveclassiﬁcationwiththeinteractionmodel, weobtainan accuracyofabout58.3%onthetestsetwhenall96 connectives are considered, andabout 58.9%,60.8%, 62.8%, 69.9% withthetop60,30,20,10mostfrequentconnectives,respectively. Accuraciesoftheaveragemodelareabout4–5%lower thanthose oftheinteractionmodel.Theseresultsindicatethat:(1)oursimple modelsforconnectiveclassiﬁcationareeffective,and(2)the inter-actionmodelismorepowerfulthantheaveragemodelby captur-ingdiscourserelationshipsbetweenwordsexplicitly.

Hyper-parameters for CBWE and IDRR are selected based on theircorrespondingvalidationsets,andlistedinTable3.Thesame hyper-parameters are used for the average model and the inter-action modelforCBWE. dmeans thedimension of word embed-dings, bsize the batch size of training data, lr the learning rate, dropout thedropout rate[34] in hiddenlayers,

λ

the regulariza-tion coeﬃcient in Eq.(3), update the parameter update strategy. The nonlinear function fis used inEq.(2), hiddenlayers of Fin Eq.(6)andGinEq.(7).ThelearningrateforCBWEisdecayedby afactorof0.8perepoch.Inaddition,hsizesMLP,hsizesFandhsizesG are the sizes of hiddenlayers in the MLP, Fin Eq. (6) andG in

(5)

Table3

Hyper-parametersfortraining CBWE and IDRR .

Hyper-parameter CBWE IDRR

d 300 300 bsize 64 32 lr 1.0 0.1 dropout – 0.2 λ 0.0001 0.0001 update SGD AdaDelta f ReLU ReLU hsizesMLP [200] [200,50] hsizesF – [100,100] hsizesG – [100,100]

Eq. (7), respectively. Note that [200, 50] means that two hidden layers with thesizes of200 and50 are used,andparameters in all hiddenlayers are initialized with thedefaultxavier_initializer functioninTensorflow[1].Wealsofindthat,AdaDeltaismore sta-blethanotherparameterupdatestrategiesforIDRR,andarelative largelearningratecaneffectivelyspeedupthetrainingprocessof CBWE.ThelearnedCBWEisusedasthepre-trainedword embed-dingsforIDRR,andfixed duringtraining. Validationsetsareused to early stop the trainingprocess. Different from the conference version [39] ofthis paper,no anysurface features are used here forafaircomparisonwithotherwork.

Due to the small and uneven test data set, we use both the Accuracy and Macro-averagedF1 (Macro-F1) to evaluate the whole

system.Werunourmethod10timeswithdifferentrandomseeds (thereforedifferentinitialparameters),andreporttheresults(ofa run)whichareclosesttotheaverageresults.

4.2. Comparisonwithgeneralwordembeddings

We ﬁrst compare the learned connective-based word embed-dings(CBWE)withtwopubliclyavailablewordembeddings4_:

• GloVe5_: _trained_on_840B _words_from_internet _(common _crawl) usingthecountbasedmodelin[25],withavocabularyof2.2M andadimensionalityof300.

• word2vec6_:_trained_on_100B_words_from_Google_News _using_the CBOW modelin [22],with a vocabulary of 3M anda dimen-sionalityof300.

• CBWEavg:ourconnective-basedwordembeddingslearnedwith theaveragemodel.

• CBWEint: ourconnective-based wordembeddingslearnedwith theinteractionmodel.

ResultsinTable4showthat IDRRusingCBWEgainssigniﬁcant improvements(one-tailed t-testwithp_<0.05)overusingGloVe or word2vec, on both Accuracy and Macro-F1. More importantly,

us-ing CBWE achieves substantial improvements across all relations on the F1 score, which indicates that our proposed method can

not onlyhelp minorityrelations (Temp,Comp), butalsomajor re-lations (Cont, Expa).In addition,IDRRusingCBWEint achieves bet-ter performance overusing CBWEavg, whichsuggeststhat model-ing interaction betweenwordsexplicitlyis reallyhelpful.Overall, ourCBWEcaneffectivelyincorporatediscourseinformationin ex-plicit discoursedata,andthus beneﬁts implicitdiscourserelation recognition.

4_The_reasons_for_using_these_word_embeddings_are:₍₁₎_They_are_both_trained_on massivedata.(2)Itwillbeconvenientforotherpeopletoreproduceour experi-ments.(3)Using GloVe or word2vec wordembeddingstrainedonthesamecorpus as CBWE achievesworseperformancethanthepublicembeddings.

5_{http://nlp.stanford.edu/data/glove.840B.300d.zip}_.

6_{https://code.google.com/archive/p/word2vec/GoogleNews-vectors-negative300.} bin.gz.

Table4

Resultsofusingdifferentwordembeddings.WealsolistthePrecision(P ),Recall(R ) and F 1scoreforeachrelation.

IDRR +GloVe +word2vec +CBWE avg +CBWE int

Temp P 42.11 51.72 50.00 44.90 R 29.09 27.27 34.55 40.00 F 1 34.41 35.71 40.86 42.31 Comp P 39.29 40.54 40.00 47.44 R 22.76 20.69 24.83 25.52 F 1 28.82 27.40 30.64 33.18 Cont P 49.00 46.69 49.64 49.26 R 45.05 49.08 49.82 48.72 F 1 46.95 47.86 49.73 48.99 Expa P 62.23 62.48 64.70 64.82 R 73.79 72.12 73.23 73.98 F 1 67.52 66.95 68.70 69.10 Accuracy 56.28 56.08 57.86 58.36 Macro -F 1 44.42 44.48 47.48 48.39

IntheﬁrstlineofTable4,thereisadropfrom50.00%to44.90% onPrecisionofTemp(+CBWEavgvs.+CBWEint).Thepossiblereason isthatthenumberoftestinstances(Temp)isverysmall,only55in thetestset(aslistedinTable1).Inthiscase,thePrecisionand Re-callscoresonTemp(class-wise) arerelativelysensitive. Notethat, thesamephenomenon can alsobe foundin [19],where the Pre-cisiononTempdropsfrom60.00% to42.42%whenmoreauxiliary tasksare used.7 _It _is_worthy _to_note _that, _for_the_other _three re-lations,theperformanceismorestable.Inaddition,thetestsetis veryuneven,forexample,theExpainstancesaccountfor53.2%of totalinstances.Therefore,likemostpreviouswork,boththe Accu-racyandMacro-F1 on thewholetest setareusedto evaluateour

method.

4.3.Comparisonwithrecentmethods

Inthis section,we compare ourmethod withrecentmethods whichalsouseexplicitdiscoursedatatoboosttheperformance:

• [32]:adataselectionmethodwhichdirectlyenlargesthe train-ingdatawiththechosenexplicitdiscoursedata.

• [7]:acount-basedmethodtolearnconnective-basedword rep-resentationsfromexplicitdiscourse data,whicharethen used asfeaturesinalogisticregressionmodel.

• [19]:amulti-task neuralnetworkmodeltoincorporateseveral discourse-relateddata,includingexplicitdiscoursedataandthe RST-DTcorpus[37].

• [38]:abilingually-constrainedmethodtosynthesizeadditional training data and a multi-task neural network to incorporate thesesyntheticdata.

• [12]: an attention-based mechanism to learn representations throughinteraction betweenarguments,anda multi-task neu-ralnetworktoleverageknowledgefromexplicitdiscoursedata. • [9]: a paragraph-level neural network to model inter-dependencies between discourse units, and a CRF layer to predict asequence ofexplicitandimplicitrelations ina para-graph. Both the labeled implicitand explicit instances in the PDTBareused.

ResultsinTable5showthesuperiorityofourproposedmethod, withthehighestAccuracyandacomparableMacro-F1amongthese

methods. The main reason for these improvements is that our methodcan effectively utilizemassiveexplicit discourse data,up toabout4.88Minstances.Boththedataselectionmethod[32]and multi-taskmethods[12,19,38]directlyuseexplicitdatatoestimate parametersofimplicitdiscourse relationclassiﬁers.Asa result, it

(6)

Table5

Comparisonwithrecentmethods.

Method Accuracy Macro -F 1

[32] 57.10 40.50 [7] 52.81 42.27 [19] 57.27 44.98 [38] 58.06 45.19 [12] 57.39 47.80 [9] 57.44 48.82 IDRR+CBWE avg 57.86 47.48 IDRR+CBWE int 58.36 48.39 Table6

Thetransferof CBWE. CBWE IDRRmeanslearningthe CBWE withthe IDRR model.∗ meansthatweruntheircodeandreportresults.+CBWE ∗ meansusingthelearned

CBWE ∗insteadofgeneralwordembeddings.

IDRR+CBWE int 58.36 48.39

IDRR+CBWE IDRR 57.17 46.63

IDRR+CBWE IDRR+Feature layer transfer 58.46 48.00

[9] 57.44 48.82

[3]∗ _60.14 _50.69

[9]+CBWE int 58.85 49.21

[3]+CBWE int 60.93 51.32

ishard forthem to incorporatemassive explicit data because of thenoise problem. For example,only 20,000 and40,000 explicit discourseinstances are used in[32] and[19], respectively.While BraudandDenis[7]usesmassiveexplicitdiscoursedata,itis lim-ited bythe fact that the maximumdimension ofword represen-tationsisrestrictedby thenumberofconnectives,forexample96 intheir work. Bycomparison, we learnCBWE by predicting con-nectives conditioning on arguments, which has no such dimen-sionlimitationandyieldsbetterperformance.Overall,ourmethod canconvenientlyandeffectivelyleveragemassiveexplicitdiscourse data,andthusismorepowerfulthanrecentbaselines.

Dai andHuang[9]performs slightlybetteron Macro-F1.In

ad-ditionto using explicit discourse data, it also boosts the perfor-mancebyusing bothargument-level andparagraph-levelcontext toencodeargument,castingtherelationrecognitiontaskasa se-quencelabeling task,andaugmenting inputwordrepresentations with Part-of-Speech tags and named entity tags. In comparison, ourIDRRmodelis relativelyweak. Moreimportantly, asthenext sectionshows, ourlearnedCBWEcan beeasily used toboost the performanceofDaiandHuang[9].

4.4.TransferofCBWE

Fromtheperspectiveoftransferlearning[23],ourmethodonly transferswordembeddings betweentwo relatedtasks. Letus re-callthatthetaskofconnectiveclassification(forlearningCBWE)is similarto implicitdiscourse relationrecognition,justwith differ-entoutput labels. Iftwo similar models (just withdifferent MLP layers)areseparately usedforthetwo tasks,wecan transfernot onlywordembeddings butalsoparameters of featurelayers. We conductsomeexperimentstoexplorethisproblem.Specifically,we constructtwo models by using different MLP layers in the IDRR model(Section3).OnemodelforlearningCBWE,theotherfor re-lationrecognition. The learned CBWEis referred as CBWEIDRR. In theupperpartofTable6,IDRR+CBWEIDRR meanstransferring only theCBWEIDRRforrelationrecognition,IDRR+CBWEIDRR+Featurelayer transfermeans transferring both the CBWEIDRR andparameters in feature layers (the interaction and aggregate layers). From these results, we can find that: (1) IDRR+CBWEIDRR gets a significantly lowerperformancethanourIDRR+CBWEint (Line2vs.Line1),and (2)IDRR+CBWEIDRR+Featurelayertransferachievescomparable

per-formancewithours (Line3vs.Line1).Theseresultsindicatethat thelearnedCBWEIDRR isnotasgoodasourCBWEintandsome use-fulinformationisencodedintotheparametersoffeaturelayers.In addition,itishardtotransferparametersoffeaturelayerstoother neuralnetworkmodels.Therefore,thesimpleinteractionmodelis moresuitableforlearningCBWEthanrelativelycomplicated mod-els,interms ofeﬃciencyandeffectiveness.In thiscase, asmuch aspossibleinformation isencodedintoCBWE,which canalso be easily transferred to other implicit discourse relation recognition models.

In the bottom part of Table 6, using our learned CBWEint in-steadofthepre-trainedword2vecembeddings,bothDaiandHuang [9]+CBWEint andBai and Zhao [3]+CBWEint achieve better perfor-mance(Line6vs.Line4andLine7vs.Line5).Inaddition,Baiand Zhao[3]+CBWEint obtains the state-of-the-artperformance,to the bestofourknowledge.Notethat[9]isastrongbaselineby mod-eling both the argument-level and paragraph-level context,8 _and BaiandZhao[3]achievestheSOTAperformanceviaadeeper neu-ral model augmented by different grained text representations.9 We use the source codes provided by the authors. These results indicate that our CBWE can be easily combined with other ad-vance techniques to boost the performance further, for exam-ple,thepowerfulcontextualizedwordembeddingELMo[26]used in [3]. Overall, we recommend to use CBWE instead of general word2vecorGloVewordembeddingsforimplicitdiscourserelation recognition.

4.5. Effectofnoise

Themainadvantageofourmethodisthatitcanleverage mas-sive explicit discourse data, while previous methods are usually troubledbynoise.Inthissection,weconductexperimentstoshow to what extent the noise in explicit data affects these methods. Speciﬁcally,wecompareourmethodwiththefollowingtwo meth-ods:

• IDRR+word2vec+Direct:directlyextendingthetrainingdatawith explicitdiscoursedata.Weﬁrstmapconnectivesto correspond-ingdiscourserelations.Inordertoalleviatethenoiseproblem, we discardexplicitinstances withambiguous connectives(eg. while),andrandomlysampleasubsetofexplicitdata,withthe samedistributionofimplicitdata.

• IDRR+word2vec+MT: leveraging explicit discourse data in a multi-task framework. Following Liu et al. [19], a connective classification task is defined on explicit discourse data, and usedastheauxiliarytasktoboosttherelationrecognitiontask. Thetwotaskssharethesameinputandfeaturelayers,anduse separate MLPlayers forclassification.A relatively small learn-ingrateisusedforconnective classificationwhentrainingthe twotasksimultaneously,toconflictwithnoise.

As illustrated inFig. 3,we conduct experimentswith10, 100, 500, 1000 and 4880 thousands explicit instances, respectively. Note that 10 thousandsinstances are aboutthe sameamount of labeledimplicitdata,100thousandsinstancesare usuallyusedin previousmulti-taskmethods,andtheotherscanbeconsideredas massivedata.Wecanﬁndthat: (1)Directly usingexplicitdataas additionaltrainingdataisharmful, withsigniﬁcantdrops inboth Accuracy andMacro-F1. The moreexplicit instances are used,the

moretheperformanceisaffectedbynoise.Theobservationis con-sistentwiththeﬁndingin[33].(2)Themulti-taskmethodachieves improvements when 10 or 100 thousands explicit instances are used,butdegradestheperformancewhenmoreexplicitinstances

8_{https://github.com/ZeyuDai/paragraph-level}_{_}_implicit_{_}_discourse_{_}_relation_{_} classiﬁcation.

(7)

Fig.3. Theeffectofnoise. Direct meansusingexplicitdiscoursedatadirectly. MT meansincorporatingexplicitdiscoursedatainmulti-tasklearning. CBWE intmeansusing ourlearnedwordembeddingsinsteadofword2vec .

Table7

Top15closestwordsofnot and good inbothword2vec and CBWE .

not good

word2vec CBWE avg CBWE int word2vec CBWE avg CBWE int

do no n’t great great happy

did n’t no bad lot interesting

anymore never neither terrific very positive necessarily nothing nothing decent better pleased anything neither never nice success great anyway none none excellent well helpful does difficult nowhere fantastic happy definitely never nor unaware better certainly glad want refused unable solid respect deserve neither impossible nobody lousy fine deserves if limited unknown wonderful import better know declined refused terrible positive fine anybody nobody seldom Good help lot yet little hardly tough useful reasonable either denied impossible best welcome ok

areused.Insomeextent,themulti-taskmethodalsousesexplicit instances directly, because it updates parameters of the IDRR model according to the loss on explicit data. (3) Our proposed method gets better performance when massive explicit data are used,andisalmostnotaffectedbynoise.Thereasonbehindisthat our method uses theseadditional data indirectly, learningCBWE ﬁrst andthenusingit astheinput forimplicitdiscourserelation recognition.These resultsshow that, whenlarge amounts of dis-coursedataareused,ourmethodscaneffectivelycontrolthenoise problems.

4.6. QualityofCBWE

To give an intuition of what information is encoded into the learnedCBWE, welist inTable 7the top 15closest words ofnot andgood,accordingto thecosine similarity.We can findthat,in CBWE, wordssimilar to not to some extent havenegative mean-ings. And since refused, declined are similar to not, a classifier mayeasily identify implicitinstance[A networkspokesman would notcomment. ABCSports officialsdeclinedto be interviewed.]asthe Expansion.Conjunctionrelation.ForgoodinCBWE,thesimilarwords nolongerincludewordslikebadandterrific.Furthermore,the sim-ilar score between good and great in CBWEint is 0.48 while the

scorebetweengoodandbadisjust0.30,whichmaymakea classi-ﬁereasiertodistinguishwordpairs(good,great)from(good,bad), andthus ishelpful for predicting the Expansion.Conjunction rela-tion.ThisqualitativeanalysisdemonstratestheabilityofourCBWE tocapturediscourserelationshipsbetweenwords.

4.7.Casestudy

Twoexamples shownin Figs. 4 and5 give us some evidence thatthelearnedCBWEissuperiorthantheword2vecword embed-dingswhenusedforimplicitdiscourserelationrecognition.These figuresshow theattentionscores(eij inEq.(6))calculatedby the IDRR model. Wordpairs assigned withhigh attention scores are highlighted, the higherthe score and the darkerthe color. From thesefigures,we cantake adeeplookintowhichwordpairs are importantwhenmakingprediction.Specifically,weshowthe inter-actionmatricesoftheIDRR+word2vecmodelandtheIDRR+CBWEint model on two test instances, to demonstrate how they behave differently.

We can ﬁnd that: (1) For the Expansion instance in Fig. 4, the IDRR+CBWE_int model succeeds in detecting cross-argument wordpairsthatindicatethecorresponding relations,e.g., injuries-collapsed. While theIDRR+word2vec model focuseson wordpairs like injuries-lines. (2) For the Comparison instance in Fig. 5, the IDRR+word2vec model gives the wrongprediction Expansion. The reasonisthatitfocusesmoreattentiononwordpairslike options-stock.Ontheother hand,theIDRR+CBWE_int modelmakesthe cor-rect prediction by giving more attention on word pairs stopped-remained, stopped-open. (3) After examining all test instances, we notice that IDRR+CBWEint usually focuses on lesswords than IDRR+word2vec,andgeneralwordslike and,were, inaregiven lit-tleattention. All thesesuggest that,the CBWEint can catch differ-entinformationfromthose inthe word2vec.Itcatches wordpair information(from explicit discourse data) that is relevant to the discourse relation recognition task. With the help of our leaned CBWEint,theIDRRmodelcanreallyfocusonrelation-relevantword pairs,andthusboosttherecognitionperformance.

4.8.Numberofconnectives

We conduct experiments to investigate the impact of con-nectives used in training CBWE on the performance of IDRR.

(8)

Fig.4. AnExpansion instanceinthetestset.

Fig.5. A Comparison instanceinthetestset.

Fig.6. Impactofconnectivesusedintraining CBWE int.

Speciﬁcally, we use explicit discourse instances with top 10, 20, 30,60mostfrequentorall connectivestolearnCBWEint, account-ing for 78.9%, 91.9%, 95.8%, 99.4% or 100% of total instances, re-spectively.Thetop10mostfrequentconnectivesare:and,but,also, while,as,when,after,if,howeverandbecause.Accordingto connec-tivesandtheirrelatedrelationsinthePDTB,inmostcases,andand alsoindicatetheExpansionrelation,ifandbecausetheContingency relation,aftertheTemporalrelation,butandhoweverthe

Compari-sonrelation.Connectives as,whenandwhileareambiguous. Over-all,theseconnectiveshavecoveredallfourtop-levelrelations de-ﬁned in the PDTB. As illustrated in Fig. 6, withonly the top 10 connectives,thelearnedCBWEintachievesbetterperformancethan theword2vecwordembeddings(reddottedlines).Weobserve sig-niﬁcantimprovementswhenusingtop 20connectives,almostthe bestperformancewithtop30connectives,andnofurther substan-tial improvements with more connectives.These results indicate

(9)

Table8

StatisticsofdatasetsontheCDTB.

Relation Training Validation Test

Tran 33 1 5

Caus 682 88 95

Expl 1143 147 126

Coor 2300 529 347

Table9

ResultsontheCDTB.∗_means_that_we_run_their_codes_on_the_CDTB.₊_CBWE∗ means usingthelearned CBWE ∗ insteadofgeneralwordembeddings.

IDRR +GloVe 70.30 58.04 IDRR +word2vec 69.44 57.12 IDRR+CBWE avg 73.42 63.16 IDRR+CBWE int 73.59 64.56 [38] 74.30 62.57 [9]∗ _73.77 _64.24 [3]∗ _74.82 _65.95 [9]+CBWE int 74.12 64.75 [3]+CBWE int 75.70 66.27

thatwecanuseonlytopnmostfrequentconnectivestocollect ex-plicitdiscourse dataforCBWE,whichis veryconvenientformost languages.

4.9. ResultsontheCDTB

Four top-level relations are defined in the CDTB (an Chinese discourse corpus), includingTransition(Tran), Causality(Caus), Ex-planation (Expl) andCoordination (Coor). We useinstances inthe first 50 documents as the test set, second 50 documents as the validation set and remaining 400 documents as the trainingset. Table8lists thestatisticsofthesedatasets.Weconduct a3-way classificationbecauseofonly39instancesforTran.About6.5M ex-plicitdiscourseinstancescollectedfromtheChineseGigaword Cor-pus (3rd edition), with 88 connectives,are used for CBWE. Note that connectives occurring lessthan 10,000 times are discarded. Wefindthathyper-parametersselectedonthePDTB(seeTable3) also work well onthe CDTB, exceptthat the learningrateis set to0.08andbatchsizeto16fortrainingIDRR.Forthemodein[3], weusethepre-trainedChineseELMoembeddings10_and_ignore_the sub-wordinformationintheinputlayer.

ResultsinTable9showthattheperformanceofourmethodon the CDTBhas thesimilar trend asthat on thePDTB. Speciﬁcally, IDRR+CBWEachievessigniﬁcant improvementsoverIDRR+GloVeor IDRR+word2vec (theupperpartofTable 9),andusingCBWE_int for strong baselines achieves substantial improvements (the bottom partofTable9).BaiandZhao[3]+CBWEintobtainsthe state-of-the-artperformanceontheCDTB,tothebestofourknowledge.These resultsindicatethat ourproposedmethodisalsoeffectiveonthe Chinese implicit discourse relation recognition, and the learned CBWEcan beeasily combinedwithothertechniquesto boostthe performancefurther.

5. Relatedwork

Implicit discourse relation recognition attracts more attention sincethereleaseofPDTB[29],theﬁrstlargediscoursecorpus dis-tinguishingimplicitinstancesfromexplicitones.Mostprevious re-search focuses on designing surface features manually, including lexical andpolarity features [27],word pairs and parse informa-tion [16],entityfeatures [20], wordcluster pairs [31], andso on.

10_{https://github.com/HIT-SCIR/ELMoForManyLangs}_.

Recently,researchersresorttoneuralnetworkstolearndistributed features automatically, forexample, a shallowconvolutional net-work [42], entity-augmented recursive networks [11], a convolu-tionalneuralnetwork withdynamicpooling[19],gatedrelevance networks [8], repeated reading neural networks with multi-level attention[18],asimplewordinteractionmodel[14],andadeeper enhanced model with different grained text representations [3]. However,duetothelimitedtrainingdata,methodsbasedon sur-face features (high dimensions) or distributed features (compli-catedmodelswithmanyparameters)usuallyfacethedatasparsity problem.

Therefore,the second lineof research triesto take advantage of unlabeled data, especially explicit discourse data (weakly la-beled by connectives), to enrich the training data. For the first time, Marcu andEchihabi [21] propose to use explicit discourse instancesasadditionaltrainingdatabyremovingconnectivesand mappingthemtocorrespondingrelations.However,Sporlederand Lascarides [33] suggest that using these artificial implicit data indiscriminatelydegradestheperformance,becauseofthedomain problemand meaning shift problem. Subsequently, to effectively useexplicitdata,some researchersusemulti-task learning meth-ods [12,13,19,38]. Specifically, they leverage auxiliary tasks (e.g., connective classification on explicit data) to promote the perfor-mance of main task (implicit discourse relation recognition), by sharing common information between them. Some researchers use data selection methods [32,36,40,41]. They select explicit instances (similar to implicit ones) according to some criteria, and use them to enlarge the training corpus directly. Both the multi-task learning and date selection methods show promising results.However,theyuseexplicitdatatotrainclassifiersdirectly, which makes them hard to incorporate massive explicit data becauseofthenoise problem.Differentfromtheabove work,we learn connective-based word embeddingsfrom explicitdata, and use them as inputting features. Our method leverages massive explicit data indirectly, and thus can reduce the influence of noise.

Some aspectsof thisworkare similar to[4,7].Based on mas-siveexplicitinstances,theyﬁrstbuildaword-connective(orword pair-connective)co-occurrencefrequencymatrix,andthenweight theserawfrequenciesasword(wordpair)representations.Inthis way,theyrepresentwords(wordpairs)inthespaceofconnectives todirectlyencodetheirdiscoursefunction.Themajorlimitationof theirapproachisthatthedimensionofwordrepresentationsmust be lessthan orequal to the number of connectives.By compar-ison,we learnwordembeddingsbypredictingconnectives condi-tioningonarguments,whichhasnosuchdimensionlimitation. Es-sentially,theyusecount-basedmethodstolearnword representa-tions,whileweadoptaprediction-basedmethodandachieve bet-terperformance.

6. Conclusion

Inthispaper,weproposeto learnconnective-basedword em-beddingsfrommassiveexplicitdataforimplicitdiscourserelation recognition. Experiments on both the PDTB and CDTB data sets showthatusingourlearnedwordembeddingsasfeaturescan sig-niﬁcantlyboost theperformance. We alsoshow that our method canusemassiveexplicitdatamoreeffectivelythanpreviouswork. Sincemostofneuralnetworkmodelsforimplicitdiscourserelation recognitionanddiscourse-relatedtasksusepre-trainedword em-beddingsasinputs,wehopeourlearnedwordembeddingswould beneﬁtthem.

In the future, we would like to explore how to learn task-speciﬁc sentencerepresentations based onabundance of weakly-labeledexplicitdiscoursedata.Weare alsointerestedinverifying

(10)

theeffectiveness ofour resultingword embeddingson taskslike sentimentclassiﬁcation, sincethey seem usefulevenbeyond dis-courserelatedtasks.

DeclarationofCompetingInterest

Theauthorsdeclarethattheyhavenoknowncompeting ﬁnan-cialinterestsorpersonalrelationshipsthatcouldhaveappearedto inﬂuencetheworkreportedinthispaper.

Acknowledgments

We would like to thank all the reviewers for their construc-tiveandhelpfulsuggestionsonthispaper.Thisworkissupported bytheNaturalScienceFoundationofChina(Grant No.61866012), the Natural Science Foundation of Jiangxi Province (Grant No. 20181BAB202012),andScienceandTechnologyResearchProjectof EducationDepartmentofJiangxiProvince(GrantNo.GJJ180329).

References

[1] M.Abadi,P.Barham,J.Chen,Z.Chen,A.Davis,J.Dean,M.Devin,S.Ghemawat, G.Irving,M.Isard, M.Kudlur,J. Levenberg,R. Monga,S. Moore, D.G. Mur-ray,B.Steiner,P.Tucker,V.Vasudevan,P.Warden,M.Wicke,Y.Yu,X.Zheng, TensorFlow:asystemforlarge-scalemachinelearning,in:Proceedingsofthe 12thUSENIXConference onOperatingSystemsDesignandImplementation, OSDI’16,Berkeley,CA,USA,2016,pp.265–283.

[2] D.Bahdanau,K.Cho,Y.Bengio,Neuralmachinetranslationbyjointlylearning toalignandtranslate,in:ProceedingofInternationalConferenceonLearning Representations,ICLR2015,2015,pp.1–11.

[3] H. Bai, H. Zhao, Deepenhancedrepresentation for implicit discourse rela-tionrecognition,in:ProceedingsofInternationalConferenceonComputational Linguistics,COLING,AssociationforComputationalLinguistics,SantaFe,New Mexico,USA,2018,pp.571–583.

[4] O.Biran,K.McKeown,Aggregatedwordpairfeaturesforimplicitdiscourse re-lationdisambiguation,in:ProceedingsofAssociationforComputational Lin-guistics,ACL,Soﬁa,Bulgaria,2013,pp.69–73.

[5] C. Braud, P. Denis, Combining natural and artiﬁcial examples to improve implicit discourse relation identiﬁcation, in: Proceedings of International Conference on Computational Linguistics, COLING, Dublin, Ireland, 2014, pp.1694–1705.

[6] C.Braud,P.Denis,Comparingwordrepresentationsforimplicitdiscourse re-lationclassiﬁcation,in:ProceedingsofEmpiricalMethodsinNaturalLanguage Processing,EMNLP,Lisbon,Portugal,2015,pp.2201–2211.

[7] C.Braud,P.Denis,Learningconnective-basedwordrepresentationsforimplicit discourserelationidentiﬁcation,in:ProceedingsofEmpiricalMethodsin Nat-uralLanguageProcessing,EMNLP,Austin,Texas,2016,pp.203–213.

[8] J. Chen, Q.Zhang,P. Liu, X.Qiu,X. Huang, Implicit discourserelation de-tectionviaadeep architecture with gatedrelevance network,in: Proceed-ingsofAssociationforComputationalLinguistics,ACL,Berlin,Germany,2016, pp.1726–1735.

[9] Z.Dai,R.Huang,Improvingimplicitdiscourserelationclassiﬁcationby mod-elinginter-dependenciesofdiscourseunits inaparagraph, in:Proceedings ofNorthAmericanChapter oftheAssociationforComputationalLinguistics, NAACL,Melbourne,Australia,2018,pp.141–151.

[10] J.Gehring,M.Auli,D.Grangier,D.Yarats,Y.N.Dauphin,Convolutionalsequence tosequencelearning,in:ProceedingsofInternationalConferenceonMachine Learning,ICML2017,Sydney,NSW,Australia,2017,pp.1243–1252.JMLR.org. [11]Y. Ji,J. Eisenstein,Onevector is notenough: entity-augmenteddistributed

semantics for discourse relations,Trans. Assoc. Comput. Linguist. 3(2015) 329–344.

[12] M.Lan,J.Wang,Y.Wu,Z.-Y.Niu,H.Wang,Multi-taskattention-basedneural networksforimplicitdiscourserelationshiprepresentationandidentiﬁcation, in:ProceedingsofEmpiricalMethodsinNaturalLanguageProcessing,EMNLP, Copenhagen,Denmark,2017,pp.1310–1319.

[13] M. Lan, Y. Xu, Z. Niu, Leveraging synthetic discourse data via multi-task learning for implicit discourserelation recognition,in: Proceedings of As-sociationfor ComputationalLinguistics, ACL,Soﬁa,Bulgaria,2013, pp.476– 485.

[14] W.Lei,X.Wang,M.Liu,I.Ilievski,X.He,M.Y.Kan,SWIM:asimpleword in-teractionmodelforimplicitdiscourserelationrecognition,in:Proceedingsof InternationalJointConferencesonArtiﬁcalIntelligence,IJCAI,Melbourne, Aus-tralia,2017,pp.4026–4032.

[15] Y. Li, W.Feng,J. Sun,F.Kong,G. Zhou, BuildingChinesediscoursecorpus with connective-driven dependencytree structure, in: Proceedings of Em-piricalMethodsinNaturalLanguage Processing,EMNLP,Doha, Qatar,2014, pp.2105–2114.

[16]Z.Lin,M.-Y.Kan,H.T.Ng,Recognizingimplicitdiscourserelationsinthepenn discoursetreebank,in:ProceedingsofEmpiricalMethodsinNaturalLanguage Processing,EMNLP,PA,USA,2009,pp.343–351.

[17]Z.Lin,H.T.Ng,M.-Y.Kan,APDTB-styledend-to-enddiscourseparser,Nat.Lang. Eng.20(02)(2014)151–184.

[18]Y.Liu,S.Li,Recognizingimplicitdiscourserelationsviarepeatedreading: neu-ralnetworkswithmulti-levelattention,in:ProceedingsofEmpiricalMethods inNaturalLanguageProcessing,EMNLP,Austin,Texas,2016,pp.1224–1233. [19]Y.Liu,S.Li,X.Zhang,Z.Sui,Implicitdiscourserelationclassiﬁcationvia

mul-ti-taskneuralnetworks,in:ProceedingsofConferenceonArtiﬁcialIntelligenc, AAAI,Arizona,USA,2016,pp.2750–2756.

[20]A.Louis,A.Joshi,R.Prasad,A.Nenkova,Usingentityfeaturestoclassify im-plicit discourserelations,in: ProceedingsofSpecial InterestGroupon Dis-courseandDialogue,SIGDIAL,PA,USA,2010,pp.59–62.

[21]D.Marcu,A.Echihabi,Anunsupervisedapproachtorecognizingdiscourse re-lations,in:ProceedingsofAssociationforComputationalLinguistics,ACL,PA, USA,2002,pp.368–375.

[22]T.Mikolov,K.Chen,G.Corrado,J.Dean,Eﬃcientestimationofword represen-tationsinvectorspace,in:ProceedingsofWorkshopatICLR,2013,pp.1–12. [23]S.J.Pan,Q.Yang,Asurveyontransferlearning,IEEETrans.Knowl.DataEng.

22(10)(2010)1345–1359.

[24]A.Parikh,O.Täckström,D.Das,J.Uszkoreit,Adecomposableattentionmodel fornaturallanguageinference,in:ProceedingsofEmpiricalMethodsinNatural LanguageProcessing,EMNLP2016,AssociationforComputationalLinguistics, Austin,Texas,2016,pp.2249–2255.

[25]J.Pennington,R.Socher,C.Manning,Glove:globalvectorsforword represen-tation,in:ProceedingsofEmpiricalMethodsinNaturalLanguageProcessing, EMNLP,Doha,Qatar,2014,pp.1532–1543.

[26]M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettle-moyer, Deepcontextualizedwordrepresentations, in:ProceedingsofNorth American Chapter ofthe Association for Computational Linguistics, NAACL 2018,AssociationforComputationalLinguistics,NewOrleans,Louisiana,2018, pp.2227–2237.

[27]E.Pitler,A.Louis,A.Nenkova, Automaticsenseprediction for implicit dis-course relations in text, in: Proceedings of ACL-IJCNLP, PA, USA, 2009, pp.683–691.

[28]E.Pitler,A.Nenkova,Usingsyntaxtodisambiguateexplicitdiscourse connec-tivesintext,in:ProceedingsofACL-IJCNLP,PA,USA,2009,pp.13–16. [29]R.Prasad,N.Dinesh,A.Lee,E.Miltsakaki,L.Robaldo,A.Joshi,B.Webber,The

penndiscourseTreeBank2.0,in:ProceedingsofInternationalConferenceon LanguageResourcesandEvaluation,LREC,24,2008,pp.2961–2968. [30]L.Qin,Z.Zhang,H.Zhao,Astackinggatedneuralarchitectureforimplicit

dis-courserelationclassiﬁcation,in:ProceedingsofEmpiricalMethodsinNatural LanguageProcessing,EMNLP,Austin,Texas,2016,pp.2263–2270.

[31] A.Rutherford,N.Xue,Discoveringimplicitdiscourserelationsthroughbrown clusterpairrepresentationandcoreferencepatterns,in:Proceedingsof Euro-peanChapteroftheAssociationforComputationalLinguistics,EACL, Gothen-burg,Sweden,2014,pp.645–654.

[32]A.Rutherford,N.Xue,Improvingtheinferenceofimplicitdiscourserelations viaclassifyingexplicitdiscourseconnectives,in:ProceedingsofNorth Ameri-canChapteroftheAssociationforComputationalLinguistics,NAACL,Denver, Colorado,2015,pp.799–808.

[33]C.Sporleder,A.Lascarides,Usingautomaticallylabelledexamplestoclassify rhetoricalrelations:anassessment,Nat.Lang.Eng.14(3)(2008)369–416. [34]N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,R.Salakhutdinov, Dropout:

asimplewaytopreventneuralnetworksfromoverﬁtting,J.Mach.Learn.Res. 15(1)(2014)1929–1958.

[35]A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser, I.Polosukhin,Attentionisallyouneed,in:ProceedingofAdvancesinNeural InformationProcessingSystems,NIPS,2017,pp.6000–6010.

[36]X.Wang,S.Li,J.Li,W.Li,Implicitdiscourserelationrecognitionby select-ingtypicaltrainingexamples.,in:ProceedingsofInternationalConferenceon ComputationalLinguistics,COLING,Mumbai,India,2012,pp.2757–2772. [37]M. William,S. Thompson, Rhetoricalstructuretheory: towardsafunctional

theoryoftextorganization,Text8(3)(1988)243–281.

[38]C. Wu, X.Shi, Y. Chen, Y. Huang, J. Su, Leveragingbilingually-constrained syntheticdataviamulti-taskneuralnetworksforimplicitdiscourserelation recognition,Neurocomputing243(2017)69–79.

[39]C. Wu, X.Shi, Y. Chen, J. Su, B. Wang, Improvingimplicit discourse rela-tionrecognitionwithdiscourse-speciﬁcwordembeddings,in:Proceedingsof Association for Computational Linguistics, ACL,2, Vancouver,Canada, 2017, pp.269–274.

[40]C.Wu,X.Shi,J.Su,Y.Chen,Y.Huang,Co-trainingforimplicitdiscourse re-lationrecognitionbasedonmanualanddistributedfeatures,NeuralProcess. Lett.46(1)(2017)233–250.

[41]Y.Xu,Y.Hong,H.Ruan,J.Yao,M.Zhang,G.Zhou,Usingactive learningto ex-pandtrainingdataforimplicitdiscourserelationrecognition,in:Proceedings ofEmpiricalMethodsinNaturalLanguageProcessing,EMNLP,Associationfor ComputationalLinguistics,Brussels,Belgium,2018,pp.725–731.

[42]B.Zhang,J.Su,D.Xiong,Y.Lu,H.Duan,J.Yao,Shallowconvolutionalneural networkforimplicitdiscourserelationrecognition,in:Proceedingsof Empir-icalMethodsinNaturalLanguageProcessing,EMNLP,Lisbon,Portugal,2015, pp.2230–2235.

(11)

ChangxingWureceivedhisPh.D.degreeinSchoolof In-formationScienceandTechnology,XiamenUniversity, Xi-amen,China,in2017.Heisnowworking inEastChina JiaotongUniversity.Hisresearchinterestsincludenatural languageprocessinganddeeplearning.

JinsongSureceivedhisPh.D.degreeincomputerscience fromChineseAcademyofSciences,Beijing,China,in2011. HeiscurrentlyanassociateprofessoratXiamen Univer-sity.Hisresearchinterestsincludenaturallanguage pro-cessinganddeeplearning.

YidongChen receivedhisPh.D.degreeinmathematics from XiamenUniversity,Xiamen,China, in2008.Heis nowanassociateprofessorintheCognitiveScience De-partmentofXiamenUniversity.Hisresearchinterests in-cludemachinetranslationandsemanticanalysis.

Xiaodong Shi received his Ph.D. degree in computer software from National University ofDefense Technol-ogy,Changsha,China,in1994.Heisnowaprofessorin theCognitive ScienceDepartment ofXiamenUniversity. Hisresearchinterestsincludenaturallanguageprocessing andartiﬁcialintelligence.

39–49

ScienceDirect

www.elsevier.com/locate/neucom

https://github.com/linziheng/pdtb-parser

http://nlp.stanford.edu/data/glove.840B.300d.zip

6

8

https://github.com/hxbai/Deep

https://github.com/HIT-SCIR/ELMoForManyLangs

Natural

Education

265–283

2015,

571–583

69–73

1694–1705

2201–2211

203–213

1726–1735

141–151

1243–1252

329–344

1310–1319

476–485

4026–4032

2105–2114

343–351

151–184

1224–1233

2750–2756

59–62

368–375

2013,

1345–1359

2249–2255

1532–1543

2227–2237

683–691

13–16

2961–2968

2263–2270

645–654

799–808

369–416

1929–1958

60

2757–2772

243–281

69–79

269–274

233–250

725–731

2230–2235