Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Boosting
implicit
discourse
relation
recognition
with
connective-based
word
emb
e
ddings
Changxing Wu
a,∗, Jinsong Su
b, Yidong Chen
c,d, Xiaodong Shi
c,d aVirtual Reality and Interactive Techniques Institute, East China Jiaotong University, Nanchang 330013, China bXiamen University, Xiamen 361005, ChinacFujian Key Lab of the Brain-like Intelligent Systems, Xiamen University, Xiamen 361005, China
dDepartment of Cognitive Science, School of Information Science and Technology, Xiamen University, Xiamen 361005, China
a
r
t
i
c
l
e
i
n
f
o
Article history:
Received4September2018 Revised12June2019 Accepted27August2019 Availableonline30August2019 CommunicatedbyDr.Y.Chang
Keywords:
Connective-basedwordembeddings Implicitdiscourserelationrecognition Connectiveclassification
Neuralnetwork
a
b
s
t
r
a
c
t
Implicitdiscourserelationrecognitionistheperformancebottleneckofdiscoursestructureanalysis.To alleviatethe shortageoftrainingdata,previousmethods usuallyuseexplicitdiscoursedata,whichare naturallylabeledby connectives,as additionaltraining data.However, itis oftendifficultfor themto integratelargeamountsofexplicitdiscoursedatabecauseofthenoiseproblem.Inthispaper,we pro-poseasimple and effective methodtoleverage massive explicitdiscourse data.Specifically,we learn connective-basedwordembeddings(CBWE)byperformingconnectiveclassificationonexplicitdiscourse data.ThelearnedCBWEiscapableofcapturingdiscourserelationshipsbetweenwords,andcanbeused aspre-trainedwordembeddingsfor implicitdiscourserelationrecognition. OnboththeEnglish PDTB andChineseCDTBdatasets,usingCBWEachievessignificantimprovementsoverbaselineswithgeneral wordembeddings,andbetterperformancethanbaselinesintegratingexplicitdiscoursedata.By combin-ingCBWEwithastrongbaseline,weachievethestate-of-the-artperformance.
© 2019ElsevierB.V.Allrightsreserved.
1. Introduction
Recognizingdiscourserelations(e.g.,Comparison)betweentwo text spans is a crucial subtask of discourse structure analysis. These relations can benefit many downstream natural language processing tasks, including questionanswering, machine transla-tion and so on. A discourse relation instance is usually defined as a discourse connective (e.g., but, and) taking two arguments (e.g.,clause,sentence).Example(a)isanexplicitdiscourseinstance signaled by the connective but, while Example(b) is an implicit discourse instancewith theComparisonrelation, andthe connec-tiveisabsent.Forexplicitdiscourserelationrecognition,usingonly connectives asfeatures achievesmorethan 93% inaccuracy [28]. However,duetotheabsenceofconnectives,implicitdiscourse re-lationrecognitionneeds toinferencediscourse relationsbasedon twoarguments,andisstill challenging.Earlierresearchersusually develop surface features and use supervised learning method to performthetask[6,16,20,27,31].Amongthesefeatures,wordpairs occurringinargumentpairsareconsidered asimportantfeatures, since they can partially catch discourse relationships between two arguments.Forexample,wordpairswithantonymicrelation,
∗ Correspondingauthor.
E-mail address: [email protected](C.Wu).
like (crude, advance) in Example (b), may mean a Comparison relation,whilesynonymwordpairslike(good,great)mayindicate aConjunctionrelation.However,previousclassifiersbasedonthese featuresdonotworkwellbecauseofthedatasparsityproblem.
(a) [Thecomputerswerecrudebytoday’sstandards,]Arg1
but,[AppleIIwasamajoradvancefromAppleI.]Arg2
(b) [Wehaveseenabigadvanceoftheproject.]Arg1
[Theothersarestillverycrude.]Arg2
Toaddress thisproblem, someresearchersattempttotake ad-vantage of unlabeled data, especially explicit discourse data, to enrich the trainingdata. For example, explicitinstances signaled bytheconnective butcanbe potentiallyused asadditional train-ing data for the Comparison relation in implicit discourse rela-tionrecognition.Theyremoveconnectivesfromexplicitdiscourse instances, and automatically labeled them by mapping connec-tives into corresponding discourse relations (e.g., but – Compar-ison). However, according to Sporleder and Lascarides [33], di-rectlyusing thesedataasadditionaltrainingdatawoulddegrade the performance due to the following two drawbacks: (1) The meaningshiftproblem.Consideringtheexplicitinstance:‘Iam ea-ger to go home for the vacation. Nonetheless, I will book a flight to Beijing.’, one would infer the Contingency relation rather than theComparisonrelationifnonetheless isdropped. (2)The domain https://doi.org/10.1016/j.neucom.2019.08.081
problem.Therearedifferentworddistributionsanddifferent rela-tiondistributionsbetweenexplicitandimplicitdiscoursedata.For example,in the PDTB dataset, the four top-level discourse rela-tions include: explicit instances (18.9% Temporal, 28.8% Compari-son,18.7%Contingencyand33.6%Expansion)andimplicitinstances (5.7%Temporal,16.9%Comparison,24.9%Contingencyand52.5% Ex-pansion). In other words,Implicit andexplicit discourse datacan beconsideredasdatafromdifferentdomains.Accordingly,for im-plicit discourse relation recognition, explicit discourse instances canbepotentially usedasadditionallabeleddata,butwithsome noise.
Recent researchers seektoleverage explicitdiscourse datavia domainadaptation [5], data selection [32] or multi-task learning [12,13,19,38]. While showing better results, they all directly use explicit data to train classifiers. As a result, a small amount of explicit data is just used because of the noise problem. Intu-itively,incorporatingmassiveexplicitdiscoursedatawouldfurther improve the performance. Recently, some researchers use word embeddingsinsteadofwordsasinputfeatures,anddesignvarious neuralnetworkstocapturediscourserelationshipsbetween argu-ments[8,9,11,14,18,30,42].While achievingpromisingresults,they areallbasedongeneralwordembeddingswhichignorediscourse information(e.g.,good,great,andbadareoftenmappedintoclose vectors). In general, using task-specific word embeddings would furtherboosttheperformance.
Based on the above analysis, we propose to learn connective-based word embeddings (CBWE) from massive explicit data for implicitdiscourserelationrecognition.Explicitdatacanbe consid-eredtobeautomaticallylabeledbyconnectives.Whiletheycannot be directly used as training data for implicit discourse relation recognitionandcontain some noise, theyare effective enoughto provide weakly supervised signals to train the connective-based word embeddings. Our method is inspired by the observation that synonym (antonym) word pairs tend to appear around the discourseconnectiveand(but).Otherconnectivescanalsoprovide some discourse clues. We expect to mine these discourse clues from explicit data, and encode them into distributed represen-tations of words. These representations can used as features for implicitdiscourse relationrecognitionandother discourse-related tasks, and boost their performance potentially. Compared with previous work, our method provides two benefits: (1) Discourse relevantword pair information is encodedinto connective-based word embeddings, such information is helpful for implicit dis-courserelationrecognition.As shownin theexplicitExample(a), wordpair(crude,advance) isrelatedwiththeconnectivebut.Our method can encode the semantic relation signaled by but into theembeddingsofwords crudeandadvance,whichare obviously helpful for distinguish the Comparison relation of the implicit Example(b). (2) Our method is a two-stage method which first learnsCBWEandthenusesitasfeatures.Inthisway,ourmethod leveragesmassiveexplicitdataindirectlyandthuscanreducethe influence of noise. On the other hand, both data selection and multi-task methods use explicit data to train their recognition modelsdirectly,whichmakesthemmoresusceptibletonoiseand hardertoincorporatemassiveexplicitdata.
Specifically, we use two simple andeffective neural networks to learn CBWE by performing connective classification on mas-siveexplicitdata: an averagemodel capturingdiscourse relation-shipsbetweenwords implicitly,andan interaction model captur-ing these relationships explicitly. We apply CBWE as pre-trained wordembeddingsfor aneural implicit discourse relations recog-nition model.On both the English PDTB [29] andChinese CDTB [15]data sets,usingCBWEyields significantlybetter performance thanusing generalword embeddings,andrecent methods incor-poratingexplicitdiscoursedata.Theinteractionmodelshows bet-terresultsthantheaveragemodel.Moreimportantly,thelearned
Fig.1. Theaveragemodelforlearning CBWE .
CBWEcanbeeasilytransferredtostrongimplicitdiscourserelation modelslike[3,9],toboosttheirperformancefurther.
The major contributions of this paper include: (1) We intro-duce asimpleandeffectivemethodtoleverage explicitdiscourse data. (2) The learned CBWE1 can be easily combined withother techniques, to potentially boost the performance of implicit dis-course relation recognition further. (3) We achieve the state-of-the-artperformance,tothebestofourknowledge.Thecontentsof thispaperareorganizedasfollows.Wedetailmodelsforlearning CBWEinSection 2andtheusedimplicitdiscourserelation recog-nitionmodelinSection3.Weconductexperimentstovalidatethe effectivenessofCBWEin Section4.Finally, wereview therelated workinSection5anddrawconclusionsinSection6.
2. Connective-basedwordembeddings
WeinduceCBWEbasedonexplicitdiscourse databy perform-ingconnectiveclassification.Theconnectiveclassificationtask pre-dicts whichconnective is suitable forcombiningtwo given argu-ments.Inthissection, wefirstintroducetwosimpleandeffective neural network models for learning CBWE, and then the way of collectingexplicitdiscoursedatafortraining.
2.1. Theaveragemodel
Weadaptthemodelin[38]tolearnCBWEbyperforming con-nectiveclassification,andcall itthe averagemodel.As illustrated inFig.1,itusesanaveragelayertorepresenttwoarguments,and then a multi-layer perception (MLP) for classification. The aver-agemodelissimpleenoughtoenableustotrainonmassivedata efficiently.
Formally,let(Arg1,Arg2,conn)denotesanexplicitdiscourse
in-stance, where Arg1 andArg2 are arguments, connis the
connec-tive. Letxi∈Rd andyj∈Rd denote theembeddingsof ithwordof Arg1andjthwordofArg2,mandndenotethelengthsofArg1 and
Arg2,respectively.Allword’sembeddingscanbedenotedasa
ma-trixL∈Rv×d,wherevisthesizeofvocabularyanddthedimension ofwordembeddings.Theaveragemodelfirst representsan argu-mentastheaverageofwords,xforArg1 andyforArg2.Andthen
itconcatenatesxandy ash0,asinglerepresentationofbothArg1
andArg2: x= 1 m m i=1 xi, y= 1 n n j=1 yj, h0=[x,y] (1)
Finally,h0∈R2disfedintoaMLPforclassification.Specially,l
non-linearhiddenlayersarestackedtogetamoreabstractive represen-tationhl,andthenasoftmaxlayerto getprobabilities ofdifferent 1Ourlearned CBWE ispubliclyavailableathere,andwewillmakethesource codeavailableafterreview.
Fig.2. TheinteractionmodelforlearningCBWE .
classeso:
hi= f
(
Wihi−1+bi)
,i∈[1,l] o=so ftmax(
Wchl+bc)
(2) where fisthe nonlinearactivation function forall hiddenlayers, Wi,bi,Wc,bc areparameters.Wecombinethecross-entropyerror andregularizationerrorastheobjectivefunction:
J
(
θ
)
=−q k=1
gk×log
(
ok)
+λ
2θ
2, (3)whereg isthe ground-truthlabelvector foran traininginstance, q the numberof classes,
λ
the regularization coefficient andθ
=(
L,Wi,bi,Wc,bc)
the setofparameters. Notethat bi,bc andLare not includedinθ
.Duringtraining, Lisfirst randomlyinitialized, andthentunedtominimizetheobjectivefunction.Thefinally ob-tainedLisourCBWE.2.2. Theinteractionmodel
Recently, neural network models whichincorporate wordpair information directlyachieve superior performance onnature lan-guage inference [24] and implicit discourse relation recognition [14]. Therefore, we propose to use a simplified version of these models tolearn CBWE,andcall ittheinteraction model.As illus-tratedinFig.2,theinteractionmodelfirstusesaninteractionlayer tocapturethecross-argumentwordpairinformation,thenan ag-gregate layerto representarguments,and finally aMLP layer for classification. The interaction model captures discourse relation-shipsbetweenwordsexplicitly,whiletheaveragemodeldoesthis implicitly.
Formally, we first calculate the word interaction score matrix E∈Rm×n,computinge
ijforeachpossibleithword(xi)inArg1 and
jthword(yj)inArg2,andnormalizethemasfollows: ei j=xiyTj,
α
i j= exp(
ei j)
n k=1exp(
eik)
,β
ji= exp(
ei j)
m k=1exp(
ek j)
. (4) A high interaction score eij means the corresponding word pair is well correlatedwitha particulardiscourse relation.α
i1,...,α
in indicate the normalized interaction scores betweenthe ithword inArg1 andeachwordinArg2.Similarly,β
j1,...,β
jmindicatethe normalized interaction scores betweenthe jth word inArg2 andeach word in Arg1. As described in Eq.(5), based on these
nor-malizedscores,weaugmenttherepresentationofithwordinArg1
as[xi,xi],wherexi canbe consideredasits relatedpartsinArg2.
Similarly,thejthwordinArg2 isrepresentedas[yj,yj].Finally,an
argumentisrepresentedastheaverageofaugmented representa-tions ofwords in it, xfor Arg1 and y forArg2. And the
concate-nation[x,y] isfed intoamulti-layerperception forclassification. xi=n j=1
α
i jyj, x=m1 m i=1 [xi,xi], yj=m i=1β
jixi, y= 1n n j=1 [yj,yj]. (5)Essentially, the task of connective classification is similar to implicitdiscourse relation recognition,just with differentoutput labels.Therefore, anyeffectiveneural network model forimplicit relation recognition can be easily adapted for connective classi-fication. The reasons why we choose simple models instead of complicatedmodelsforlearningCBWEare two-folds:(1)training simple models on massive explicit datais time-efficient, and (2) simple models usually contain fewer other parameters, which makesasmuchaspossibleinformationbeencodedintotheword embeddings.Forexample,both the averagemodeland the inter-actionmodelonlyhavetwosetsofparameters:wordembeddings and parameters of the MLP. In our experiments, if we compute ei j=F
(
xi)
F(
yj)
T asthat in[24],whereFisafeed-forward neural network,orei j=xiAyjT+B[xi,yj]+ci j asthatin[14],whereA, B, cij are parameters tomodelandencode wordpairsemantics, the resultedCBWEisnotasgoodasthatlearnedbytheaveragemodel orthe interaction model.The reason behindis that some useful informationisencodedintotheparametersofForA,B,cij.Wordorderinformationisnotusedinboththeaveragemodel andtheinteractionmodel,whichmakesourmodelsarevery sim-pleandcanbetrainedonlargeamountsofexplicitdiscoursedata efficiently.Intuitively,consideringthewordorderinformationcan boost the performance of connective classification. For example, wecanenhancetherepresentationofawordbyconcatenatingits word embedding and position embedding as [10], or use recur-rentneural networks(e.g.,LSTM)insteadof theaverage operator inEq.(1) orEq.(5).We conduct experiments(results not listed) toverifytheabovetwo methodsandfindthat:(1)Bothmethods donot result inbetter CBWE.Especially when LSTM isused, the resultingCBWE is not asgood asthat learned by the interaction method. The reason is that some useful information is encoded intotheparametersofLSTM.(2)Transferringthelearnedposition embeddingsisnothelpful.The reasoningbehindisthat word or-derinformation canbe learnedfromanysentences,not just lim-ited to the explicit discourse data. (3) Using LSTM is very time-consumingbecauseitcannotbeparallelized.Ourprimarypurpose in thispaper is to learn better CBWE. The learned CBWEcan be easilytransferredtonotonlytheIDRRmodel(Section 3),butalso futuremodelsforthistaskordiscourse-relatedtasks,topotentially boosttheir performance. Based on theabove analysis, we donot considertheorderinformationinourmodelsforCBWE.
Compared with the attention mechanisms in sequence-to-sequence models [2] and multi-head self-attention models [35], ourinteractionmodelisbidirectionalandusuallyusedinscenarios withtwo sentences.Comparedwiththebi-attentionmodels used in [14,24], our interaction model is a simplified version. Specifi-cally,weadaptcommonlyusedbi-attentionmodelstocatch word-pairinformationexplicitlyandencodetheseinformationintoword embeddings.We simplifythem by retainingonlytwo sets of pa-rametersthat arenecessary, wordembeddingsandparameters of theMLP layer. Therefore, asmuch aspossible information is en-coded into the word embeddings as expected. Experimental re-sultsinTable6alsoshowthatourinteractionmodelismore suit-ableforlearningCBWEthanthecommonlyusedbi-attention mod-els.Accordingly,tosomeextent,ourinteractionmodelisproposed foranewapplication,learningconnective-basedwordembeddings (CBWE)frommassiveexplicitdiscoursedata.
2.3.Collectingexplicitdiscoursedata
Collectingexplicitdiscoursedataincludestwosteps:(1) distin-guishwhetheraconnective occurringreflectsadiscourserelation. Forexample,theconnectiveandcaneitherfunctionasadiscourse connectiveto join two Conjunctionarguments, orbe justusedto linktwonounsinaphrase.(2)identifythepositionsoftwo argu-ments.AccordingtoPrasadetal.[29],Arg2 isdefinedasthe
argu-mentfollowinga connective,however,Arg1 canbelocatedwithin
thesamesentenceastheconnective,insome previousor follow-ingsentence.Linetal.[17]showthat theaccuracyof distinguish-ingEnglish connectivesismorethan97%,whileidentifying argu-ments isbelowthan 80%. Therefore,we use the existingtoolkit2 tofind Englishdiscourse connectives,and justcollectexplicit in-stancesusingpatternslike[Arg1 connArg2],wheretwoarguments
areinthesamesentence,todecreasenoise.
Therestrictionofthesamesentenceseemstohavetwo poten-tially detrimental effects on the collected corpus. First, it would skewthedistributionofdiscourserelationstowardsthosethatare typically expressed within the same sentence. Second, it would actually exclude explicit discourse instances consisting of two separate sentences, which are more similar to implicit discourse instances.Toexplore thisproblem,we comparetwocollected En-glishexplicitdiscourse corpora:(1) onecollectedby ourmethod, and(2)theothercollectedfromtheresultsofthepdtb-styleparser [17],where all identified explicitinstances are included. We find that:(1)thereisnoobviousdifferenceinthedistributionof con-nectivesbetweenthetwocorpora,and(2)connectivebasedword embeddingstrained on two corpora achieve similar performance on implicit discourse relation recognition. Therefore, our way of collectingexplicitdataisfeasiblewhenusingaverylarge corpus. Itcanalsobeeasily generalizedtootherlanguages,onejustneed totrainaclassifiertofinddiscourseconnectivesfollowing[17].
3. Modelforimplicitdiscourserelationrecognition
ToevaluatetheeffectivenessofourlearnedCBWE,weuseitas thepre-trainedwordembeddingsfora popularimplicitdiscourse relationrecognitionmodel(IDRRmodel,hereafter),toseeifit im-provesthe performance.In fact,ourCBWEcan be easilyused for anyneuralimplicitdiscourserelationrecognitionmodel.Inthe fol-lowing,webrieflyintroducetheusedIDRRmodel3[24]forthis pa-pertobeselfcontained.Letusrecallthattheinteractionmodelfor CBWE(Section 2.2) isessentially a simplified version ofthe IDRR model. They both use an interaction layer to capture the cross-argumentwordpair information,then an aggregate layer to rep-resent arguments, and finally a MLP layer for classification. The onlydifferencesbetweenthemaretheIDRRmodelusesEq.(6) in-steadofEq.(4)forwordinteractionscores,andEq.(7)insteadof Eq.(5)forargumentrepresentations.Specifically,FinEq.(6)andG inEq.(7)aremulti-layerfeed-forwardneuralnetworksandadded forlearningmoreabstractfeaturerepresentations.
ei j=F
(
xi)
F(
yj)
T,α
i j= exp(
ei j)
n k=1exp(
eik)
,β
ji= exp(
ei j)
m k=1exp(
ek j)
(6) xi=n j=1α
i j yj, x=m1 m i=1 G(
[xi,xi])
yj=m i=1β
jixi, y=1n n j=1 G(
[yj,yj])
(7) 2https://github.com/linziheng/pdtb-parser.3Thoughthemodeldescribedin[24]isusedfornaturelanguageinference,we find it isalso effective for implicitdiscourse relation recognition,and achieves slightlybetterresultsthanthemodelin[14]inourexperiments.
Table1
StatisticsofdatasetsonthePDTB.
Relation Training Validation Test
Temp 582 48 55
Comp 1855 189 145
Cont 3235 281 273
Expa 6673 638 538
Table2
Top10mostfrequentconnectivesinthecollectedexplicitdiscoursedata. Connective Frequency Connective Frequency
and 1,040,207 when 224,116 but 770,705 after 209,224 also 665,039 if 202,497 while 238,364 however 155,811 as 227,702 because 150,589 4. Experiments
In thissection, we evaluate theeffectiveness of ourproposed methodonboththeEnglishPDTBandChineseCDTBdatasets.We focusontestingwhetherourmethodismorehelpfulthanprevious methods which use explicit discourse data as additionaltraining data,andwhetherourlearnedCBWEcanbecombinedwithother techniquestoboosttheperformancefurther.
4.1. Dataandsettings
Implicitdiscourse relationrecognitionis usually consideredas amulti-wayclassificationtask.Following[19],weperforma4-way classificationonthefourtop-levelrelationsinthePDTB,including Temporal(Temp),Comparison(Comp),Contingency(Cont)and Expan-sion(Expa).WeadoptthestandardsettingsandsplitthePDTB cor-pusinto the trainingset (Sections 2–20),validation set (Sections 0–1) and test set (Sections 21–22). Table 1 lists the statistics of thesedatasets.
Wecollect explicitdatafromthe XinandLtwpartsofthe En-glishGigawordCorpus(3rdedition),andgetabout4.92Mexplicit instances.Thereare100discourseconnectivesinthePDTB,we ig-norefourparallelconnectives(e.g.,if...then)forsimplicity.Dueto thespacelimitation,weonlylistthetop10mostfrequentEnglish connectivesinthecollectedcorpusinTable2.Werandomly sam-ple20,000instancesasthevalidationset,20,000instancesasthe testsetandtheothersasthetrainingsetforCBWE.After discard-ing wordsoccurringlessthan5times,the sizeofthe vocabulary is185,048.Forconnectiveclassificationwiththeinteractionmodel, weobtainan accuracyofabout58.3%onthetestsetwhenall96 connectives are considered, andabout 58.9%,60.8%, 62.8%, 69.9% withthetop60,30,20,10mostfrequentconnectives,respectively. Accuraciesoftheaveragemodelareabout4–5%lower thanthose oftheinteractionmodel.Theseresultsindicatethat:(1)oursimple modelsforconnectiveclassificationareeffective,and(2)the inter-actionmodelismorepowerfulthantheaveragemodelby captur-ingdiscourserelationshipsbetweenwordsexplicitly.
Hyper-parameters for CBWE and IDRR are selected based on theircorrespondingvalidationsets,andlistedinTable3.Thesame hyper-parameters are used for the average model and the inter-action modelforCBWE. dmeans thedimension of word embed-dings, bsize the batch size of training data, lr the learning rate, dropout thedropout rate[34] in hiddenlayers,
λ
the regulariza-tion coefficient in Eq.(3), update the parameter update strategy. The nonlinear function fis used inEq.(2), hiddenlayers of Fin Eq.(6)andGinEq.(7).ThelearningrateforCBWEisdecayedby afactorof0.8perepoch.Inaddition,hsizesMLP,hsizesFandhsizesG are the sizes of hiddenlayers in the MLP, Fin Eq. (6) andG inTable3
Hyper-parametersfortraining CBWE and IDRR .
Hyper-parameter CBWE IDRR
d 300 300 bsize 64 32 lr 1.0 0.1 dropout – 0.2 λ 0.0001 0.0001 update SGD AdaDelta f ReLU ReLU hsizesMLP [200] [200,50] hsizesF – [100,100] hsizesG – [100,100]
Eq. (7), respectively. Note that [200, 50] means that two hidden layers with thesizes of200 and50 are used,andparameters in all hiddenlayers are initialized with thedefaultxavier_initializer functioninTensorflow[1].Wealsofindthat,AdaDeltaismore sta-blethanotherparameterupdatestrategiesforIDRR,andarelative largelearningratecaneffectivelyspeedupthetrainingprocessof CBWE.ThelearnedCBWEisusedasthepre-trainedword embed-dingsforIDRR,andfixed duringtraining. Validationsetsareused to early stop the trainingprocess. Different from the conference version [39] ofthis paper,no anysurface features are used here forafaircomparisonwithotherwork.
Due to the small and uneven test data set, we use both the Accuracy and Macro-averagedF1 (Macro-F1) to evaluate the whole
system.Werunourmethod10timeswithdifferentrandomseeds (thereforedifferentinitialparameters),andreporttheresults(ofa run)whichareclosesttotheaverageresults.
4.2. Comparisonwithgeneralwordembeddings
We first compare the learned connective-based word embed-dings(CBWE)withtwopubliclyavailablewordembeddings4:
• GloVe5: trainedon840B wordsfrominternet (common crawl) usingthecountbasedmodelin[25],withavocabularyof2.2M andadimensionalityof300.
• word2vec6:trainedon100BwordsfromGoogleNews usingthe CBOW modelin [22],with a vocabulary of 3M anda dimen-sionalityof300.
• CBWEavg:ourconnective-basedwordembeddingslearnedwith theaveragemodel.
• CBWEint: ourconnective-based wordembeddingslearnedwith theinteractionmodel.
ResultsinTable4showthat IDRRusingCBWEgainssignificant improvements(one-tailed t-testwithp<0.05)overusingGloVe or word2vec, on both Accuracy and Macro-F1. More importantly,
us-ing CBWE achieves substantial improvements across all relations on the F1 score, which indicates that our proposed method can
not onlyhelp minorityrelations (Temp,Comp), butalsomajor re-lations (Cont, Expa).In addition,IDRRusingCBWEint achieves bet-ter performance overusing CBWEavg, whichsuggeststhat model-ing interaction betweenwordsexplicitlyis reallyhelpful.Overall, ourCBWEcaneffectivelyincorporatediscourseinformationin ex-plicit discoursedata,andthus benefits implicitdiscourserelation recognition.
4Thereasonsforusingthesewordembeddingsare:(1)Theyarebothtrainedon massivedata.(2)Itwillbeconvenientforotherpeopletoreproduceour experi-ments.(3)Using GloVe or word2vec wordembeddingstrainedonthesamecorpus as CBWE achievesworseperformancethanthepublicembeddings.
5http://nlp.stanford.edu/data/glove.840B.300d.zip.
6https://code.google.com/archive/p/word2vec/GoogleNews-vectors-negative300. bin.gz.
Table4
Resultsofusingdifferentwordembeddings.WealsolistthePrecision(P ),Recall(R ) and F 1scoreforeachrelation.
IDRR +GloVe +word2vec +CBWE avg +CBWE int
Temp P 42.11 51.72 50.00 44.90 R 29.09 27.27 34.55 40.00 F 1 34.41 35.71 40.86 42.31 Comp P 39.29 40.54 40.00 47.44 R 22.76 20.69 24.83 25.52 F 1 28.82 27.40 30.64 33.18 Cont P 49.00 46.69 49.64 49.26 R 45.05 49.08 49.82 48.72 F 1 46.95 47.86 49.73 48.99 Expa P 62.23 62.48 64.70 64.82 R 73.79 72.12 73.23 73.98 F 1 67.52 66.95 68.70 69.10 Accuracy 56.28 56.08 57.86 58.36 Macro -F 1 44.42 44.48 47.48 48.39
InthefirstlineofTable4,thereisadropfrom50.00%to44.90% onPrecisionofTemp(+CBWEavgvs.+CBWEint).Thepossiblereason isthatthenumberoftestinstances(Temp)isverysmall,only55in thetestset(aslistedinTable1).Inthiscase,thePrecisionand Re-callscoresonTemp(class-wise) arerelativelysensitive. Notethat, thesamephenomenon can alsobe foundin [19],where the Pre-cisiononTempdropsfrom60.00% to42.42%whenmoreauxiliary tasksare used.7 It isworthy tonote that, fortheother three re-lations,theperformanceismorestable.Inaddition,thetestsetis veryuneven,forexample,theExpainstancesaccountfor53.2%of totalinstances.Therefore,likemostpreviouswork,boththe Accu-racyandMacro-F1 on thewholetest setareusedto evaluateour
method.
4.3.Comparisonwithrecentmethods
Inthis section,we compare ourmethod withrecentmethods whichalsouseexplicitdiscoursedatatoboosttheperformance:
• [32]:adataselectionmethodwhichdirectlyenlargesthe train-ingdatawiththechosenexplicitdiscoursedata.
• [7]:acount-basedmethodtolearnconnective-basedword rep-resentationsfromexplicitdiscourse data,whicharethen used asfeaturesinalogisticregressionmodel.
• [19]:amulti-task neuralnetworkmodeltoincorporateseveral discourse-relateddata,includingexplicitdiscoursedataandthe RST-DTcorpus[37].
• [38]:abilingually-constrainedmethodtosynthesizeadditional training data and a multi-task neural network to incorporate thesesyntheticdata.
• [12]: an attention-based mechanism to learn representations throughinteraction betweenarguments,anda multi-task neu-ralnetworktoleverageknowledgefromexplicitdiscoursedata. • [9]: a paragraph-level neural network to model inter-dependencies between discourse units, and a CRF layer to predict asequence ofexplicitandimplicitrelations ina para-graph. Both the labeled implicitand explicit instances in the PDTBareused.
ResultsinTable5showthesuperiorityofourproposedmethod, withthehighestAccuracyandacomparableMacro-F1amongthese
methods. The main reason for these improvements is that our methodcan effectively utilizemassiveexplicit discourse data,up toabout4.88Minstances.Boththedataselectionmethod[32]and multi-taskmethods[12,19,38]directlyuseexplicitdatatoestimate parametersofimplicitdiscourse relationclassifiers.Asa result, it
Table5
Comparisonwithrecentmethods.
Method Accuracy Macro -F 1
[32] 57.10 40.50 [7] 52.81 42.27 [19] 57.27 44.98 [38] 58.06 45.19 [12] 57.39 47.80 [9] 57.44 48.82 IDRR+CBWE avg 57.86 47.48 IDRR+CBWE int 58.36 48.39 Table6
Thetransferof CBWE. CBWE IDRRmeanslearningthe CBWE withthe IDRR model.∗ meansthatweruntheircodeandreportresults.+CBWE ∗ meansusingthelearned
CBWE ∗insteadofgeneralwordembeddings.
Method Accuracy Macro -F 1
IDRR+CBWE int 58.36 48.39
IDRR+CBWE IDRR 57.17 46.63
IDRR+CBWE IDRR+Feature layer transfer 58.46 48.00
[9] 57.44 48.82
[3]∗ 60.14 50.69
[9]+CBWE int 58.85 49.21
[3]+CBWE int 60.93 51.32
ishard forthem to incorporatemassive explicit data because of thenoise problem. For example,only 20,000 and40,000 explicit discourseinstances are used in[32] and[19], respectively.While BraudandDenis[7]usesmassiveexplicitdiscoursedata,itis lim-ited bythe fact that the maximumdimension ofword represen-tationsisrestrictedby thenumberofconnectives,forexample96 intheir work. Bycomparison, we learnCBWE by predicting con-nectives conditioning on arguments, which has no such dimen-sionlimitationandyieldsbetterperformance.Overall,ourmethod canconvenientlyandeffectivelyleveragemassiveexplicitdiscourse data,andthusismorepowerfulthanrecentbaselines.
Dai andHuang[9]performs slightlybetteron Macro-F1.In
ad-ditionto using explicit discourse data, it also boosts the perfor-mancebyusing bothargument-level andparagraph-levelcontext toencodeargument,castingtherelationrecognitiontaskasa se-quencelabeling task,andaugmenting inputwordrepresentations with Part-of-Speech tags and named entity tags. In comparison, ourIDRRmodelis relativelyweak. Moreimportantly, asthenext sectionshows, ourlearnedCBWEcan beeasily used toboost the performanceofDaiandHuang[9].
4.4.TransferofCBWE
Fromtheperspectiveoftransferlearning[23],ourmethodonly transferswordembeddings betweentwo relatedtasks. Letus re-callthatthetaskofconnectiveclassification(forlearningCBWE)is similarto implicitdiscourse relationrecognition,justwith differ-entoutput labels. Iftwo similar models (just withdifferent MLP layers)areseparately usedforthetwo tasks,wecan transfernot onlywordembeddings butalsoparameters of featurelayers. We conductsomeexperimentstoexplorethisproblem.Specifically,we constructtwo models by using different MLP layers in the IDRR model(Section3).OnemodelforlearningCBWE,theotherfor re-lationrecognition. The learned CBWEis referred as CBWEIDRR. In theupperpartofTable6,IDRR+CBWEIDRR meanstransferring only theCBWEIDRRforrelationrecognition,IDRR+CBWEIDRR+Featurelayer transfermeans transferring both the CBWEIDRR andparameters in feature layers (the interaction and aggregate layers). From these results, we can find that: (1) IDRR+CBWEIDRR gets a significantly lowerperformancethanourIDRR+CBWEint (Line2vs.Line1),and (2)IDRR+CBWEIDRR+Featurelayertransferachievescomparable
per-formancewithours (Line3vs.Line1).Theseresultsindicatethat thelearnedCBWEIDRR isnotasgoodasourCBWEintandsome use-fulinformationisencodedintotheparametersoffeaturelayers.In addition,itishardtotransferparametersoffeaturelayerstoother neuralnetworkmodels.Therefore,thesimpleinteractionmodelis moresuitableforlearningCBWEthanrelativelycomplicated mod-els,interms ofefficiencyandeffectiveness.In thiscase, asmuch aspossibleinformation isencodedintoCBWE,which canalso be easily transferred to other implicit discourse relation recognition models.
In the bottom part of Table 6, using our learned CBWEint in-steadofthepre-trainedword2vecembeddings,bothDaiandHuang [9]+CBWEint andBai and Zhao [3]+CBWEint achieve better perfor-mance(Line6vs.Line4andLine7vs.Line5).Inaddition,Baiand Zhao[3]+CBWEint obtains the state-of-the-artperformance,to the bestofourknowledge.Notethat[9]isastrongbaselineby mod-eling both the argument-level and paragraph-level context,8 and BaiandZhao[3]achievestheSOTAperformanceviaadeeper neu-ral model augmented by different grained text representations.9 We use the source codes provided by the authors. These results indicate that our CBWE can be easily combined with other ad-vance techniques to boost the performance further, for exam-ple,thepowerfulcontextualizedwordembeddingELMo[26]used in [3]. Overall, we recommend to use CBWE instead of general word2vecorGloVewordembeddingsforimplicitdiscourserelation recognition.
4.5. Effectofnoise
Themainadvantageofourmethodisthatitcanleverage mas-sive explicit discourse data, while previous methods are usually troubledbynoise.Inthissection,weconductexperimentstoshow to what extent the noise in explicit data affects these methods. Specifically,wecompareourmethodwiththefollowingtwo meth-ods:
• IDRR+word2vec+Direct:directlyextendingthetrainingdatawith explicitdiscoursedata.Wefirstmapconnectivesto correspond-ingdiscourserelations.Inordertoalleviatethenoiseproblem, we discardexplicitinstances withambiguous connectives(eg. while),andrandomlysampleasubsetofexplicitdata,withthe samedistributionofimplicitdata.
• IDRR+word2vec+MT: leveraging explicit discourse data in a multi-task framework. Following Liu et al. [19], a connective classification task is defined on explicit discourse data, and usedastheauxiliarytasktoboosttherelationrecognitiontask. Thetwotaskssharethesameinputandfeaturelayers,anduse separate MLPlayers forclassification.A relatively small learn-ingrateisusedforconnective classificationwhentrainingthe twotasksimultaneously,toconflictwithnoise.
As illustrated inFig. 3,we conduct experimentswith10, 100, 500, 1000 and 4880 thousands explicit instances, respectively. Note that 10 thousandsinstances are aboutthe sameamount of labeledimplicitdata,100thousandsinstancesare usuallyusedin previousmulti-taskmethods,andtheotherscanbeconsideredas massivedata.Wecanfindthat: (1)Directly usingexplicitdataas additionaltrainingdataisharmful, withsignificantdrops inboth Accuracy andMacro-F1. The moreexplicit instances are used,the
moretheperformanceisaffectedbynoise.Theobservationis con-sistentwiththefindingin[33].(2)Themulti-taskmethodachieves improvements when 10 or 100 thousands explicit instances are used,butdegradestheperformancewhenmoreexplicitinstances
8https://github.com/ZeyuDai/paragraph-level_implicit_discourse_relation_ classification.
Fig.3. Theeffectofnoise. Direct meansusingexplicitdiscoursedatadirectly. MT meansincorporatingexplicitdiscoursedatainmulti-tasklearning. CBWE intmeansusing ourlearnedwordembeddingsinsteadofword2vec .
Table7
Top15closestwordsofnot and good inbothword2vec and CBWE .
not good
word2vec CBWE avg CBWE int word2vec CBWE avg CBWE int
do no n’t great great happy
did n’t no bad lot interesting
anymore never neither terrific very positive necessarily nothing nothing decent better pleased anything neither never nice success great anyway none none excellent well helpful does difficult nowhere fantastic happy definitely never nor unaware better certainly glad want refused unable solid respect deserve neither impossible nobody lousy fine deserves if limited unknown wonderful import better know declined refused terrible positive fine anybody nobody seldom Good help lot yet little hardly tough useful reasonable either denied impossible best welcome ok
areused.Insomeextent,themulti-taskmethodalsousesexplicit instances directly, because it updates parameters of the IDRR model according to the loss on explicit data. (3) Our proposed method gets better performance when massive explicit data are used,andisalmostnotaffectedbynoise.Thereasonbehindisthat our method uses theseadditional data indirectly, learningCBWE first andthenusingit astheinput forimplicitdiscourserelation recognition.These resultsshow that, whenlarge amounts of dis-coursedataareused,ourmethodscaneffectivelycontrolthenoise problems.
4.6. QualityofCBWE
To give an intuition of what information is encoded into the learnedCBWE, welist inTable 7the top 15closest words ofnot andgood,accordingto thecosine similarity.We can findthat,in CBWE, wordssimilar to not to some extent havenegative mean-ings. And since refused, declined are similar to not, a classifier mayeasily identify implicitinstance[A networkspokesman would notcomment. ABCSports officialsdeclinedto be interviewed.]asthe Expansion.Conjunctionrelation.ForgoodinCBWE,thesimilarwords nolongerincludewordslikebadandterrific.Furthermore,the sim-ilar score between good and great in CBWEint is 0.48 while the
scorebetweengoodandbadisjust0.30,whichmaymakea classi-fiereasiertodistinguishwordpairs(good,great)from(good,bad), andthus ishelpful for predicting the Expansion.Conjunction rela-tion.ThisqualitativeanalysisdemonstratestheabilityofourCBWE tocapturediscourserelationshipsbetweenwords.
4.7.Casestudy
Twoexamples shownin Figs. 4 and5 give us some evidence thatthelearnedCBWEissuperiorthantheword2vecword embed-dingswhenusedforimplicitdiscourserelationrecognition.These figuresshow theattentionscores(eij inEq.(6))calculatedby the IDRR model. Wordpairs assigned withhigh attention scores are highlighted, the higherthe score and the darkerthe color. From thesefigures,we cantake adeeplookintowhichwordpairs are importantwhenmakingprediction.Specifically,weshowthe inter-actionmatricesoftheIDRR+word2vecmodelandtheIDRR+CBWEint model on two test instances, to demonstrate how they behave differently.
We can find that: (1) For the Expansion instance in Fig. 4, the IDRR+CBWEint model succeeds in detecting cross-argument wordpairsthatindicatethecorresponding relations,e.g., injuries-collapsed. While theIDRR+word2vec model focuseson wordpairs like injuries-lines. (2) For the Comparison instance in Fig. 5, the IDRR+word2vec model gives the wrongprediction Expansion. The reasonisthatitfocusesmoreattentiononwordpairslike options-stock.Ontheother hand,theIDRR+CBWEint modelmakesthe cor-rect prediction by giving more attention on word pairs stopped-remained, stopped-open. (3) After examining all test instances, we notice that IDRR+CBWEint usually focuses on lesswords than IDRR+word2vec,andgeneralwordslike and,were, inaregiven lit-tleattention. All thesesuggest that,the CBWEint can catch differ-entinformationfromthose inthe word2vec.Itcatches wordpair information(from explicit discourse data) that is relevant to the discourse relation recognition task. With the help of our leaned CBWEint,theIDRRmodelcanreallyfocusonrelation-relevantword pairs,andthusboosttherecognitionperformance.
4.8.Numberofconnectives
We conduct experiments to investigate the impact of con-nectives used in training CBWE on the performance of IDRR.
Fig.4. AnExpansion instanceinthetestset.
Fig.5. A Comparison instanceinthetestset.
Fig.6. Impactofconnectivesusedintraining CBWE int.
Specifically, we use explicit discourse instances with top 10, 20, 30,60mostfrequentorall connectivestolearnCBWEint, account-ing for 78.9%, 91.9%, 95.8%, 99.4% or 100% of total instances, re-spectively.Thetop10mostfrequentconnectivesare:and,but,also, while,as,when,after,if,howeverandbecause.Accordingto connec-tivesandtheirrelatedrelationsinthePDTB,inmostcases,andand alsoindicatetheExpansionrelation,ifandbecausetheContingency relation,aftertheTemporalrelation,butandhoweverthe
Compari-sonrelation.Connectives as,whenandwhileareambiguous. Over-all,theseconnectiveshavecoveredallfourtop-levelrelations de-fined in the PDTB. As illustrated in Fig. 6, withonly the top 10 connectives,thelearnedCBWEintachievesbetterperformancethan theword2vecwordembeddings(reddottedlines).Weobserve sig-nificantimprovementswhenusingtop 20connectives,almostthe bestperformancewithtop30connectives,andnofurther substan-tial improvements with more connectives.These results indicate
Table8
StatisticsofdatasetsontheCDTB.
Relation Training Validation Test
Tran 33 1 5
Caus 682 88 95
Expl 1143 147 126
Coor 2300 529 347
Table9
ResultsontheCDTB.∗meansthatweruntheircodesontheCDTB.+CBWE ∗ means usingthelearned CBWE ∗ insteadofgeneralwordembeddings.
Method Accuracy Macro -F 1
IDRR +GloVe 70.30 58.04 IDRR +word2vec 69.44 57.12 IDRR+CBWE avg 73.42 63.16 IDRR+CBWE int 73.59 64.56 [38] 74.30 62.57 [9]∗ 73.77 64.24 [3]∗ 74.82 65.95 [9]+CBWE int 74.12 64.75 [3]+CBWE int 75.70 66.27
thatwecanuseonlytopnmostfrequentconnectivestocollect ex-plicitdiscourse dataforCBWE,whichis veryconvenientformost languages.
4.9. ResultsontheCDTB
Four top-level relations are defined in the CDTB (an Chinese discourse corpus), includingTransition(Tran), Causality(Caus), Ex-planation (Expl) andCoordination (Coor). We useinstances inthe first 50 documents as the test set, second 50 documents as the validation set and remaining 400 documents as the trainingset. Table8lists thestatisticsofthesedatasets.Weconduct a3-way classificationbecauseofonly39instancesforTran.About6.5M ex-plicitdiscourseinstancescollectedfromtheChineseGigaword Cor-pus (3rd edition), with 88 connectives,are used for CBWE. Note that connectives occurring lessthan 10,000 times are discarded. Wefindthathyper-parametersselectedonthePDTB(seeTable3) also work well onthe CDTB, exceptthat the learningrateis set to0.08andbatchsizeto16fortrainingIDRR.Forthemodein[3], weusethepre-trainedChineseELMoembeddings10andignorethe sub-wordinformationintheinputlayer.
ResultsinTable9showthattheperformanceofourmethodon the CDTBhas thesimilar trend asthat on thePDTB. Specifically, IDRR+CBWEachievessignificant improvementsoverIDRR+GloVeor IDRR+word2vec (theupperpartofTable 9),andusingCBWEint for strong baselines achieves substantial improvements (the bottom partofTable9).BaiandZhao[3]+CBWEintobtainsthe state-of-the-artperformanceontheCDTB,tothebestofourknowledge.These resultsindicatethat ourproposedmethodisalsoeffectiveonthe Chinese implicit discourse relation recognition, and the learned CBWEcan beeasily combinedwithothertechniquesto boostthe performancefurther.
5. Relatedwork
Implicit discourse relation recognition attracts more attention sincethereleaseofPDTB[29],thefirstlargediscoursecorpus dis-tinguishingimplicitinstancesfromexplicitones.Mostprevious re-search focuses on designing surface features manually, including lexical andpolarity features [27],word pairs and parse informa-tion [16],entityfeatures [20], wordcluster pairs [31], andso on.
10https://github.com/HIT-SCIR/ELMoForManyLangs.
Recently,researchersresorttoneuralnetworkstolearndistributed features automatically, forexample, a shallowconvolutional net-work [42], entity-augmented recursive networks [11], a convolu-tionalneuralnetwork withdynamicpooling[19],gatedrelevance networks [8], repeated reading neural networks with multi-level attention[18],asimplewordinteractionmodel[14],andadeeper enhanced model with different grained text representations [3]. However,duetothelimitedtrainingdata,methodsbasedon sur-face features (high dimensions) or distributed features (compli-catedmodelswithmanyparameters)usuallyfacethedatasparsity problem.
Therefore,the second lineof research triesto take advantage of unlabeled data, especially explicit discourse data (weakly la-beled by connectives), to enrich the training data. For the first time, Marcu andEchihabi [21] propose to use explicit discourse instancesasadditionaltrainingdatabyremovingconnectivesand mappingthemtocorrespondingrelations.However,Sporlederand Lascarides [33] suggest that using these artificial implicit data indiscriminatelydegradestheperformance,becauseofthedomain problemand meaning shift problem. Subsequently, to effectively useexplicitdata,some researchersusemulti-task learning meth-ods [12,13,19,38]. Specifically, they leverage auxiliary tasks (e.g., connective classification on explicit data) to promote the perfor-mance of main task (implicit discourse relation recognition), by sharing common information between them. Some researchers use data selection methods [32,36,40,41]. They select explicit instances (similar to implicit ones) according to some criteria, and use them to enlarge the training corpus directly. Both the multi-task learning and date selection methods show promising results.However,theyuseexplicitdatatotrainclassifiersdirectly, which makes them hard to incorporate massive explicit data becauseofthenoise problem.Differentfromtheabove work,we learn connective-based word embeddingsfrom explicitdata, and use them as inputting features. Our method leverages massive explicit data indirectly, and thus can reduce the influence of noise.
Some aspectsof thisworkare similar to[4,7].Based on mas-siveexplicitinstances,theyfirstbuildaword-connective(orword pair-connective)co-occurrencefrequencymatrix,andthenweight theserawfrequenciesasword(wordpair)representations.Inthis way,theyrepresentwords(wordpairs)inthespaceofconnectives todirectlyencodetheirdiscoursefunction.Themajorlimitationof theirapproachisthatthedimensionofwordrepresentationsmust be lessthan orequal to the number of connectives.By compar-ison,we learnwordembeddingsbypredictingconnectives condi-tioningonarguments,whichhasnosuchdimensionlimitation. Es-sentially,theyusecount-basedmethodstolearnword representa-tions,whileweadoptaprediction-basedmethodandachieve bet-terperformance.
6. Conclusion
Inthispaper,weproposeto learnconnective-basedword em-beddingsfrommassiveexplicitdataforimplicitdiscourserelation recognition. Experiments on both the PDTB and CDTB data sets showthatusingourlearnedwordembeddingsasfeaturescan sig-nificantlyboost theperformance. We alsoshow that our method canusemassiveexplicitdatamoreeffectivelythanpreviouswork. Sincemostofneuralnetworkmodelsforimplicitdiscourserelation recognitionanddiscourse-relatedtasksusepre-trainedword em-beddingsasinputs,wehopeourlearnedwordembeddingswould benefitthem.
In the future, we would like to explore how to learn task-specific sentencerepresentations based onabundance of weakly-labeledexplicitdiscoursedata.Weare alsointerestedinverifying
theeffectiveness ofour resultingword embeddingson taskslike sentimentclassification, sincethey seem usefulevenbeyond dis-courserelatedtasks.
DeclarationofCompetingInterest
Theauthorsdeclarethattheyhavenoknowncompeting finan-cialinterestsorpersonalrelationshipsthatcouldhaveappearedto influencetheworkreportedinthispaper.
Acknowledgments
We would like to thank all the reviewers for their construc-tiveandhelpfulsuggestionsonthispaper.Thisworkissupported bytheNaturalScienceFoundationofChina(Grant No.61866012), the Natural Science Foundation of Jiangxi Province (Grant No. 20181BAB202012),andScienceandTechnologyResearchProjectof EducationDepartmentofJiangxiProvince(GrantNo.GJJ180329).
References
[1] M.Abadi,P.Barham,J.Chen,Z.Chen,A.Davis,J.Dean,M.Devin,S.Ghemawat, G.Irving,M.Isard, M.Kudlur,J. Levenberg,R. Monga,S. Moore, D.G. Mur-ray,B.Steiner,P.Tucker,V.Vasudevan,P.Warden,M.Wicke,Y.Yu,X.Zheng, TensorFlow:asystemforlarge-scalemachinelearning,in:Proceedingsofthe 12thUSENIXConference onOperatingSystemsDesignandImplementation, OSDI’16,Berkeley,CA,USA,2016,pp.265–283.
[2] D.Bahdanau,K.Cho,Y.Bengio,Neuralmachinetranslationbyjointlylearning toalignandtranslate,in:ProceedingofInternationalConferenceonLearning Representations,ICLR2015,2015,pp.1–11.
[3] H. Bai, H. Zhao, Deepenhancedrepresentation for implicit discourse rela-tionrecognition,in:ProceedingsofInternationalConferenceonComputational Linguistics,COLING,AssociationforComputationalLinguistics,SantaFe,New Mexico,USA,2018,pp.571–583.
[4] O.Biran,K.McKeown,Aggregatedwordpairfeaturesforimplicitdiscourse re-lationdisambiguation,in:ProceedingsofAssociationforComputational Lin-guistics,ACL,Sofia,Bulgaria,2013,pp.69–73.
[5] C. Braud, P. Denis, Combining natural and artificial examples to improve implicit discourse relation identification, in: Proceedings of International Conference on Computational Linguistics, COLING, Dublin, Ireland, 2014, pp.1694–1705.
[6] C.Braud,P.Denis,Comparingwordrepresentationsforimplicitdiscourse re-lationclassification,in:ProceedingsofEmpiricalMethodsinNaturalLanguage Processing,EMNLP,Lisbon,Portugal,2015,pp.2201–2211.
[7] C.Braud,P.Denis,Learningconnective-basedwordrepresentationsforimplicit discourserelationidentification,in:ProceedingsofEmpiricalMethodsin Nat-uralLanguageProcessing,EMNLP,Austin,Texas,2016,pp.203–213.
[8] J. Chen, Q.Zhang,P. Liu, X.Qiu,X. Huang, Implicit discourserelation de-tectionviaadeep architecture with gatedrelevance network,in: Proceed-ingsofAssociationforComputationalLinguistics,ACL,Berlin,Germany,2016, pp.1726–1735.
[9] Z.Dai,R.Huang,Improvingimplicitdiscourserelationclassificationby mod-elinginter-dependenciesofdiscourseunits inaparagraph, in:Proceedings ofNorthAmericanChapter oftheAssociationforComputationalLinguistics, NAACL,Melbourne,Australia,2018,pp.141–151.
[10] J.Gehring,M.Auli,D.Grangier,D.Yarats,Y.N.Dauphin,Convolutionalsequence tosequencelearning,in:ProceedingsofInternationalConferenceonMachine Learning,ICML2017,Sydney,NSW,Australia,2017,pp.1243–1252.JMLR.org. [11]Y. Ji,J. Eisenstein,Onevector is notenough: entity-augmenteddistributed
semantics for discourse relations,Trans. Assoc. Comput. Linguist. 3(2015) 329–344.
[12] M.Lan,J.Wang,Y.Wu,Z.-Y.Niu,H.Wang,Multi-taskattention-basedneural networksforimplicitdiscourserelationshiprepresentationandidentification, in:ProceedingsofEmpiricalMethodsinNaturalLanguageProcessing,EMNLP, Copenhagen,Denmark,2017,pp.1310–1319.
[13] M. Lan, Y. Xu, Z. Niu, Leveraging synthetic discourse data via multi-task learning for implicit discourserelation recognition,in: Proceedings of As-sociationfor ComputationalLinguistics, ACL,Sofia,Bulgaria,2013, pp.476– 485.
[14] W.Lei,X.Wang,M.Liu,I.Ilievski,X.He,M.Y.Kan,SWIM:asimpleword in-teractionmodelforimplicitdiscourserelationrecognition,in:Proceedingsof InternationalJointConferencesonArtificalIntelligence,IJCAI,Melbourne, Aus-tralia,2017,pp.4026–4032.
[15] Y. Li, W.Feng,J. Sun,F.Kong,G. Zhou, BuildingChinesediscoursecorpus with connective-driven dependencytree structure, in: Proceedings of Em-piricalMethodsinNaturalLanguage Processing,EMNLP,Doha, Qatar,2014, pp.2105–2114.
[16]Z.Lin,M.-Y.Kan,H.T.Ng,Recognizingimplicitdiscourserelationsinthepenn discoursetreebank,in:ProceedingsofEmpiricalMethodsinNaturalLanguage Processing,EMNLP,PA,USA,2009,pp.343–351.
[17]Z.Lin,H.T.Ng,M.-Y.Kan,APDTB-styledend-to-enddiscourseparser,Nat.Lang. Eng.20(02)(2014)151–184.
[18]Y.Liu,S.Li,Recognizingimplicitdiscourserelationsviarepeatedreading: neu-ralnetworkswithmulti-levelattention,in:ProceedingsofEmpiricalMethods inNaturalLanguageProcessing,EMNLP,Austin,Texas,2016,pp.1224–1233. [19]Y.Liu,S.Li,X.Zhang,Z.Sui,Implicitdiscourserelationclassificationvia
mul-ti-taskneuralnetworks,in:ProceedingsofConferenceonArtificialIntelligenc, AAAI,Arizona,USA,2016,pp.2750–2756.
[20]A.Louis,A.Joshi,R.Prasad,A.Nenkova,Usingentityfeaturestoclassify im-plicit discourserelations,in: ProceedingsofSpecial InterestGroupon Dis-courseandDialogue,SIGDIAL,PA,USA,2010,pp.59–62.
[21]D.Marcu,A.Echihabi,Anunsupervisedapproachtorecognizingdiscourse re-lations,in:ProceedingsofAssociationforComputationalLinguistics,ACL,PA, USA,2002,pp.368–375.
[22]T.Mikolov,K.Chen,G.Corrado,J.Dean,Efficientestimationofword represen-tationsinvectorspace,in:ProceedingsofWorkshopatICLR,2013,pp.1–12. [23]S.J.Pan,Q.Yang,Asurveyontransferlearning,IEEETrans.Knowl.DataEng.
22(10)(2010)1345–1359.
[24]A.Parikh,O.Täckström,D.Das,J.Uszkoreit,Adecomposableattentionmodel fornaturallanguageinference,in:ProceedingsofEmpiricalMethodsinNatural LanguageProcessing,EMNLP2016,AssociationforComputationalLinguistics, Austin,Texas,2016,pp.2249–2255.
[25]J.Pennington,R.Socher,C.Manning,Glove:globalvectorsforword represen-tation,in:ProceedingsofEmpiricalMethodsinNaturalLanguageProcessing, EMNLP,Doha,Qatar,2014,pp.1532–1543.
[26]M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettle-moyer, Deepcontextualizedwordrepresentations, in:ProceedingsofNorth American Chapter ofthe Association for Computational Linguistics, NAACL 2018,AssociationforComputationalLinguistics,NewOrleans,Louisiana,2018, pp.2227–2237.
[27]E.Pitler,A.Louis,A.Nenkova, Automaticsenseprediction for implicit dis-course relations in text, in: Proceedings of ACL-IJCNLP, PA, USA, 2009, pp.683–691.
[28]E.Pitler,A.Nenkova,Usingsyntaxtodisambiguateexplicitdiscourse connec-tivesintext,in:ProceedingsofACL-IJCNLP,PA,USA,2009,pp.13–16. [29]R.Prasad,N.Dinesh,A.Lee,E.Miltsakaki,L.Robaldo,A.Joshi,B.Webber,The
penndiscourseTreeBank2.0,in:ProceedingsofInternationalConferenceon LanguageResourcesandEvaluation,LREC,24,2008,pp.2961–2968. [30]L.Qin,Z.Zhang,H.Zhao,Astackinggatedneuralarchitectureforimplicit
dis-courserelationclassification,in:ProceedingsofEmpiricalMethodsinNatural LanguageProcessing,EMNLP,Austin,Texas,2016,pp.2263–2270.
[31] A.Rutherford,N.Xue,Discoveringimplicitdiscourserelationsthroughbrown clusterpairrepresentationandcoreferencepatterns,in:Proceedingsof Euro-peanChapteroftheAssociationforComputationalLinguistics,EACL, Gothen-burg,Sweden,2014,pp.645–654.
[32]A.Rutherford,N.Xue,Improvingtheinferenceofimplicitdiscourserelations viaclassifyingexplicitdiscourseconnectives,in:ProceedingsofNorth Ameri-canChapteroftheAssociationforComputationalLinguistics,NAACL,Denver, Colorado,2015,pp.799–808.
[33]C.Sporleder,A.Lascarides,Usingautomaticallylabelledexamplestoclassify rhetoricalrelations:anassessment,Nat.Lang.Eng.14(3)(2008)369–416. [34]N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,R.Salakhutdinov, Dropout:
asimplewaytopreventneuralnetworksfromoverfitting,J.Mach.Learn.Res. 15(1)(2014)1929–1958.
[35]A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.N.Gomez,L.Kaiser, I.Polosukhin,Attentionisallyouneed,in:ProceedingofAdvancesinNeural InformationProcessingSystems,NIPS,2017,pp.6000–6010.
[36]X.Wang,S.Li,J.Li,W.Li,Implicitdiscourserelationrecognitionby select-ingtypicaltrainingexamples.,in:ProceedingsofInternationalConferenceon ComputationalLinguistics,COLING,Mumbai,India,2012,pp.2757–2772. [37]M. William,S. Thompson, Rhetoricalstructuretheory: towardsafunctional
theoryoftextorganization,Text8(3)(1988)243–281.
[38]C. Wu, X.Shi, Y. Chen, Y. Huang, J. Su, Leveragingbilingually-constrained syntheticdataviamulti-taskneuralnetworksforimplicitdiscourserelation recognition,Neurocomputing243(2017)69–79.
[39]C. Wu, X.Shi, Y. Chen, J. Su, B. Wang, Improvingimplicit discourse rela-tionrecognitionwithdiscourse-specificwordembeddings,in:Proceedingsof Association for Computational Linguistics, ACL,2, Vancouver,Canada, 2017, pp.269–274.
[40]C.Wu,X.Shi,J.Su,Y.Chen,Y.Huang,Co-trainingforimplicitdiscourse re-lationrecognitionbasedonmanualanddistributedfeatures,NeuralProcess. Lett.46(1)(2017)233–250.
[41]Y.Xu,Y.Hong,H.Ruan,J.Yao,M.Zhang,G.Zhou,Usingactive learningto ex-pandtrainingdataforimplicitdiscourserelationrecognition,in:Proceedings ofEmpiricalMethodsinNaturalLanguageProcessing,EMNLP,Associationfor ComputationalLinguistics,Brussels,Belgium,2018,pp.725–731.
[42]B.Zhang,J.Su,D.Xiong,Y.Lu,H.Duan,J.Yao,Shallowconvolutionalneural networkforimplicitdiscourserelationrecognition,in:Proceedingsof Empir-icalMethodsinNaturalLanguageProcessing,EMNLP,Lisbon,Portugal,2015, pp.2230–2235.
ChangxingWureceivedhisPh.D.degreeinSchoolof In-formationScienceandTechnology,XiamenUniversity, Xi-amen,China,in2017.Heisnowworking inEastChina JiaotongUniversity.Hisresearchinterestsincludenatural languageprocessinganddeeplearning.
JinsongSureceivedhisPh.D.degreeincomputerscience fromChineseAcademyofSciences,Beijing,China,in2011. HeiscurrentlyanassociateprofessoratXiamen Univer-sity.Hisresearchinterestsincludenaturallanguage pro-cessinganddeeplearning.
YidongChen receivedhisPh.D.degreeinmathematics from XiamenUniversity,Xiamen,China, in2008.Heis nowanassociateprofessorintheCognitiveScience De-partmentofXiamenUniversity.Hisresearchinterests in-cludemachinetranslationandsemanticanalysis.
Xiaodong Shi received his Ph.D. degree in computer software from National University ofDefense Technol-ogy,Changsha,China,in1994.Heisnowaprofessorin theCognitive ScienceDepartment ofXiamenUniversity. Hisresearchinterestsincludenaturallanguageprocessing andartificialintelligence.