Paul Clough and Robert Gaizauskas and Scott S.L. Piao and Yorick Wilks
Department of Computer Science
University of SheÆeld
Regent Court,211 PortobelloStreet,
SheÆeld, England, S14DP
[email protected]. ac.u kg
Abstract
In this paper we present results from
theMETER (MEasuringTExtReuse)
project whose aim is to explore issues
pertainingtotextreuseandderivation,
especiallyinthecontextofnewspapers
usingnewswire sources. Although the
reuse of text by journalists has been
studiedinlinguistics,wearenotaware
ofanyinvestigationusingexisting
com-putational methodsfor thisparticular
task. We investigate the classication
ofnewspaperarticlesaccordingtotheir
degree of dependence upon,or
deriva-tion from, a newswire source using a
simple3-levelschemedesignedby
jour-nalists. Three approaches to
measur-ing text similarity are considered:
n-gram overlap, Greedy String Tiling,
and sentence alignment. Measured
againstamanuallyannotatedcorpusof
sourceand derived newstext, weshow
that a combined classier with
fea-tures automatically selected performs
best overall for the ternary
classica-tion achieving an average F
1
-measure
score of 0.664 across all three
cate-gories.
1 Introduction
Atopicofconsiderabletheoreticalandpractical
interestisthatoftextreuse: thereuseofexisting
writtensourcesinthecreationofanewtext. Of
course,reusinglanguageisasoldastheretelling
of stories,butcurrenttechnologiesforcreating,
copyinganddisseminatingelectronictext,make
it easier thanever before to take some or all of
any number of existing text sources and reuse
them verbatimorwith varying degreesof
mod-ication.
One form of unacceptable text reuse,
plagia-rism, has received considerable attention and
software for automatic plagiarism detection is
nowavailable(see, e.g. (Clough, 2000)fora
re-cent review). But in this paper we present a
benign and acceptable form of text reuse that
is encountered virtually every day: the reuseof
news agency text (called copy) in the
produc-tion of daily newspapers. The question is not
just whether agency copy has been reused, but
towhat extentandsubjecttowhat
transforma-tions. Usingexistingapproachesfrom
computa-tional text analysis, we investigate theirability
toclassifynewspapersarticlesintocategories
in-dicatingtheir dependencyonagency copy.
2 Journalistic reuse of a newswire
The process of gathering, editing and
publish-ing newspaper stories is a complex and
spe-cialisedtaskoftenoperatingwithinspecic
pub-lishing constraints such as: 1) short deadlines;
2) prescriptive writingpractice (see, e.g. Evans
(1972)); 3) limitsofphysicalsize; 4) readability
and audience comprehension, e.g. a tabloid's
vocabularylimitations; 5) journalistic bias, e.g.
political and 6) a newspaper's house style.
Of-ten newsworkers, such as the reporter and
edi-tor,willrelyuponnewsagencycopyasthebasis
of anewsstory orto verify factsand assessthe
appearing on the newswire. Because of the
na-ture of journalistic text reuse, dierences will
arisebetween reusednewsagency copyand the
original text. For example consider the
follow-ing:
Original (news agency) A drink-driver who
ran into the QueenMother's oÆcial
Daim-lerwasned $700andbanned fromdriving
for two years.
Rewrite (tabloid) A DRUNK driver who
ploughed intothe QueenMother's limowas
ned $700 and banned for two years
yes-terday.
This simple example illustrates the types of
rewrite that can occur even in a very short
sentence. The rewrite makes use of slang and
exaggeration to capture its readers' attention
(e.g. DRUNK, limo, ploughed). Deletion (e.g.
from driving) hasalso beenused and the
addi-tion of yesterday indicates when the event
oc-curred. Many of the transformations we
ob-served between moving from news agency copy
to the newspaper version have also been
re-ported by the summarisation community (see,
e.g., McKeown and Jing(1999)).
Giventhevalueoftheinformationnews
agen-cies supply, the ease with which text can be
reused and commercial pressures, it would be
benecial to beable to identifythose news
sto-riesappearinginthenewspapersthathaverelied
uponagencycopyintheirproduction. Potential
uses include: 1) monitoring take-up of agency
copy; 2) identifyingthemostreusedstories ;3)
determiningcustomerdependencyupon agency
copyand4)newmethodsforchargingcustomers
based upon the amount of copy reused. Given
the large volume of news agency copy output
each day, it would be infeasible to identify and
quantifyreusemanually;thereforeanautomatic
method isrequired.
3 A conceptual framework
To begin to get a handle on measuring text
reuse, we have developed adocument-level
clas-agency copy, and a lexical-level classication
scheme, indicating the level at which
individ-ual word sequences within a newspaper story
are derived from agency copy. This framework
rests upon the intuitions of trained journalists
to judge text reuse, and not on an explicit
lex-ical/syntactic denition of reuse (which would
presupposewhatwearesettingouttodiscover).
At the document level, newspaper stories
are assigned to one of three possible categories
coarsely reecting the amount of text reused
from the news agency and the dependency of
the newspaper story upon news agency copy
for the provision of \facts". The categories
in-dicate whether a trained journalist can
iden-tify text rewritten from the news agency in
a candidate derived newspaper article. They
are: 1) wholly-derived (WD): all text in
the newspaper article is rewritten only from
newsagencycopy;2)partially-derived(PD):
some text isderived from thenews agency, but
other sources have also beenused;and 3)
non-derived (ND):newsagency hasnotbeenused
asthesourceofthearticle;althoughwordsmay
stillco-occurbetweenthenewspaperarticleand
news agency copy on the same topic, the
jour-nalistiscondentthenewsagency hasnotbeen
used.
Atthelexicalorwordsequence level,
individ-ualwordsandphraseswithinanewspaperstory
areclassiedasto whether theyare usedto
ex-press the same information as words in news
agency copy (i.e. paraphrases) and or used to
express information not found in agency copy.
Once again,three categories areused,based on
thejudgementofatrainedjournalist: 1)
verba-tim: text appearing word-for-word to express
the same information; 2) rewrite: text
para-phrasedtocreateadierentsurfaceappearance,
but express the same informationand 3) new:
text usedto express informationnotappearing
inagency copy(can includeverbatim/rewritten
text, butbeingused ina dierentcontext).
3.1 The METER corpus
the news agency source and nine British daily
newspapers 1
who subscribe to thePA as
candi-date reusers. The METER corpus (Gaizauskas
et al., 2001) is a collection of 1716 texts (over
500,000 words) carefully selected from a 12
month period from the areas of law and court
reporting (769 stories) and showbusiness (175
stories). 772ofthese textsarePA copyand 944
fromtheninenewspapers. Thesetextscover265
dierentstoriesfromJuly1999toJune2000and
allnewspaperstorieshavebeenmanually
classi-ed at the document-level. They include 300
wholly-derived, 438 partially-derived and 206
non-derived(i.e. 77% arethought to have used
PA in some way). In addition, 355 have been
classiedaccording to thelexical-levelscheme.
4 Approaches to measuring text
similarity
Many problems in computational text
analy-sis involve the measurement of similarity. For
example, the retrieval of documents to full a
userinformationneed,clusteringdocuments
ac-cordingtosome criterion,multi-document
sum-marisation, aligning sentences from one
lan-guagewiththoseinanother,detectingexactand
nearduplicates ofdocuments,plagiarism
detec-tion,routingdocumentsaccordingto theirstyle
and identifying authorship attribution.
Meth-ods typically vary depending upon the
match-ing method, e.g. exact or partial, the degree
towhichnaturallanguageprocessingtechniques
are used and the type of problem, e.g.
search-ing, clustering, aligning etc. We have not had
time to investigate all of these techniques, nor
is there space here to review them. We have
concentratedon justthree: ngramoverlap
mea-sures, GreedyStringTiling,andsentence
align-ment. The rst was investigated because it
of-fersperhapsthesimplestapproach to the
prob-lem. Thesecondwasinvestigatedbecauseithas
beensuccessfullyusedinplagiarismdetection,a
problemwhichatleastsuperciallyisquiteclose
1
Thenewspapersincludevepopularpapers(e.g. The
Sun,TheDailyMail,DailyStar,DailyMirror)andfour
qualitypapers(e.g. DailyTelegraph,TheGuardian,The
nally,alignment (treating thederived text as a
\translation" of the rst) seemed an intriguing
idea,andcontrasts,certainlywiththengram
ap-proach,byfocusingmoreonlocal,asopposedto
globalmeasuresof similarity.
4.1 Ngram Overlap
Aninitial,straightforwardapproachtoassessing
the reuse between two texts is to measure the
number of shared word ngrams. This method
underlies many of the approaches used in copy
detectionincludingtheapproachtakenbyLyon
etal. (2001).
They measure similarity using the
set-theoretic measures of containment and
resem-blanceofsharedtrigramstoseparatetexts
writ-tenindependentlyandthosewithsuÆcient
sim-ilarityto indicatesome form ofcopying.
We treat each document as a set of
overlap-pingn-wordsequences(initiallyconsideringonly
n-word types) and compute a similarity score
from this. Given two sets of ngrams, we use
the set-theoretic containment score to measure
similaritybetweenthedocumentsforngramsof
length 1 to 10 words. For a source text A and
a possiblyderived text B representedbysets of
ngrams S
n
(A) and S
n
(B) respectively,the
pro-portionofngramsinBalsoinA,thengram
con-tainmentC
n
(A;B), isgiven by:
C
n
(A;B)= jS
n
(A)\S
n (B)j
jS
n (B)j
(1)
Informallycontainmentmeasuresthenumber
of matches betweenthe elements of ngram sets
S
n
(A) and S
n
(B), scaled by the sizeof S
n (B).
In other words we measure the proportion of
unique n-grams in B that are found in A. The
score ranges from 0 to 1, indicating noneto all
newspapercopy sharedwith PArespectively.
We alsocomparetextsbycountingonlythose
ngrams with low frequency, in particular those
occurringonce. For1-grams,thisisthesameas
comparing the hapax legomena which has been
shown to discriminate plagiarised texts from
those written independently even when lexical
nd that repetition in PA copy drastically
re-duces the number of shared hapax legomena
thereby inhibiting classication of derived and
non-derived texts. Therefore we compute the
containmentofhapaxlegomena(hapax
contain-ment)bycomparingwordsoccurringonceinthe
newspaper,i.e. those1-grams inS
1
(B) that
oc-cur once with all 1-grams in PA copy, S
1 (A).
This containment score represents the number
ofnewspaperhapax legomenaalsoappearingat
leastonce inPA copy.
4.2 Greedy String-Tiling
Greedy String-Tiling (GST) is a substring
matchingalgorithm whichcomputes the degree
of similarity between two strings, for
exam-ple software code, free text or biological
subse-quences (Wise,1996). Comparedwithprevious
algorithmsforcomputingstring similarity,such
astheLongestCommonSubsequence or
Levenshteindistance,GSTisabletodealwith
transposition of tokens (in earlier approaches
transpositionisseenasanumberofsingle
inser-tions/deletionsratherthanasingleblockmove).
The GST algorithm performs a1:1 matching
oftokensbetweentwostringssothatasmuchof
onetokenstreamiscoveredwithmaximallength
substrings from the other (called tiles). In our
problem,weconsiderhowmuchnewspapertext
can be maximally covered by words from PA
copy. A minimummatch length(MML)can be
used to avoid spurious matches (e.g. of 1 or
2 tokens) and the resulting similarity between
the strings can be expressed as a quantitative
similaritymatch oraqualitativelistofcommon
substrings. Figure1showstheresultofGSTfor
theexampleinSection 2.
Figure1: ExampleGST results(MML=3)
2
Asstories unfold,PAreleasecopywithnew,aswell
Given PA copy A, a newspaper text B and a
setofmaximalmatches,tiles,ofagivenlength
betweenA and B, thesimilarity,gstsim(A,B),
isexpressed as:
gstsim(A;B)= P
i2til es length
i
jB j
(2)
4.3 Sentence alignment
Inthepastdecade,variousalignmentalgorithms
have been suggested for aligning multilingual
parallel corpora (Wu, 2000). These algorithms
have been used to map translation equivalents
across dierent languages. In thisspecic case,
we investigate whether alignment can map
de-rived texts (or parts of them) to their source
texts. PA copy may be subject to various
changes during text reuse, e.g. a single
sen-tence may derive from parts of several source
sentences. Therefore,strongcorrelationsof
sen-tence length between the derived and source
sentences cannot be guaranteed. As a result,
sentence-length based statistical alignment
al-gorithms(Brownetal.,1991;Galeand Church,
1993) arenot appropriatefor thiscase. Onthe
other hand, cognate-based algorithms (Simard
et al., 1992; Melamed, 1999) are more eÆcient
for coping with change of text format.
There-fore, a cognate-based approach is adopted for
theMETER task. Here cognates aredenedas
pairsof termsthatareidentical, sharethesame
stems,oraresubstitutableinthegiven context.
The algorithm consists of two principal
com-ponents: a comparison strategy and a scoring
function. In brief,the comparison worksas
fol-lows(moredetailsmaybefoundinPiao (2001)).
Foreach sentence inthecandidate derivedtext
DT the sentences in the candidate source text
STarecomparedinordertondthebestmatch.
A DT sentence is allowed to match up to three
possibly non-consecutive ST sentences. The
candidate pair with the highest score (see
be-low) above a threshold is accepted as a true
alignment. If no such candidate is found, the
DT sentence is assumed to be independent of
theST. Based onindividualDTsentence
tionof alignedsentencesinthenewspapertext.
Note that not only may multiple sentences in
the STbe aligned witha single sentence in the
DT,butalso multiplesentencesinthe DTmay
bealigned withone sentence intheST.
Given a candidate derived sentence DS and
a proposed (set of) source sentence(s) SS, the
scoring function works as follows. Three basic
measures are computed for each pair of
candi-date DS and SS: SNG is the sum of lengths
of themaximum lengthnon-overlapping shared
n-grams with n 2; SWD is the number of
matchedwords sharingstems not inan n-gram
guring in SNG; and SUB is the number of
substitutable terms (mainly synonyms) not
g-uringinSNGorSWD. LetL
1
bethelengthof
thecandidateDSandL
2
thelengthofcandidate
SS.Then,threescoresPD,PS(Dicescore)and
PVS are calculatedasfollows:
PSD =
SWD+SNG+SUB
L
1
PS =
2(SWD+SNG+SUB)
L
1 +L
2
PSNG =
SNG
SWD+SNG+SUB
These three scores reectdierent aspects of
relationsbetween thecandidateDSand SS:
1. PSD: The proportion of the DS which is
sharedmaterial.
2. PS: The proportion of shared terms in DS
andSS.ThismeasureprefersSS'swhichnot
onlycontainmanytermsintheDS,butalso
donot containmanyadditionalterms.
3. PSNG: The proportion of matching
n-grams amongst the shared terms. This
measure captures the intuition that
sen-tencessharingnotonlywords,butword
se-quencesare morelikelyto be related.
These three scores are weighted and combined
together to provide an alignment metric WS
(weighted score), whichis calculatedasfollows:
WS=Æ PSD+Æ PS+Æ PSNG
1 2 3
ablesÆ
i
(i=1;2;3)havebeendetermined
empir-ically and are currently set to: Æ
1
= 0:85;Æ
2 =
0:05;Æ
3 =0:1.
5 Reuse Classiers
Toevaluatethepreviousapproachesfor
measur-ingtext reuseatthedocument-level,wecastthe
probleminto one ofa supervisedlearningtask.
5.1 Experimental Setup
Weusedsimilarityscoresasattributesfora
ma-chine learningalgorithm andusedtheWeka 3.2
software(Witten and Frank, 2000). Because of
the smallnumberof examples, we used tenfold
cross-validationrepeated10times(i.e. 10runs)
and combined thiswith stratication to ensure
approximately the same proportion of samples
from each class were used in each fold of the
cross-validation. All 769 newspapertexts from
thecourtsdomainwereusedforevaluationand
randomly permuted to generate 10 sets. For
each newspaper text, we compared PA source
texts from the same story to create results in
theform: newspaper;class;score. Theseresults
wereorderedaccordingto eachsettocreate the
same 10 datasets foreachapproach thereby
en-ablingcomparison.
Using this data we rst trained ve
single-featureNaiveBayesclassiersto dotheternary
classicationtask. The featureineach casewas
avariantofone ofthethree similaritymeasures
described in Section 4, computed between the
twotextsinthetrainingset. Thetarget
classi-cationvaluewasthereuseclassicationcategory
from the corpus. A Naive Bayes classier was
used because of its success in previous
classi-cation tasks, however we are aware of its naive
assumptions that attributes are assumed
inde-pendentand data to be normallydistributed.
We evaluated results using the F
1
-measure
(harmonic mean of precision and recall given
equal weighting). For each run, we calculated
the average F
1
score across the classes. The
overall average F
1
-measure scores were
com-puted from the 10 runs for each class (a single
accuracy measure would suÆce but the Weka
the standard deviation of F
1
scores was
com-puted for each class and F
1
scores between all
approaches were tested for statistical
signi-canceusing1-wayanalysis ofvarianceat a99%
condence-level. Statistical dierences between
results were identied using Bonferroni
analy-sis 3
.
Afterexaminingtheresultsofthesesingle
fea-ture classiers, we also trained a \combined"
classier using a correlation-based lter
ap-proach(HallandSmith,1999)toselectthe
com-binationoffeaturesgivingthehighest
classica-tion score ( correlation-basedlteringevaluates
all possible combinations of features). Feature
selection wascarried foreach fold during
cross-validationandfeatures usedinall 10folds were
chosen as candidates. Those which occurred in
at least5 of the10 runs formedthe nal
selec-tion.
We also tried splittingthe training datainto
variousbinarypartitions(e.g. WD/PDvs. ND)
and trainingbinary classiers, usingfeature
se-lection, to see how well binary classication
couldbeperformed. Eskinand Bogosian (1998)
have observed that using cascaded binary
clas-siers, each of which splits the data well, may
work better on n-ary classication problems
than a single n-way classier. We then
com-putedhowwellsuchacascadedclassiershould
performusingthebestbinaryclassier results.
5.2 Results
Table 1 shows the results of the single ternary
classiers. The baseline F
1
measure is based
uponthepriorprobabilityofadocumentfalling
into one of theclasses. The guresin
parenthe-sisarethestandarddeviationsfortheF
1 scores
across the ten evaluation runs. The nal row
showstheresultsforcombiningfeaturesselected
usingthecorrelation-based lter.
Table 2 shows the result of training binary
classiers using feature selection to select the
most discriminatingfeatures forvarious binary
splitsof thetrainingdata.
Forbothternaryandbinaryclassiersfeature
selection produced better results than usingall
3
Baseline WD 0.340(0.000)
PD 0.444(0.000)
ND 0.216(0.000)
total 0.333(0.000)
3-gram WD 0.631(0.004)
containment PD 0.624(0.004)
ND 0.549(0.005)
total 0.601(0.003)
GSTSim WD 0.669(0.004)
MML=3 PD 0.633(0.003)
ND 0.556(0.004)
total 0.620(0.002)
GSTSim WD 0.681(0.003)
MML=1 PD 0.634(0.003)
ND 0.559(0.008)
total 0.625(0.004)
1-gram WD 0.718(0.003)
containment PD 0.643(0.003)
ND 0.551(0.006)
total 0.638(0.003)
Alignment WD 0.774(0.003)
PD 0.624(0.005)
ND 0.537(0.007)
total 0.645(0.004)
hapax WD 0.736(0.003)
containment PD 0.654(0.003)
ND 0.549(0.010)
total 0.646(0.004)
hapaxcont. WD 0.756(0.002)
1-gramcont. PD 0.599(0.006)
alignment ND 0.629(0.008)
(\combined") total 0.664(0.004)
Table1: A summary ofclassication results
possiblefeatures, with theone exceptionof the
binaryclassication betweenPDand ND.
5.3 Discussion
From Table 1, we ndthat all classier results
are signicantly higher than the baseline (at
p < 0:01) and all dierences are signicant
ex-ceptbetweenhapaxcontainmentandalignment.
The highest F-measure for the 3-class problem
is 0.664 for the \combined" classier, which is
signicantly greater than 0.651 obtained
with-out. We notice that highest WD classication
is with alignment at 0.774, highest PD
classi-cation is 0.654 with hapax containment and
highestNDclassicationis0.629withcombined
features. Usinghapaxcontainmentgiveshigher
results than 1-gram containment alone and in
fact provides results as good as or better than
themore complexsentence alignmentand GST
approaches.
Previous research by (Lyon et al., 2001) and
(Wise, 1996) had shown derived texts could be
Attributes Category AvgF1
Correlation- alignment WD 0.942(0.008)
based ND 0.909(0.011)
lter total 0.926(0.010)
alignment PD/ND 0.870(0.003)
WD 0.770(0.003)
total 0.820(0.002)
alignment WD 0.778(0.003)
PD 0.812(0.002)
total 0.789(0.002)
hapaxcont. WD/PD 0.882(0.002)
alignment ND 0.649(0.007)
1-gramcont. total 0.763(0.002)
1-gram PD 0.802(0.002)
GSTmml3 ND 0.638(0.007)
GSTmml1 total 0.720(0.004)
alignment
GSTmml1 WD/ND 0.672(0.002)
alignment PD 0.662(0.003)
total 0.668(0.003)
Table2: BinaryClassierswithfeatureselection
However, our results run counter to this
be-cause the highest classication scores are
ob-tained with 1-grams and an MML of 1, i.e. as
n or MML length increases, the F
1
scores
de-crease. We believe thisresultsfrom two factors
which are characteristic of reuse in journalism.
First,sinceevenNDtextsarethematically
sim-ilar(same events beingdescribed)there ishigh
likelihood of coincidental overlap of ngrams of
length3ormore(e.g. quotedspeech). Secondly,
when journalists rewrite it is rare for them not
to varythe source.
Fortheintendedapplication{helpingthePA
tomonitortext reuse{thecostofdierent
mis-classicationsisnotequal. Iftheclassiermakes
amistake,itisbetterthatWDandNDtextsare
mis-classiedasPD,and PDasWD.Given the
dierence in distribution of documents across
classeswherePDcontainsthemostdocuments,
the classier will be biased towards this class
anyway as required. Table 3 shows the
confu-sion matrixforthecombinedternary classier.
WD PD ND
WD 203 55 4
PD 79 192 70
ND 3 53 109
Table3: Confusionmatrixforcombinedternary
classier
Although the overall F
1
-measure score is low
(0.664), mis-classication of both WD as ND
classication of PD as both WD and ND,
re-ectingthediÆcultyofseparatingthisclass.
From Table 2,we ndalignmentis aselected
feature for each binary partition of the data.
Thehighestbinary classicationisachieved
be-tween theWD and ND classesusing alignment
only, and the highest three scores show WD is
the easiest class to separate from the others.
The PD class is the hardest to isolate,
reect-ingthemis-classicationsseen inTable3.
Topredict howwella cascadedbinary
classi-erwillperformwecanreasonasfollows. From
the preceding discussion we see that WD can
be separated most accurately; hence we choose
WDversusPD/NDastherstbinaryclassier.
Thisforcesthesecond classiertobePDversus
ND.From theresultsinTable2and the
follow-ing equation to compute the F
1
measure for a
two-stage binaryclassier
WD+(PD=ND)(
PD+ND
2 )
2
weobtainanoverallF
1
measureforternary
clas-sication of 0.703, which is signicantly higher
thanthebestsingle stage ternary classier.
6 Conclusions
In thispaperwe have investigatedtext reusein
thecontext ofthereuseofnewsagencycopy,an
area of theoretical and practical interest. We
present a conceptual framework in which we
measurereuseand basedonwhichtheMETER
corpushasbeenconstructed. Wehavepresented
the resultsof using similarityscores, computed
usingn-gramcontainment,GreedyStringTiling
and an alignment algorithm, as attributes for
a supervised learning algorithm faced with the
task of learning how to classify newspaper
sto-ries as to whether they are wholly, partiallyor
non-derived from a news agency source. We
show that the best single feature ternary
clas-sieruseseitheralignmentorsimplehapax
con-tainment measures and that a cascaded binary
classier using a combination of features can
outperformthis.
formations and the high amount of verbatim
text overlapping as a result of thematic
simi-larityand \expected"similaritydueto, e.g.,
di-rect/indirect quotes. Given the relative
close-ness of results obtained by all approaches we
haveconsidered, wespeculatethatany
compar-ison method based upon lexical similarity will
probably not improve classication results by
much. Perhaps improved performance at this
taskmaypossibleby usingmore advanced
nat-ural languageprocessing techniques,e.g. better
modeling of the lexical variation and syntactic
transformationthatgoesoninjournalisticreuse.
Nevertheless the results we have obtained are
strongenoughinsomecases(e.g. whollyderived
textscan beidentiedwith>80%accuracy) to
beginto beexploited.
In summarymeasuringtext reuseisan
excit-ing new area that will have a number of
appli-cations, in particular, butnot limited to,
mon-itoring and controlling the copy produced by a
newswire.
7 Future work
WeareadaptingtheGSTalgorithmtodealwith
simplerewrites(e.g. synonymsubstitution)and
to observe the eects of rewriting upon nding
longestcommonsubstrings. We arealso
experi-mentingusingthemoredetailedMETERcorpus
lexical-levelannotationsto investigate howwell
the GST and ngrams approaches can identify
reuseat thislevel.
A prototype browser-baseddemo of both the
GST algorithm and alignment program,
allow-ing users to test arbitrary text pairs for
simi-larity,is now available 4
and willcontinue to be
enhanced.
Acknowledgements
The authors would like to acknowledge the
UK Engineering and Physical Sciences
Re-search CouncilforfundingtheMETER project
(GR/M34041). Thanks alsotoMarkHepplefor
helpfulcommentson earlierdrafts.
4
P.F.Brown, J.C. Lai, andR.L. Mercer. 1991. Aligning
sentences in parallel corpora. In Proceedings of the
29thAnnualMeetingoftheAssoc.forComputational
Linguistics,pages169{176,Berkeley,CA,USA.
PClough. 2000. Plagiarisminnaturalandprogramming
languages: Anoverviewofcurrenttoolsand
technolo-gies. Technical ReportCS-00-05, Dept.ofComputer
Science,UniversityofSheÆeld,UK.
E.EskinandM. Bogosian. 1998. Classifyingtext
docu-mentsusingmodularcategoriesandlinguistically
mo-tivatedindicators. InAAAI-98WorkshoponLearning
forTextClassication.
H.Evans. 1972. EssentialEnglish forJournalists,
Edi-torsandWriters. Pimlico,London.
S.Finlay. 1999. Copycatch. Master's thesis, Dept. of
English. UniversityofBirmingham.
R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel,
P. Clough, and S. Piao. 2001. The meter corpus:
Acorpusforanalysingjournalistictextreuse. In
Pro-ceedings of the Corpus Linguistics 2001 Conference,
pages214|223.
W.A.GaleandK.W.Church.1993. Aprogramfor
align-ingsentencesinbilingualcorpus. Computational
Lin-guistics,19:75{102.
M.A.Halland L.A.Smith. 1999. Featureselection for
machine learning: Comparingacorrelation-based
l-ter approach to the wrapper. In Proceedings of the
Florida Articial Intelligence Symposium
(FLAIRS-99),pages235{239.
C.Lyon,J.Malcolm,andB.Dickerson. 2001. Detecting
shortpassagesofsimilartextinlargedocument
collec-tions. InConferenceonEmpiricalMethodsinNatural
LanguageProcessing(EMNLP2001),pages118{125.
K.McKeownandH. Jing. 1999. Thedecompositionof
human-written summary sentences. InSIGIR 1999,
pages129{136.
I.DanMelamed. 1999. Bitextmapsandalignment via
patternrecognition. ComputationalLinguistics,pages
107{130.
Scott S.L. Piao. 2001. Detecting and measuring text
reuse via aligning texts. Research Memorandum
CS-01-15, Dept. of Computer Science, University of
SheÆeld.
M. Simard, G. Foster, and P. Isabelle. 1992. Using
cognates to align sentences in bilingual corpora. In
Proceedings of the 4thInt. Conf. on Theoretical and
Methodological Issues in Machine Translation, pages
67{81,Montreal,Canada.
M.Wise. 1996. Yap3: Improveddetectionofsimilarities
incomputerprogramsandothertexts. InProceedings
ofSIGCSE'96,pages130{134,Philadelphia,USA.
I.H. Wittenand E. Frank. 2000. Datamining -
practi-cal machine learning tools and techniques with Java
implementations. MorganKaufmann.
D.Wu. 2000. Alignment. InR.Dale andH.Moisland
H.Somers (eds.), A Handbook ofNatural Language