Measuring Text Reuse

(1)

Paul Clough and Robert Gaizauskas and Scott S.L. Piao and Yorick Wilks

Department of Computer Science

University of SheÆeld

Regent Court,211 PortobelloStreet,

SheÆeld, England, S14DP

[email protected]. ac.u kg

Abstract

In this paper we present results from

theMETER (MEasuringTExtReuse)

project whose aim is to explore issues

pertainingtotextreuseandderivation,

especiallyinthecontextofnewspapers

usingnewswire sources. Although the

reuse of text by journalists has been

studiedinlinguistics,wearenotaware

ofanyinvestigationusingexisting

com-putational methodsfor thisparticular

task. We investigate the classication

ofnewspaperarticlesaccordingtotheir

degree of dependence upon,or

deriva-tion from, a newswire source using a

simple3-levelschemedesignedby

jour-nalists. Three approaches to

measur-ing text similarity are considered:

n-gram overlap, Greedy String Tiling,

and sentence alignment. Measured

againstamanuallyannotatedcorpusof

sourceand derived newstext, weshow

that a combined classier with

fea-tures automatically selected performs

best overall for the ternary

classica-tion achieving an average F

1

-measure

score of 0.664 across all three

cate-gories.

1 Introduction

Atopicofconsiderabletheoreticalandpractical

interestisthatoftextreuse: thereuseofexisting

writtensourcesinthecreationofanewtext. Of

course,reusinglanguageisasoldastheretelling

of stories,butcurrenttechnologiesforcreating,

copyinganddisseminatingelectronictext,make

it easier thanever before to take some or all of

any number of existing text sources and reuse

them verbatimorwith varying degreesof

mod-ication.

One form of unacceptable text reuse,

plagia-rism, has received considerable attention and

software for automatic plagiarism detection is

nowavailable(see, e.g. (Clough, 2000)fora

re-cent review). But in this paper we present a

benign and acceptable form of text reuse that

is encountered virtually every day: the reuseof

news agency text (called copy) in the

produc-tion of daily newspapers. The question is not

just whether agency copy has been reused, but

towhat extentandsubjecttowhat

transforma-tions. Usingexistingapproachesfrom

computa-tional text analysis, we investigate theirability

toclassifynewspapersarticlesintocategories

in-dicatingtheir dependencyonagency copy.

2 Journalistic reuse of a newswire

The process of gathering, editing and

publish-ing newspaper stories is a complex and

spe-cialisedtaskoftenoperatingwithinspecic

pub-lishing constraints such as: 1) short deadlines;

2) prescriptive writingpractice (see, e.g. Evans

(1972)); 3) limitsofphysicalsize; 4) readability

and audience comprehension, e.g. a tabloid's

vocabularylimitations; 5) journalistic bias, e.g.

political and 6) a newspaper's house style.

Of-ten newsworkers, such as the reporter and

edi-tor,willrelyuponnewsagencycopyasthebasis

of anewsstory orto verify factsand assessthe

(2)

appearing on the newswire. Because of the

na-ture of journalistic text reuse, dierences will

arisebetween reusednewsagency copyand the

original text. For example consider the

follow-ing:

Original (news agency) A drink-driver who

ran into the QueenMother's oÆcial

Daim-lerwasned $700andbanned fromdriving

for two years.

Rewrite (tabloid) A DRUNK driver who

ploughed intothe QueenMother's limowas

ned $700 and banned for two years

yes-terday.

This simple example illustrates the types of

rewrite that can occur even in a very short

sentence. The rewrite makes use of slang and

exaggeration to capture its readers' attention

(e.g. DRUNK, limo, ploughed). Deletion (e.g.

from driving) hasalso beenused and the

addi-tion of yesterday indicates when the event

oc-curred. Many of the transformations we

ob-served between moving from news agency copy

to the newspaper version have also been

re-ported by the summarisation community (see,

e.g., McKeown and Jing(1999)).

Giventhevalueoftheinformationnews

agen-cies supply, the ease with which text can be

reused and commercial pressures, it would be

benecial to beable to identifythose news

sto-riesappearinginthenewspapersthathaverelied

uponagencycopyintheirproduction. Potential

uses include: 1) monitoring take-up of agency

copy; 2) identifyingthemostreusedstories ;3)

determiningcustomerdependencyupon agency

copyand4)newmethodsforchargingcustomers

based upon the amount of copy reused. Given

the large volume of news agency copy output

each day, it would be infeasible to identify and

quantifyreusemanually;thereforeanautomatic

method isrequired.

3 A conceptual framework

To begin to get a handle on measuring text

reuse, we have developed adocument-level

clas-agency copy, and a lexical-level classication

scheme, indicating the level at which

individ-ual word sequences within a newspaper story

are derived from agency copy. This framework

rests upon the intuitions of trained journalists

to judge text reuse, and not on an explicit

lex-ical/syntactic denition of reuse (which would

presupposewhatwearesettingouttodiscover).

At the document level, newspaper stories

are assigned to one of three possible categories

coarsely reecting the amount of text reused

from the news agency and the dependency of

the newspaper story upon news agency copy

for the provision of \facts". The categories

in-dicate whether a trained journalist can

iden-tify text rewritten from the news agency in

a candidate derived newspaper article. They

are: 1) wholly-derived (WD): all text in

the newspaper article is rewritten only from

newsagencycopy;2)partially-derived(PD):

some text isderived from thenews agency, but

other sources have also beenused;and 3)

non-derived (ND):newsagency hasnotbeenused

asthesourceofthearticle;althoughwordsmay

stillco-occurbetweenthenewspaperarticleand

news agency copy on the same topic, the

jour-nalistiscondentthenewsagency hasnotbeen

used.

Atthelexicalorwordsequence level,

individ-ualwordsandphraseswithinanewspaperstory

areclassiedasto whether theyare usedto

ex-press the same information as words in news

agency copy (i.e. paraphrases) and or used to

express information not found in agency copy.

Once again,three categories areused,based on

thejudgementofatrainedjournalist: 1)

verba-tim: text appearing word-for-word to express

the same information; 2) rewrite: text

para-phrasedtocreateadierentsurfaceappearance,

but express the same informationand 3) new:

text usedto express informationnotappearing

inagency copy(can includeverbatim/rewritten

text, butbeingused ina dierentcontext).

3.1 The METER corpus

(3)

the news agency source and nine British daily

newspapers 1

who subscribe to thePA as

candi-date reusers. The METER corpus (Gaizauskas

et al., 2001) is a collection of 1716 texts (over

500,000 words) carefully selected from a 12

month period from the areas of law and court

reporting (769 stories) and showbusiness (175

stories). 772ofthese textsarePA copyand 944

fromtheninenewspapers. Thesetextscover265

dierentstoriesfromJuly1999toJune2000and

allnewspaperstorieshavebeenmanually

classi-ed at the document-level. They include 300

wholly-derived, 438 partially-derived and 206

non-derived(i.e. 77% arethought to have used

PA in some way). In addition, 355 have been

classiedaccording to thelexical-levelscheme.

4 Approaches to measuring text

similarity

Many problems in computational text

analy-sis involve the measurement of similarity. For

example, the retrieval of documents to full a

userinformationneed,clusteringdocuments

ac-cordingtosome criterion,multi-document

sum-marisation, aligning sentences from one

lan-guagewiththoseinanother,detectingexactand

nearduplicates ofdocuments,plagiarism

detec-tion,routingdocumentsaccordingto theirstyle

and identifying authorship attribution.

Meth-ods typically vary depending upon the

match-ing method, e.g. exact or partial, the degree

towhichnaturallanguageprocessingtechniques

are used and the type of problem, e.g.

search-ing, clustering, aligning etc. We have not had

time to investigate all of these techniques, nor

is there space here to review them. We have

concentratedon justthree: ngramoverlap

mea-sures, GreedyStringTiling,andsentence

align-ment. The rst was investigated because it

of-fersperhapsthesimplestapproach to the

prob-lem. Thesecondwasinvestigatedbecauseithas

beensuccessfullyusedinplagiarismdetection,a

problemwhichatleastsuperciallyisquiteclose

1

Thenewspapersincludevepopularpapers(e.g. The

Sun,TheDailyMail,DailyStar,DailyMirror)andfour

qualitypapers(e.g. DailyTelegraph,TheGuardian,The

nally,alignment (treating thederived text as a

\translation" of the rst) seemed an intriguing

idea,andcontrasts,certainlywiththengram

ap-proach,byfocusingmoreonlocal,asopposedto

globalmeasuresof similarity.

4.1 Ngram Overlap

Aninitial,straightforwardapproachtoassessing

the reuse between two texts is to measure the

number of shared word ngrams. This method

underlies many of the approaches used in copy

detectionincludingtheapproachtakenbyLyon

etal. (2001).

They measure similarity using the

set-theoretic measures of containment and

resem-blanceofsharedtrigramstoseparatetexts

writ-tenindependentlyandthosewithsuÆcient

sim-ilarityto indicatesome form ofcopying.

We treat each document as a set of

overlap-pingn-wordsequences(initiallyconsideringonly

n-word types) and compute a similarity score

from this. Given two sets of ngrams, we use

the set-theoretic containment score to measure

similaritybetweenthedocumentsforngramsof

length 1 to 10 words. For a source text A and

a possiblyderived text B representedbysets of

ngrams S

n

(A) and S

n

(B) respectively,the

pro-portionofngramsinBalsoinA,thengram

con-tainmentC

n

(A;B), isgiven by:

C

n

(A;B)= jS

n

(A)\S

n (B)j

jS

n (B)j

(1)

Informallycontainmentmeasuresthenumber

of matches betweenthe elements of ngram sets

S

n

(A) and S

n

(B), scaled by the sizeof S

n (B).

In other words we measure the proportion of

unique n-grams in B that are found in A. The

score ranges from 0 to 1, indicating noneto all

newspapercopy sharedwith PArespectively.

We alsocomparetextsbycountingonlythose

ngrams with low frequency, in particular those

occurringonce. For1-grams,thisisthesameas

comparing the hapax legomena which has been

shown to discriminate plagiarised texts from

those written independently even when lexical

(4)

nd that repetition in PA copy drastically

re-duces the number of shared hapax legomena

thereby inhibiting classication of derived and

non-derived texts. Therefore we compute the

containmentofhapaxlegomena(hapax

contain-ment)bycomparingwordsoccurringonceinthe

newspaper,i.e. those1-grams inS

1

(B) that

oc-cur once with all 1-grams in PA copy, S

1 (A).

This containment score represents the number

ofnewspaperhapax legomenaalsoappearingat

leastonce inPA copy.

4.2 Greedy String-Tiling

Greedy String-Tiling (GST) is a substring

matchingalgorithm whichcomputes the degree

of similarity between two strings, for

exam-ple software code, free text or biological

subse-quences (Wise,1996). Comparedwithprevious

algorithmsforcomputingstring similarity,such

astheLongestCommonSubsequence or

Levenshteindistance,GSTisabletodealwith

transposition of tokens (in earlier approaches

transpositionisseenasanumberofsingle

inser-tions/deletionsratherthanasingleblockmove).

The GST algorithm performs a1:1 matching

oftokensbetweentwostringssothatasmuchof

onetokenstreamiscoveredwithmaximallength

substrings from the other (called tiles). In our

problem,weconsiderhowmuchnewspapertext

can be maximally covered by words from PA

copy. A minimummatch length(MML)can be

used to avoid spurious matches (e.g. of 1 or

2 tokens) and the resulting similarity between

the strings can be expressed as a quantitative

similaritymatch oraqualitativelistofcommon

substrings. Figure1showstheresultofGSTfor

theexampleinSection 2.

Figure1: ExampleGST results(MML=3)

2

Asstories unfold,PAreleasecopywithnew,aswell

Given PA copy A, a newspaper text B and a

setofmaximalmatches,tiles,ofagivenlength

betweenA and B, thesimilarity,gstsim(A,B),

isexpressed as:

gstsim(A;B)= P

i2til es length

i

jB j

(2)

4.3 Sentence alignment

Inthepastdecade,variousalignmentalgorithms

have been suggested for aligning multilingual

parallel corpora (Wu, 2000). These algorithms

have been used to map translation equivalents

across dierent languages. In thisspecic case,

we investigate whether alignment can map

de-rived texts (or parts of them) to their source

texts. PA copy may be subject to various

changes during text reuse, e.g. a single

sen-tence may derive from parts of several source

sentences. Therefore,strongcorrelationsof

sen-tence length between the derived and source

sentences cannot be guaranteed. As a result,

sentence-length based statistical alignment

al-gorithms(Brownetal.,1991;Galeand Church,

1993) arenot appropriatefor thiscase. Onthe

other hand, cognate-based algorithms (Simard

et al., 1992; Melamed, 1999) are more eÆcient

for coping with change of text format.

There-fore, a cognate-based approach is adopted for

theMETER task. Here cognates aredenedas

pairsof termsthatareidentical, sharethesame

stems,oraresubstitutableinthegiven context.

The algorithm consists of two principal

com-ponents: a comparison strategy and a scoring

function. In brief,the comparison worksas

fol-lows(moredetailsmaybefoundinPiao (2001)).

Foreach sentence inthecandidate derivedtext

DT the sentences in the candidate source text

STarecomparedinordertondthebestmatch.

A DT sentence is allowed to match up to three

possibly non-consecutive ST sentences. The

candidate pair with the highest score (see

be-low) above a threshold is accepted as a true

alignment. If no such candidate is found, the

DT sentence is assumed to be independent of

theST. Based onindividualDTsentence

(5)

tionof alignedsentencesinthenewspapertext.

Note that not only may multiple sentences in

the STbe aligned witha single sentence in the

DT,butalso multiplesentencesinthe DTmay

bealigned withone sentence intheST.

Given a candidate derived sentence DS and

a proposed (set of) source sentence(s) SS, the

scoring function works as follows. Three basic

measures are computed for each pair of

candi-date DS and SS: SNG is the sum of lengths

of themaximum lengthnon-overlapping shared

n-grams with n 2; SWD is the number of

matchedwords sharingstems not inan n-gram

guring in SNG; and SUB is the number of

substitutable terms (mainly synonyms) not

g-uringinSNGorSWD. LetL

1

bethelengthof

thecandidateDSandL

2

thelengthofcandidate

SS.Then,threescoresPD,PS(Dicescore)and

PVS are calculatedasfollows:

PSD =

SWD+SNG+SUB

L

1

PS =

2(SWD+SNG+SUB)

L

1 +L

2

PSNG =

SNG

SWD+SNG+SUB

These three scores reectdierent aspects of

relationsbetween thecandidateDSand SS:

1. PSD: The proportion of the DS which is

sharedmaterial.

2. PS: The proportion of shared terms in DS

andSS.ThismeasureprefersSS'swhichnot

onlycontainmanytermsintheDS,butalso

donot containmanyadditionalterms.

3. PSNG: The proportion of matching

n-grams amongst the shared terms. This

measure captures the intuition that

sen-tencessharingnotonlywords,butword

se-quencesare morelikelyto be related.

These three scores are weighted and combined

together to provide an alignment metric WS

(weighted score), whichis calculatedasfollows:

WS=Æ PSD+Æ PS+Æ PSNG

1 2 3

ablesÆ

i

(i=1;2;3)havebeendetermined

empir-ically and are currently set to: Æ

1

= 0:85;Æ

2 =

0:05;Æ

3 =0:1.

5 Reuse Classiers

Toevaluatethepreviousapproachesfor

measur-ingtext reuseatthedocument-level,wecastthe

probleminto one ofa supervisedlearningtask.

5.1 Experimental Setup

Weusedsimilarityscoresasattributesfora

ma-chine learningalgorithm andusedtheWeka 3.2

software(Witten and Frank, 2000). Because of

the smallnumberof examples, we used tenfold

cross-validationrepeated10times(i.e. 10runs)

and combined thiswith stratication to ensure

approximately the same proportion of samples

from each class were used in each fold of the

cross-validation. All 769 newspapertexts from

thecourtsdomainwereusedforevaluationand

randomly permuted to generate 10 sets. For

each newspaper text, we compared PA source

texts from the same story to create results in

theform: newspaper;class;score. Theseresults

wereorderedaccordingto eachsettocreate the

same 10 datasets foreachapproach thereby

en-ablingcomparison.

Using this data we rst trained ve

single-featureNaiveBayesclassiersto dotheternary

classicationtask. The featureineach casewas

avariantofone ofthethree similaritymeasures

described in Section 4, computed between the

twotextsinthetrainingset. Thetarget

classi-cationvaluewasthereuseclassicationcategory

from the corpus. A Naive Bayes classier was

used because of its success in previous

classi-cation tasks, however we are aware of its naive

assumptions that attributes are assumed

inde-pendentand data to be normallydistributed.

We evaluated results using the F

1

-measure

(harmonic mean of precision and recall given

equal weighting). For each run, we calculated

the average F

1

score across the classes. The

overall average F

1

-measure scores were

com-puted from the 10 runs for each class (a single

accuracy measure would suÆce but the Weka

(6)

the standard deviation of F

1

scores was

com-puted for each class and F

1

scores between all

approaches were tested for statistical

signi-canceusing1-wayanalysis ofvarianceat a99%

condence-level. Statistical dierences between

results were identied using Bonferroni

analy-sis 3

.

Afterexaminingtheresultsofthesesingle

fea-ture classiers, we also trained a \combined"

classier using a correlation-based lter

ap-proach(HallandSmith,1999)toselectthe

com-binationoffeaturesgivingthehighest

classica-tion score ( correlation-basedlteringevaluates

all possible combinations of features). Feature

selection wascarried foreach fold during

cross-validationandfeatures usedinall 10folds were

chosen as candidates. Those which occurred in

at least5 of the10 runs formedthe nal

selec-tion.

We also tried splittingthe training datainto

variousbinarypartitions(e.g. WD/PDvs. ND)

and trainingbinary classiers, usingfeature

se-lection, to see how well binary classication

couldbeperformed. Eskinand Bogosian (1998)

have observed that using cascaded binary

clas-siers, each of which splits the data well, may

work better on n-ary classication problems

than a single n-way classier. We then

com-putedhowwellsuchacascadedclassiershould

performusingthebestbinaryclassier results.

5.2 Results

Table 1 shows the results of the single ternary

classiers. The baseline F

1

measure is based

uponthepriorprobabilityofadocumentfalling

into one of theclasses. The guresin

parenthe-sisarethestandarddeviationsfortheF

1 scores

across the ten evaluation runs. The nal row

showstheresultsforcombiningfeaturesselected

usingthecorrelation-based lter.

Table 2 shows the result of training binary

classiers using feature selection to select the

most discriminatingfeatures forvarious binary

splitsof thetrainingdata.

Forbothternaryandbinaryclassiersfeature

selection produced better results than usingall

3

Baseline WD 0.340(0.000)

PD 0.444(0.000)

ND 0.216(0.000)

total 0.333(0.000)

3-gram WD 0.631(0.004)

containment PD 0.624(0.004)

ND 0.549(0.005)

total 0.601(0.003)

GSTSim WD 0.669(0.004)

MML=3 PD 0.633(0.003)

ND 0.556(0.004)

total 0.620(0.002)

GSTSim WD 0.681(0.003)

MML=1 PD 0.634(0.003)

ND 0.559(0.008)

total 0.625(0.004)

1-gram WD 0.718(0.003)

ND 0.551(0.006)

total 0.638(0.003)

Alignment WD 0.774(0.003)

PD 0.624(0.005)

ND 0.537(0.007)

total 0.645(0.004)

hapax WD 0.736(0.003)

ND 0.549(0.010)

total 0.646(0.004)

hapaxcont. WD 0.756(0.002)

1-gramcont. PD 0.599(0.006)

alignment ND 0.629(0.008)

(\combined") total 0.664(0.004)

Table1: A summary ofclassication results

possiblefeatures, with theone exceptionof the

binaryclassication betweenPDand ND.

5.3 Discussion

From Table 1, we ndthat all classier results

are signicantly higher than the baseline (at

p < 0:01) and all dierences are signicant

ex-ceptbetweenhapaxcontainmentandalignment.

The highest F-measure for the 3-class problem

is 0.664 for the \combined" classier, which is

signicantly greater than 0.651 obtained

with-out. We notice that highest WD classication

is with alignment at 0.774, highest PD

classi-cation is 0.654 with hapax containment and

highestNDclassicationis0.629withcombined

features. Usinghapaxcontainmentgiveshigher

results than 1-gram containment alone and in

fact provides results as good as or better than

themore complexsentence alignmentand GST

approaches.

Previous research by (Lyon et al., 2001) and

(Wise, 1996) had shown derived texts could be

(7)

Attributes Category AvgF1

Correlation- alignment WD 0.942(0.008)

based ND 0.909(0.011)

lter total 0.926(0.010)

alignment PD/ND 0.870(0.003)

WD 0.770(0.003)

total 0.820(0.002)

alignment WD 0.778(0.003)

PD 0.812(0.002)

total 0.789(0.002)

hapaxcont. WD/PD 0.882(0.002)

alignment ND 0.649(0.007)

1-gramcont. total 0.763(0.002)

1-gram PD 0.802(0.002)

GSTmml3 ND 0.638(0.007)

GSTmml1 total 0.720(0.004)

alignment

GSTmml1 WD/ND 0.672(0.002)

alignment PD 0.662(0.003)

total 0.668(0.003)

Table2: BinaryClassierswithfeatureselection

However, our results run counter to this

be-cause the highest classication scores are

ob-tained with 1-grams and an MML of 1, i.e. as

n or MML length increases, the F

1

scores

de-crease. We believe thisresultsfrom two factors

which are characteristic of reuse in journalism.

First,sinceevenNDtextsarethematically

sim-ilar(same events beingdescribed)there ishigh

likelihood of coincidental overlap of ngrams of

length3ormore(e.g. quotedspeech). Secondly,

when journalists rewrite it is rare for them not

to varythe source.

Fortheintendedapplication{helpingthePA

tomonitortext reuse{thecostofdierent

mis-classicationsisnotequal. Iftheclassiermakes

amistake,itisbetterthatWDandNDtextsare

mis-classiedasPD,and PDasWD.Given the

dierence in distribution of documents across

classeswherePDcontainsthemostdocuments,

the classier will be biased towards this class

anyway as required. Table 3 shows the

confu-sion matrixforthecombinedternary classier.

WD PD ND

WD 203 55 4

PD 79 192 70

ND 3 53 109

Table3: Confusionmatrixforcombinedternary

classier

Although the overall F

1

-measure score is low

(0.664), mis-classication of both WD as ND

classication of PD as both WD and ND,

re-ectingthediÆcultyofseparatingthisclass.

From Table 2,we ndalignmentis aselected

feature for each binary partition of the data.

Thehighestbinary classicationisachieved

be-tween theWD and ND classesusing alignment

only, and the highest three scores show WD is

the easiest class to separate from the others.

The PD class is the hardest to isolate,

reect-ingthemis-classicationsseen inTable3.

Topredict howwella cascadedbinary

classi-erwillperformwecanreasonasfollows. From

the preceding discussion we see that WD can

be separated most accurately; hence we choose

WDversusPD/NDastherstbinaryclassier.

Thisforcesthesecond classiertobePDversus

ND.From theresultsinTable2and the

follow-ing equation to compute the F

1

measure for a

two-stage binaryclassier

WD+(PD=ND)(

PD+ND

2 )

2

weobtainanoverallF

1

measureforternary

clas-sication of 0.703, which is signicantly higher

thanthebestsingle stage ternary classier.

6 Conclusions

In thispaperwe have investigatedtext reusein

thecontext ofthereuseofnewsagencycopy,an

area of theoretical and practical interest. We

present a conceptual framework in which we

measurereuseand basedonwhichtheMETER

corpushasbeenconstructed. Wehavepresented

the resultsof using similarityscores, computed

usingn-gramcontainment,GreedyStringTiling

and an alignment algorithm, as attributes for

a supervised learning algorithm faced with the

task of learning how to classify newspaper

sto-ries as to whether they are wholly, partiallyor

non-derived from a news agency source. We

show that the best single feature ternary

clas-sieruseseitheralignmentorsimplehapax

con-tainment measures and that a cascaded binary

classier using a combination of features can

outperformthis.

(8)

formations and the high amount of verbatim

text overlapping as a result of thematic

simi-larityand \expected"similaritydueto, e.g.,

di-rect/indirect quotes. Given the relative

close-ness of results obtained by all approaches we

haveconsidered, wespeculatethatany

compar-ison method based upon lexical similarity will

probably not improve classication results by

much. Perhaps improved performance at this

taskmaypossibleby usingmore advanced

nat-ural languageprocessing techniques,e.g. better

modeling of the lexical variation and syntactic

transformationthatgoesoninjournalisticreuse.

Nevertheless the results we have obtained are

strongenoughinsomecases(e.g. whollyderived

textscan beidentiedwith>80%accuracy) to

beginto beexploited.

In summarymeasuringtext reuseisan

excit-ing new area that will have a number of

appli-cations, in particular, butnot limited to,

mon-itoring and controlling the copy produced by a

newswire.

7 Future work

WeareadaptingtheGSTalgorithmtodealwith

simplerewrites(e.g. synonymsubstitution)and

to observe the eects of rewriting upon nding

longestcommonsubstrings. We arealso

experi-mentingusingthemoredetailedMETERcorpus

lexical-levelannotationsto investigate howwell

the GST and ngrams approaches can identify

reuseat thislevel.

A prototype browser-baseddemo of both the

GST algorithm and alignment program,

allow-ing users to test arbitrary text pairs for

simi-larity,is now available 4

and willcontinue to be

enhanced.

Acknowledgements

The authors would like to acknowledge the

UK Engineering and Physical Sciences

Re-search CouncilforfundingtheMETER project

(GR/M34041). Thanks alsotoMarkHepplefor

helpfulcommentson earlierdrafts.

4

P.F.Brown, J.C. Lai, andR.L. Mercer. 1991. Aligning

sentences in parallel corpora. In Proceedings of the

29thAnnualMeetingoftheAssoc.forComputational

Linguistics,pages169{176,Berkeley,CA,USA.

PClough. 2000. Plagiarisminnaturalandprogramming

languages: Anoverviewofcurrenttoolsand

technolo-gies. Technical ReportCS-00-05, Dept.ofComputer

Science,UniversityofSheÆeld,UK.

E.EskinandM. Bogosian. 1998. Classifyingtext

docu-mentsusingmodularcategoriesandlinguistically

mo-tivatedindicators. InAAAI-98WorkshoponLearning

forTextClassication.

H.Evans. 1972. EssentialEnglish forJournalists,

Edi-torsandWriters. Pimlico,London.

S.Finlay. 1999. Copycatch. Master's thesis, Dept. of

English. UniversityofBirmingham.

R. Gaizauskas, J. Foster, Y. Wilks, J. Arundel,

P. Clough, and S. Piao. 2001. The meter corpus:

Acorpusforanalysingjournalistictextreuse. In

Pro-ceedings of the Corpus Linguistics 2001 Conference,

pages214|223.

W.A.GaleandK.W.Church.1993. Aprogramfor

align-ingsentencesinbilingualcorpus. Computational

Lin-guistics,19:75{102.

M.A.Halland L.A.Smith. 1999. Featureselection for

machine learning: Comparingacorrelation-based

l-ter approach to the wrapper. In Proceedings of the

Florida Articial Intelligence Symposium

(FLAIRS-99),pages235{239.

C.Lyon,J.Malcolm,andB.Dickerson. 2001. Detecting

shortpassagesofsimilartextinlargedocument

collec-tions. InConferenceonEmpiricalMethodsinNatural

LanguageProcessing(EMNLP2001),pages118{125.

K.McKeownandH. Jing. 1999. Thedecompositionof

human-written summary sentences. InSIGIR 1999,

pages129{136.

I.DanMelamed. 1999. Bitextmapsandalignment via

patternrecognition. ComputationalLinguistics,pages

107{130.

Scott S.L. Piao. 2001. Detecting and measuring text

reuse via aligning texts. Research Memorandum

CS-01-15, Dept. of Computer Science, University of

SheÆeld.

M. Simard, G. Foster, and P. Isabelle. 1992. Using

cognates to align sentences in bilingual corpora. In

Proceedings of the 4thInt. Conf. on Theoretical and

Methodological Issues in Machine Translation, pages

67{81,Montreal,Canada.

M.Wise. 1996. Yap3: Improveddetectionofsimilarities

incomputerprogramsandothertexts. InProceedings

ofSIGCSE'96,pages130{134,Philadelphia,USA.

I.H. Wittenand E. Frank. 2000. Datamining -

practi-cal machine learning tools and techniques with Java

implementations. MorganKaufmann.

D.Wu. 2000. Alignment. InR.Dale andH.Moisland

H.Somers (eds.), A Handbook ofNatural Language