Determination of writing styles to detect similarities in digital documents

(1)

Submittedin:March2013 Acceptedin: April2013 Publishedin:January2014

Recommended citation

Ril, Y., Toll, Y.C. & Fonseca, E. (2014). Determinationofwritingstylestodetectsimilaritiesindigitaldocu

(2)

Abstract

Anythinginvolvinghumanintellectisatriskofbeingplagiarised. Thisincludesscientificandliterary workssuchasarticles, theses, audiovisualworks, plans, projectsandcomputerprograms. However,

thisarticlepaysspecialattentiontotheexistenceofthisphenomenoninwrittenworksingeneral,

andindigitaldocumentsinnaturalorprogramminglanguagesinparticular. Theobjectiveofthe

researchistodevelopandapplyamathematicalmodelthatallows thewritingstyleusedinthe

draftingoftextstobedetermined. Theresultsobtainedfromtheapplicationoftheprocedureare

intendedtoserveasthebasisforreducingthenumberofdocumentsthatneedtobecomparedin

ordertoanalyseanddetectsimilaritiesinthem. Theprocedurewasexperimentallyappliedtoasetof articlesclassifiedbytopicandauthor, wherethewritingstylesusedtodraftthemdiffered.

Keywords

writingstyle, digitaldocuments, plagiarism, procedure

Determinación de estilos de escritura para la detección de similitudes

entre documentos digitales

Resumen

Todo lo inherente al intelecto humano es susceptible de actos de plagio: obras científicas y literarias tales como artículos, tesis, obras audiovisuales, planos y proyectos, códigos fuentes de programas, entre otros. Sin embargo, el presente trabajo dedica especial atención a la existencia de este fenómeno en obras escri-tas, en concreto documentos digitales provenientes de lenguajes naturales o de programación, y centra su objetivo en el desarrollo y aplicación de un modelo matemático que permite determinar el estilo de escri-tura empleado en la redacción de los textos. Los resultados que se esperan obtener a partir de la aplicación del procedimiento servirán de base para la reducción en el número de documentos que se deben comparar en el análisis y detección de similitudes entre estos documentos. De forma experimental se aplica el proce-dimiento a un grupo de artículos clasificados por temáticas y autores y que difieren entre ellos en el estilo de escritura utilizado para su redacción.

Palabras clave

estilo de escritura, documentos digitales, plagio, procedimiento

1. Introduction

The advantages offered by information and communication technologies (ICTs) are irrefutable andborneoutbynumerousexamples. However, alongwiththepositivescomethenegativeslike plagiarism, whichisundoubtedlyoneofthemostillustrativeexamples. TheReal Academia Española,

theinstitutionresponsibleforregulatingtheSpanishlanguage, definestheSpanishterm ‘plagiar’ (Real AcademiaEspañola, 2001)– ‘toplagiarise’inEnglish–inasimple, categoricalmanner: “Copiar en

lo sustancial obras ajenas, dándolas como propias”. A definitioninEnglishofthetermis “toappropriate

(3)

Anythinginvolvinghumanintellectisatriskofbeingplagiarised. Thisincludesscientificandliterary

workssuchasarticles, theses, films, sheetmusic, audiovisualworks, plans, projects, computerprograms

andwebsites. However, thisarticlepaysspecialattentiontotheexistenceofthisphenomenonin

writtenworksingeneral, andindigitaldocumentsinnaturalorprogramminglanguagesinparticular.

Using elements from other works isacommon practice. The way itisdone determinesthe

intentionality of respecting– or otherwise– the authorship ofthe sources. Among thevarious

methodsusedarethefollowing:

R Correct: the provenance of the referenced textual material is accurately shown, adding

informationabouttheauthoroftheworkcited.

R Paraphrasing:basedonaninterpretation, certainideasfromotherdocumentsareexpressed

inanauthor’sownwords.

R Multiple sources:anewdocumentiscreatedbythetextualormodifieduseoffragmentsof

textfromotherdocuments.

R Mosaic:segmentsofanoriginaldocumentaredisarrangedsoastobetextuallyusedinanew

document.

R Odd modifications:segmentsofanoriginaldocumentareusedwhilewordsorphrasesare

insertedtomodifythem.

R Copying and pasting:thedocumentiscopiedinpartorinfullwithoutanymodifications. This

isrelativelyeasytodetectbydoingacomparisontodeterminewhetherornotthereareany

differencesbetweenthedocuments.

Technologicaladvancesmeanthatsystemsarenowavailabletodetectplagiarismindocuments;

however, humansupervisionoftheprocessisstillessential. Anaccusationofplagiarismisserious

andshouldnotbemadelightly. The useofautomated toolsreducesthenumber ofdocuments

thatrequirehumaninterventionbydifferentiatingbetweenthem, sometimesusingheuristics, as

isthecasefordeterminingthattheoriginalofseveralsimilarwebsitesistheonewiththehighest

ranking. Forcomparisonsbetweenworksdonebystudents, theonebythestudentwiththehighest

performancecouldbeconsideredtheoriginal. Unfortunately, theseheuristicsarenotalwaysvalid,

anditisnotevenpossibletobesurethattheyaresointhemajorityofinstances. Asisoftenthecase,

theproposedanalysisdependsonasubjectiveassessment. Thus, fromthispointforward, theterm

‘determinationofsimilarities’willbeusedinsteadof ‘plagiarism’.

Manyarticleshavediscussedtopicsrelatingtothedeterminationofsimilarityindocuments. Of

particularnoteamongtheseare:

“Detecting similar documents using salient terms” (Cooper et al., 2002)

Thecomparisonbetweentwodocumentsisdonebypre-processingbothandsearchingfortokens1

definedinadvance, suchaspropernouns, acronyms, locations, abbreviations, etc. AnInformation

1. A token, alsocalleda ‘lexicalcomponent’, isastringofcharactersthathasacoherentmeaninginaparticularnaturalor

(4)

Quotient(IQ)(Cooperetal., 2002)isassignedtoeachterm, whichisequivalenttotheamountof

informationthatitsappearanceinthetextprovides. Thescoreforthesimilarityintwotextsreflects

proportionalitywiththenumberoftermsappearinginonedocumentbutnotintheother.

“Tool support for plagiarism detection in text documents” (Gruner & Naven, 2005)

A methodisshownfortheanalysisofwritingstylesonthebasisofstatisticalbehaviour. Itseeksto

determinehowlanguageisusedbytheauthorandtocompareitwithotherdocuments. Thesearch

forstylescanbeperformedonparagraphsoronfragmentsoftext.

“Check: a document plagiarism detection system” (Si et al., 1997)

Itdeterminessimilarityonthebasisofhowoftenthetermsappear. A weightingsvectorforeach

compareddocumentsisgenerated, andthecosinebetweenthesevectorsisusedasaparameter

inafunctionthatreturnsanestimateofsimilarity. Theprocedureisrepeatedforeachsectionofthe

documentsuntiltheexistenceorotherwiseofcopyingbetweenthemisdetermined.

“Sim: a utility for detecting similarity in computer programs” (Gitchell & Tran, 1999)

Thearticleisaboutdetectingsimilarityinprograms. Theprogramsareseparatedintotokensinorder

tothencalculatethealignment scorebetweentwotokenstreams. Eachgapormismatchinthe

alignmentisassignedaweighting. Thesimilarityiscomputedusingtheexpression:

Where score (p1, p2)isthealignmentscorebetweenprogram1andprogram2.

“Plagiarism in natural and programming languages: an overview of current tools and

technologies” (Clough, 2000)

Itsummarisesasetoftoolsavailablefordetectingsimilaritiesintextsinnaturalorprogramming

languages. Itconciselydescribesthedistinctiveelementsofsomealgorithmsusedbythosetoolsto

performcomparisons, asisthecasefordeterminingtheLongestCommonSubsequence.

The objectiveofthe researchistodevelopandapplyamathematicalmodelthat allowsthe

writingstyleusedinthedraftingoftextstobedetermined.

Research method and theory used

In ordertoconducttheresearch, variousscientificmethodswereused. Forexample, aliterature

reviewwasconductedtolookforsourcesofinformationtotheoreticallyunderpinthestudy. Analysis

andsynthesismethodsweregenerallyusedthroughouttheresearchprocess, andparticularlyfor

specifyingthetheoreticalfundamentalsrelatingtothedetectionofsimilaritiesindigitaldocuments,

andtotheanalysisandinterpretationoftheresultsobtained fromtheapplicationofthetoolto

detectplagiarismindigitaldocuments. Inaddition, descriptivestatisticswereusedtoprocessthe

[I]

s 2 = score (p1,p2)

(5)

data obtained from the applicationof the tools todetect similarities in digital documents, and

inferentialstatisticswereusedtomakedecisionsaboutwhetherornottorejectthedataobtained

fromtheprocessfordetectingsimilaritiesindigitaldocuments.

2. Development

Inthecaseofsystemsfordetectingsimilaritiesindocuments, thebiggestchallengeistheenormous

volumeofdatathathastobeprocessed;thatiswhyapre-classificationofthewholesampleorthe

originaldocumentsavailableforthecomparisonisrequired.

The criteriataken intoaccountfor the classification ofteninclude documenttype, language,

category, subcategory, keywords, authorsanddates. Ourproposalistoincorporatewritingstyleas

acomparativecriterionwhenitcomestodeterminingwhetherornotadocumentisoriginal. Thus,

thenumberofdocumentstobecomparedinthesearchforsimilaritieswouldbereducedtothose

thathaveadirectrelationship–intermsofclassification–withthecompareddocument, whichin

turnhaveasimilarwritingstyle.

2.1. Procedure for writing style extraction

Anytextcanberepresentedbyastatisticalmodelthatidentifiesitsintrinsiccharacteristics, which

relatetotheauthor’swritingstyle. Amongthesewewouldmention:

Stop words: thisreferstotheuseofarticles, adverbs, prepositionsandconjunctions. Thefrequency

withwhichthesewordsareuseddenotestheauthor’swritingstyle. Consequently, stopwordmean

isdefinedas:

WherePpisstopwordmean, CppisthenumberofstopwordsusedandTpisthetotalnumberof

wordsinthetext.

Level of difficulty: determines thelevel ofeducation thatsomeone needs tohavein order to

understandthetext. Thereareseveralindicesavailabletocalculatethislevel. Gunning(Wikipedia,

2011), DaleandChall(1948)andFlesch-Kincaid(DuBay, 2004)aresomeofthem, althoughthelatter

isthemostcommonlydocumentedandcited.

TheexpressiontodeterminetheFlesch-Kincaidindexisasfollows:

[II]

Pp Cpp_Tp * 100 %

[III]

(6)

Where λ is the mean ofone-syllable wordsper 100words, and βis themean sentence length

measuredbythenumberofwords.

Richness of vocabulary:proposedby Honoré (1979), thistriestodeterminetherichnessofthe

author’s vocabularyon the basis of the totalunrepeated words usedin the text. The following

expressionisusedforthatpurpose:

WhereMisthenumberofdifferentwordsinthetext.

Dependingonthetypeofdocumentbeinganalysed, thecalculationofRhasmoreorlessvalidity.

Certainspecialistarticlesandcomputerprogramsaregoodexamplesofthis, as theirverynature

requirestheconstantrepetitionofwords.

Asaconsequenceoftheabove, anapproachproposedby Yule(1944)isintroducedasacalculation

alternative, defining:

WhereViisthenumberofwordsthatappearitimesinthetext. Mhasthesamemeaningasinthe

previouscalculation.

Mean sentence length:isareliablemeasureofgrammaticalknowledgethattheauthorusesinthe

compositionofsentences.

Whereloiisthelengthinwordsofthesentenceoccuringinpositioni, Noisthetotalnumberof

sentencesandTpisthetotalnumberofwordsinthetext.

Mean word length:thistermisdirectlyconnnectedwiththerichnessoftheauthor’svocabularyand

measureshisorherabilitytousecomplexwords2_.

2. Complexwordsareconsideredtobethoseformedbythreeormoresyllablesthatdonotrepresentpropernouns, prefixes,

suffixesorcompoundwords.

[IV]

R 100 log (M)_M2

[V]

K 10

4₍

-i' 1 i2Vi<M)

M2

[VI]

Lp0 -i

No

1 loi

No

Tp No

[II]

(7)

WhereLppismeanwordlength, Tcisthetotalcharactersused(excludingspaces), andTpisthetotal

numberofwordsused.

The finaldefinitionisthatofthe writing stylevector (E), whosecomponents aredescribed

above:

Whileitistruetosaythedeterminationofanauthor’swritingstyleimpliessomedegreeofuncertainty,

knowingwhatthatdegreeisenablesasuitablemarginofdeviationtobeestablishedwhenitcomes

todeterminingwhothecreatorofadocumentis. Amongthemainparametersthathaveaninfluence

onuncertaintyarethetopiccovered, thedocumentlength(intermsofthenumberofparagraphs,

sentencesorwordsused), theauthor’sexperienceandthedocumentenduser.

Havingastatisticalestimateofthewritingstyleisimportanttoensurethatthereisacomparative

criterionforselectingandclassifyingdocuments, andfordeterminingtheirauthorship. Oneofthe

applicationsisthedeterminationofsimilaritiesindocuments, whereprocessingtimeneedstobe

optimised. The materialstobecomparedarestoredonvastdatabases, soprocessingeverything

isnotanoption, norisprocessingonlythosedocumentsbelongingtooneclassificationcategory.

Otherarguments are required todifferentiate betweenand considerably reducethe numberof

documentstobeprocessed, likeawritingstyleidentifier, forexample.

3. Research analysis and results

Inordertoapplytheproposalfordeterminingwritingstyles, 37documentsby5authorsin4thematic

areaswereselected:education, history, medicineandchildren’sstories. Twooftheauthorsbelongto

thethematicareaofmedicine. Chart1showsthedocumentdistributionbythematicarea.

EDUCATION HISTORY MEDICINE STORIES

14%

27%

32%

27%

Chart 1.Document distribution by thematic area

[VIII]

(8)

Inordertoautomaticallydeterminethewritingstylevector, atoolwasdeveloped(Figure1)toenable

theextractionofasetofstatisticsfromthetextsanalysedasthesteppriortocalculatingthevector.

Figure 2.Tool for calculating the style vector

The characteristicsextractedfrom the documentsare listedbelow (to aid understanding ofthe

contentofFigure1fornon-Spanishspeakers, theEnglishtranslationofeachcharacteristicisgiven

inbrackets):

R Promedio de palabras de paro (Stopwordmean)

R Nivel de dificultad (Levelofdifficulty)

R Variedad de vocabulario (Richnessofvocabulary)

R Longitud promedio de oraciones (Meansentencelength, measuredinwords)

R Longitud promedio de palabras (Meanwordlength, measuredincharacters)

R Número de oraciones (Numberofsentences)

R Número de palabras (Numberofwords)

R Número de palabras monosílabas (Numberofone-syllablewords)

R Número de palabras de paro (Numberofstopwords)

R Total de caracteres (Totalcharacters)

(9)

Inordertodeterminetrendsandothersstatistics, aMicrosoftOffice2007 spreadsheetwasused.

Table1showsthemeanvaluesobtainedforthestylevectorcomponentsforeachauthor. Aswecan

see, theparameterthatmakesthebiggestdifferenceisrichnessofvocabulary. Authors1and2stand

outinthiscase, whocoverededucationandhistorytopics, respectively. Inparticular, author2hasthe

lowestlevelofdifficultyowingtotheuseofeasilyunderstandablenarrativedocuments.

Foreachauthor, itisessentialtodeterminehowthewritingstyleparametersvary, thatistosay,

whethernotaparticularauthorisabletomaintainhisorherwritingstyleacrossasetofdocuments

on thesame topic. Ananalysis was performed on eachvector parameter tocalculate its mean

deviation, asshownin Table2. Ascanbeanticipatedfromtheinformationshownintheprevious

table, itispreciselytherichnessofvocabularyparameterthatvariesthemost, andthemeanword

lengthparameterthatvariestheleast–solittlesothatitwasnecessarytoincreasethenumberof

decimalplacestothree.

Chart2showsthedistributionofdeviationsthroughagraphicrepresentation ofareas. Aswe

cansee, author2hasthemostconsistentwritingstylebecausethearearepresentingit(red)ismore

uniform. Theoppositeisthecaseforauthors4and5, whohaveseveralpeaksandtroughsintheir

respectivedeviationdistributionareas.

Despitethefactthatauthors3and4bothbelongtothethematicareaofmedicine, thereisa

considerabledifferenceintheirwritingstyles.

Table 1.Style vector means by author

Stop word mean Level of difficulty Richness of vocabulary Mean sentence lenght Mean word length

Author 1 37.49 21.89 2482.43 20.46 5.314

Author 2 35.88 11.40 2357.63 30.39 4.742

Author 3 30.52 26.88 1011.01 12.41 5.162

Author 4 24.21 28.96 1121.34 10.57 5.356

Author 5 33.34 26.70 665.76 12.54 4.399

Table 2.Mean deviations by vector parameter

Stop word mean Level of difficulty Richness of vocabulary Mean sentence lenght Mean word length

Author 1 0.65 0.82 107.03 1.26 0.079

Author 2 0.53 0.74 82.57 1.10 0.075

Author 3 1.96 1.05 69.78 1.05 0.066

Author 4 1.25 1.53 229.08 0.99 0.092

(10)

Theproposedmethodfordeterminingwritingstylescanbeusedinascenariowhereitisnecessary

todescribedocumentswhoseauthorshiphasbeenvalidated. Somedescriptorsarecommonlyused,

suchasauthor, title, keywords, category, subcategoryanddocumenttype, yethavingadescriptor

likewritingstylewillenableareductioninthenumberofdocumentsthatneedtobecompared

inthesearchfor possibleplagiarism. Inthis respect, whentheaimistoanalysethepresenceof

plagiarisminanewdocument, itwillonlyneedtobecomparedwithasub-setmadeupofthose

withasimilarwritingstylevector.

Althoughitispossibletostartwiththeideathateveryauthorhasauniquewritingstyle, itis

neverthelessimportanttoconsiderthat themainthreatstothevalidityofthis methodresidein

the selection of documentsto determinean author’s writing style. They should be documents

whoseauthorshiphasbeenauthenticated, whilebearinginmindthatmorethanoneauthormay

have participatedin the drafting of adocument. Whenthe document is notaboutone ofthe

topicscustomarilycoveredbytheauthor, itisacceptableforthedeviationindicestovarywithin

areasonablerange. Thus, itisadvisabletousedocumentsontopicsthattheauthorusuallycovers.

Finally, thereareotherelementsthatareequallyasimportant, suchasthelearningprocessandthe

changingrealitysurroundingtheindividual, whichgraduallygiverisetovariationsineveryauthor’s

writingstyle. Inthisrespect, thedeterminationofwritingstyleswillbemorereliablewhen, fromthe

viewpointofwritingstyle, theindividualismorematureandexperienced.

Figure 3. Deviation distribution areas

Stop word mean

Level of difficulty

Richness of vocabulary

Mean sentence lenght

(11)

4. Discussion and conclusions

Theproposedprocedurefocusesonthedeterminationofwritingstylesandenablestheacademic

community to observe similarities in documents. However, it does not constitute a definitive

solutiontotheseriousproblemofacademicplagiarism. Mindful, systematictraininginvaluessuch

asresponsibilityandhonestyisrequired.

Whyinsist onacademichonesty? If academichonesty isnon-existent, thenan impressionof

knowledgewillbegiventhatdoesnotmatchtherealityofwhatisgenuinelyinthecognitivestructure.

Dishonestbehaviourintheacademicsphereisameansofdeceitand, aboveall, ofself-deceit. It

erodesthecoreoftheeducationalaimofourteachingactivity. Owingtoitsethicalnature, wemight

assumethattheconceptofhonestyisimplicit, butthisassumptionisnotabsolutelycertain. Actsof

plagiarismmaybecommitteddeliberately, accidentallyorunknowingly.

Whatcanbedonetostrengthenthesevaluesamongourstudents? Whenweshowprofessional

coherenceand commitment, creatinga high-levelpedagogicalandintellectual atmosphere, we

managetoeliminateorreduceacademicdishonestyanditsnegativeeffects. Inhighereducation, there

areclearlymajoropportunitiestolearnand, atthesametime, majoropportunitiestodefraud. Insome

ways, wetrustthatmostofthestudentsareawareoftheirobligationstolearn, influencedbytheneed

togetadegree. Oureducationfunctioncannotbelimitedtotheuseoftoolstooverseeplagiarism;this

doesnotguaranteethatthestudentswillnotactfraudulently. Wemustinsistonthefactthatethical

behaviourisbasedonfreeacceptanceoftherulesofconductthatcannotbeimposedbyforceofauthority.

Fromaveryyoungage, humanbeingsgraduallyenrichtheirvocabulary, gainmoreexperienceand

training, anddeveloptheirwritingstylesastheylearn. Thatiswhythecreationofauniquewritingstyle

isperceivedasaslowprocess. Infuturestudies, theintentionistoexplore(a)theproceduresinorder

toobtainthedevelopmentcurvesforwritingstyles, and(b)thecharacterisationofthesestylesonthe

basisoftheparametersdescribingthem, allofwhichhaveformedafundamentalpartofthisstudy.

Theappropriationofthird-partyworksasone’sownwillcontinueandevolveatthesamepace

astechnology, sobeingreadytocounteractthisphenomenonisofvitalimportance. Mostofthe

studiesreviewedforthisresearchwillformthebasisforthedevelopmentoffutureworksonthe

detectionofsimilarities indocuments, andtheproposalpresentedherewillserveas thestarting

pointfordeterminingwhichdocumentstocompare. Theextractionofthestylevectormarksthe

differencebetweenauthors, whetherornottheycoverthesametopic. Byapplyingtheproposed

mathematicalmodeltoaconsiderablesetofdocuments, itwasfoundthattrendsreallydoexist

whenitcomestodrafting, andthatsuchtrendsputastampofauthenticityontoadocument.

References

Clough, P. (2000). Plagiarisminnaturalandprogramminglanguages:anoverviewofcurrenttools

andtechnologies. Research Memoranda: CS-00-05, Department of Computer Science, University of

(12)

Cooper, J. W., Coden, A. R., & Brown, E. W. (2002). Detectingsimilardocumentsusingsalientterms. In

Proceedings of the 11th international conference on Information and Knowledge Management. New

York, NY: ACM. Retrievedfromhttp://www.labsoftware.com/flahdo/Papers/CIKMDuplicates.pdf

Dale, E., & Chall, J. S. (1948). A formulaforpredictingreadability. Educational Research Bulletin, 27(1),

11-20. Retrievedfromhttp://www.ecy.wa.gov/quality/plaintalk/resources/classics.pdf

Dubay, W. H. (2004). The principles of readability. CostaMesa, CA:ImpactInformation. Retrievedfrom

http://files.eric.ed.gov/fulltext/ED490073.pdf

Gitchell, D., & Tran, N. (1999). Sim: autility for detectingsimilarity in computer programs. In The

proceedings of the 30th SIGCSE technical symposium on Computer Science Education. New York, NY:

ACM. Retrievedfromhttp://www.eng.uwi.tt/depts/elec/staff/feisal/ee302/sim-gitchell.pdf

Gruner, S. & Naven, S. (2005). Toolsupportforplagiarismdetectionintextdocuments. InProceedings

of the 2005 ACM symposium on Applied Computing. New York, NY: ACM.Retrievedfrom http://

dl.acm.org/citation.cfm?id=1066677.1066854. doihttp://dx.doi.org/10.1145/1066677.1066854

Honoré, A. (1979). Somesimple measures of richness of vocabulary. Association for Literary and

Linguistic Computing Bulletin, 7(2).

Plagiarise(n.d.). In The Collins English Dictionary. Retrievedfromhttp://www.collinsdictionary.com/

dictionary/english/plagiarise

Real AcademiaEspañola(Ed.)(2001). Diccionario de la Real Academia Española. Madrid, Spain:Real

AcademiaEspañola.

Si, A., Leong, H. V., & Lau, R. W. H. (1997). Check:adocumentplagiarismdetectionsystem. InProceedings

of the 1997 ACM symposium on Applied Computing. New York, NY: ACM. Retrievedfrom http://

www.cs.cityu.edu.hk/~rynson/papers/sac97.pdf. doihttp://dx.doi.org/10.1145/331697.335176

Wikipedia (2011). Gunning fog index. Wikipedia. Online: Wikipedia.org. Retrieved from http://

en.wikipedia.org/wiki/Gunning_fog_index

Yule, G. U. (1944).The statisticalstudy ofliteraryvocabulary. Journal of the Royal Statistical Society,

107(2), 129-131. Retrievedfrom http://www.jstor.org/discover/10.2307/2981280?uid=3737824&

(13)

(14)