A Robust System for Local Reuse Detection of Arabic Text on the Web

(1)

Scholarworks@UAEU

Theses

Electronic Theses and Dissertations

12-2016

A Robust System for Local Reuse Detection of

Arabic Text on the Web

Leena Mahmoud Ahmed Lulu

Follow this and additional works at:

https://scholarworks.uaeu.ac.ae/all_theses

Part of the

Computer Sciences Commons

This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at Scholarworks@UAEU. It has been accepted for inclusion in Theses by an authorized administrator of Scholarworks@UAEU. For more information, please [email protected].

Recommended Citation

Ahmed Lulu, Leena Mahmoud, "A Robust System for Local Reuse Detection of Arabic Text on the Web" (2016). Theses. 646. https://scholarworks.uaeu.ac.ae/all_theses/646

(2)

United Arab Emirates University

College of Information Technology

A ROBUST SYSTEM FOR LOCAL REUSE DETECTION OF ARABIC TEXT ON THE WEB

Leena Mahmoud Ahmed Lulu

This di sertation is submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy

nder the Supervision of Professor Boumediene Belkhouche

(3)

Declaration of Original Work

I , Leena M ahmoud Ahmed Lulu, the under igned, a graduate tudent at the United Arab Emi rates ni er. it ( E ), and the author of thi di, 'ertation entitled

"A Robust System for

Local Rellse Detection of rabic Text on the Web",

_{hereby, olemnly declare that thi dis er} tation is my own ori gi nal re ear ch work that ha been done and prepared by me under the , uper i sion of Profe 'or Boumediene Belkhouche, in the Col lege of I nformation Technology at UAEU, This work has not previ ou Iy been presented or publi shed, or formed the basis for the award of any academic degree, diploma or a, imilar title at thi or any other un iversity. Any materi al borrowed from other our ce ( whether pubJ i hed or unpubl ished) and rel ied upon or incl uded in my di ertation have been properly cited and acknowledged in accordance with appropri ate academic convent ion . I further dec lare that there i no potential confl ict of inter e ,t with re pect to the research, data col l ection, authorshi p, presentation and/or publicati on of thi di ertat ion.

(4)

=t-Copyri ght © 20 1 6 Leena Mah moud Ahmed Lulu All R i ghts Reserved

(5)

Advisory Committee

I ) Ad isor: Boumediene Belkhouche Title: Profe ,or

Department of C mputer cience and Software Engineeri ng College of I nformation Technology

2) Co-advL or: Saad R arou Ti tle: A. oci ate Profe 'sor

Department of Computer Science and Software Engi neeri ng Col lege of I nformat ion Tech nology

3 ) Member: L ireo Zhang

Title: Profe or

Departmen t of Computer and Network Engi neeri ng College of In formation Technology

(6)

Approval of the Doctorate Dissertation

Thl Doctorate Di ertation 1 approved by the following Examining Committee Members: d I or (Committee ChaIr): Boumedlene Belkhouche

Title: Profe sor

Deparrment of Computer SCIence and Software Engineering

College of Information

Ign Date

�c.,

��

0/ (.

2) M ember: Salah Bouktif Title: A 0 i ate Profe or

Department of Computer Science and Software Engi neeri ng

Signature

--)<:��'f);:.="""'--'---3) M ember: bderrahmane Lakas

Title: A sociate Profes or

Date

0 r

-

/.£

./L>

It

Department of Computer and etwork Engi neeri ng

Signature_�� ______ _

4) M ember (External Exami ner): Nizar Habash

Title: As ociate Prafes or

Department of Computer Science

u Dhabi

Signature

--I-0�=:::;L�2L�::::::::::"'-Date

Of'

-_•

r.a

-�

1£

(7)

This Doctorate Di' ertat ion is accepted by:

Dean of the Col lege of I nformation Technology : Profe. or Omar EI-Gayar

S ignatur

�

----

s---

Date �� --,L '2....0

IT-Dean of the Col lege of Graduat e Studie : Professor Nagi T. Waki m

f""

Signature

�: k��'

(8)

A bstract

We de eloped tec hnique for finding loc al text reuse on the Web with an empha i on the Arabic lang uag e. That i , our obj ec tive i to develop text reu e detec tion method that c an dete t alternative ersion of the arn e information and foc us on explori ng the fea ibility of employing text reu e detec tion method on the Web. The re ult of thi researc h c an be thought of a ric h tool to information analy t for c orporate and intelligenc e appl ic ation . Suc h tooL w i l l bec me e sential part in val idating and as essing information c oming from unc ertain orig i n . The e tool w i l l prove u eful for detec ting reuse in sc ientific literature too. It i al 0 _{the t i me for ordi nary Web user to bec ome Fac t In pec tor by providing a tool that}

al low people to quic k ly c hec k the validity and orig inality of statements and their sourc e , 0

they w i l l be gi en the opportunity to perfollll their own a es ment of information quality.

Loc al text reu e detec tion c an be div ided i nto two major subtasks: the first ubtask

the retrieval of c andidate doc uments that are l i kely to be the orig inal ourc e of a g iven doc ument in a c ol lec tion of doc ument and then performing an extensive pairwise c ompari on between the g iven doc u ment and eac h of the pos ible ourc e of text reuse that have been retrieved. For this purpose, we develop a new tec hnique to addre the c halleng i ng problem of c andidate doc ument retrieval from the Web. Given an i nput doc ument d, the problem of loc al text reu. e detec tion i to detec t from a g iven doc u ments c ol lec tion, al l the po ible reu ed pa ag es between d and the other doc u ments. Comparing the pa ag e of doc u ment d w ith the pa sag e of every other doc ument i n the c ol lec tion is obviously infeasible espec ially with l arg e c ol lec tion uc h as the Web. Therefore, selec ti ng a ub et of the doc uments that potentially c ontains reused text with d bec ome a maj or step in the detec t ion problem. I n the etting o f the Web, the searc h for uc h c andidate ourc e doc uments i usually pelformed throug h l im i ted q uery i nterfac e . We developed a new effic ient approac h of q uery formulation to retrieve Arabic -based c andidate ourc e doc u ments from the Web. The c andidate doc u ments are then fed to a loc al text reuse detec tion ystem for detai led i m i l arity eval uation with d. We c on ider the c andidate ourc e doc ument retrieval problem a an es ential step i n the detec tion

(9)

e eral techni que. have been previ ou. Iy propo ed for detecti ng text reuse, however, these te hni que. ha e been desig ned for relati vely mall and homog eneous col lecti on . Fur thermore. we are not aware of any actual previ ou work on Arabi c text reu e detecti on on the Web. Thi s i s due to comple i t of the Arabi c lang uag e a wel l as the heterog enei ty of the i n formati on contai ned on the Web and i t larg e cale that makes the ta k of text reu e detecti on

n the Web much more di fficult than i n relati vely , mal l and homog eneous collecti ons. We e aluated the work usi ng a collecti on of document e peci al l y constructed and down loaded from the Web for the evaluati on of Web document retri eval i n parti cu l ar and the detai led text reuse detecti on i n g eneral. Our work to a certai n deg ree i s exploratory rather than defi ni tive, i n that thi s problem has not been i nve tig ated before for Arabi c docu ments at the Web scale. H owever, our re ult how that the method we de cri bed are appli cable for Arabi c-ba ed reu e detecti on i n practi ce. The experi ments how that around 80% of the Web docu ment u ed i n the reused ca es were ucce sfu l l y retri eved. A for the detai led si mi lari ty analysi s, the y tern ach ieved an overal l core of 97.2% ba ed on the preci si on and recal l eval uati on

metri c .

Keywords: I n format ion Retri eval , Text Mi n i ng , Local Text Reuse Detecti on, Candi date Doc u ment Retri eval, Fi ng erpri nti ng , Web Queri e Formulati on, Query Expan i on, Arabi c Text

(10)

' 0'

1,

\.

�

L

· ---�

.

�

;

("{'

�

C L �

'�

�.

1 ,.

b

�

l

'V

».

�

I:

'

..,

�

.

�

J

�

Cv

�

-_

\'

�

f'

[

1 ,t

:

� 11 � , .

f

L

'1..

.

t

-= c -L

�

'. .�

'�

,

�

[.

}

.

..

:

' 1..

�

f

�

.

r

�

,

'-'

<L

[.

�

l

,

_ ,

,,-,

c·

�

�.

�

;

�

[;

�

C"

{-'

_ .

5-f);

�

1 l!E

,

�

,

f.

,[

t

r:

t

h

�

�,

:[

'l:

\'

�

l

� ,

_

�'

'

L

f'

S

t:

�

1 t

�.

<;-�

,

�

!

);

.

t

l

(:

f

i

� ,

�,

.{

t

l

�

'

�

"

0 '

�

;

1.

�

.[

, f. -.. '1.,.

f'

1 '-'

.

l

L

\

-t..

.

�

0-(l

,.

-t'

-

t

A � .

11

· );

'

,.

�.

f'

-L

· t

L '

1;

"

. r

t

�

'L

�

f.

b � \ ,I: . r_,

-�

L � 0 ' 't o . \.-" ,,'

�

'

�

. �

l'

--..=; II""

�

G·

Cf!

.

L l

C

l

�

� , L � G· �

(,

' Ci..

)9 .

(,

� .-:.

l

' t ,

�'

.� ,

c;

�

11

· �

;

,

t

:

�'

�

;

't

�

t

:11

[

�'r;-�

:

l

1 �

�

-�

-

�'

-1

y

'-'

_ L

.

�

" .

'

' ., L

.

['

{l o. 0 ' L

�

f

� .'

�

�.

.

�

,.

.,

-'

-�

�

,

r::

�

(,

'

�

[;

t'

[

'Ci...

'1.

� .

�.

�

.

� :

t

�

�.

�

k

'

'1'

;

'

.r.

�

(:

,

t

· t

'

'<-,

�

�'

�

;

l

�

r;: -t.. .

l�

.

L

� L. � ·

t6

,-,

:;'

�

�,

�

,

;

'L

�

,�

�

,

'�'

1,

f

rc

' r ,

,

�

:

�.

1 �

(,

'Ci..

.

�

�'

�

1:

.

�.

E

,.

c... , c

1,

'I:

�

� r

[.

E

[.

f

{

,

' 11 \ .� c . t). c· . '-'

1-.

' -17\

�

b � �

--

-�

�.!

'�

�

1,

·l

1-�

�

r

( "

�. c. c-.

�

r

-�

�

-'r. . t)

1

�

'8

\.J> '

l

-t

(

, f

'

.

�

G! .

�

c;

�,

�

[

.

'

1..

.

r

(

.0

(,

� �

'1c.

t:

'i;

t

\'

f

�:

,

't

'v

-f

l'

t

,

'1:,

l

� �

(,

_ . r I � ' � �

-t·

.

r::

h

�

.

'::

-i

1;. ; -�

�

'

-t

f'

1 �

(.

�r

,

' � ,

1

�'

);

'

,

�h�

\-:r

);'

1;

1. �'

fE'

�

t

(,

1 (

, c .

\--;t

I

E

�

�.

);

' �

f

t

�

ct

L

»

,

-:

�

�.

t

-

1 .�

t

E

�

(

.

£

�

(

,

C(

Co

�t

(

-�

'I:�

'

�

b h

�

11

�

,�

t �

�

r;: r

�

.

�

.

t. '

�

:

�

"

g

'1

(

,

�

,

�

;.

1 r

;�

,

l

�

G, � ,

\-

r.

,

l

� .

l

�

<;-�

r::-t"�

�'

�.

;:

'�'

�

,

f

'

�

',

�

(,

�

.�:.

�.

(

:

11 '�

;

�

�'

� c

r'

�

_

l

-

n-�

't-,-, '

�

h

�. '_ . '

�.

E:

,t:

�

·

\-� . �

f

( ,

L

· t,

11 t

�

t

'

' I;. r

�

.r::

,

.

� � ,. r '-' t) . C.

E

�

"r

1"

�.

<;-�

�

1 .�

'

' Ci..

.

�

II

·

'_ .

�

J'

r;

.

h

-t'

[

. '

�

!t'

1 �

"

1:

C{

.

(

,

�

c-

E�

11

f:

� ....

-

I't> _{� ::} _Q.. _> _0- _r,., _{... ..,} _� _" _... ..-.. _. ::: > _.., � 0-

-

.

"

'-'

x '

(11)

�'

c...

'f

f-[,

�

t

'r,

t'

�

t

�

t�l

l'r

t

f

==-,

(,

�

t

s;

r

�

'

t

�

[ ,

('{t

f.

t �

Qt

;:-.

-:

T v

t

'

�

',

't

:

t

,

�

� �

t�

,t

f.

"

[ ,

-

l

· f.

�

t

f·

\

[

1:

�

:

�

"

t

f.

1;

f

�:

�'.

f' �

:�

�

r

'Ci.,

l

Co

�

-);

.

r

,

t

�

t

� �

-l

E

,

t

t: �

�

�.

�

�. [,

l

�.

.

�

t

'

�

t.'

�

f

-�

;

'!

''b

. :

t

[

'i;'

[.

t

.

'L

�

f.

�

'

�l

,<;'

.t

E

1-�

� .

t

�

(1;

�

l

�

.'

[

J

Ci.,.

L.

o .

CIA

.

· )

f'

t

c

-f.

1 �

.

�

[

L

�

l

v

l'

t

.

-� 0' 0' ' L _ ,,� V\ v .

o·

;

-\I' _ �\ _

�

'�

[

:

'�

1:

:

�

\

t

· r;

�

[.

�

'

-:

1

� N

;

'

r

�

c,

t

l

L

,

'�)"

�

r

'L

.

�

\.

1 f

-

1

�>

[

1\

t

J

�

,:

1

�' s; .1: ' ;:R

l

0�

'

-�

-t

L

�.

fI : � ""' � �

}

"L

�.

' � "

ll'-,

�

1. n:�

,

c.

�

\l 0

.

. t

.

' � ,

L

E

'

II

[, fl

.

t

'

·

r

-[

�

� ' >�

,

.

�

.

1'.,

-

.. ' � ,r

'

�

.

'

l

't

�

"

�

f

�

\ .

;

�.

�

; �

'r

�

\--

�

s;

t

rt

r:

"f

J

f

�

r

�. .�

�

c· -I • � 0'

v

f'

r

I:'

- > � , '"

.

[

r!'

1

.. ' -[ t : t' "

.

-. ' V

,

(;., [,

t

�

L [

�

lit

_

t

· ,£

�

r ,

.

�

-1'., � _ 1'...

\

[

t

.

c

1-1'.,

�

��

f

i

'Ci.,.

f.

�

\ .

f

�

[ �

�

t

'

\-�:

\ .

r!� � " , , ' , .1:'

�

.

'

�

'E-�

,I:'

�

r

{

t

�:

.

( '

'j;

'-

· t

�

"

,

l

[rt'

�

£

1 �

'E-.�

�

,[

'Ci.,.

�

;

�'.

�

1"

.

�

-(

c... t;;

-[

.'

.

�

t ' 1'. C

'r;.

.2

1,

(

.

l

�r.

'

.�

r-1·

[

�

L

t [,

�

t.

r

,

t'

�

� t. ' , v'

'f.

[

.�

f

�'.

�.

l

�

t

.�

r

�

f

.

\-.

�

.f

�

c-,

'

-.

[ �

t

�

-;;

f.

\

�

v

�. \,. C L

t �'

1 (

�.

l'

1,

(.

� �

. 1'...

t�

.

);

,

.

v

,,--..

V

�

-'-"

-f

r

;

i

f

Ii

l

-�,

f

rt

);

.

�

'Ci.,.

r

2 �

f'

�

.

�

t

,

r

t:

�

t

'

\

t·[

_t

'

t

'Ci.,.

�

t

.:

{'

.f

-l

r

.�

�

c;

�

�.

!

�

11:

· r

-

L

[

\-v' [ ' ..

r

�

t

g

� ><

(12)

Acknowledgements

I would li ke t e press m 'peci al appreci ati on and thanks to my advi ser Profe sor Boumedi ene Belkh uche and Dr. aad Harous for thei r great support, pati ence, and encourage

throughout my PhD "·

tudy.

I thank them for conti nui ng to beli eve i n me even at ti mes that I 10 t

b li evi llg i n my elf. Thei r gui dance, advi ce and upport have been pri cele s. I would al 0 li ke

to thank the members of my advi sory commi ttee, Prof sor Li ren Zhang and Dr. Yaci ne Ati f and the member� of the exami ni ng commi ttee, Dr. Abderrahmane Laka Dr. Salah Boukti f, and Dr. Niz ar Haba h . I am grateful to them for thei r feedback and advi ce.

I wou ld li ke to thank my parent , my husband, my fami l y, and all of those around me who upported me uncondi ti onally throughout my tudi es. No word can ex pres my grati tude and appreci ati on to my parent for all the un matched love, pati ence, and acri fice they have made for me. Your love and prayer were what ustai ned me thu far. I am forever i ndebted to my mother for her great help i n taki ng care of my chi ldren and for the sleepJes ni ght duri ng the ti me of my wri ti ng of thi the i . Thi s the i s would not have been completed wi thout her

upport: . I wou ld li ke to thank my hu band Ashraf and my chi ldren, Ghalya, Yahya, Yazan and the li ttle weet baby Malak. Thi work would not have been po i ble wi thout A hraf' love, pati ence, and advi ce. I am forever gratefu l to A hraf for hi upport and encouragement. To my chi ldren I am very overwhelmed for the every-ni ght prayers, and for the love, joy, and happi ne that they bri ng, and wi l l bri ng, to our li ves. Fi nal l y, I would li ke to thank all my brothers Ahmed, Mohammed, Abdal l ah, and Ayman and my si ters Deena and H ayat and al l thei r fami li e for thei r pi ri tual support: , encouragement and prayer . I would li ke to acknowledge the efforts of my brother Mohammed and thank hi m for always bei ng there for me whenever needed. I offer my i ncere grati tude to my beloved Aunt. H ayat for her conti nuou encouragement and love. I extend my thanks to my mother-i n-law and my si ters i n-law Hanan, Madleen, and Nedaa for thei r prayers and Llpport. Speci al thanks to Mr .

Layla, Om Mohammed, the fri end of the fami ly, for her unforgettable help and love to me and my chi ldren and to Mr . M ajeda, Om Raafat, for her pi ri tual support and prayer .

(13)

-,I n ... and for our pa, �ionate con eL ation w e had during our tudy. I w ould l ike LO e pre my gratitude lO all my , upporl ing friends, Maitha, Amna, Noha, M anal, and Shehtaj Sultana.

F i nal 1 , I w ou ld l i ke to thank the Un ited Arab Emirate Univer ity for offering me the scholarship that helped me pur ue my PhD tudy.

(14)

Dedication

To

my dearest parents,

rny beloved husband,

&

my adorable children

(15)

Table of Contents Titl

Decla rati n of Original Work

Cop right . . . . dv i. ory Commi ttee

ppro al of the Doctorate Di sertation

bstract . . .

.

Title and Ab tract ( i n Arabic ) .

Ack now ledgement

Dedication .

. . .

Table of Content L ist of Tables Li , t of F i gure Chapter I : I ntroduction

1 . 1

M otivation

1 .2

Text Reuse Detection

1 . 3

Rel ated Research .

.

1 .4

Text Reu e and Plagiari m Detection

1 .4. 1

Pl agi arism Detection Types .

1 .4 . 2

Plagiari m Detection CIa e

. .

1 .5

Text Reu e on the W eb .

.

. .

1 .6

Text Reuse Detect ion for Arabic Text .

1 .7

Contribution of thi The i

1 . 8

Objective . . .

.

1 . 8 . 1

M ajor Objective

1 .8 . 2

General Objectives

1 .9

Di ertation Out l i ne . .

Chapter

2 :

A n Overv iew of F i ngerpri nting Techn ique

2 .1

F i ngerpri nt Generation .

.

2 .2 F ingerp rint Ma tc h i ng

.

. . . .

2.3 Fi ngerpr i n ti ng App roac hes . . .

2 . 3 .]

Overl ap F i ngerprinting Methods

2 . 3 . 2

Non-Overlap F i ngerpri nting Method

2 . 3 . 3

Other Approac hes . . .

.

2.4

Eval uation of F i ngerprinting Approaches .

11 III IV V VlI IX Xl Xlii XIV XVll XV1ll

I

2

3

7

8

9 1 0

1 0

1 1

1 2

1 4

]5

1 5

1 6

1 9

20

22

(16)

2.4. 1 l n f rmation Retrie al E aluation . . . 22

2.4.2 Text Reu e and Plagiari. m Detection Corpora . . . . 26 2.4.3 Perf orm ance Eval uation of F i ngerpri nting techniques 27

hapter 3: y tern Architecture and Text Analy i for Reu e Detection . 30

3 . I I ntroduction . . . ₃₀

3 . 2 Web-based Candidate Document. Retrieval _{3 1}

3 . 3 Text Pre-proces i ng and Repre. entation _{3 1}

3 .4 Arabic Language Characteri tic ₃₂

3.5 Text Preproces, ing Technique ₃₅

3 .5 . 1 Token i zation . . . . . . ₃₆

3.5.2 Character Normalization and Diacritic Removal 36

3 . 5 . 3 Stemming . . . 37

3 . 5 .4 Stop-word, F i ltering . . . 39

3.5.5 Sentence Identifi cation and Segmentation ₄₀

3.5.6 Punctuation Removal . . . 4 1

3 . 6 Text Representation Technique 4 1

3.6. 1 Bag of Word Representation . 4 1

3.6.2 N -Gram Representation Model 42

3.6.3 H a h Model . . . 43

3.7 Local Te t Reu e Detection and Text Al ignment 44

3 . 8 Po t-proce ing and Pre entation 44

3 .9 I mplementation Detai l 44

3 . 1 0 Chapter S ummary . . . 49

Chapter 4: Candidate Docu ment Retrieval from the Web 50

4. 1 I ntroduction . . . 50

4.2 Related Literature . . . 5 2

4.3 O u r Propo e d Document Retrieval Model 54

4.3. 1 Notation and Problem Defi n it ion 54

4.3.2 Model Spec i fication 56

4.4 Experi mental Re ult 60

4.4. 1 A Ca e Study . . . . 60

4.4.2 Construct ion of Candidate Document Retrieval Te t Ca es 6 ]

4.4.3 Best Q uery Length . . . 67

4.4.4 Q uality of the F ormu lated Q ueries 67

4.5 Chapter Sum mary . . . 68

Chapter 5: F ingerp rinting-ba ed Detection and Detai led Analysi of Text Reuse 69

5 . 1 I ntroduction . . . 69

5 . 2 F i ngerpri nting Approaches . . . 70

5.2. 1 Hail torm F i ngerpri nting M ethod . 70

5.2.2 Winnowing F i ngerpri nting Method . 7 1

5.3 A F ramework for Text Reuse Detection Using A Hybrid Approach 72

5.3 . 1 Candidate Document Retrieval M odel . . . 76

5 . 3 . 2 Pairwi e Comparison Mode l (Text A l ignment) 83

5 . 3 . 3 Reused Text Pre entation 85

(17)

5.4. 1 Performance Mea. ure. . . .

5.4.2 E peri mental etup . . . .

5 .4.3 E peri mental Resu lts and Analy i

5 .4.4 Comparison to Other F i ngerpri nting Approaches 5 . 5 Stati tical Analysis . . . .

5.5. 1 E peri ment Defi n i t ion

5.5.2 E peri ment Planning . 5 . 5 . 3 Experi ment Operation 5.5.4 E peri ment I nterpretation

5.6 Chapter Summary Ch apter 6: Concl usion . . .

6. 1 D i ertation Summary . 6. 1 . 1 Di sertation Contributions 6.2 F uture Research Reference . . . . 86 88 88 95 98 98 98 99 1 05 1 06 1 08 1 08 1 09 1 1 0 1 1 1

(18)

List of Tab les

Tab le 2. 1 : Compar i on of fingerpri nting techn iques ba ed on their election heuri -tic . Note that

d

refer to a document, c and

C

are a chunk and the et

oC chunk , respectively . . . . . . . . . . . . . . . . . . . . 29

Table 4. 1 : Characteristic of the collection for Web document retr ieval 66

Table 5. l : Characteri tics of the corpus u ed to eval uate the sy tern [ 1 ] 89

Tab le 5 .2: M icr o a nd ma cr o mea ure of t he DETRA with dif fere nt va lues of n for

l1- gra m a nd s ente nce s ize =25 . . . . . . . . . . . 90

Table 5 .3: Mi cr o a nd ma cr o meas ur es of DETRA wit h differ ent val ues of n f or

n -gram a nd s ente nce ize =30 . . . . . . . . . . . . . . . 92

Table 5 .4: M icr o a nd ma cr o meas ur es of DETRA with diff ere nt val ues of n for

n -gra m a nd s ente nce iz e=35 . . . . . . . . . . . . . . . . . . 93 Tab le 5 .5: Ma cr o meas u re s of th e wi nnow i ng appr oa ch i n ter ms of the F -measu re

and the overal l TRDS compared w ith our method for the dif ferent val-ue of n = 3 . . . 7 of n -gr am , w i ndow size = 1 0, and sentence ize=25 . 96 Table 5. 6: M acro mea ures of the w i n nowing approach i n ter m of the F-measure

and t he overal l TRDS compared with our method for the dif fer ent

val-ue. of n = 3 .. . 7 of l1-gram , w i ndow ize = 1 0, and sentence size=30 . 97

Table 5 . 7: The tat i st ical measures computed for the four differ ent models based on the n umber of reused ource docu ment and the over al l i m i l ar ity

(19)

Figure 1 . 1 : Figure 1 .2 : Figure 1 . 3: Figure 2. 1 : Figure 2 . 2 : Figure 3. 1 : Figure 3.2 : Fiaure 3. 3: Figure 3. 4 : Figure 3. 5 : Figure 3.6: Figure 3. 7 : Figure 3.8 : Figure 3.9: Figure 3. 1 0: Figure 3. I 1: Figure 3. 1 2 : Figure 4. 1 : Figure 4 . 2 : Figure 4. 3: Figure 4. 4 : Figure 4 . 5 : List of Figures

The general form of text reuse . . . . The simi larity spectrum a de. cribed i n [ 2 ] . . . . Tran '_{ition relation in near-dupl icate text detection doe} local te t reu e detection . . . .

The eval uation of 5 fi ngerpri nting algorithms in term and recall values as reported in [ 3] . . . .

The performance evaluation of 5 fingerpri nting algorithm

F-m a ure value u. ing TREe new wire, [ 4 ] . . . .

not apply to

of prec i ion

in terms of

Framework for Web-based Arabic Text Reuse Detection

Different val id word reordering of the same sentence in Arabic .

Two imi lar word i n Arabic

/

( hown in red ) that share the arne root. However, the two words have different meanings ba ed on the context. The first mean di pute, while the second means trees . . . . Different representation of the same word

r

( i l m - knowledge) based on the pre ence or ab ence of diacritics . . . . A complete sentence formed i n Arabic by binding c l itic to the verb. Unbinding ome of the c l itics produces longer sentence a in the

sec-ond example . . . . Text preproce sing cherne appl ied before text reuse detect ion . . . .

The et of prefi xes and u ffi xe i n the Arabic language, which are

tripped off by the stemming algorithms . . . .

The tword I i t used i n the tword removal preprocessing op-eration . . . . Word and character n-gram on Arabic text for different values of n .

A ample of the visual output pre enting a reused text case (enc1o ed

in a box ) . . . .

The graph ical user i nterface of the propo ed y tern . . . . A n example of text reuse detection of a 2-page report for a set of i nput documents. Page ( I ) di play a l ist of the input doc uments ex ami ned for text reuse in tances. The second page ( 2 ) haws a detai led

ample of reused text instance for i nput-document0055.txt and the carre ponding ource document from which it hare reused text . . . The proposed framework of a Web-ba ed text reuse detection for

Ara-3 4 5 26 28 30 32 3 3 3 4 3 4 35 39 40 42 45 48 48

bic document ( DETRA) . Pha e 1 i n the DETRA i s shaded in blue . . 5 1 The building block of candidate document retrieval from the Web . . 56 The part of DETRA' system i nterface for Web documents retrieval ta k 60 A n example of a document w ith text copied from other Web docu

ments. Each reu ed text is displ ayed in a bounding box. The different

color indicate different ource documents. Note that the

representa-tive fingerpri nt of the document are di played in red . . . . . 62 The set of randomly generated queries for the sample input document

in Figure 4. 4. The queries are encoded into UTF-8 character format

(20)

Figure 4.6: Figure 4. 7 : Figure 4 . 8 : Figur e 4.9: Figure 4. I 0: Figure 5. 1 : Figure 5 . 2 : Figure 5. 3 : Figure 5.4 : Figure 5 . 5 : Figure 5 . 6 :

The lOp ranked Web d cument ' URL addre e returned by each query 64 The X M L output of the detection ta k identify i ng the Web URL ad dresse of the documents that , hare text with the input document with high simi larity rate . . . . . . . . . . . . . . . . . . 65 The i ual output of the detection ta k pre enting part of the reu ed

text which i displayed in red . . . . . . . . . . . 65 Compari ng the different query length . The be t result obtained with

w = 1 0 term per query . . . . . . . . 67

The quality of document retrieval mechanism ( precision) 68

Hail stonn fingerpri nting appl ied on a, ample text . Winnow i ng fi ngerpri nting appl ied on a ample text The main proces e for detecting local text reuse .

An e ample of a text w ith elected fi ngerpri nt based on Hail torm

method . . . .

The elected fingerpri nts by our modified Hailstorm method . . . . . The bu i lding blocks of source documents retrieval from a given local

document collection . . . . 72 7 3 7 3 74 75 76 Figure 5.7 : The meta information about the reu ed text cases detected a pair

between the i nput and the ource document . . . . . . 85

Figure 5.8: A ample of the vi ual output pre enting a reused text case (enclosed

i n a box) between the di pl ayed document with a ource document named : ource_document0040 1 . . . . . . . . . . . . 86

Figure 5.9: M acro text reuse detection core (TRDS) ba ed on entence length 89

Figure 5. 1 0 : The micro and macro evaluation measure of our DETRA for

docu-ment wi th sentence size= 25 . . . . . . . . 9 1 Figure 5. 1 1 : The overal l text reu e detection core (TRDS ) of our propo ed

DE-TRA for document w ith entence ize= 25 . . . . . 9 1 Figure 5. 1 2 : The micro and macro eval uation mea ure of DETRA for documents

with sentence size= 30 . . . . . . 92 F igure 5. 1 3: The overal l text reuse detection core of DETRA for documents w ith

entence size= 30 . . . . . . . . . . . . 9 3 Figure 5. 1 4 : The micro and macro evaluation mea ures of DETRA for documents

with entence size= 35 . . . . . . . . . 9 4 F igure 5 . 1 5 : The overall text reuse detection core of DETRA for documents with

entence size= 35 . . . . . . . . . . . . . 9 4 F igure 5 . 1 6 : A n example of electing a common phrase as a fi ngerprint, which may

cause fal e po itive case . . . . . . . . . . . . 9 4 Figure 5. 1 7 : A nother example of electing a common phrase as a fingerpri nt, which

may affect preci ion/recal l core . . . 95

Figure 5 . 1 8 : Comparison between the winnowing method and our DETRA ba ed

on the F -mea ure for entences i zes= 25 and 30. DETRA

outper-forms the w i nnowi ng-based sy tern . . . . . 97

Figure 5 . 1 9 : Compari on between the winnowi ng-ba ed system and DETRA i n term of the overall text reuse detection scores ( M acro TRDS) for sentences ize 25 and 30. The proposed sy tern DETRA outperform

(21)

Figure 5.20: The exact fi t model ( model E), hown in black, and model 0 of DE TR 'y tern with both temm ing and character normalization applied on the input te t, hown i n blue. The overall accuracy of the system

i 92% . . . 1 0 J Figure 5. 2 1 : The actual imi larity percentages i n model E compared with the sim

i l arity percentage' computed in model C of the system, where char acter normal izat ion i applied on the text and temming i bypa ed.

The overal l accuracy of thi model i 87o/t . . . 1 02

Figure 5 . 2 2 : The exact model E, hown in black, compared with model B

with-out the character normal ization proce . The overal l accuracy of the model i 86% . . . 1 0 3

Figure 5 . 2 3: The war t performance of the system i n model A, with di sregarding

of both temming and character normal izat ion processe . The overal l

accuracy i 84% . . . 1 04

Figure 5.24: The combi ned four di fferent model s of the ystem with the exact

(22)

1 . 1 Moti ation

Te t reu e tand, _{for the act of u ing exi ting document in creating new one . It}

oc-urs in variou: forms based on the amount and method of reu e. Some document are entirely copied and tored in many di fferent locations. Some other document are partially reu ed u ing one of the different forms of text reu e such as quotation , tran lations, rewording, ,UITImaries, or plagiari m . For e ample, people could quote text from other' emails i n their replies. One could create variou ver ions of a re earch paper, each of which i likely to have a ign ificant amount of common text that make them closely related to each other. Many au thor may reu, e considerable amount of text from their published work in conference papers to prepare more detai led journal publ ication of their work. Events or topic may be pre ented in variou, way for different readers. Such pre entat ion may be a copy of one another with few modification . Therefore, the h i tory of a given topic of interest can be identified given a

ufficient ly l arge arch ive.

Ba ical ly,

local

te t reuse occurs when only mal l part of text in a document,

e.g.

entence , fact . or pas ages are copied and modified ( see Figure 1.1). The cope of the term

local

i nc lude a wide range of text transformations. For example a typical text reuse from the Web, given the huge amount of text on the Web and the ea y access to them, may happen by u ing Web search engine to look for a relevant source, then part of the text from that source

copied and pos ibly modified. More new statements or fact may be added to the original text, and o rne part may be deleted, or partial l y rewri tten, produdng a simi lar document to the original ource. Obviou l y, the level of similarity between such documents wou ld vary ubstantially depending on the amount of editing performed from mi nor edits to complete modi fication [ 2 ] . I n this the is, we are intere ted i n identifying such reu ed text i n Arabic document .

I n general . detecting similar tatement , facts, or pa sage is hard using the existing Web earch engine . Several techniq ues have been previously proposed for detecting text

(23)

reu,>e ; ho ever, these technique have been de igned for relatively mall and homogeneou ol lection'. Our inve. tigation of the state of the aJ1 reveals the lack of actual work on text reu�e detection on the Web. Thi i due to the heterogeneity of the information contained on the Web and it large 'cale that make the ta k of text reu e detection on the Web much more difficult than in relatively mal l and homogeneous col lections. However, the etting of the Web prov ide much more i ntere ting and challenging propertie to text reuse detection such a. ,>ource detect ion and tracking event evolution. That i , given a entence or a pas age of interest, the goal of 'ource detection i to fi nd the documents from which the discussed topic origi nated . The result of th is research w i l l have a significant influence on the development of tool that can be u ed to val idate information that come from variou sources of differing re l i abi l ity. S uch a tool would be val uable i n many appl ication in education, scientific research,

oc ial network , and national ecurity.

More objectives and contribution of the the i are addre ed in detail later i n this chapter. I n the next ect ion, the ba ic of detecting text reu e are pre ented .

1 .2 Text Reuse Detection

Before we tart reviewi ng re earchers' approache for local text reu e detection algo rithm , it i neces ary for us to have an elementary understanding of text reu e detection and i n formation retrieval.

Text reu e detection ha recently received significant attention of many re earcher from di fferent fields uch as text data mining and i nformat ion retrieval . Detecting text reuse ha many appl ications that ste m from it. For instance, dupl icate or near-dupl icate detection, which mai n l y detect documents that are al mo t simi lar to the original document except for very few change , plays an important role in Web search and informat ion retrieval, where i t i u ed to fi l ter the search re ults I i t. Plagiarism detection is another interesting appl ication that ha attracted the attention of many researchers. It occur when an author of a document reu es text from one or more document without ack nowledging that reuse.

(24)

Pout Future

_--- Time --____ ---+.

Figure I . I : The general form of text reu e

and edited. Therefore. given a document, the detection problem of local text reuse can be defined ac i dentifying al l reused tatement with i n a given document [ 4 ] . This task can be ac compl i hed by fir t look ing for candidate document that are l ikely to be the original ources to the given document in a collect ion of documents and then performing an extensive pairwi e compari on between the given document and each of the po ible ource of text reuse that ha e been retrieved. B a ically, thi mean comparing the pa age of the given document one by one to the pa sage in the candidate source , earching for con picuou imilarities. I ntu itively, confini ng the search space to a reasonable and carefu l l y selected number of candidate document rather than earching all the avai lable documents in the collection would certainly m i n i m ize the complexity of the earch as we l l as enhance the efficiency of the y tern. This

tep become crucial in the setting of the Web. The detection proces of text reu e can then be di i ded into two mai n ub-proces e :

1. Candidate ource document retrieval,

2. Pai rwi e similarity compari son within documents.

The next ection provides an overview of the re lated l i terature for the local text reuse detection

problem.

1 .3 Related Research

The degree at which pa ages of text are considered imi lar to each other

( i. e. ,

local text reu e detection ) can be placed somewhere in the middle range of the imi larity

(25)

spec-(rum [ 2 ] ( Figure 1 . 2). At on extreme of thi spectrum i the highe t degree of imilarity. That iii, tv- o d cumenL are con. idered id ntical . Much of the re earch that focused on duplicate detection and plagiarism is located at this end of the spectrum, identifying nearly identical document [ 5, 6. 7, , 9, 1 0 1 · The topical imi larity. which is the standard ta k of information retrieval lies at the other end of the pectrum. which mean , two document are a match if they are topical l -related to the same information need or query. This traditional area of research ha� also been largely focu ed. Little prior re earch, however, deals with the intermediate form of '_{imi larity on the simi larity spectru m, which is where text reuse is located.}

The objective of thi re earch i to develop method for detecting the reuse of facts and concepb at the pas age or entence level . Thi degree of semantic simi larity is a stronger form of topical relevance, but doe not i mpo e the yntactic similarity constraints typical of copy detection y tems [ 1 I ] .

Near Duplicates Topical Similarity

loal Tnt

Figure 1 .2 : The imi larity pectrum as described i n [ 2 ]

A a re earch field, text reuse h a received much attention. Researchers from the U niver ity of Sheffield initiated the re earch on a project named M ETER ( M Easuring TExt Reu e). The aim of thi project wa the automated detection of text reuse within pecific domain of journali m [ 1 2, 1 3] . A fol low up of thi project wa made by re earchers from the Univer ity of Lanca ter a they used the technologies developed in M ETER to analyze the text reu e in journalism from the 1 7th century [ 1 4 ] . A th ird project has started at the Un iver ity of M assachusetts Amherst to explore method for text reuse on the Web [ 2 ] . It is worth noti ng that although plagiarism detection is con i dered a spec ial case of the general field of text reuse detection th i s particular appl ication ha been widely tudied in the l iterature and many techniq ue have been proposed [ 3] . The fo l lowing are urveys about text-based plagiari m detection [ I S, 1 6, 1 7 ] and surveys about code-ba ed plagiarism detection techn iques can be

(26)

found in [ J , 1 9 ] .

Detecting e act copie ' is fairly imple and ha been widely tudied [ 1 0, 20, 2 1 ] . De

tect ing partial copie ,

i.e. , I

_{oking for documents that are almo t typical to the original doc}

umenL ex epl for few change _{uch as the in ertion or paraphra ing of a few words (and}

well-known a' near-duplicate or near- imilarity detection ), i complex and has been a major focu of re. earchef. [ 22, 2 3, 24, 2 5 ] . It is even m re complex when the reused text i modified

or paraphra. ed a it i. the case with local text reuse. Detecting thi k i nd of local reuse can be used as a ba'is of new and powerfu l tool ranging from document management and i n

forma-t ion reforma-trieval forma-to a new informaforma-tion eeking behavior of forma-the Web earch such a forma-the reforma-trieval of

related document , or detecting the origin of text segments [ 26, 27 ] .

Generally, the algorithm u 'ed for near-dupl icate document detection do not work wel l for the local text reu e detection [4, 2 8 ] . This is due to the nature of the local text reuse, which reu es only mal l part of a document from other OUITe . More preci ely, near-dupl icate

document detection algorithm a ume a tran itive relation between documents. Preci ely, document A hould be a near dupl icate of a document C, given that document A is a near-dupl icate of a document B , which i a near-near-duplicate of document C. Figure 1 . 3 hows how thi a umption i v iol ated i n the case of local text reuse detection. Note that � in the fi gure an indicator function that eval uate to 1 i f two documents have text reuse re lationship [ 4 ] .

Oocument A Document B

11 _ 1

Document C

Figure 1 . 3: Tran ition relation in near-dupl icate text detection does not apply to local text reu e detection

(27)

each ther ha e been propo ed, m ,t of them belong to one of the following three approache a, defi ned by [ 29 ] :

u b t ring Matching. I n thi approach, technique uch a Greedy String Tiling [ 30 ] and Local l ignment [ 3 1 ] are used to identify maxi mum matches in pairs of strings, which then are used a ' reu e indicator [ 32 ] . The e , tring' are represented in uffi x trees and graph bu, ed metrics are u ed to evaluate the amount of text reu e [ 3 3]. However, the computational comple ity of these method in term of time and torage is very high. For example, the worst case com pIe ity of the standard Greedy String Ti ling approach i proved to be

O(n3 ),

where

11 _{i the length of tring [ 30 ] . On the other hand, the complexity of the Local Al ignment}

algorithm, given two tri ng A and B i s

O( ( IA I

+

I

B

!)3)

.

Keyword Similarity. I n thi approach , representative topical l y-rel ated keywords are col

lected and weighted from a given document. The e keywords are then compared to the key

word e tracted from the other documents. If a match i found between the document ' key word , the document with potential imi larity are plit i nto maIler port ions of text, where

the e port ion are compared recur ively. Thi approach a sume that imi larity i found only

in document with i m i l ar topic . [ 34 ] . A nother simi lar approach compares text pas ages

again t their rel at ive frequency of word occurrences [ 1 0, 1 1 ] . Two pas ages hare simi lar

word frequencie relative to each other are considered simi lar. Thi approach, again, assume

document with imilar topics.

Fingerprint A nalysis. The most popular and succe ful approach to near-similarity search

and local text reuse detection i the detection of overlapping documents fi ngerprints [ 20, 35 ] . Document are div ided into term sequence (chunks), which are then converted i nto a numeric form repre enting the document fi ngerpri nts. Then, whenever two doc uments share one or more fi ngerprint , thi mo t probably indicates a reu ed text between the two document while maintai ning an acceptable runt ime and torage behaviour [ 1 0, 2 1 ] . Chapter 2 overview most of the avai lable fi ngerprinting technique for local text reu e detection . The next ection

(28)

de cribe the general a pect of plagiarism detecti n .

1 .4 Text Reu e a n d Plagiarism Detection

Plagiari m is the act of unacknowledged re-u e of omeone's else work. Text plagia rism is an il legal reuse of te t from one or more docllments without being properly cited [ 36 ] .

lthough plagiari m detection ie on idered a special ca e of the general term text-reuse, thi particu lar application ha been widely tudied in the l i terature [ I S, 1 6, 1 7, 37 ] .

1 .4. 1 Plagiarism Detection Types

The main objective of plagiarism detection tools i to detect and prevent the two main t pe of plagiari m. Namely, plagiari m i n programming languages and plagiari m in natural language (al 0 cal led textual plagiarism).

One of the earl ie t model for automated plagiari m detection reported in the litera ture ha focu ed on detecting plagiarism i n tudents' source code [ 38 ] . Detecting plagiari m ca e i n programming languages i considered less complicated when compared to natural l anguage. The detection i nvol ve fol lowing the mai n tool of a given program, uch a l i ne number . tatement . and method call and is wel l researched i n the l i terature [ 1 8, 1 9 ] . Ex ample of pl agiari m detection tools for programming codes are MOSS ( Measure Of Software S i m i l arity), YAPIYA P 3 ( Yet A nother Plague), and S I D ( Software I ntegrity Diagno i or Share I nfomlation D i tance) .

The detection t a k becomes more compl icated when deal ing with document in natu ral language a the text may get easily modi fied or paraphra ed. The most re earched approach is when a text i n a document i a dupl icate or a near-dupl icate of another document. However, the complex i ty of the problem i ncrea es when paraphrasing takes place. Thi difficulty of the problem varies from si mple word i nsertions, deletion , or substitution to complete reformu lat ion or even word re-ordering a it is the case for word-free language . Natural language plagiari. m can be c l assified i nto two mai n type : uni-J i ngual and multi-lingual plagiari sm de tection. Mo t of the researcher add res ed the uni-lingual plagiari m detection type, wh ich

(29)

deal<, with the automatic identification of te tual plagiari m ca e in document with the arne language, for in, tance , Arabi - rabic. Multi-li ngual plagiari m detection, on the other hand, addrc�ses the automatic detecti n of plagiarism between document with different language . For c ample, Engli h- Arabic.

1 .4.2 Plagiari m Detect ion Classes

Textual Plagiarism detection can also be divided into two major problem cla e , namely e ternal plagiarism detection and i ntrinsic plagiarism detection. In the ca e of external plagiarism, the proce s sear he for candidate source ( from a large col lection of documents) from which a suspiciou document may have reu ed parts of it text [ 39 ] . On the other hand, i ntrin ic plagiari m detection focu e on the change in the writing style of a given uspiciou document by identifying part of the document that differ igni ficantly from the remai n i ng text within the arne doc ument. Note that there i no reference col lection used in thi type of plagiari m cla [ 40 ] .

M any commercial plagiari m detection appl ication have been developed in the la t decade . Some of the e detectors are ba ed on fi ngerpri nting techn iques addres ed i n th i re earch, while other product uch a

Academic Plagiarism Checker, PLagiarisrnDetect, Tur

l1i(i11 .

and

Grammer!y

are for commercial purposes and their underl ying techn ical i n formation

are kept hidden .

Turn irin

i considered one o f the most popu lar plagiarism detection y terns. I t h a

been reported that i t uses documents col lection a s w e l l a Web page and 9 8 % o f the U K un iversitie u e i t for plagiarism detection purposes [ 4 1 ] . O n the other hand, i t s main technol ogy rel ie on finding tri ng matches of i ze 8 to 1 0 words as revealed by the analysis i n [ 42 ] .

E pecially, the system perform wel l only for exact copies and probably not a good i n the ca e of paraphrased plagi ari m.

(30)

1 .5 Text Reu e on the Web

I n thi. ection, we describe lhe main i sue and challenges that are related to local text reu 'e dele tion on the Web.

everal technique ha e been proposed for detecting local text reu e on the Web. How ever, these technique have been developed to work on Latin language and on relatively mall and topical ly related data et that have been created through crawling specific Web pages. To the be. t of our knowledge, no prior work has been previou ly conducted on text reuse detec tion on the Web. Bendersky and Croft [ 2 ], however, have recently sugge ted a framework for detecting English te t reu e on the Web. They focused on retrieving and weighting key con cept related to the tatement of i ntere t, and defi ning the time-l i ne of the tatement it elf and the related Web page. to it. H owever, the practical appl icability of the suggested text reuse detection y tern on the Web has not been i mplemented.

Chiu

et. al.

[ 2 8 ] presented a prototype text reuse earching i nterface based on the ar chitecture propo ed i n [ 2 ] . They i nve tigated the fea ibil ity of a text reu e arch i tecture for the Web. The y tern fi r t creates an in itial document et from the Web and then appl ies k-gram ba ed fingerpri nting method on the retrieved set . The steps include query i ng the document to be used for the reu e detection, down loading them, and finally apply i ng fi ngerpri nting technique to compute the i mi l arity cores and define the t i me-l ine of reuse. Obv iously the performance of the whole system initial l y depend on which documents are retrieved at the fi r t tep, and fi nally the text reu e detection method. Therefore, the focus hould be on new technique that w i l l advance the current research.

A traightforward approach to fi nd documents on simi lar topics i to extract keywords from the i nput document and to retrieve other documents also containing these keywords [ 4 3] . Our contribution to thi problem i a strategy to formulate querie to a Web earch engine in order to retrieve candidate ource documents. U ing the Web as a source requires tech nique for automatic query generation and formulat ion to identify potential candidates i nce the Web i the typical place of text reu e.

(31)

1 .6 Text Reuse Detection for A rabic Text

The Arabic language i a rich morphological language that i among the mostly u ed language ' in (he world and on the Web a wel l . However, little or no attention i given to (hi language in the ar a of text reu 'e detection and even plagiari m detection. While the local te t reu' detection problem ha been mo tly tudied i n the literature for Western language , it is sti l l one of the bigge t chal lenge in the Arabic language and the re earch have remai ned qu ite l imited. While there are few work related to plagiarism detection in Arabic on relatively 'mall col lections, we con fidently tate that there i no available previous work or system on te t reu e detection for Arabic-ba ed documents on the Web.

It i worth mentioni ng that the only work concerning text reuse, and plagiarism detec t ion in part icular, are the work done by Alzahrani et al [44 ] , Jadal lah et al. [45 ], Menai [46 ] , and J aoua [ 47 ] . A l l o f them focu ed o n plagiarism detection o f Arabic-based documents i n relatively smal l col lection . B e ide , a l l of them are particularly tai lored for the external ap proach of plagiari m detection, while the work done by Ben alem [ 48 ] focused on the intri nsic approach, and are, hence, Ie s relevant to our work .

1 .7 Contri butions of this Thesis

I n this the i , we intend to develop techn ique for finding text reuse on the Web. That , our objective i to develop text reu e detection method that can detect alternat ive vel' ion of the same i n formation from the Web or using any Arabic-based documents col lection. The result of thi re earch can be thought of as rich tools to information analyst to val idate and a e information coming from uncertain sources. The ystem would prove useful for detect

ing reuse in cientific literature too. It i al 0 the t i me for ordi nary Web users to become "Fact

I n pectors" by prov iding a tool that al lows people to quickly check the val idity and original ity of tatements and their sources, so they w i l l be given the opportunity to perform their own a, es ment of information qual ity. The main contribut ions of thi the is are summarized as fol low :

(32)

• We propo, e a new querie ' formulation model for candidate ource document retrieval at th Web , cale. The queries are formulated by extracting a et of representative term from a given document that form a a signature to that document in order to query a Web search engine ia a publ ic search API. The set of querie is selected in uch a way that ensure a global overage of the document.

• We propo e a new text reuse detection technique ba ed on fingerprinting to detect local

te t reu e in the etting of the Web. Our technique is hybrid in the sense of applying fin gerpri nting election method together with Information Retrieval hemi tic to produce better and more robust result and to accommodate with the complexity of the Arabic

language .

• We develop a more effective and i nteractive Web search tool that al 10w long queries

to be u ed rather than hort keyword , which acts a an important ub-ta k in text reuse detection problem on the Web.

1 .8 Objectives

1 .8. 1 Major Objectives

The major objective of th is dis ertat ion are the fol lowing:

1. To de i gn a y tern for detecting text reuse for A rabic doc uments on the Web.

2. To focus on developing efficient Web query formu lation techniques to retrieve clo ely related doc ument from the Web based on the i nput document being tested.

3. To de i gn model for detecting text reuse that include paraphra ing, word deletion and words reordering.

4. To anal yze the impact of text pre-processing tech niques uch as stemming, lemmatiza tion, top-word removal on the overal l detection sy tern.

(33)

6. � construct a c rpus for the e aluation of both ub- ystem , the candidate document retrie al from the Web, and the pairwi e compari on between pair of imilar document . 7 . To e al uate the effect i eness of the te t reuse detection model and addre the problem

that ari. e in the Web en i ronment.

1 .8.2 General Objectives

The general objectives of thi research are the fol lowing:

( 1 ) To re iew the problem of detecting text reuse and its impact on the de ign of tools for retrieving information and it analysi s.

(2) To rev iew the tate of the art tech nology for text reuse detection, identifying the chal lenges behind the m iddle of level of the i m i l arity pectrum.

( 3) To de ign model for detecting text reuse, captur ing orne text edit uch as ynonym ub titution, words deletion and words reordering.

( 4) To review exi ting finge rprinting technique for text reuse detection.

(5) To i nve tigate and te t the fea ibil ity of text reu e detection at the Web scale in the etting of Web search and discover the advantages and drawback of using the Web as a source.

( 6) To addre s the chal lenge that would arise from detecting reu e in Arabic document .

1 .9 Dissertation Outline

Thi dis ertation i structured as fol l ows:

In Chapter 2, we overview several local text reuse detection method ba ed on fi n gerprinting techniques. We fi rst define the context of local text reuse and si tuate it within the general pectrum of information retrieval in order to pinpoint it part icu lar appl icability and chal lenges. After a brief de cription of the major text reuse detection appro ache , we i ntroduce the general pri nciples of fi ngerpri nting algorithm from an i n formation retrieval

(34)

perspect ive. Three clas 'es of fi ngerprinting method (overlap, non-overlap, and randomized)

are urve ed. Specific algorithm. , . _{uch a k-gram, winnowing, hail torm,} OCT and hash

break ing. are de cribed. A number of benchmark corpora for Latin document that are used to val uate the performan e of the text reu e detection sy tem is pre ented. Finally, the pelfor mance measure that are commonly used for eval uation are de cribed and the characteristics of the mentioned algori thm are ummari zed ba ed n data from the l iterature.

In Chapter 3, we pre ent the sy tern architecture and de cribe its pha e . The Arabic language characteri tic are al 0 presented in th i chapter. Text preproces ing and repre enta tion technique. u ed before applying the i ndexing and detection method are also discu ed. Final ly. 'ome i mplementation detai l are pre ented.

In Chapter 4. we defi ne a new method of Web queries formulation for the problem of candidate documents retrieval from the Web. We evaluated the work u ing a collection of doc ument especially con tmcted for the eval uation of Web document retrieval. The exper i ment how that, on average, around 80% of the Web documents used in the reused case were ucce ful l y retrieved.

In Chapter 5, we de cribe in detail the general framework to detect text reuse i n Arabic text . A e t of experi ment were conducted t o gain a n i n ight i nto the effect of the u ni que feature of the Arabic l anguage to the mai n components of the general framework of local text reu e detection . Query expan ion i integrated i n to the retrieval pha e i n order to deal with edited text. Queries are expanded w ith terms deletion, words reordering, and ynonym substitution. We have eval uated the performance of the detection component of our sy tern using a relatively large corpu that has been recently created for the detection of text reu e and plagiari m in Arabic doc uments. The resu lts demonstrate that the query expan ion tool i mprove the overall performance of the document retrieval phase compared to q ueries wi thout expansion .

(35)

Chapter 2 : An Overview of Fingerprinting Techniques

Th i <; chapter provide an in-depth , tudy f a wide range of fingerprinting method .

We first introduce the general pri n iples for any fingerprinting algorithm from an information retrie al peL pe tive and then urvey the rele ant algorithm from the literature. A mentioned before in Chapter I , _{unl ike near-duplicate document detection, our focu i on algorithms that}

det ct

I

_{cal Ie ,t reus ba 'ed on part of documents. For example, a reu ed text i typical ly} modi fied to onform with the content of the document. Therefore, technologie to detect uch ca e,' of te t reu 'e can not rely on the near-dupl icate method , in which the reused text being clo e t identical to its original, but rather have to be account for modi fication of mall parts of the document . Such local text reuse detection methods are based on a technique called fi n ge rpri nting.

A fingerpri nt

h (d)

of a document

d

can be considered as a set of encoded substrings taken from

d,

which serve a a repre entation of the document

d.

These encoded substri ngs are u ual l y formed u ing the n -gram method, wh ich divide the document into a set of con t iguou sub tring of length n (the n-gram method is described in more details in Section 2. 3).

Each n-gram i then ha hed and a sub et of these hashe i s selected to be the document' fi n gerprint . Note that

h

denote the ha h function. Other type of fi ngerpri nting methods are formed by reformulating

d

as a whole to obtain a simpl i fied repre entation of

d

whi le main tain i ng a much i n formation a po ible of the content of the document. Such fi ngerprints of the document are u ed for the local text reuse detection. In general , a good fingerpri nting technique should sati fy the fol lowing propertie :

• Accuracy: the fingerpri n ts generated by the fingerpri nting technique mu t be accurate

enough to repre ent the document .

• Efficiency: the generated fingerpri nts must be as small a po ible.

Con i dering the above propertie , we i ntroduce several fingerpri nting methods. How ever, mo t of the e techniques hare the fol low ing two operation : fi ngerpri nt generation and fi ngerprint matching.