Lexical Semantic in Arabic Plagiarism Detection Using Winnowing Algorithm (Word Level)

(1)

Vol. 28, No. 12, (2019), pp. 16-24

Lexical Semantic in Arabic Plagiarism Detection Using Winnowing Algorithm (Word Level)

Zahraa Jasim Jabir1, Ahmed H. Aliwy2

1,2Department of Computer Science, Faculty of Computer Science and Mathematics, University of Kufa, Najaf, Iraq

Abstract

The Plagiarism is an illegal electronic crime for violating the rights of authors and publishers. In the indexing process, fingerprint algorithms are used. We extracted fingerprints for every five words by hash function. The winnowing algorithm was used on words instead of letters for selecting the fingerprint and to reduce the size of the index.

Winnowing algorithm used with synonym replacement and without synonym replacement to compute the percentage of grams of plagiarized texts, done in two ways: (i) percent of plagiarism in suspicious document (a file in dataset) and (ii) percent of plagiarism in a file in dataset (to suspicious file). And then the Precision, Recall, F-measure and Error rate are estimated. The results for winnowing algorithm (with synonyms replacement) in the first methodology are (precision, recall, f-measure and error rate) were (0.84861, 1, 0.889206 and 0.034461) respectively, and in the second methodology are (0.852961, 1, 0.892954 and 0.030137) for (precision, recall, f-measure and error rate) respectively.

The results for the winnowing algorithm (without synonym replacement) in the first methodology are (0.849207, 1, 0.889764 and 0.034452) and in the second methodology are (0.852554692, 1, 0.892920502 and 0.030153138) for (precision, recall, f-measure and error rate) respectively.

Keywords: Plagiarism detection, Arabic text, Winnowing, Fingerprinting

1. Introduction

Unauthorized plagiarism means that some people uses (ideas, language, results) of someone else without giving the appropriate credit to that person, so it is considered a problem in scientific research. -Plagiarism detection was started since the 1970s which was very difficult problem but technology development, it become applicable [1]. The process of detecting plagiarism in English is easier than Arabic because of the complexity and difficulty of the Arabic language in many situation and language processing levels such as morphology, syntax, semantics and others. Plagiarism Detection is the process of retrieval of information by specialized IR systems. There are many plagiarism detection approaches and systems were suggested by the researchers and companies but all these detection systems of plagiarism are divided into two types: external detection methods and internal detection methods.

In case of External methods, the suspect document is compared with a set of original sources [2]. A content-based method is one of the most important external methods. In other hand, internal methods are statistical methods that obscure the analysis of the style and style of the author in writing and do not rely on the comparison of the suspicious text with the collection of original references [3]. The most important internal methods are Stylometry.

zahraaj.altalakany uokufa.com ahmedh.almajidy uokufa.com

(2)

Vol. 28, No. 12, (2019), pp. 16-24

2. Related Work

There are some works and researches in plagiarism detection for Arabic language some of these works will be presented:

Schleimer & et al. [4] introduced techniques for local fingerprint algorithms where they used algorithms such as the winnowing algorithm and Karp – Rabin and concluded that the winnowing algorithm is an efficient and guarantee algorithm. They discussed several experiments and showed the effectiveness of the winnowing algorithm in real data.

Elbegbayan, N. [5] showed a literature study of Winnowing, a fingerprinting algorithm for the documents. The Winnowing selects fingerprints from hashes of k-grams, a neighboring, substring of length k. they offer a document fingerprinting instance to display the performance of the algorithm. They concluded the Winnowing, and fingerprinting algorithm for document that is both efficient and guarantees that matches of a special length are detected in documents guarantees that matches of a special length are detected in documents. Jadalla,

Jadalla, A. & et al. [6] used the fingerprinting method, the n-gram method for chunking the text into words and winnowing algorithm to reduce the index size. The fingerprints of each sentence are its n-grams that are represented by hash codes. The winnowing algorithm computes fingerprints for each sentence. As a result, the search time was improved and the detection process is more accurate. The experimental results showed the Recall as 94% and Precision as 99%.

Zaher, M. & et al. (2017) [7]. presented the efficiency of the APDS system. It is a web- enable system for detecting Arabic plagiarism, integrated with e-learning systems to distinguish between the students' duties, papers and dissertations. The authors evaluated the APDS in terms of accuracy and recall rates. The results showed that the average percentage accuracy is 82% the recall is equal to 92.5%

3. Plagiarism Detection Approaches

The most common methods of detecting plagiarism can be: external or Intrinsic. In external ways, suspicious texts are compared with a set of classified documents known as reference documents. In Intrinsic methods, the suspicious document is compared based on the writing style of the author.

Intrinsic Approaches Plagiarism Detection

The intrinsic methods analyze the text to be evaluated without performing comparisons to external documents. This approach aims to recognize changes in the unique writing style of an author as an indicator for potential plagiarism it can be one of: grammar semantics hybrid plagiarism detection methods, structure-based methods, Stylometric- based methods: and syntax-based methods [8].

External Approaches Plagiarism Detection

The external ways of detecting plagiarism rely on comparing the suspicious text with a set of classified documents. Recent methods used semantic based approaches (for example synonyms) for covering the modified content. External plagiarism detection system consists of many of methods such as: Character-Based Methods, Cross Lingual Plagiarism Detection, and Semantic-Based Methods, Based Plagiarism Detection, Classification and Cluster-Based Method, Citation-Based Methods, Content-based methods [9].

Preprocessing

The preprocessing is done on the corpus and the suspicious document. The proposed has four steps normalization, tokenization, stop words removal and light stemming.

(3)

Vol. 28, No. 12, (2019), pp. 16-24

Normalization

Normalization, for Arabic language, is a unification of Arabic letters, diacritics removing and Tatweel removing. Some Arabic letters has multiple forms, for example the letter “ا“have the forms (أ, آ, إ) which in turn should be unified.

Tokenization

Tokenization is chunking or separating texts into the smallest units either letters or words based on the limits of a word that based on white spaces and punctuation marks as constraints between words (“.”, „ „, „,‟, „;‟ …..).in the proposed system this stage is considered as an important stage for the plagiarism detection of Arabic Texts Because many words types are merged together without spaces such as prepositions and pronouns therefore, dealing with Arabic has become more difficult than English.

Stop Words Removal

Stop words are words that have no meaning if found alone. When it was deleted, it does not affect the meaning of the sentence. It is used, only to complete the sentence structure. They are the most frequent words that appear in texts such as pronouns, prepositions, tools, etc. [10].

Light Stemming

Stemming is process of removing any affixes such as(i) prefixes that added to the beginning of the word, (ii) infixes that added to the middle of the word, and (iii) suffixes that added to the ending of the word, to reduce these words to their stems or roots on the assumption that some words sharing the same stem [11].

Synonym Replacement

Most of the plagiarized texts are modified to prevent the plagiarism detection tool from detect it. This is done with many methods but the most important are using synonym and reordering the phrases. Therefore, similar words, in meaning, should be unified which make the detection process more effective. This is achieved by converting the similar words to their most common synonym, which in turn help to detect advanced forms of hidden impersonation as shown in Table 1.

Table 1: Synonyms example

N-gram

An n-gram is a continuous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. When the items are words, n-grams may also be called shingles. N-gram is used extensively in authoring analysis methods, which can be defined as a series of contiguous words called word collecting

Most IR systems use one word as base for indexing but in some cases, they use more than one word are used as a reference which called n-gram. According to our knowledge, there are not any Plagiarism detection method used n-gram for indexing and matching for exact percent, but it is used as preprocessing in fingerprint method to give the candidate one approximately. Our methodology suggests using n-gram for indexing and extracting the percent of plagiarized text without any farther processing. In this paper,

(4)

Vol. 28, No. 12, (2019), pp. 16-24

n-gram is used in word level (Light stems). The used corpus (data set) is partitioned to n- gram after completing the preprocessing and synonym replacement steps for one time only and hence each n-gram will be sequence of n stems. These n-grams will be stored in a suitable data structure for efficient access. Also any suspicious text will have the same processing (n-gram). In addition, the total number of grams was calculated for each document to be used in the calculation of the ratio of the plagiarism.

"Fig. 1." Shows 5-gram of the selected files

Fig. 1: 5-gram of the selected text file after preprocessing and synonyms Fingerprinting

The most common used method, in content-based method and external plagiarism detection, is fingerprinting. It works by representing the main content of the document as a set of integers called fingerprint. The fingerprint can be generated by many techniques, such as k-gram which is a series of characters or words whose length is k. The advantage of fingerprinting is to reduce texts size that Comparisons in them and hence to increase the comparisons speed without losing any of these comparisons. For the extraction of document fingerprints, the document should be chunck into smaller units. The chucking can be in sentence, word, character level. In all cases the partitioning of the document based on coefficient n [12]. There are many fingerprint techniques, including: Non- overlapping (Hash-breaking and DCT fingerprinting) and Overlapping methods (n-gram and winnowing). The fingerprint generation process consists of four steps [6]:

 The hash function: it generates the hash values of the document sub – strings as shown equation 1.

 The granularity: it means the size of the chunck extracted from the document.

 The resolution: it represents the number of hash values generated by the hash function.

 Strategy for selecting sub-strings from the document.

Winnowing

Winnowing is a strategy proposed by Schleimer et al. [4]. Using fingerprints selected from the hashes and k-gram. The upper limit of algorithm performance shows a trade-off between the number of selected fingerprints and the shortest match of the detection guarantee. After preprocessing, the document should be partitioned into tokens using n-

(5)

Vol. 28, No. 12, (2019), pp. 16-24

gram. Then the hash function is used on these tokens, to produce numerical representation (hash-values). These values are partitioned in the same way in first step with fixed size of window (w). According to little criteria, one hash value from some window will be selected. The final hash values will represent the fingerprint of the document size of the window can be selected to the minimum length of string to be matched (threshold guarantee t) and the maximum length of string match then should be neglected (noise threshold k) where w =t-k+1 and t >=k [13].

k& t can be selected by user. Then number of fingerprints of winnowing, for document D, computed by:

Number of winnowing fingerprints in D = 2 / (w+1) * number of n-gram in D [14].

Data Set

A data set is a collection of large text files that are stored and processed electronically.

Dr. Ahmed Aliwy collected the files from Al-Sabah newspaper, consists of 54300 text files of Arabic language and classified into 26 categories. Zena, Aliwy and Aljanabi [15]

are filtered these files and divided them into five categories were chosen including:

Arts and Literature, Sport, Economy, Science and Technology and Family and Community. These categories are less than the original classifications. Some of the original categories were merged to one category such as science and technology and have become one category as well as the family and society…etc.

4. Proposed Method

Winnowing algorithm was applied to 5-grams of words instead of characters. Hash- Function Used to generate fingerprints for files (a unique numerical representation) and then determine a window size to choose the least fingerprint of each window.

The following steps apply to suspicious documents and dataset and Fig. 2 explain the winnowing algorithm:

 Apply of preprocessing on the documents. In the stage of tokenization sentences are chunk into words.

 After completing the preprocessing, the synonyms are replaced by the synonyms dictionary, which was created using WordNet, which gives each word a list of synonyms, and then returns each synonym to the original word. The word which has no synonym stay without replacement. Synonyms are replaced for the corpus and suspicious separately.

(6)

Vol. 28, No. 12, (2019), pp. 16-24

 The documents were divided into words (5-grams) instead of characters.

 Convert each n-gram into number by using hash function to generating a list of fingerprints.

 The fingerprint list is divided into fixed-size windows then shifted the window by one element at a time, and then select the lowest value in each window by the winnowing algorithm.

The fingerprints of suspicious documents are compared with the dataset to find the percentage of plagiarism.

Fig. 2: Winnowing algorithm with synonym replacement

5. Results

There are two methodologies for plagiarism detection:

i. Percent of plagiarism in Si from dj to dj, but the result will be huge therefore precision, Recall, and F-measure are evaluated.

ii. Percent of plagiarism in Si from dj to Si for the same region in point (i), Precision, Recall, and F- measure are evaluated the average, Error rate are estimated according to percent of plagiarism as shown in Table 2.

"Fig. 3,4." shows the evaluation metrics (Precision, Recall, f-measure, and Error rate) respectively of Winnowing algorithm (with synonym replacement) for first and second methodologies.

(7)

Vol. 28, No. 12, (2019), pp. 16-24

Table 2: Average of precision, Recall, f- measure, and Error rate of winnowing algorithm (average for all suspicious files)

Fig. 3: Evaluation metrics of Winnowing algorithm with synonym replacement for first methodology

The used method Precision Recall f-measure Error rate winnowing

with synonyms

First methodology 0.84861 1 0.889206 0.034461

Second methodology 0.852961 1 0.892954 0.030137 winnowing

without synonyms

First methodology 0.849207 1 0.889764 0.034452 Second methodology 0.852554692 1 0.892920502 0.030153138

(8)

Vol. 28, No. 12, (2019), pp. 16-24

Fig. 4: Evaluation metrics of Winnowing algorithm with synonym replacement for second methodology

6. Conclusion

In this paper, the winnowing algorithm was used on the words instead of letters (5- words) for selecting fingerprints. Hash-function was used to convert each 5-grams (five words not five letters) into number is called hash value to reduce the size of the data and the winnowing algorithm is applied by replacing synonyms and without replacing synonyms. The results showed that when replacing synonyms, the results are better and the error rate is less than applying winnowing algorithm without replacing synonyms.

There are many works can be done as future works. Some of them can be summarized by:

 Trying to detect idea plagiarism for Arabic language, because it is challenge task.

 Dealing with semantics deeply for produce intelligent plagiarism detection.

 Introducing OCR system to assistant in plagiarism detection because some works insert image as a text.

 Designing a multi lingual system including Arabic language for P.

References

[1] Gheni, H., (2016). “Plagiarism Detection Based on Syntax and Semantic Analysis”. the degree of Master in Computer Sciences, of the College of Information Technology at University of Babylon.

[2] Stein, B., Koppel, M., & Stamatatos, E. (2007, December). Plagiarism Analysis, Authorship Identification , and Near -Duplicate Detection (PAN'07). In SIGIR Forum (Vol. 41, No. 2, pp. 68-71).

[3] Meyer zu Eissen S, Stein B (2006) Intrinsic Plagiarism Detection. In:Proceedings of the 28th European Conference on IR Research, Springer,London, UK, Lecture Notes in Computer Science, vol 3936, pp 565–569,doi: 10.1007/11735106_66.

[4] Schleimer, S., Wilkerson, D. S., & Aiken, A. (2003, June). Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76-85). ACMر

(9)

Vol. 28, No. 12, (2019), pp. 16-24

[5] Elbegbayan, N. ( 2005). Winnowing, a Document Fingerprinting Algorithm . TDDC03 Projects, Spring.

[6] Jadalla, A., & Elnagar, A. (2012, April). A fingerprinting-based plagiarism detection system for Arabic text-based documents. In 2012 8th International Conference on Computing Technology and Information Management (NCM and ICNIT) (Vol. 1, pp. 477-482). IEEE.

[7] Zaher, M., Shehab, A., Elhoseny, M., & Osman, L. (2017, September). A New Model for Detecting Similarity in Arabic Documents. In International Conference on Advanced Intelligent Systems and Informatics (pp. 488-499). Springer, Cham.

[8] Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133–149. doi:10.1109/tsmcc.2011.2134847.

[9] Eiselt, M.P.B.S.A. and Rosso, A.B.C.P., (2009) „Overview of the 1st international competition on plagiarism detection‟. In 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (p. 1).

[10] Alsallal, M. (2016). A Machine Learning Approach for Plagiarism Detection. The Degree of Doctor of Philosophy, Coventry University.

[11] Ali, A. M. E. T., Abdulla, H. M. D., & Snasel, V. ( 2011). Overview and Comparison of Plagiarism Detection Tools. In DATESO (pp. 161-172).

[12] Dreher, H. (2007). Automatic conceptual analysis for plagiarism detection.

Journal of Issues in Informing Science and Information Technology , 4(2007), 601-614.

[13] Albalooshi, N., Mohamed, N., & Al-Jaroodi, J. (2011, December). The challenges of Arabic language use on the Internet. In 2011 International Conference for Internet Technology and Secured Transactions (pp. 378-382). IEEE.

[14] Idicula, S. M. ( 2015, December). Fingerprinting based detection system for identifying plagiarism in Malayalam text documents . In 2015 International Conference on Computing and Network Communications (CoCoNet) ( pp. 553- 558). IEEE.

[15] Abutiheen, Z. A., Aliwy, A. H., & Aljanabi, K. B. S. (2018). Arabic text classification using master-slave's technique. Journal of Physics: Conference Series, 1032, 012052. doi:10.1088/1742-6596/1032/1/012052.