An Empirical Study on Author Affirmation

(1)

An Empirical Study on Author Affirmation

1

Mousmi A. Chaurasia,

2

Dr. Sushil Kumar

2_{Department of Electrical & Electronics Bhilai Institute of Technology} Bhilai, India

1_{Member ACM,}1_{mchaurasia@acm.org}_,2_{sk1_bit@rediffmail.co m}

Abstract -- Bootlegging and copyright intrusion are major problems in academic and commercial sectors. Automatic information processing and retrieval are therefore become an urgent need. In this paper a novel approach to authorship affirmation is proposed i.e. initial character n-gram approach dealing with real-world text (or unrestricted text). However, with a small experiment, we attempt to affirm the author more accurately than the previous research has shown. The results obtained show that the technique of the initial n-gram is very effective as it reaches the 100% accuracy level in the field of author identification.

Index Term - Author identification, Character n-gram, Dis-similarity measure, Natural Language Processing.

I. INT RODUCT ION

Individuals have distinctive ways of speaking and writ ing, and there e xists a long history of linguistic and stylistic investigation into authorship attribution. Various practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), crimina l law (identify ing writers of ransom notes and harassing letters), civ il law (copyright and estate disputes), and computer security (tracking authors of co mputer virus source code). Stylo metry attempts to capture an author‟s style using quantitative measure ments of various features in the text such as word length or vocabulary distributions. Many stylometric stud ies have measured word dependencies as a feature of an author‟s style using language models that restrict what words a given word can depend upon. Furthermore , in the past few years, character ngrams have successfully been used for named -entity recognition [1], pred icting authorship [2], web page genre identification [3], and sentence-level subjectivity classification [4].

Author affirmation, as the name imp lies, involves determin ing the author of a disputed work. Author affirmation is the task of identifying the author of a given te xt. Anyone can take a copy of someone else‟s text and put it on the web with his or her own name on it. Author affirmation methods are important to determine who deserves recognition for the work. It empowers governments and institutions to give credit where credit is due, be it for scholarly works or for terrorist manifestos.

Attribution of authors can be considered as a typical classification proble m, where a set of documents with known authorship are used for train ing and the aim is to automatica lly

determine the corresponding author of an anonymous text [5]. Author affirmat ion is becoming an important application in web informat ion manage ment. One such exa mple is the Federalist Papers, of which t welve a re c la imed to have been written by both Ale xander Ha milton and Ja mes Madison [6]. For documents such as email, blogs and other online content, formatting and other structural features can also be profitably exploited for authorship attribution [7].

In recent survey, Sta matatos [8] distinguishes the following types of stylometric features: le xica l features (word frequencies, word n-gra ms, vocabulary richness, etc.), character features (character types, character n-grams), syntactic features (part-of-speech frequencies, types of phrases, etc.), semantic features (synonyms, semantic dependencies, etc.), and application-specific features (structural, content-specific, language-specific).

An N-gra m is a fill in succession of k-ite ms in any given string of length m, where gra ms can b e anything, fro m characters to words. In computational linguistics n -gram models are used most commonly in predicting words (in word level ngra m) or predicting characters (in character level n -gram) for the purpose of various applications [9]. Motivated by some positive results in using character N-gra ms in building author profiles and their use in automated authorship attribution [10] .There a re several N-gra ms can be e xtracted fro m a g iven text. Here, space can be indicated by “*”. e.g. Given text:

“pull and pull” gives the following N-grams:

Bi-grams: pu, ul, ll, l*, *a, an, nd….…etc. Tri-grams: pul, ull, ll*,l*a,*an, and……etc. Quad-grams: pull, ull*, ll*a, l*an……….etc.

The n-grams have several advantages:

• Automatic capture of the roots of the most frequent words.

• Independence towards the document language.

(2)

• To lerances with the spelling mistakes and the deformations For e xa mp le, it is possible that the word “chapter " is written like " chaptr ". A system based on the words will have difficulties to recognize the word " chapter " since the word is badly spelled. On the other hand, our strategy based on the initial n-gra ms is able to take into account n-grams like " ch", " cha ", " chap ", etc

Typical te xt classification systems use a range of statistical and machine lea rning techniques based on regression model, K-Nearest Ne ighbor (KNN) [11], [12], Na ive Bayes [13], [14], [15], Support Vector Machines (SVM) [16], [17], [18], n-gra m based [19], [20], [21], and so on. In particula r, in this paper, we propose a different approach to identify the author using initial character n-g ra m whereas prior research has shown the identification on total character n-gram.

II. PRIOR WORK DONE RELATED TO CHARACTER N-GRAM

Authorship Attribution is the search for quantifiable features that are able to differentiate between authors. This line of research is called „stylometry‟. Character n-grams is used for text categorization [9], [11], [19], [21], authorship attribution [2], [10], [23], [26], genre identification [3], etc. Study in Intrinsic p lagiaris m detection [22] based on character n-gra m profiles and an dissimilarity measure based on style change function on character tri-gra m with the sliding window length 1000. Sta matatos [22] a lso proposed a protocol set that attempt to detect plagiaris m–free docu ments and tried to reduce the effect of irrelevant style changes within a document. Moreover, one of the studies [23] was also based on interpolation of n-gram probabilit ies which shows the probabilit ies measure gives the better performance than Author Profile Matching based technique. Note also that Raza et. al [23] applied one e xisting and one modified method based on n-gram probabilities to identify the author for an anonymous text.

Furthermore, another approach [24] was based on building a character-level n -gra m model of an author‟s writing, in wh ich one Gree k dataset performs 18% accuracy improve ment over deeper NLP techniques. Another study show [10] that sub-words units like character n-gra ms (i.e., character sequences of length n) can be very e ffective for capturing the nuances of an author‟s style. The most frequent n-gra ms of a te xt p rovide cruc ial informat ion about the author‟s stylistic choices on the lexical, syntactical, and structural level. For e xa mple, the most frequent 3-grams of an English corpus indicate lexical („the‟, „to‟, „tha‟, „con‟).

A modified approach [25] was to co mpare each n-gram with simila r n-gra ms (e ither longer or shorter) and keep the dominant n-gra ms. In order to cope fro m a single labe led mu ltic lass classification proble m, a suitable ensemble -based model [2] and apply it to character n-gra m representations of authors‟ style. Linear discriminate analysis which is a stable classification algorith m between c lassification accuracy and training time cost.

As far as prior revie w has co me into picture, research on initia l, media l & final gra m is ra re as compared to total

n-gram. Character n-gra m profiles turn down to a novel approach using dissimila rity measure. We propose init ial character n-gra m analysis to discriminate authors writ ing style. In this paper, we co mpared the accuracy to identify the author in various types of character n-gram with different profile sizes.

III. OUR METHODOLOGY AND ALGORITHM

An N-gra m is an N-character slit of a longer string. Although in the literature, the terms have an impression of any co-occurring set of characters in a string. The key feature of this paper is we do not use the term for conterminous slits. Typically, our paper strategy based on one start slit of every term (initial slit). Thus, for e.g. sentence “Author Identification using Natural Language Processing.” would be co mposed of following initial n-gram.

TABLE I

POSSIBLE INIT IAL CHARACT ER N-GRAM

Bi-gram Au Id us Na La Pr

Tri-gram Aut Ide usi Nat Lan Pro

Quad-gram Auth Iden usin Natu Lang Proc

In general, a string of length m, will have m bi-gra ms, m tri-gra ms, m quad grams and so on. A possible strategy for character n-gram is to use all positional n-grams including initia l n-gra m. However, such a strategy often leads to less feature space i.e. easy identification fro m an e xisting machine learning methods.

A. Rendering N-gram Frequency Profiles

The bubble and boxes Fig. I display the working environment of our strategy. It is very simply to understand. It me rely reads incoming te xts and counts the occurrences of all N-gra ms. However, pre-processing is required which consist of following steps:

 Eliminate numerals from the text.

 Remove all punctuations marks from the text.

 Procedure is case-insensitive also.

Generating the character n-gram profile includes the following steps:

 Split the te xt into separate tokens based on preprocessing steps.

 Run down the desired token, compute all possible character n-gram for N = 2, 3and so on.

 Hash into table to ascertain that each output N-gram has its own frequency.

 Sort the N-gra ms order by their frequencies from most frequent to least frequent.

(3)

Fig. I. RENDERING N-GRAMS PROFILES

Each document to be classified went through the text preprocessing phase, and then the N-gra m profile was generated as described above. The N-gram profile of each te xt document (document profile ) was compared against the profiles of a ll documents in the training classes (class profile) in terms of dis-similarity algorithm [1] used in this paper.

B. Algorithm and Equations

This algorith m used for co mputing dissimilarity is the one presented in [10] .We generates the bi- & tri-gra ms fro m author‟s original sample text called Author‟s Profile. A similar profile is generated for the test data. Now, let f1 (n) be the

frequency of the nth bi- or tri-gra m in the Author‟s Profile. Let f2 (n) be the frequency of the nth bi- or t ri-gra m in the test

data. We calculate the dissimilarity between the two using the following formula:

∑ ( f1(n) –f2(n) )2 n€ profiles

In order to norma lize these differences, we div ide the m by the average frequency for a given n-gram.

Sum =

∑

Sum =

∑

Given t wo profiles the algorith m returns a positive nu mber, which is a measure of d issimilarity. For identical te xts, and more generally, for te xts that has identical L most frequent n -grams, the dissimilarity are 0. Using this measure and a set of author profiles, we can easily assign a te xt to an author by generating a text profiles and assigning the text to the category with which the calculated dissimila rity is min ima l. Moreover, in order to get better result text size data sho uld be separated and of same size appro ximately whereas tra in size can varies and use to train the system.

We will see in our short experiment that we did find such a threshold that separates the author profiles fro m d ifferent authors.

IV. EXPERIM ENTAL RESULTS

In this section, we discuss our data-set describing train size and test size for our problem domain. In the same manner, we explore the results of accuracy level which varies with type of n-gram.

In experimental analysis, capture the profiles of authors of variable size for training and capture the profiles of different authors along with the same author profiles for testing. For our experimental purpose, we used data-set of four authors mention in TABLE II.

TABLE II

AUT HORS IN OUR DAT A-SET

DAT A SET

AUT HOR NAME

T RAIN SIZE (Words)

T EST SIZE (Words)

Z0 EVA -GALE 22068 22068

Z1 PAYT ON LEE 278303 275585

Z2 RUT H ANN

NORDIN 235807 109254

Z3 ROSS

BECKMANN 124047 124047

In TABLE II, it is mention that the total number of words for train size and total number of words for test size of respective authors. It includes three novels for Z0, six novels for Z1, five novels for Z2 and two novels for Z3. A profile is generated from each novel. Furthermore, minimally all the authors has two novels. So, the training set profiles is generated from the combination of two novels from each author and the accuracy is measured on attribution of rest of novels. We experimented our approach with the small data-set and compared the accuracy of verifying the author from different authors‟

n profiles

2

n profiles

(4)

profiles. In the following tables, we demonstrated medial n-gram, final n-n-gram, total n-gram and the initial n-gram.

TABLE III INIT IAL BI-GRAM Profile Length 50 100 200

Accuracy Measure

in percentage 83.33 93.75 100

TABLE IV

INIT IAL T RI-GRAM

Profile Length 100 200 500 700

Accuracy Measure

in percentage 95.8 100 100 97.91

TABLE V

T OT AL BI-GRAM Profile Length 100 200 430

Accuracy Measure

in percentage 39.58 62.5 68.75

TABLE VI

T OT AL T RI-GRAM

Accuracy Measure in percentage

72.91 68.75 70.83 70.83

TABLE VII

MEDIAL BI-GRAM

Profile Length 50 100 200

Accuracy Measure

in percentage 68.75 66.66 75

TABLE VIII MEDIAL T RI-GRAM

Profile Length 500 700 1000

Accuracy Measure

in percentage 75 66.66 68.75

TABLE IX

FINAL BI-GRAM Profile Length 50 100 200

Accuracy Measure

in percentage 66.66 75 72.91

TABLE X

FINAL T RI-GRAM

Accuracy Measure in

percentage 68.75 64.58 62.5 64.58

Moreover, it is clear that initial n-gram approach gives quiet good result as compared other types of n-gram which brings a clear picture to verify the author text from other authors to avoid plagiarism, piracy etc. The above results show the highest accuracy of 100% for initial n-gram whereas other reaches the lower level of accuracy. The method is very successful on this data-set.

CONCLUSION

In this paper, ana lyzing the effic iency of n -gra ms that the profiles of init ial n-gra m gives absolute good quality result than media l n-gra m, final n-gra m and total n-gra m. Since, we demonstrated our approach in s mall data-set & obtained a good performance state so, our future issues will be to increase our corpus with a high hope to get effect ive and accurate results. This would especially he lp to find better threshold. This can be enhanced to higher degree of n-gram analysis.

REFERENCES

[1] D. Klein, J. Smarr, H. Nguyen, and C. Manning, “Named Entity Recognition With Character-Level Models,” in Proc. of CoNLL-2003, 2003.

[2] E. Stamatatos, “Ensemble-Based Author Identification Using Character N-Grams,” in Proc. of TIR‟06, 2006.

[3] I. Kanaris and E. Stamatatos, “Webpage Genre Identification Using Variable-Length Character N-Grams,” in Proc. Of ICT AI 2007.

[4] S. Raaijmakers and W. Kraaij, “A Shallow Approach To Subjectivity Classification,” in Proc. of ICWSM‟08, 2008. [5] R. María, C. Morales, L. Villaseñor, et. al “Authorship Attribution

using Word Sequences “ CIARP,2006

[6] D.I Holmes, R.S. Forsyth, “the Federalist Revisited: New Directions in Authorship Attribution”. Literary and Linguistic Computing, 1995. 10(2): p. 111-127.

[7] A. Abbasi, and H. Chen, “Applying Authorship Analysis To Extremist -Group Web Forum Messages”, IEEE Intelligent Systems, 2005.

[8] E.Stamatatos, “A Survey of Modern Authorship Attribution Methods” , Journal of the American Society for information Science and T echnology, 2009, 60(3): 538-556.

(5)

[10] V. Keˇselj, F. Peng, N. Cercone, and C. Thomas. “N-Gram-Based Author Profiles For Authorship Attribution”. In PACLING‟03, August 2003, pages 255–264.

[11] J. Fürnkranz, “A Study Using N-Gram Features For Text Categorization”, http//citeseer.ist.psu.edu/johannes98study.html , 1998.

[12] Y. Zhao,& J. Zobel, “Effective Authorship Attribution Using Function Word”, in 'Proc. 2nd AIRS Asian Information Retrieval Symposium', Springer, 2005, pp. 174-190

[13] R.J. Mooney and L. Roy, “ Content-Based Book Recommending Using Learning for T ext Categorization”, In the proceedings of DL-00, 5th ACM Conference on Digital Libraries, 1999.

[14] M. Forsberg and K. Wilhelmsson, “Automatic T ext Classification with Bayesian Learning”,

http://www.cs.chalmers.se/~markus/LangClass/LangClass.pdf [15] Peng, F., Schuurmans, D., Wang, S. (2004), “Augumenting Naive

Bayes Text Classifier with Statistical Language Models”, Information Retrieval, 7 (3-4), pp. 317 – 345

[16] T. Joachims, “Text Categorization with Support Vector Machines Learning with Many Relevant Features”, In The Proceedings of ECML-98, 10th European Conference on Machine Learning, 1997. [17] M. Koppel, J. Schler, K. Zigdon, “Determining an Author‟s Native Language by Mining a Text for Errors”, Proceedings of KDD ‟05, Chicago IL, 2005.

[18] R. Zheng, J. Li, H. Chen, Z. Huang, “A Framework For Authorship Identification Of Online Messages: Writing-Style Features And Classification Techniques,” Journal of the American Society for Information Science and Technology, vol. 57, no. 3, 2006, pp. 378–393.

[19] W.B. Cavnar and J.M. Trenkle, “N-Gram-Based Text Categorization”, In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.

[20] J.P.R. Gustavsson, “Text Categorization Using Acquaintance”, Diploma Project, Stockholm University, http://www.f.kth.se/~f92-jgu/Cuppsats/ cup.html, 1996, unpublished.

[21] P. Náther, “N-gram based T ext Categorization, Institute of Informatics”, Comenius University, 2005, unpublished.

[22] E. Stamatatos, “Intrinsic Plagiarism Detection Using Character n-gram Profiles”, In Proc. of the 3rd Int. Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, 2009 [23] A. Raza, A. Athar, and S. Nadeem, “N-Gram Based Authorship

Attribution in Urdu Poetry “, Proceedings of the Conference on Language & T echnology 2009 , p 88-93

[24] F. Peng, F. Shuurmans, V. Keselj, S. Wang,” Language Independent Authorship Attribution Using Character Level Language Models”, In Proc. of the 10th European Association for Computational Linguistics (2003).

[25] J. Houvardas, E. Stamatatos, “N-gram Feature Selection for Authorship Identification”, .In J. Euzenat, and J. Domingue (Eds.) Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA'06), LNCS 4183, 2006, pp. 77-86.