Influence of Data Discretization on Efficiency of Bayesian Classifier for Authorship Attribution

(1)

Procedia Computer Science 35 ( 2014 ) 1112 – 1121

Peer-review under responsibility of KES International. doi: 10.1016/j.procs.2014.08.201

ScienceDirect

18

th

_{International Conference on Knowledge-Based and Intelligent}

Information & Engineering Systems - KES2014

Inﬂuence of data discretization on e

ﬃ

ciency of Bayesian classiﬁer

for authorship attribution

Grzegorz Baron

a

a_{Silesian University of Technology, Akademicka 16, Gliwice 44-100, Poland}

Abstract

Authorship attribution is one of the research areas in data mining domain and various methods can be employed for performing that task. The paper presents results of research on influence of data discretization on efficiency of Naive Bayes classifier. The analysis has been carried on datasets founded on texts of two male and two female authors using the WEKA data mining software framework. The binary classification was performed separately for both datasets for wide range of parameters of discretization process in order to investigate dependency between ways of discretization and quality of classification using Naive Bayes method. The numerical results of tests have been compared and discussed and some observations and conclusions formulated.

c

2014 The Authors. Published by Elsevier B.V. Peer-review under responsibility of KES International.

Keywords: Bayesian classiﬁer; Naive Bayes; stylometry; authorship attribution; text analysis; classiﬁcation; discretization; binarization;

1. Introduction

Bayesian classifier is a simple yet powerful tool for classifying wide range of data in different application domains. It is based on Bayes theorem. Because of its simplicity and the way of behaviour it is often called Naive Bayesian classifier, but it sometimes outperforms other more advanced approaches1. This method covers domains such as from medicine2,3_{, biology, genetics}4_{, bioinformatics}5_{through technical, mechanical}6_{applications, social research}7_, or broadly defined data mining8,9_{. There have been some attempts to improve the method. In}10_{the Hidden Naive} Bayes method is proposed, where with each attribute a hidden one is connected, which includes some features of other attributes. In some studies Naive Bayes method is joined with others like in11_{, where combination of decision tree} and Naive Bayes has been proposed.

In the area of text analysis Naive Bayesian classifiers are popularly used. In12_{authors developed a feature scaling} method using Naive Bayes classifier, whereas in13_{some modifications to the method have been proposed. Worth of} attention is the presented feature scoring function, which allows measuring the quality of features regarding

classiﬁ-∗_{Grzegorz Baron Tel.:}₊_{48-32-237-2009 ; fax:}₊_{48-32-237-2733.}

E-mail address:[email protected]

(2)

cation efficiency. The problem of preparing features set for Bayesian classifier is also touched in14_{. Authors proposed} feature weighting method and text normalization that gives results comparable with other classifiers. Generally speak-ing, proper preparation or preprocessing of characteristic features of texts is very important for improving the quality of Naive Bayes classifiers.

Authorship attribution can be considered as one of tasks of stylometry. Stylometry, as deﬁned in15_{, is ”the statistical} analysis of variations in literary style between one writer or genre and another”. It is useful for determining authors of anonymous texts, detecting plagiarism, etc. Diﬀerent techniques are applied to perform stylometric tasks. The main approaches are statistics-oriented or machine learning.

The prime important problem in textual analysis which stands before researchers is finding a methodology of selecting attributes or characteristic feature sets that must be author invariant16_{. There are di}_ff_{erent algorithms based} on employing statistical or linguistic methods (like vocabulary, syntactic, orthographic, structure and layout properties of texts) into attribute or features selection process17. Authors of18,19 proposed fully-automated approach to the identification of the authorship of unrestricted text that excludes any lexical measure. Style markers were used to perform the analysis of the text by a natural language processing tool using three stylometric levels, i.e., token-level, phrase-level, and analysis-level measures. The next step is attributional analysis where the classification of relevant features that have been previously extracted is performed. Different supervised and unsupervised methods can be taken into account. To the unsupervised methods belong principal component analysis, multidimensional scaling, cluster analysis. Among supervised methods simple descriptive statistics, linear discriminant analysis, distance-based methods, machine learning techniques (neural networks, decision trees, Naive Bayes classifiers), or support vector machines can be mentioned17,20_.

Interesting approach to authorship attribution problem is presented in21_{. Authors noticed that many current studies} suffer from the limitation of assuming a small closed set of candidate authors and essentially unlimited training texts for each. Authors stated and try to solve three problems: profiling, where there is no candidate set at all; the needle-in-a-haystack problem – there are thousands of candidates; verification when there is one suspect candidate.

In this paper an application of Naive Bayes classifier for authorship attribution is presented. During past research the positive influence of data discretization on quality of classification tasks has been noticed and generated motivation to analyse that problem deeply. The analysis has been carried on datasets founded on texts of four authors (two male and two female). Investigation into the influence of data discretization on efficiency of classification process has been conducted.

The paper is organized as follows. Sections 2 and 3 present the theoretical background and methods employed in the research. Section 4 introduces the experimental setup, presents datasets used and techniques employed. The results from experiments and their discussion are given in Section 5, whereas the last Section 6 contains conclusions.

2. Bayesian classiﬁer

Bayesian classifiers are very simple and popular classifiers used in different application domains. Very often they are applied as reference model for other classification research. One of the main assumptions constituting Bayesian approach to classification is the independence of attributes. It is obvious that in real-world problems it cannot be always true. But, paradoxically, performance of Bayesian classifier is surprisingly good. For that simple assumption mentioned above such classifiers are called Naive Bayes (sometimes simple Bayes, or stronger idiot’s Bayes). As-sumption of attribute’s independence leads to conclusion that learning process for each attribute can be conducted separately, which accelerates training process for large sets of attributes. Therefore document classification is an area where Naive Bayesian classifiers perform very well.

There are two fundamental approaches in input data preprocessing for text analysis. For the first one a set of binary attributes is created. Each element is related to one word and stores information if such word occurs in the analysed text. There is no information about a number of occurrences. Such approach, where features (assumed as independent) are binary variables, is called multivariate Naive Bayes. It is based on multivariate Bernoulli event model. The second approach specifies that information about word occurrences is extracted from the document and composed into features sets. Classifier founded on presented model is called multinomial Naive Bayes. For both models information about word positions and its order is omitted. As reported in22_{, multinomial Naive Bayes is} much more suited for large vocabulary sizes than multivariate Naive Bayes, especially when vocabulary size for both

(3)

classifiers is tailored for some specific task. Only sometimes, for some small vocabulary sets, multivariate Naive Bayes classifier delivers better results.

Bayesian classiﬁer is based on Bayes’ rule of conditional probability: p(cj|d)=

p(d|cj)p(cj)

p(d) , (1)

where:

• p(cj|d) – a’posteriori probability of instancedbeing in classcj,

• p(d|cj) – probability of generating instancedgiven classcj,

• p(cj) – a’priori probability of occurence of classcj,

• p(d) – probability of instancedoccurring.

For numerous attributes (instances) value ofp(d|cj) can be calculated as a product of probabilities for all

elemen-tary instancesdi, as presented by the following equation:

p(d|cj)= p(d1|cj)p(d2|cj). . .p(dm|cj). (2)

A result of classiﬁcation processNBC(d1, . . . ,dn) is determined using MAP (maximum a’posteriori) decision rule: NBC(d1, . . . ,dn)=argmax c p(C=c) n i=1 p(Di=di|C=c). (3)

A common assumption made for values of numeric attributes is that they are normally distributed, so the probability values can be calculated utilizing probability density function for Gaussian (normal) distribution, as follows:

p(D=d|C=c)= √1 2πσc

e−(d−2συcc)22 , (4)

whereσis standard deviation andυis the mean of the attribute given the class. It is simplest but not the only solution. For specific purposes other distributions can be more suitable23_{. Naive Bayes classifier, despite its simplicity, is} widely used in different classification tasks. It gives surprisingly good results in comparison with more novel and sophisticated methods. Summarizing, some important features of Naive Bayes classifier can be listed:

• it is a fast method during training and classiﬁcation process, • it is not sensitive to irrelevant features,

• independence of features is assumed – can be considered as disadvantage, • it is optimal if features are really independent,

• it has low storage requirements. 3. Discretization

Discretization is the process of partitioning values of continuous variables into categories. The goal of discretiza-tion is to ﬁnd a set of cut points to divide the range of data into some number of intervals. Discretizadiscretiza-tion methods can be fundamentally divided into two groups: supervised and unsupervised. Supervised methods utilize class informa-tion during discretizainforma-tion process whereas unsupervised do not. Supervised methods are considered as more eﬃcient and deliver better results24,25_{. Basic discretization process can be composed of four steps:}

1. sorting the continuous range of data to be discretized, 2. evaluating points for splitting (or intervals for merging),

3. applying splitting or merging process established on speciﬁed rules,

(4)

In presented research mainly unsupervised methods were taken into consideration, but supervised discretization using the information entropy minimization heuristic was also used26_{. Basic unsupervised methods are equal-width} and equal-frequency discretizations. The equal-width algorithm evaluates the minimum and maximum values of the discretized attribute and then divides the range into the previously deﬁned number of equal width discrete intervals. The equal frequency algorithm evaluates the minimum and maximum values of the discretized attribute, sorts all values in ascending order, and splits the range into a deﬁned number of intervals so that every interval contains the same number of values24_.

Additionally to the discretization process yet another mechanism was employed in presented work, namely bina-rization integrally included into discretization. It transforms each discretized attribute into a set of binary attributes. k–values attribute is transformed into (k−1) binary attributes. In fact it is a kind of coding of an ordered attribute using a set of binary attributes27_.

4. Experimental setup

The following subsections present the main conditions of performed experiments. Source texts are pointed out. There are also given descriptions of attributes, and properties of training and testing datasets. The last part presents description of techniques employed in research.

4.1. Experimental datasets

Stylometry, which is the application domain of presented work, employs a wide range of methods to textual analysis in order to extract and describe characteristic features of texts.These features (resulting from quantitative analysis) allow to find unique characteristics of writers to achieve different aims, such as proving authorship of texts, plagiarism detection, categorization of texts. Historically it was performed personally by researchers by studying and analysing texts. Thanks to development of the computers modern stylometry bases on statistical analysis or artificial intelligence. Authorship attribution is one of the tasks of stylometry, with the main aim focused on designation of the work author. As stated in21, different types of features or measures can be analysed to capture the lexical preferences:

• complexity measures, • function words,

• syntax and parts-of-speech, • functional lexical taxonomies, • content words,

• character n-grams.

There are a lot of possibilities for creating the characteristic features sets. Such features must be similar for all texts of given author (sometimes texts created during the deﬁned period of author’s life) and distinctive to other authors.

For the purposes of presented work linguistic descriptors, namely lexical attributes and syntactic features have been chosen. To determine lexical attributes, statistical measures as: total number of characters or words, average number of words or characters per sentence, frequency of usage of letters or words can be employed. The style of sentences building, usage of diﬀerent punctuation marks is described by syntactic features. List of lexical and syntactic elements for presented research has been drawn from28, where its good quality for speciﬁed texts was shown. The list consists of two subsets:

• lexical elements – but, and, not, in, with, on, at, of, this, as, that, what, from, by, for, to, if,

• syntactic elements – fullstop, a comma, a question mark, an exclamation mark, a semicolon, a colon, a bracket, a hyphen.

As can be noticed all lexical elements are function words whereas syntactic ones are punctuation marks. Attributes are frequencies of usage of these elements. Calculated attributes have been used to create three of attributes sets: ﬁrst with lexical attributes, second with syntactic ones, and third which is a sum of the previous two.

(5)

As the base for all experiments texts of four authors have been chosen28_{, two female writers: Edith Wharton and} Jane Austin, and two male: Henry James and Thomas Hardy. For each author several books have been studied thanks to the Project Gutenberg website. The source text ﬁles have been preprocessed to extract proper attribute information in order to create basic training and testing datasets. The whole process has been composed from the following steps:

1. splitting of each text into blocks of almost equal size,

2. calculating frequency of elements occurrence for each block (for given attributes set), 3. composing obtained results into.arﬀﬁles, which are suitable for WEKA system.

The assumption is that frequencies of usage of presented markers (function words and punctuation marks) are unique for all authors due to their personal preferences. Therefore authorship attribution based on that kind of data can be efficiently performed. The training and testing datasets were composed from different works of writers which means that data used for testing was not used for training. Such approach allows to get objective results. One of research assumption was that binary classification is performed, so only pairs of authors (in respect to their gender) are taken into consideration during experiments. Finally the separate training and testing datasets, with two balanced classes in each set with specified attributes, have been obtained.

4.2. Techniques employed

There were two main processes employed during experiments, namely discretization of data and classification using Naive Bayes classifier. Appropriate modules from WEKA data mining software have been used. Unsupervised discretization was performed by simple binning resulting in equal-width bins. Additionally the option of optimizing of bins number, using leave-one-out estimate of the entropy, has been used. Output format of discretization process was set to binary. The only parameter which has been modified for unsupervised discretization during experiments was the number of bins. Because of bins number optimization, the resultant number of bins could be less or equal to the value set for discretization process. Supervised discretization applied during experiments was based on Fayyad & Irani’s MDL method26.

Naive Bayesian classiﬁer in WEKA implementation can deal with unary attributes, missing values, numeric at-tributes, binary atat-tributes, nominal atat-tributes, empty nominal attributes which strongly facilitates the experimentation. It also supports binary and nominal classes. The normal distribution is used for numeric attributes.

An additional process was binarization of discretized data. For some experiments output data from the discretiza-tion module, even though it was binarized, was binarized once again. Formally such manipuladiscretiza-tion changes nominal attributes into numerical ones and should not inﬂuence the overall result of classiﬁcation. But experimental results, discussed in section 5, did not prove that assumption.

Processing of datasets during experiments required sequential execution of some algorithms. It has been achieved in few steps, as follows:

1. discretization of input data,

2. optional binarization of discretized data,

3. Naive Bayesian classiﬁcation (learning or testing process – depends on stage), 4. evaluation of results in testing stage.

In a single experiment the whole process was executed in two stages, firstly for training attributes set, secondly for testing attributes set to obtain information about efficiency of previously learned Naive Bayesian classifier. Experi-ments were executed several times for different combinations of attributes sets (lexical, syntactic, full), for male or female author pairs, and with optional binarization of discretized data.

5. Results and discussion

To obtain a reference point for the discussion, some preliminary experiments have been performed. The results are presented in Table 1. There were Naive Bayesian classiﬁcation processes executed for two groups of learning datasets

(6)

(for male and female authors) and for three attribute sets: full, with lexical attributes only, and with syntactic attributes only. Neither discretization nor binarization operations were performed in the data ﬂow.

The validation of classification accuracy was performed using testing datasets prepared from author’s texts, other than those used for creating training ones. The first observations show that the Naive Bayesian classification for female authors gave better results. Classification for lexical attributes set gives slightly better results than for full attributes sets. It is a fact worth of notice. The quality of classification of syntactic attributes datasets is poorer for male datasets. The disproportion between results for male and female datasets is unclear. One of the possible reasons are smaller differences in linguistic styles of both groups of authors.

Table 1. Results of NaiveBayes classiﬁcation for experiments performed using two training and testing sets (for male and female authors), with three kinds of attributes sets: full, lexical attributes only and, syntactic attributes only. Obtained values constitute the reference level for discussion about further results.

Attributes set Male authors Female authors

Full set 81.67 % 92.22 %

Lexical attributes 88.33 % 93.33 %

Syntactic attributes 55.00% 87.78 %

The main research work concerned analysis of influence of data discretization on quality of classification using Naive Bayesian classifier. Additionally, the binarization of discretized data have been performed and such results have also been analysed. As mentioned in section 4.2, the parameter varied during the experiments was number of bins in unsupervised discretization process. The first group of experiments was executed for relatively high values of that parameter, starting from 10 up to 100, with step 10. The results presented in Fig. 1 show that increasing number of bins reduces dramatically the efficiency of Naive Bayesian classifier. The binarization of data before classification did not affect homogeneously the results. The positive observation was that for small number of bins results were slightly better for male authors datasets and equal for female authors, comparing with the reference results from Table 1. Such conclusion indicated the direction of further experiments, and it has been determined that discretization process should be investigated deeper for numbers of bins values between 2 and 10, with step 1.

ϱϬ ϱϱ ϲϬ ϲϱ ϳϬ ϳϱ ϴϬ ϴϱ ϵϬ ϵϱ ϭϬϬ ϭϬ ϮϬ ϯϬ ϰϬ ϱϬ ϲϬ ϳϬ ϴϬ ϵϬ ϭϬϬ Žƌ ƌĞĐ ƚůǇ Đů ĂƐ Ɛŝ Ĩŝ ĞĚ ŝŶƐƚ ĂŶĐ ĞƐ ΀й ΁ EƵŵďĞƌŽĨďŝŶƐ ŵĂůĞͺĂůů ŵĂůĞͺĂůůͺďŝŶĂƌŝǌĞĚ ĨĞŵĂůĞͺĂůů ĨĞŵĂůĞͺĂůůͺďŝŶĂƌŝǌĞĚ

Fig. 1. Experimental results for full attribute set (lexical and syntactic attributes). Formale allandfemale allseries output data of discretization process were classified using Naive Bayesian classifier. Formale all binarizedandfemale all binarizedseries output data of discretization process were binarized before classification.

Figure 2 presents the results of Naive Bayesian classification for two learning sets (for male and female authors). For each set classification of nominal data from discretization module has been performed likewise the classification of binarized data. For male authors the best result is achieved for 4 bins with binarization of discretized data. The value of 93.33% correctly classified instances is the result about 11% better than for the first experiment (see Table 1). This was the interesting result mentioned in section 1, which motivated the presented work. But as can be seen in Fig. 2, it is rather an anomaly than the rule. Generally discretization increases an average performance of the whole classification process, but there is no evidence that using of binarization is beneficial. The average result is about 85%

(7)

for data, with and without binarization, and is better than the reference value. The worst result obtained for 7 bins with binarization is equal to the reference point.

ϴϬ ϴϱ ϵϬ ϵϱ ϭϬϬ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ Ž ƌƌĞĐƚ ůǇ ĐůĂƐ ƐŝĨŝĞĚ ŝŶƐ ƚĂŶĐĞƐ ΀й ΁ EƵŵďĞƌŽĨďŝŶƐ ŵĂůĞͺĂůů ŵĂůĞͺĂůůͺďŝŶĂƌŝǌĞĚ ĨĞŵĂůĞͺĂůů ĨĞŵĂůĞͺĂůůͺďŝŶĂƌŝǌĞĚ

Fig. 2. Experimental results for full attribute set (lexical and syntactic attributes). Formale allandfemale allseries output data of discretization process were classified using Naive Bayesian classifier. Formale all binarizedandfemale all binarizedseries output data of discretization process were binarized before classification.

For female authors the best result of 93.33% is achieved for 6 and 8 bins and it is equal to the first one. Binarization of discretized data in that case has rather constant negative influence on the efficiency of classification. The discretiza-tion of learning and testing data for male authors increases the number of correctly classified instances while there is no such observation for female authors. In such situation it is not possible to state that discretization has positive influence on classification process. The results depend on the conditions, mainly on nature of classified data.

The next analysis has been performed also using training and testing datasets with lexical attributes only. Bound-aries and step for number of bins values have been assumed like for the previous experiments. Overall results presented in Figure 3 conﬁrm the preliminary observations that quality of classiﬁcation is better for lexical attributes datasets. For male authors results obtained for 4 bins with binarization is about 14% better. The average result is not so impres-sive because of poor results for 2 and 3 bins. There is no evidence that binarization improves the results even though the best result is achieved for binarized data.

ϳϱ ϴϬ ϴϱ ϵϬ ϵϱ ϭϬϬ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ Ž ƌƌ ĞĐƚů Ǉ ĐůĂƐ ƐŝĨŝĞ Ě ŝŶƐƚ ĂŶĐĞƐ ΀й ΁ EƵŵďĞƌŽĨďŝŶƐ ŵĂůĞͺůŝƚ ŵĂůĞͺůŝƚͺďŝŶĂƌŝǌĞĚ ĨĞŵĂůĞͺůŝƚ ĨĞŵĂůĞͺůŝƚͺďŝŶĂƌŝǌĞĚ

Fig. 3. Experimental results for lexical attributes set. Formale litandfemale litseries output data of discretization process were classified using Naive Bayesian classifier. Formale lit binarizedandfemale lit binarizedseries output data of discretization process were binarized before classification.

(8)

For female authors the best results are in some cases slightly better than the reference value. There is no significant influence of discretization on the quality of classification. Also binarization process does not improve the overall performance. ϰϱ ϱϬ ϱϱ ϲϬ ϲϱ ϳϬ ϳϱ ϴϬ ϴϱ ϵϬ ϵϱ ϭϬϬ Ϯ ϯ ϰ ϱ ϲ ϳ ϴ ϵ ϭϬ Ž ƌƌ ĞĐ ƚů Ǉ ĐůĂ ƐƐ ŝĨ ŝĞĚ ŝŶ Ɛƚ ĂŶ ĐĞƐ ΀й ΁ EƵŵďĞƌ ŽĨďŝŶƐ ŵĂůĞͺƉƵŶĐƚ ŵĂůĞͺƉƵŶĐƚͺďŝŶĂƌŝǌĞĚ ĨĞŵĂůĞͺƉƵŶĐƚ ĨĞŵĂůĞͺƉƵŶĐƚͺďŝŶĂƌŝǌĞĚ

Fig. 4. Experimental results for syntactic attributes set. Formale punctandfemale punctseries output data of discretization process were classified using Naive Bayesian classifier. Formale punct binarizedandfemale punct binarizedseries output data of discretization process were binarized before classification.

The third series of experiments have been performed for datasets with syntactic attributes only. As Figure 4 presents, the results are generally bad, especially for male authors datasets. Percentage of correctly classiﬁed instances for female authors is at most equal to the reference point.

Some results of experiments with supervised discretization are presented in Table 2. In oposition to the previous experiments none of the parameters were changed. Therefore for each experiment the single value is delivered as a result. While comparing obtained results with previous ones and the reference values (presented in Table 1), it can be concluded that in some cases the applied supervised discretization gives the best results. Especially for female authors the value of 98.00% of correctly classified instances for full attributes set is a very good result. On the other hand, effects of classification for male authors are not so impressive. Generally they are below or equal to the average result obtained in the previously described experiments. Only for syntactic attributes classification quality is greater, although performance of the classifier for this case is still rather poor.

Table 2. Results of Naive Bayes classiﬁcation for experiments performed using two training and testing sets (for male and female authors), for full, lexical and syntactic attributes sets with Fayyad & Irani’s MDL discretization method applied.

Attributes set Male authors Male authors (binarized) Female authors Female authors (binarized)

Full set 80.00 % 81.67 % 98.00% 98.00%

Lexical attributes 81.67 % 85.00 % 86.00% 92.00%

Syntactic attributes 71.67% 71.67 % 96.00% 96.00%

Summarizing the presented experiments it must be expressed that there is no steady and observable dependence between the method of discretization, number of bins (being the parameter of discretization process), and quality of classification using Naive Bayesian classifier. But results obtained for male authors datasets are promising and prove that for some situations discretization of learning and testing data before classification process can increase the quality of the overall result. The selection of attributes also must be performed very carefully. Analysis of the presented results provides conclusion that the presence of some attributes (punctuation marks in that case) can decrease the quality of classification. Influence of individual attributes on the overall performance of a classifier can be the subject of some future research work.

(9)

6. Conclusions

In the paper the application of Naive Bayes classifier for authorship attribution is presented. In particular an analy-sis of influence of data discretization on efficiency of classification process has been executed. All experiments were performed using datasets founded on texts of two male and two female authors. All results were compared with refer-ence values obtained using Naive Bayesian classifier applied for given datasets without discretization. The validation of classification accuracy was performed using testing datasets. Main analysis was performed for small amounts of bins for discretization process because of the results of preliminary experiments, which show steep degradation of classification performance for number of bins greater than 10. Significant improvement has been observed for male author datasets given full attributes and lexical attributes datasets. Unfortunately, no similar effect was perceived for female authors datasets.

Therefore, it can be concluded that employing discretization of data should be taken into consideration, because improvement of classification quality may be achieved for specific tasks. Another important fact is that learning and testing datasets, especially attributes sets, should be deeply investigated. Unproperly chosen attributes can negatively influence the classification process. It will be the subject of further research. Binarization applied for data after discretization process did not prove its usefulness. Even though the best results have been obtained for binarized data (for supervised and unsupervised discretization), general observations of the results show rather minor negative influence for the whole process.

Acknowledgements

The research described in the paper was performed at the Silesian University of Technology, Gliwice, Poland, in the framework of the statutory project BK/266/RAu2/2014/502. All texts used in presented experiments have been downloaded from Project Gutenberg website (http://www.gutenberg.org/). All experiments were performed using WEKA data mining workbench (http://www.cs.waikato.ac.nz/ml/weka/), developed as an open source project by Machine Learning Group at the University of Waikato29_.

References

1. Webb, G.. Naive Bayes. In: Sammut, C., Webb, G., editors.Encyclopedia of Machine Learning. Springer US; 2010, p. 713–714. 2. Setsirichok, D., Piroonratana, T., Wongseree, W., Usavanarong, T., Paulkhaolarn, N., Kanjanakorn, C., et al. Classiﬁcation of complete

blood count and haemoglobin typing data by a C4.5 decision tree, a Naive Bayes classiﬁer and a multilayer perceptron for thalassaemia screening.Biomedical Signal Processing and Control2012;7(2):202–212.

3. Klement, W., Wilk, S., Michalowski, W., Farion, K.J., Osmond, M.H., Verter, V.. Predicting the need for CT imaging in children with minor head injury using an ensemble of Naive Bayes classiﬁers.Artiﬁcial Intelligence in Medicine2012;54(3):163–170.

4. Chandra, B., Gupta, M.. Robust approach for estimating probabilities in Naive-Bayes classiﬁer for gene expression data. Expert Systems with Applications2011;38(3):1293–1298.

5. Loganantharaj, R.. Extensions of Naive Bayes and their applications to bioinformatics. In: Mandoiu, I., Zelikovsky, A., editors. Bioinfor-matics Research and Applications; vol. 4463 ofLecture Notes in Computer Science. Springer Berlin Heidelberg; 2007, p. 282–292. 6. Muralidharan, V., Sugumaran, V.. A comparative study of Naive Bayes classiﬁer and Bayes net classiﬁer for fault diagnosis of monoblock

centrifugal pump using wavelet analysis.Applied Soft Computing2012;12(8):2023–2029.

7. Valle, M.A., Varas, S., Ruz, G.A.. Job performance prediction in a call center using a Naive Bayes classiﬁer. Expert Systems with Applications2012;39(11):9939–9945.

8. Koc, L., Mazzuchi, T.A., Sarkani, S.. A network intrusion detection system based on a hidden Naive Bayes multiclass classiﬁer. Expert Systems with Applications2012;39(18):13492–13500.

9. Mukherjee, S., Sharma, N.. Intrusion detection using Naive Bayes classiﬁer with feature reduction.Procedia Technology2012;4:119–128. 2nd International Conference on Computer, Communication, Control and Information Technology (C3IT-2012) on February 25 - 26, 2012. 10. Jiang, L., Zhang, H., Cai, Z.. A novel Bayes model: Hidden Naive Bayes.IEEE Transactions on Knowledge and Data Engineering2009;

21(10):1361–1371.

11. Farid, D.M., Zhang, L., Rahman, C.M., Hossain, M., Strachan, R.. Hybrid decision tree and Naive Bayes classiﬁers for multi-class classiﬁcation tasks.Expert Systems with Applications2014;41(4, Part 2):1937–1946.

12. Youn, E., Jeong, M.K.. Class dependent feature scaling method using Naive Bayes classiﬁer for text datamining. Pattern Recognition Letters2009;30(5):477–485.

13. Schneider, K.M.. Techniques for improving the performance of Naive Bayes for text classiﬁcation. In:In Proceedings of Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing). 2005, p. 682–693.

14. Kim, S.B., Han, K.S., Rim, H.C., Myaeng, S.H.. Some eﬀective techniques for Naive Bayes text classiﬁcation. IEEE Transactions on Knowledge and Data Engineering2006;18(11):1457–1466.

(10)

15. Oxford dictionaries: stylometry. 2014. Http://http://www.oxforddictionaries.com/deﬁnition/english/stylometry (access on 2014-03-30). 16. Sta´nczyk, U.. Establishing relevance of characteristic features for authorship attribution with ANN. In: Decker, H., Lhotska, L., Link,

S., Basl, J., Tjoa, A., editors.Database and Expert Systems Applications; vol. 8056 ofLecture Notes in Computer Science. Springer Berlin Heidelberg; 2013, p. 1–8.

17. Juola, P.. Authorship attribution.Foundations and Trends in Information Retrieval2008;1(3):233–334.

18. Stamatatos, E., Fakotakis, N., Kokkinakis, G.. Computer-based authorship attribution without lexical measures. In:Computers and the Humanities. Kluwer Academic Publishers; 2001, p. 193–214.

19. Stamatatos, E., Fakotakis, N., Kokkinakis, G.. Automatic authorship attribution. In:Proceedings of EACL. 1999, p. 158–164.

20. Sta´nczyk, U.. Rough set and artiﬁcial neural network approach to computational stylistics. In: Ramanna, S., Jain, L.C., Howlett, R.J., editors.Emerging Paradigms in Machine Learning; vol. 13 ofSmart Innovation, Systems and Technologies. Springer Berlin Heidelberg; 2013, p. 441–470.

21. Koppel, M., Schler, J., Argamon, S.. Computational methods in authorship attribution.Journal of the American Society for Information Science and Technology2009;60(1):9–26.

22. McCallum, A., Nigam, K.. A comparison of event models for Naive Bayes text classiﬁcation. In:AAAI-98 Workshop On Learning For Text Categorization. AAAI Press; 1998, p. 41–48.

23. John, G., Langley, P.. Estimating continuous distributions in Bayesian classiﬁers. In:Proceedings of the Eleventh Conference on Uncertainty in Artiﬁcial Intelligence. Morgan Kaufmann; 1995, p. 338–345.

24. Kotsiantis, S., Kanellopoulos, D.. Discretization techniques: A recent survey. International Transactions on Computer Science and Engineering2006;1(32):47–58.

25. Dougherty, J., Kohavi, R., Sahami, M.. Supervised and unsupervised discretization of continuous features. In: Machine Learning: Proceedings of the Twelfth International Conference. Morgan Kaufmann; 1995, p. 194–202.

26. Fayyad, U.M., Irani, K.B.. Multi-interval discretization of continuous-valued attributes for classiﬁcation learning. In:Proceedings of the Thirteenth International Joint Conference on Artiﬁcial Intelligence (IJCAI). 1993, p. 1022–1029.

27. Witten, I.H., Frank, E., Hall, M.A.. Data Mining: Practical Machine Learning Tools and Techniques. The Morgan Kaufmann Series in Data Management Systems. Boston: Morgan Kaufmann; third ed.; 2011.

28. Sta´nczyk, U.. Decision rule length as a basis for evaluation of attribute relevance.Journal of Intelligent and Fuzzy Systems2013;24(3):429– 445.

29. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.. The weka data mining software: an update. SIGKDD Explorations2009;11(1):10–18.