Web page feature selection and classification using neural networks

(1)

Web page feature selection

and classiﬁcation using neural networks

Ali Selamat

^*

, Sigeru Omatu

Computer and System Sciences, Graduate School of Engineering, Osaka Prefecture University, Sakai, Osaka 599-8531, Japan

Received 20 June 2002; received in revised form 30 December 2002; accepted 18 March 2003

Abstract

Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web (WWW). In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features. Each news web page is represented by the term-weighting scheme. As the number of unique words in the collection set is big, the principal component analysis (PCA) has been used to select the most relevant features for the classification. Then the final output of the PCA is combined with the feature vectors from the class-profile which contains the most regular words in each class. We have manually selected the most regular words that exist in each class and weighted them using an entropy weighting scheme. The fixed number of regular words from each class will be used as a feature vectors together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM method provides acceptable classification accuracy with the sports news datasets.

Keywords: Text categorization; WWW; Proﬁle-based neural networks; Principal component analysis

*Corresponding author. Address: UTM, Faculty of Computer Science and Information Systems, UTM Skudai, Johor 81310, Malaysia. Tel.: +81-722-54-9278; fax: +81-722-57-1788.

E-mail address:[email protected](A. Selamat).

doi:10.1016/j.ins.2003.03.003

www.elsevier.com/locate/ins

(2)

1. Introduction

Neural networks have been widely applied by many researchers to classify the text documents with diﬀerent types of feature vectors. For example, Lee et al. [1] have proposed a neural network for document classiﬁcation based on a linguistic feature selection with a fuzzy learning technique. Liu and Zhang [3], Enhong et al. [6], and Lam et al. [2] have proposed a term frequency method to select the feature vectors for document categorization using neural networks.

Furthermore, Karras and Mertzios [4] have proposed a feature extraction method based on a word semantic and the categories of words co-occurrence analysis for document classiﬁcation using neural networks. Chang et al. [7]

have proposed another approach of document classiﬁcation using multiple- cause networks with latent variables as a feature selection method.

A feature selection using a hybrid case-based architecture has been proposed by Gentili et al. [8] for text categorization where two multi-layer perceptrons are integrated into a case-based reasoner. Rauber et al. [5] have proposed a diﬀerent neural network architecture for document categorization namely as a growing hierarchical self-organizing map (GSOM). However, the GSOM is concentrated on a huge number of documents in the datasets to be classiﬁed and the feature selection method used is similar to approach applied by Ko- honen et al. [9].

Furthermore, Wermeter [10] has used the document title as the vectors to be used for document categorization. Ruiz and Srinivasan [11], Weigend et al.

[12], and Calvo and Ceccatto [13] have used the X² measure to select the relevant features before classifying the text documents using the neural networks. Lam and Lee [14] have used the principal component analysis (PCA) method as a feature reduction technique to the input data to the neural networks. However, if some original terms are particularly good when dis- criminating a class category, the discrimination power may be lost in the new vector space after using a latent semantic indexing (LSI) as described by Se- bastini [15].

Moreover, the techniques based on dimensionality reduction, such as the LSI, have been shown to improve the quality of the information being retrieved by capturing the latent meaning of words present in the documents. As a result of the dimensionality reduction using the PCA or the LSI for data classification, the characteristic variables that describe smaller classes tend to be lost. Hence, the classification accuracy on the smaller classes can be de- graded in the reduced dimensional space [16,18]. Zelikovitz and Hirsh [17] have included the background knowledge to the LSI method in order to classify the text. However, if a small number of documents that are represented in a particular class in the datasets, the classification accuracy may be decreased.

This problem has not been taken into consideration in the previous approach [17].

(3)

Therefore, we propose a web page classification method (WPCM), which is based on the PCA and the class profile-based features (CPBF). The proposed method is different compared to the previous works based on the improvement of web page categorization accuracies using the PCA features selection approach with a combination of background knowledge. If a small number of web page documents use in a category class, the classification accuracy can also be increased. Each web page is represented by the term frequency-weighting scheme. As the dimensionality of a feature vector in the collection set is big, the PCA has been used to reduce it into a small number of principal components.

Then we combine the feature vectors generated from the PCA with the feature vectors from the class-profile which contains the most regular words in each class before feeding them to the neural networks for classification. We have manually selected the most regular words that exist in each class and weighted them using an entropy-weighting scheme [19]. The Yahoo sports [20] news web pages have been used for the classification purpose. The web page classification techniques such as a Bayesian, a term frequency-inverse documents frequency weighting scheme (TF-IDF), a PCA-neural network (PCA-NN), and the WPCM methods have been used as the benchmark tests for the classification accuracy. The experimental evaluation demonstrates that the proposed method provides an acceptable classification accuracy with the sports news datasets.

The organization of this paper is as follows: The news classiﬁcation using the WPCM is described in Section 2. The preprocessing of web pages is ex- plained in Section 3. The comparisons of web pages classiﬁcation using the Bayesian, TF-IDF, PCA-NN, and WPCM methods are discussed in Section 4.

In Section 5, the discussions of the classification results using different methods for the sports news web pages are stated. In Section 6, we will conclude the classification accuracy by using the WPCM compared with the other methods.

2. Web page classiﬁcation method

The news web pages have diﬀerent characteristics where the text length for each of them is variable. Also the structures of the pages are diﬀerent in the tags usage (i.e., XML, html, SGML tags, etc.). Furthermore, a huge number of distinct words exist in those pages including proper or misspelled noun words as there is no restriction on word usage in the news web pages discussed by Mase and Tsuji [21].

The high dimensionality of the news web pages dataset has made the process of classification difficult. This is because there are many categories of news in the web news pages such as sports, weathers, politics, economy, etc. In each category there are many different classes. For example, the classes that exist in

(4)

the business category are stock market, ﬁnancial investment, personal ﬁnance, etc. Our approach is based on the sports news category of web pages.

As we have mentioned in Section 1, the main purpose of this paper is to improve the classiﬁcation accuracies of the classes in the datasets that are represented with a small number of documents. Our approach is diﬀerent compared to the previous works [17]. It is based on the improvement of web page categorization accuracies using the PCA features selection approach with a combination of background knowledge where a small number of documents have been used in the document class.

In order to classify the news web pages, we propose the WPCM which uses the PCA and CPBF as the input to the neural networks. First, we have used the PCA algorithm [22,23] to reduce the original data vectors to a small number of relevant features. Then we combine these features to the CPBF before input- ting them to the neural networks for classiﬁcation as shown in Fig. 1.

Web page retrieval

Stemming and stopping processes

Term weighting scheme using TFidf method

Feature reduction and selection using PCA

Select the most regular words from each class

Calculate the weight of each of the regular words using

an entropy method

Combine two types of feature vectors

Input to the neural networks for classification

Classification results

d R

PCA CPBF

WPCM

d is a feature vectors from the PCA R is a feature vectors from the CPBF Note:

Fig. 1. The process of classifying a news web page using the WPCM method.

(5)

3. Preprocessing of web pages

The process of classifying a news web page using the WPCM method is shown in Fig. 1. It consists of web news retrieval process, stemming and stopping processes, feature reduction process using our proposed method, and web classiﬁcation process using error back-propagation neural networks. The retrieving process of sports news web pages has been done by our software agent during night-time [24,25]. Only the latest sports news web pages category will be retrieved from the Yahoo web server from the WWW. Then these web pages will be stored in the local news database. Stopping is a process of removing the most frequent word that exists in a web page document such as Ôto’, Ôand’, Ôit’, etc. Removing these words will save spaces for storing document contents and reduce time taken during the search process. Stemming is a process of extracting each word from a web page document by reducing it to a possible root word. For example, the words Ôcompares’, Ôcompared’, and Ôcomparing’ have similar meaning with the word Ôcompare’. We have used the Porter [26] stemming algorithm to select only Ôcompare’ to be used as a root word in a web page document. After the stemming and stopping processes of the terms in each document, we will represent them as the document-term frequency matrixðDocj TFjkÞ as shown in Table 1. Docjis referring to each web page document that exists in the news database where j¼ 1; . . . ; n. Term frequency TFjkis the number of how many times the distinct word wkoccurs in document Docjwhere k¼ 1; . . . ; m. The calculation of the terms weight xjk of each word wk is done by using a method that has been used by Salton [27]

which is given by

x_jk ¼ TFjk idfk; ð1Þ

where the document frequency df_k is the total number of documents in the database that contains the word w_k. The inverse document frequency idf_k¼ logð_dfⁿ

kÞ where n is the total number of documents in the database.

Table 1

The document-term frequency data matrix after the stemming and stopping processes

Docj TF1 TF2 . . . TFm

Doc1 2 4 . . . 5

Doc2 2 3 . . . 2

Doc3 2 3 . . . 2

Doc4 2 6 . . . 1

Doc5 4 3 . . . 3

...

Docn 1 3 . . . 7

(6)

3.1. Feature reduction using the PCA

Suppose that we have A, which is a matrix document-terms weight as below

A¼

x₁₁ x₁₂ x_1k x_1m x21 x22 x2k x2m

x31 x32 x3k . . . ...

... ...

... .. .

. . . x_n1 x_n2 x_nk x_nm 0

BB BB B@

1 CC CC CA

;

where xjk is the terms weight that exists in the collection of documents. The deﬁnitions of j, k, m, and n have been described in the previous paragraph.

There are a few steps to be followed in order to calculate the principal components of data matrix A. The mean of m variables in data matrix A will be calculated as follows

xx_k ¼1 n

Xⁿ

j¼1

x_jk: ð2Þ

After that the covariance matrix, S¼ fsjkg is calculated. The variance s²_kkis given by

s²_kk¼1 n

Xⁿ

j¼1

ðxjk xxkÞ²: ð3Þ

The covariance sik is given by

s_ik ¼1 n

Xⁿ

j¼1

ðxji xxiÞðxjk xxkÞ; ð4Þ

where i¼ 1; . . . ; m. Then we determine the eigenvalues and eigenvectors of the covariance matrix S which is a real symmetric positive matrix. An eigenvalue k and the eigenvector e can be found such that, Se¼ ke.

In order to ﬁnd the eigenvector e the characteristic equation jS kIj ¼ 0 must be solved. If S is an m m matrix of full rank, m eigenvalues ðk1;k2; . . . ;kmÞ can be found. By using

ðS kiIÞei¼ 0; ð5Þ

all corresponding eigenvectors can be found. The eigenvalues and corresponding eigenvectors will be sorted so that k1P k2P P km. Let a square matrix E be constructed from the eigenvector columns where E¼

½e1 e₂ e₃ e₄ em. Also let us denote K as

(7)

K¼

k1 0 0 0

0 k2 0 0 0

0 0 k3 0 0

... ...

... .. .

. . .

0 0 0 0 km

0 BB BB B@

1 CC CC CA :

In order to get the principal components of matrix S, we will perform eigenvalue decomposition which is given by

E^TSE¼ K: ð6Þ

Then we select the ﬁrst d 6 m eigenvectors where d is the desired value, e.g., 100, 200, 400, etc. The set of principal components is represented as Y1¼ e^T₁s, Y₂¼ e^T₂s; . . . ; Y_d¼ e^T_ds. An n d matrix M is represented as

M¼

f11 f12 f13 f1d

f₂₁ f₂₂ f₂₃ f_2d f31 f32 f33 . . . f₄₁ f₄₂ f₄₃ .. .

. . . f_n1 f_n2 f_n3 f_nd 0

BB BB B@

1 CC CC CA

;

where f_ijis a reduced feature vectors from the m m original data size to n d size.

3.2. Feature selection using the CPBF

For the feature selection using class profile-based approach, we have manually identified the most regular words that exist in each category and weighted them using the entropy weighting [19] scheme before adding them to the feature vectors that have been selected from the PCA. For example, the words that exist regularly in a boxing class are ÔLewis’, ÔTyson’, Ôheavyweight’, Ôfighter’, Ôknockout’, etc. Then a fixed number of regular words from each class will be used as a feature vector together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The entropy weighting scheme on each term is calculated as L_jk Gk where L_jk is the local weighting of term k and G_k is the global weighting of term k. The L_jk and G_k are given by

L_jk¼ 1þ log TFjk ðTFjk>0Þ

0 ðTFjk¼ 0Þ

ð7Þ and

G_k ¼1þPn k¼1

TFjk Fk log^TF_F^jk

k

log n ; ð8Þ

(8)

where n is the number of document in a collection and TFjk is the term frequency of each word in Docjas mentioned previously. The Fkis a frequency of term k in the entire document collection. We have selected R¼ 50 words that have the highest entropy value to be added to the ﬁrst d components from the PCA to be an input to the neural networks for classiﬁcation.

3.3. Input data to the neural networks

After the preprocessing of news web pages, a vocabulary that contains all the unique words in the news database has been created. We have limited the number of unique words in the vocabulary to 1800 as the number of distinct words is big. Each of the word in the vocabulary represents one feature vector.

Each feature vector contains the document-terms weight. The high dimensionality of feature vectors to be an input to the neural networks is not practical due to poor scalability and performance. Therefore, the PCA has been used to reduce the original feature vectors m¼ 1800 into a small number of principal components. In our case, we have selected a few values of d (e.g., 100, 200, 300, 400, and 500, 600) together with R¼ 50 features selected from the CPBF approach because this parameter performs better for web news classiﬁcation compared to the other parameters to be an input to the neural networks. The loading factor graph for the accumulated proportion of eigenvalues are shown in Fig. 2.

The value of d ð¼ 600Þ is selected based on the value of accumulated proportion of 81.6% from the original feature vectors for document classiﬁcation (see Figs. 2 and 3). The classiﬁcation accuracies of the classes that are repre-

0 20 40 60 80 100 120

0 500 1000 1500 2000

Number of principal components

Accumulated loading factors(%)

Fig. 2. Accumulated loading factors of principal components generated by the PCA.

(9)

sented with a small number of documents in the datasets are less accurate by using the PCA-NN approach. Further explanation on the PCA-NN has been described in Section 4.1. However, if a small number of documents used in the training of neural networks using the features selected from the PCA and combined with the features selected from the CPBF in the WPCM approach, the classiﬁcation accuracies have been increased as described in Section 5.1.

Therefore, we propose an improvement of web page classiﬁcation accuracy where a small number of documents are represented in a document category class based on a combination of the PCA and CPBF in the WPCM approach.

The combinations of these features are shown in Fig. 5. We have found that the numbers of feature vectors selected from the PCA (d¼ 600) and the features selected from the CPBF (R¼ 50) are the best values to be used in the WPCM approach. The classification accuracies with different value of d are shown in Table 2. We have found that the value of (dþ R ¼ 650) is the best, which contributes 96% of classification accuracy using the proposed WPCM method as shown in Fig. 3.

3.4. Characterization of the neural networks

The architecture of neural networks that have been used for classification process is shown in Fig. 4. The number of input layers ðpÞ is based on the number of principal components d ð¼ 600Þ and R ð¼ 50Þ. The number of hidden layersðqÞ is 50. The trial and error approach has been used to find a suitable number of hidden layers that provides good classification accuracy based on the input data to the neural networks. The number of output layers ðrÞ is 12 based on the number of classes in the sports news category as shown in Table 3.

We have deﬁned t as the iteration number, g is a learning rate, a is a momentum rate, hqis a bias on hidden unit q, hris a bias on output unit r, dqis the

0 0.2 0.4 0.6 0.8 1 1.2

0 500 1000 1500 2000

No. of features

% Accuracy

Fig. 3. The classiﬁcation accuracy with a combination of the PCA and CPBF feature vectors.

(10)

generalized error through a layer q, and d_r is the generalized error between layers q and r. The input values to the neural network are represented by f1; f2; . . . ; f_dþR.

Adaptation of the weights between hiddenðqÞ and input ðpÞ layers is given by

W_qpðt þ 1Þ ¼ WqpðtÞ þ DWqpðt þ 1Þ; ð9Þ

f1...f600

Doc1 f601...f650

PCA (d=600) CPBF (R=50)

Doc2

Docn

f1...f600 f601...f650

Doc3 f1...f600 f601...f650

Doc4 f1...f600 f601...f650

f1...f600 f601...f650

feature vectors to be input to the neural networks Docj

Fig. 5. The features vectors selected from the PCA are augmented with the features selected from the CPBF before classifying the web pages documents using the neural networks.

Wqp W_rq

netr

netq

Or

Oq

p q r

Input layers Hidden layers Output layers

f1

f2

f3

Op

cs1

cs2

cs12

represents a class of web sports documents where = csa

represents input vectors from the PCA and CPBF.

. 12 ,..., 1 a fd+R

fd+R

Fig. 4. The feature vectors from the PCA and CPBF are fed to the neural networks for classiﬁ- cation.

(11)

where

DWqpðt þ 1Þ ¼ gdqO_pþ aDWqpðtÞ; ð10Þ

d_q¼ Oqð1 OqÞX

r

d_rW_rq: ð11Þ

Note that the ﬁrst transfer function at hidden layerðqÞ is given by net_q¼X

q

W_qpO_pþ hq; ð12Þ

Table 3

The number of documents that are stored in the news database

Class no. Class nameðcsaÞ Number of documents

1 Baseball 569

2 Boxing 210

3 Cycling 80

4 Football 550

5 Golf 456

6 Hockey 856

7 Motor-sports 405

8 Rugby 52

9 Skiing 233

10 Soccer 261

11 Swimming 169

12 Tennis 255

Total 4096

These are the classes that exist in sports category.

Table 2

The web pages classiﬁcation accuracies using a combination of the PCA and CPBF feature vectors in the WPCM approach

Number of features Accuracy (%)

PCA (d) CPBF (R) Total number

of features (dþ R)

100 50 150 10

200 50 250 9

300 50 350 14

400 50 450 33

500 50 550 81

600 50 650 96

700 50 750 96

800 50 850 96

900 50 950 96

1000 50 1050 96

1350 50 1400 97

1550 50 1600 97

1750 50 1800 97

(12)

Oq¼ f ðnetqÞ ¼ 1=ð1 þ e^net^qÞ: ð13Þ Adaptation of the weights between outputðrÞ and hidden ðqÞ layers is given by

W_rqðt þ 1Þ ¼ WrqðtÞ þ DWrqðt þ 1Þ; ð14Þ where

DWrqðt þ 1Þ ¼ gdrO_qþ aDWrqðtÞ; ð15Þ

dr¼ Orð1 OrÞðtr OrÞ: ð16Þ

Then the output function at the output layerðkÞ is given by net_r¼X

r

W_rqO_qþ hr; ð17Þ

O_r¼ f ðnetrÞ ¼ 1=ð1 þ e^net^rÞ: ð18Þ

The parameters of error back-propagation neural networks are set as in Table 4.

3.5. Neural networks training

In order to train the neural networks, a set of training documents and speciﬁcation of the predeﬁned classes to which the document belongs is re- quired. Each training example is an input–output pair Tj¼ ðDocj; csaÞ where Tj

is the training document, Docjis the reduced feature vector of the jth training document, and csa is the desired classiﬁcation vector corresponding to Docj. The component values of csaare determined based on the classiﬁed information provided in the database as shown in Table 3. In our case, a¼ 1; . . . ; 12.

During training, the connection weights of the neural networks are initial- ized with some random values. The training examples in the training set are then presented to the neural network classiﬁer in random order and the connection weights are adjusted according to the error back-propagation learning rule. This process is repeated until the mean squares error (MSE) falls below a predeﬁned tolerance level or the maximum number of iterations is achieved.

Table 4

The error back-propagation neural networks parameters that have been used in the experiments

No. Neural networks parameters Values

1 Learning rate (g) 0.005

2 Momentum rate (a) 0.001

3 Number of iteration (t) 1000

4 MSE 0.005

(13)

4. Experiments

We have used a web page dataset from the Yahoo sports news as shown in Table 3. The types of news in the database are baseball, boxing, cycling, football, golf, hockey, motor-sports, rugby, skiing, soccer, swimming, and tennis. The total of documents are 4096. For training, we have selected ran- domly 1000 documents from different classes. The rest of the documents are used as the test sets. We have used the same test as being done by Yang and Honavar [28], which includes the Bayesian and TF-IDF classifiers. Also we include the WPCM approach as a comparison to the methods that are used in order to examine the applicability of the classification system. The description of WPCM has been mentioned in Section 2. The PCA-NN, TF-IDF and Bayesian methods for the tests are described as below.

4.1. PCA-NN classiﬁer

The PCA-neural networks (PCA-NN) is a process of web page classification using a number of features selected from the PCA and fed into the neural networks for classification. In the experiments, we have selected 600 principal components. These principal components have been fed into the neural networks for classification. The neural networks parameters used in the experiments are similar to the WPCM parameters as shown in Table 4.

4.2. TF-IDF classiﬁer

TF-IDF classiﬁer is based on the relevant feedback algorithm by Rocchio using the vector space model [29]. The algorithm represents documents as vectors so that the documents with similar contents have similar vectors. Each component of a vector corresponds to a term in the document, typically a word. The weight of each component is calculated using the term frequency inverse document frequency weighting scheme (TF-IDF) which tries to reward words that occur many times in a few documents. To classify a new document Doc⁰, the cosines of the prototype vectors with corresponding document vectors are calculated for each class. Doc⁰ is assigned to the class whose document vector has the highest cosine. Further description on the TF-IDF measure is described by Joachim [30].

4.3. Bayesian classiﬁer

For the statistical classification, we have used a standard Bayesian classifier [30]. When using the Bayesian classifier, we have assumed that term’s occurrence is independent of other terms. We want to find a class cs that gives the highest conditional probability for given document Doc⁰. w^m_k ¼ fw1; w₂; . . . ; w_mg

(14)

is the set of words representing the textual content of the document Doc⁰and k is the term number where k¼ 1; . . . ; m. The classiﬁcation score is measured by

JðcsÞ ¼Y^m

k¼1

PðwkjcsÞP ðcsÞ; ð19Þ

where PðcsÞ is the prior probability of class cs, and P ðwkjcsÞ is the likelihood of a word wk in a class cs that is estimated on a labeled training document. A given web page document is then classified in a class that maximizes JðcsÞ. If values of JðcsÞ for all the classes in the sports category are less than a given threshold, then the document is considered unclassified. The Rainbow toolkit [31] has been used for the classification of the training and test news web pages using the Bayesian and TF-IDF methods.

5. Evaluation

The TF-IDF, Bayesian, PCA-NN, and WPCM classiﬁcation methods are evaluated using the standard information retrieval measures that are precision, recall, and F 1 [32]. They are deﬁned as follows

precision¼ a

aþ b; ð20Þ

recall¼ a

aþ c; ð21Þ

F1¼ 2

1

precisionþ_recall¹ ; ð22Þ

where the values of a, b, and c are deﬁned in Table 5. The relationship between the system classiﬁcation and the expert judgment is expressed using four values as shown in Table 6. The F 1 measure is a kind of average of precision and recall.

5.1. Simulation results

The classiﬁcation results using the PCA-NN and WPCM methods are shown in Tables 7 and 8. The average for precision, recall, and F 1 measures

Table 5

The deﬁnitions of the parameters a, b, and c which are used in Table 6

Value Meaning

a The system and the expert agree with the assigned category b The system disagrees with the assigned category but the expert did c The expert disagrees with the assigned category but the system did d The system and the expert disagree with the assigned category

(15)

using the PCA-NN approach are 87.79%, 92.04%, and 89.89%, respectively. In comparison with the WPCM approach, the precision, recall, and F 1 measures are 90.35%, 93.81%, and 91.65%, respectively. Furthermore, we have compared

Table 6

The decision matrix for calculating the classiﬁcation accuracies

Expert System

Yes No

Yes a b

No c d

Table 7

The classiﬁcation results using the PCA-NN method

Class no. Precision (%) Recall (%) F1 (%)

1 89.82 100.00 94.64

2 84.75 100.00 91.74

3 71.60 72.50 72.05

4 97.06 99.00 98.02

5 100.00 97.00 98.48

6 90.83 99.00 94.74

7 97.09 100.00 98.52

8 100.00 100.00 100.00

9 99.01 100.00 99.01

10 71.26 67.00 67.07

11 82.62 95.00 92.23

12 69.44 75.00 72.12

Average 87.79 92.04 89.89

Table 8

The classiﬁcation results using the WPCM method

Class no. Precision (%) Recall (%) F1 (%)

1 97.14 68.00 80.00

2 86.96 100.00 93.02

3 98.68 93.75 96.15

4 86.09 99.00 92.09

5 89.91 98.00 93.78

6 94.29 99.00 96.59

7 84.03 100.00 91.32

8 97.62 100.00 98.80

9 89.29 100.00 94.34

10 73.91 68.00 70.83

11 90.09 100.00 94.79

12 96.15 100.00 98.04

Average 90.35 93.81 91.65

(16)

the WPCM with the TF-IDF and Bayesian methods using the F 1 measure.

They are 78.06% and 81.07% for the TF-IDF and Bayesian methods as shown in Table 9. This indicates that if the feature vectors are selected carefully, the improvement of web sports news classification using the combination of the PCA and CPBF in the WPCM approach will increase the classification accuracy. The cycling and rugby classes (class numbers 3 and 8) contain a small number of documents in the dataset as shown in Table 3. The F 1 measure using the PCA-NN approach for the cycling and rugby classes are 73% and 100%, respectively, as shown in Table 7. But for the WPCM approach the classification accuracies for both classes are 98.80% and 96.15%, respectively, as shown in Table 7. These indicate that the WPCM approach has been able to classify the documents correctly although the number of documents representing the class is small. Also, we have overcome the limitation of the PCA- NN in supervised data where the characteristic variables that describe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components, as the numbers of features from the CPBF are included during the classification process.

5.2. Discussions

The F 1 measures for the cycling, rugby, swimming, and boxing classes using the WPCM approach are better compared to the Bayesian, PCA-NN, and TF- IDF approaches. The numbers of documents belongings to each of these classes are less compared to the baseball, football, golf, hockey, motor-sports, skiing, soccer, and tennis classes as shown in Table 10. For example, there are

Table 9

The classiﬁcation results using the F 1 measure for the TF-IDF and Bayesian methods

Class no. TF-IDF (%) Bayesian (%)

1 83.04 83.04

2 88.89 90.48

3 100.00 100.00

4 79.39 78.79

5 86.86 81.75

6 68.03 86.89

7 56.46 59.49

8 50.00 75.00

9 91.30 89.86

10 88.46 80.77

11 64.00 56.00

12 80.26 90.79

Average 78.06 81.07

(17)

only 80, 52, 169, and 210 documents belongings to the cycling, rugby, swimming, and boxing classes, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the PCA-NN approach are 72.05%, 100.00%, 92.23%, and 91.74%, respectively. For the TF-IDF approach, the F 1 measures of the cycling, rugby, swimming, and boxing classes are 100.00%, 50.00%, 64.00%, and 88.89%, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the Bayesian approach are 100.00%, 75.00%, 56.00%, and 90.48%, respectively. However, the F 1 measures of the cycling, rugby, swimming, and boxing classes using the WPCM approach are 96.15%, 98.80%, 94.79%, and 93.02%, respectively. The average F 1 measures using the PCA-NN, TF-IDF, Bayesian, and WPCM approaches are 89.89%, 78.06%, 81.07%, and 91.65% respectively.

The CPBF feature selection process applied in the WPCM approach is based on the highest entropy of the keywords calculated by using the Eqs. (7) and (8).

The ﬁrst 50 keywords with the highest entropy values are selected and combined with the features taken from the PCA approach. The suitability of the keywords selection belonging to the particular classes has not been carefully considered in the WPCM approach while doing the experiments as described in Section 3.2. For example, the entropies for the keywords Ôfamous’, Ôﬁeld’, Ôskills’, and Ômanager’ from the class soccer are selected although these keywords also exist in the other classes such as the football and hockey classes.

However, identifying keywords alone are not enough. The associated weights related to each of the keywords are also important to improve the F 1 classification performance using the proposed approach as the same word may occur in different classes. This is the main reason why the classification accuracies of the baseball, football, golf, motor-sports, and skiing classes (94.64%, 98.02%, 98.48%, 98.52%, and 99.01%) are better when using the PCA-NN approach compared with the WPCM approach as shown in Table 7. Fur- thermore, the stemming and stopping processes are also degrading the classification performance using the WPCM approach compared with the PCA-NN approach. However, the average F 1 performance of WPCM approach is better than the other approach as shown in Tables 7–9.

Table 10

The F 1 measures for the classes that contain small number of documents to be classiﬁed using the WPCM, PCA-NN, TF-IDF, and Bayesian methods

Class no. Class nameðcsaÞ Number of documents

F1(%) WPCM (%)

PCA-NN (%)

TF-IDF (%)

Bayesian (%)

3 Cycling 80 96.15 72.05 100.00 100.00

8 Rugby 52 98.80 100.00 50.00 75.00

11 Swimming 169 94.79 92.23 64.00 56.00

2 Boxing 210 93.02 91.74 88.89 90.48

(18)

Also, the classiﬁcation accuracies of the baseball and soccer classes (83.00%

and 88.46%) are better when using the TF-IDF approach compared with the WPCM approach as shown in Table 9. Moreover, the classiﬁcation accuracies of the baseball and soccer classes (83.04% and 80.77%) are better when using the Bayesian approach compared with the WPCM approach (80.00% for the baseball class, 70.83% for the soccer class) as shown in Tables 8 and 9. This is because the contents of training documents belong to the football and hockey classes contain many sparse keywords. A better document selection approach needs to be used for selecting the candidate documents from each class in order to increase the F 1 classiﬁcation results by using the WPCM approach.

6. Conclusions

We have presented a new approach of web news classification using the WPCM. The feature selection for the WPCM is based on the combination of the PCA and CPBF methods. The comparison of classification accuracy using the Bayesian, TF-IDF, PCA-NN, and WPCM approaches have been presented in this paper. The WPCM approach has been applied to classify the Yahoo sports news web pages. The experimental evaluation with different classification algorithms demonstrates that this method provides acceptable classification accuracy with the sports news datasets. Also, we have overcome the limitation of the PCA-NN in supervised data where the characteristic variables that describe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components. Although the classification accuracy using the WPCM approach is high in comparison with the other approaches, the time taken for training is relatively long compared with the other methods.

Acknowledgement

The authors wish to thank the reviewers for their helpful suggestions.

References

[1] H.M. Lee, C.M. Chen, C.W. Hwang, A neural network document classiﬁer with linguistic feature selection, in: Proceedings of 13th International Conference on Industrial and Engineering Applications of Artiﬁcial Intelligence and Expert Systems, New Orleans, Louisiana, USA, IEA/AIE 2000, 19–22 June 2000, pp. 555–560.

[2] W. Lam, M.E. Ruiz, P. Srinivasan, Automatic text categorization and its applications to text retrieval, IEEE Transactions on Knowledge Data Engineering 11 (6) (1999) 865–879.

(19)

[3] Z.Q. Liu, Y. Zhang, A competitive neural network approach to web-page categorization, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 9 (6) (2001) 731–741.

[4] D.A. Karras, B.G. Mertzios, A robust meaning extraction methodology using supervised neural networks, in: Proceedings of Australian Joint Conference on Artiﬁcial Intelligence, 2002, pp. 498–510.

[5] A. Rauber, D. Merkl, M. Dittenbach, The growing hierarchical self-organizing map:

exploratory analysis of high-dimensional data, IEEE Transactions on Neural Networks 13 (6) (2002) 1331–1341.

[6] C. Enhong, W. Shangfei, Z. Zhenya, W. Xufa, Document classiﬁcation with CC4 neural network, in: Proceedings of ICONIP 2001, Sanghai, China.

[7] J.H. Chang, J.W. Lee, Y. Kim, B.T. Zhang, Topic extraction from text documents using multiple-cause networks, in: Proceedings of 7th Pacific Rim International Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence, vol. 2417, 2002, pp. 434–443.

[8] G.L. Gentili, M. Marinilli, A. Micarelli, F. Sciarrone, Text categorization in an intelligent agent for ﬁltering information on the Web, International Journal of Pattern Recognition and Artiﬁcial Intelligence 15 (3) (2001) 527–549.

[9] T. Kohonen, S. Kaski, K. Lagus, J. Saloj€aavi, J. Honkela, V. Paatero, A. Saarela, Self organization of a massive text document collection, in: E. Oja, S. Kaski (Eds.), Kohonen Maps, Elsevier, Amsterdam, 1999, pp. 171–182.

[10] S. Wermeter, Neural network agents for learning semantic text classiﬁcation, Information Retrieval 3 (2) (2000) 87–103.

[11] E.M. Ruiz, P. Srinivasan, Hierarchical text categorization using neural networks, Information Retrieval 5 (1) (2002) 87–118.

[12] A.S. Weigend, E.D. Wiener, J.O. Pedersen, Exploiting hierarchy in text categorization, Information Retrieval 1 (3) (1999) 193–216.

[13] R.A. Calvo, H.A. Ceccatto, Intelligent document classiﬁcation, Intelligent Data Analysis 4 (5) (2000).

[14] S.Y. Lam, D.L. Lee, Feature reduction for neural network based text categorization, in:

Proceedings of the 6th International Conference on Database Systems for Advanced Applications, Hsinchu, Taiwan, April 1999.

[15] F. Sebastini, Machine learning in automated text categorization, ACM Computing Surveys 34 (1) (2002) 1–47.

[16] G. Karypis, E.H. Sam, Fast dimensionality reduction algorithm with applications to document retrieval and categorization, in: Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 2000, pp.

12–19.

[17] S. Zelikovitz, H. Hirsh, Improving short text classiﬁcation using unlabeled background knowledge, in: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, USA, June 2000.

[18] A. Selamat, S. Omatu, H. Yanagimoto, Web news classiﬁcation using neural networks based on PCA, in: SICE, Osaka International Convention Center, Osaka, Japan, 5–7 August 2002, pp. 2388–2393.

[19] S.T. Dumais, Improving the retrieval of information from external sources, Behavior Research Methods, Instruments and Computers 23 (2) (1991) 229–236.

[20] Yahoo Web Pages, 2001, Available from <http://www.yahoo.com>.

[21] H. Mase, H. Tsuji, Experiments on automatic web page categorization for information retrieval system, IPSJ Journal 42 (2) (2001) 334–347.

[22] R.A. Calvo, M. Partridge, M. Jabri, A comparative study of principal components analysis techniques, in: Proceedings Ninth Australian Conference on Neural Networks, Brisbane, QLD, 1998, pp. 276–281.

(20)

[23] R.A. Johnson, W.D. Wichern, Applied Multivariate Statistical Analysis, ﬁfth ed., Prentice- Hall, USA, 2002.

[24] A. Selamat, S. Omatu, H. Yanagimoto, Information retrieval from the Internet using mobile agent search system, in: Proceedings of International Conference on Information Technology and Multimedia 2001, University Tenaga Nasional (UNITEN), Malaysia, August 2001, pp.

9–16.

[25] A. Selamat, S. Omatu, H. Yanagimoto, Web news retrieval agent using neural networks based on PCA, in: Proceedings of the 46th Annual Conference of the Institute of Systems, Control and Information Engineers (ISCIE), Kobe, Japan, May 2002, pp. 401–401.

[26] S. Jones, Karen, P. Willet, Readings in information retrieval, San Francisco, USA, 1997.

[27] Salton&McGill, Introduction to modern information retrieval, New York, McGraw-Hill, USA, 1983.

[28] J.P. Yang, V. Honavar, L. Miller, Mobile intelligent agents for document classiﬁcation and retrieval: a machine learning approach, in: Proceedings of the European Symposium on Cybernetics and Systems Research, Vienna, Austria, 1998, pp. 707–712.

[29] R.R. Korfhage, Information Storage and Retrieval, John Wiley and Sons Inc., USA, 1997.

[30] T. Joachims, Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in: Proceedings of International Conference on Machine Learning, Nashville, TN, USA, 1997, pp. 143–151.

[31] A.K. McCallum, Bow: a toolkit for statistical language modeling, text retrieval, classiﬁcation and clustering (http://www.cs.cmu.edu/mccallum/~bow), 1996.

[32] D.D. Lewis, Evaluating and optimizing autonomous text classiﬁcation systems, in: E.A. Fox, P. Ingwersen, R. Fidel (Ed.), SIGIR’95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 1995, pp. 246–254.