• No results found

Web page feature selection and classification using neural networks

N/A
N/A
Protected

Academic year: 2021

Share "Web page feature selection and classification using neural networks"

Copied!
20
0
0

Loading.... (view fulltext now)

Full text

(1)

Web page feature selection

and classification using neural networks

Ali Selamat

*

, Sigeru Omatu

Computer and System Sciences, Graduate School of Engineering, Osaka Prefecture University, Sakai, Osaka 599-8531, Japan

Received 20 June 2002; received in revised form 30 December 2002; accepted 18 March 2003

Abstract

Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web (WWW). In this paper, we propose a news web page classifi- cation method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features. Each news web page is represented by the term-weighting scheme. As the number of unique words in the col- lection set is big, the principal component analysis (PCA) has been used to select the most relevant features for the classification. Then the final output of the PCA is com- bined with the feature vectors from the class-profile which contains the most regular words in each class. We have manually selected the most regular words that exist in each class and weighted them using an entropy weighting scheme. The fixed number of regular words from each class will be used as a feature vectors together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM method provides acceptable classification accuracy with the sports news datasets.

Ó 2003 Elsevier Inc. All rights reserved.

Keywords: Text categorization; WWW; Profile-based neural networks; Principal component analysis

*Corresponding author. Address: UTM, Faculty of Computer Science and Information Systems, UTM Skudai, Johor 81310, Malaysia. Tel.: +81-722-54-9278; fax: +81-722-57-1788.

E-mail address:[email protected](A. Selamat).

0020-0255/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2003.03.003

www.elsevier.com/locate/ins

(2)

1. Introduction

Neural networks have been widely applied by many researchers to classify the text documents with different types of feature vectors. For example, Lee et al. [1] have proposed a neural network for document classification based on a linguistic feature selection with a fuzzy learning technique. Liu and Zhang [3], Enhong et al. [6], and Lam et al. [2] have proposed a term frequency method to select the feature vectors for document categorization using neural networks.

Furthermore, Karras and Mertzios [4] have proposed a feature extraction method based on a word semantic and the categories of words co-occurrence analysis for document classification using neural networks. Chang et al. [7]

have proposed another approach of document classification using multiple- cause networks with latent variables as a feature selection method.

A feature selection using a hybrid case-based architecture has been proposed by Gentili et al. [8] for text categorization where two multi-layer perceptrons are integrated into a case-based reasoner. Rauber et al. [5] have proposed a different neural network architecture for document categorization namely as a growing hierarchical self-organizing map (GSOM). However, the GSOM is concentrated on a huge number of documents in the datasets to be classified and the feature selection method used is similar to approach applied by Ko- honen et al. [9].

Furthermore, Wermeter [10] has used the document title as the vectors to be used for document categorization. Ruiz and Srinivasan [11], Weigend et al.

[12], and Calvo and Ceccatto [13] have used the X2 measure to select the relevant features before classifying the text documents using the neural net- works. Lam and Lee [14] have used the principal component analysis (PCA) method as a feature reduction technique to the input data to the neural networks. However, if some original terms are particularly good when dis- criminating a class category, the discrimination power may be lost in the new vector space after using a latent semantic indexing (LSI) as described by Se- bastini [15].

Moreover, the techniques based on dimensionality reduction, such as the LSI, have been shown to improve the quality of the information being re- trieved by capturing the latent meaning of words present in the documents. As a result of the dimensionality reduction using the PCA or the LSI for data classification, the characteristic variables that describe smaller classes tend to be lost. Hence, the classification accuracy on the smaller classes can be de- graded in the reduced dimensional space [16,18]. Zelikovitz and Hirsh [17] have included the background knowledge to the LSI method in order to classify the text. However, if a small number of documents that are represented in a particular class in the datasets, the classification accuracy may be decreased.

This problem has not been taken into consideration in the previous ap- proach [17].

(3)

Therefore, we propose a web page classification method (WPCM), which is based on the PCA and the class profile-based features (CPBF). The proposed method is different compared to the previous works based on the improvement of web page categorization accuracies using the PCA features selection ap- proach with a combination of background knowledge. If a small number of web page documents use in a category class, the classification accuracy can also be increased. Each web page is represented by the term frequency-weighting scheme. As the dimensionality of a feature vector in the collection set is big, the PCA has been used to reduce it into a small number of principal components.

Then we combine the feature vectors generated from the PCA with the feature vectors from the class-profile which contains the most regular words in each class before feeding them to the neural networks for classification. We have manually selected the most regular words that exist in each class and weighted them using an entropy-weighting scheme [19]. The Yahoo sports [20] news web pages have been used for the classification purpose. The web page clas- sification techniques such as a Bayesian, a term frequency-inverse documents frequency weighting scheme (TF-IDF), a PCA-neural network (PCA-NN), and the WPCM methods have been used as the benchmark tests for the classifi- cation accuracy. The experimental evaluation demonstrates that the proposed method provides an acceptable classification accuracy with the sports news datasets.

The organization of this paper is as follows: The news classification using the WPCM is described in Section 2. The preprocessing of web pages is ex- plained in Section 3. The comparisons of web pages classification using the Bayesian, TF-IDF, PCA-NN, and WPCM methods are discussed in Section 4.

In Section 5, the discussions of the classification results using different meth- ods for the sports news web pages are stated. In Section 6, we will conclude the classification accuracy by using the WPCM compared with the other methods.

2. Web page classification method

The news web pages have different characteristics where the text length for each of them is variable. Also the structures of the pages are different in the tags usage (i.e., XML, html, SGML tags, etc.). Furthermore, a huge number of distinct words exist in those pages including proper or misspelled noun words as there is no restriction on word usage in the news web pages discussed by Mase and Tsuji [21].

The high dimensionality of the news web pages dataset has made the process of classification difficult. This is because there are many categories of news in the web news pages such as sports, weathers, politics, economy, etc. In each category there are many different classes. For example, the classes that exist in

(4)

the business category are stock market, financial investment, personal finance, etc. Our approach is based on the sports news category of web pages.

As we have mentioned in Section 1, the main purpose of this paper is to improve the classification accuracies of the classes in the datasets that are represented with a small number of documents. Our approach is different compared to the previous works [17]. It is based on the improvement of web page categorization accuracies using the PCA features selection approach with a combination of background knowledge where a small number of documents have been used in the document class.

In order to classify the news web pages, we propose the WPCM which uses the PCA and CPBF as the input to the neural networks. First, we have used the PCA algorithm [22,23] to reduce the original data vectors to a small number of relevant features. Then we combine these features to the CPBF before input- ting them to the neural networks for classification as shown in Fig. 1.

Web page retrieval

Stemming and stopping processes

Term weighting scheme using TFidf method

Feature reduction and selection using PCA

Select the most regular words from each class

Calculate the weight of each of the regular words using

an entropy method

Combine two types of feature vectors

Input to the neural networks for classification

Classification results

d R

PCA CPBF

WPCM

d is a feature vectors from the PCA R is a feature vectors from the CPBF Note:

Fig. 1. The process of classifying a news web page using the WPCM method.

(5)

3. Preprocessing of web pages

The process of classifying a news web page using the WPCM method is shown in Fig. 1. It consists of web news retrieval process, stemming and stopping processes, feature reduction process using our proposed method, and web classification process using error back-propagation neural networks. The retrieving process of sports news web pages has been done by our software agent during night-time [24,25]. Only the latest sports news web pages category will be retrieved from the Yahoo web server from the WWW. Then these web pages will be stored in the local news database. Stopping is a process of re- moving the most frequent word that exists in a web page document such as Ôto’, Ôand’, Ôit’, etc. Removing these words will save spaces for storing document contents and reduce time taken during the search process. Stemming is a process of extracting each word from a web page document by reducing it to a possible root word. For example, the words Ôcompares’, Ôcompared’, and Ôcomparing’ have similar meaning with the word Ôcompare’. We have used the Porter [26] stemming algorithm to select only Ôcompare’ to be used as a root word in a web page document. After the stemming and stopping processes of the terms in each document, we will represent them as the document-term frequency matrixðDocj TFjkÞ as shown in Table 1. Docjis referring to each web page document that exists in the news database where j¼ 1; . . . ; n. Term frequency TFjkis the number of how many times the distinct word wkoccurs in document Docjwhere k¼ 1; . . . ; m. The calculation of the terms weight xjk of each word wk is done by using a method that has been used by Salton [27]

which is given by

xjk ¼ TFjk idfk; ð1Þ

where the document frequency dfk is the total number of documents in the database that contains the word wk. The inverse document frequency idfk¼ logðdfn

kÞ where n is the total number of documents in the database.

Table 1

The document-term frequency data matrix after the stemming and stopping processes

Docj TF1 TF2 . . . TFm

Doc1 2 4 . . . 5

Doc2 2 3 . . . 2

Doc3 2 3 . . . 2

Doc4 2 6 . . . 1

Doc5 4 3 . . . 3

...

...

...

...

...

Docn 1 3 . . . 7

(6)

3.1. Feature reduction using the PCA

Suppose that we have A, which is a matrix document-terms weight as below

x11 x12    x1k    x1m x21 x22    x2k    x2m

x31 x32    x3k    . . . ...

... ...

... .. .

. . . xn1 xn2    xnk    xnm 0

BB BB B@

1 CC CC CA

;

where xjk is the terms weight that exists in the collection of documents. The definitions of j, k, m, and n have been described in the previous paragraph.

There are a few steps to be followed in order to calculate the principal com- ponents of data matrix A. The mean of m variables in data matrix A will be calculated as follows

xxk ¼1 n

Xn

j¼1

xjk: ð2Þ

After that the covariance matrix, S¼ fsjkg is calculated. The variance s2kkis given by

s2kk¼1 n

Xn

j¼1

ðxjk xxkÞ2: ð3Þ

The covariance sik is given by

sik ¼1 n

Xn

j¼1

ðxji xxiÞðxjk xxkÞ; ð4Þ

where i¼ 1; . . . ; m. Then we determine the eigenvalues and eigenvectors of the covariance matrix S which is a real symmetric positive matrix. An eigenvalue k and the eigenvector e can be found such that, Se¼ ke.

In order to find the eigenvector e the characteristic equation jS  kIj ¼ 0 must be solved. If S is an m m matrix of full rank, m eigenvalues ðk1;k2; . . . ;kmÞ can be found. By using

ðS  kiIÞei¼ 0; ð5Þ

all corresponding eigenvectors can be found. The eigenvalues and corre- sponding eigenvectors will be sorted so that k1P k2P   P km. Let a square matrix E be constructed from the eigenvector columns where E¼

½e1 e2 e3 e4    em. Also let us denote K as

(7)

k1 0 0    0

0 k2 0 0 0

0 0 k3 0 0

... ...

... .. .

. . .

0 0 0 0 km

0 BB BB B@

1 CC CC CA :

In order to get the principal components of matrix S, we will perform ei- genvalue decomposition which is given by

ETSE¼ K: ð6Þ

Then we select the first d 6 m eigenvectors where d is the desired value, e.g., 100, 200, 400, etc. The set of principal components is represented as Y1¼ eT1s, Y2¼ eT2s; . . . ; Yd¼ eTds. An n d matrix M is represented as

f11 f12 f13    f1d

f21 f22 f23    f2d f31 f32 f33    . . . f41 f42 f43 .. .

. . . fn1 fn2 fn3    fnd 0

BB BB B@

1 CC CC CA

;

where fijis a reduced feature vectors from the m m original data size to n  d size.

3.2. Feature selection using the CPBF

For the feature selection using class profile-based approach, we have man- ually identified the most regular words that exist in each category and weighted them using the entropy weighting [19] scheme before adding them to the fea- ture vectors that have been selected from the PCA. For example, the words that exist regularly in a boxing class are ÔLewis’, ÔTyson’, Ôheavyweight’, Ôfighter’, Ôknockout’, etc. Then a fixed number of regular words from each class will be used as a feature vector together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The entropy weighting scheme on each term is calculated as Ljk Gk where Ljk is the local weighting of term k and Gk is the global weighting of term k. The Ljk and Gk are given by

Ljk¼ 1þ log TFjk ðTFjk>0Þ

0 ðTFjk¼ 0Þ



ð7Þ and

Gk ¼1þPn k¼1

TFjk Fk logTFFjk

k

log n ; ð8Þ

(8)

where n is the number of document in a collection and TFjk is the term fre- quency of each word in Docjas mentioned previously. The Fkis a frequency of term k in the entire document collection. We have selected R¼ 50 words that have the highest entropy value to be added to the first d components from the PCA to be an input to the neural networks for classification.

3.3. Input data to the neural networks

After the preprocessing of news web pages, a vocabulary that contains all the unique words in the news database has been created. We have limited the number of unique words in the vocabulary to 1800 as the number of distinct words is big. Each of the word in the vocabulary represents one feature vector.

Each feature vector contains the document-terms weight. The high dimensio- nality of feature vectors to be an input to the neural networks is not practical due to poor scalability and performance. Therefore, the PCA has been used to reduce the original feature vectors m¼ 1800 into a small number of principal components. In our case, we have selected a few values of d (e.g., 100, 200, 300, 400, and 500, 600) together with R¼ 50 features selected from the CPBF ap- proach because this parameter performs better for web news classification compared to the other parameters to be an input to the neural networks. The loading factor graph for the accumulated proportion of eigenvalues are shown in Fig. 2.

The value of d ð¼ 600Þ is selected based on the value of accumulated pro- portion of 81.6% from the original feature vectors for document classification (see Figs. 2 and 3). The classification accuracies of the classes that are repre-

0 20 40 60 80 100 120

0 500 1000 1500 2000

Number of principal components

Accumulated loading factors(%)

Fig. 2. Accumulated loading factors of principal components generated by the PCA.

(9)

sented with a small number of documents in the datasets are less accurate by using the PCA-NN approach. Further explanation on the PCA-NN has been described in Section 4.1. However, if a small number of documents used in the training of neural networks using the features selected from the PCA and combined with the features selected from the CPBF in the WPCM approach, the classification accuracies have been increased as described in Section 5.1.

Therefore, we propose an improvement of web page classification accuracy where a small number of documents are represented in a document category class based on a combination of the PCA and CPBF in the WPCM approach.

The combinations of these features are shown in Fig. 5. We have found that the numbers of feature vectors selected from the PCA (d¼ 600) and the features selected from the CPBF (R¼ 50) are the best values to be used in the WPCM approach. The classification accuracies with different value of d are shown in Table 2. We have found that the value of (dþ R ¼ 650) is the best, which contributes 96% of classification accuracy using the proposed WPCM method as shown in Fig. 3.

3.4. Characterization of the neural networks

The architecture of neural networks that have been used for classification process is shown in Fig. 4. The number of input layers ðpÞ is based on the number of principal components d ð¼ 600Þ and R ð¼ 50Þ. The number of hidden layersðqÞ is 50. The trial and error approach has been used to find a suitable number of hidden layers that provides good classification accuracy based on the input data to the neural networks. The number of output layers ðrÞ is 12 based on the number of classes in the sports news category as shown in Table 3.

We have defined t as the iteration number, g is a learning rate, a is a mo- mentum rate, hqis a bias on hidden unit q, hris a bias on output unit r, dqis the

0 0.2 0.4 0.6 0.8 1 1.2

0 500 1000 1500 2000

No. of features

% Accuracy

Fig. 3. The classification accuracy with a combination of the PCA and CPBF feature vectors.

(10)

generalized error through a layer q, and dr is the generalized error between layers q and r. The input values to the neural network are represented by f1; f2; . . . ; fdþR.

Adaptation of the weights between hiddenðqÞ and input ðpÞ layers is given by

Wqpðt þ 1Þ ¼ WqpðtÞ þ DWqpðt þ 1Þ; ð9Þ

f1...f600

Doc1 f601...f650

PCA (d=600) CPBF (R=50)

Doc2

Docn

f1...f600 f601...f650

Doc3 f1...f600 f601...f650

Doc4 f1...f600 f601...f650

f1...f600 f601...f650

feature vectors to be input to the neural networks Docj

Fig. 5. The features vectors selected from the PCA are augmented with the features selected from the CPBF before classifying the web pages documents using the neural networks.

Wqp Wrq

netr

netq

Or

Oq

p q r

Input layers Hidden layers Output layers

f1

f2

f3

Op

cs1

cs2

cs12

represents a class of web sports documents where = csa

represents input vectors from the PCA and CPBF.

. 12 ,..., 1 a fd+R

fd+R

Fig. 4. The feature vectors from the PCA and CPBF are fed to the neural networks for classifi- cation.

(11)

where

DWqpðt þ 1Þ ¼ gdqOpþ aDWqpðtÞ; ð10Þ

dq¼ Oqð1  OqÞX

r

drWrq: ð11Þ

Note that the first transfer function at hidden layerðqÞ is given by netq¼X

q

WqpOpþ hq; ð12Þ

Table 3

The number of documents that are stored in the news database

Class no. Class nameðcsaÞ Number of documents

1 Baseball 569

2 Boxing 210

3 Cycling 80

4 Football 550

5 Golf 456

6 Hockey 856

7 Motor-sports 405

8 Rugby 52

9 Skiing 233

10 Soccer 261

11 Swimming 169

12 Tennis 255

Total 4096

These are the classes that exist in sports category.

Table 2

The web pages classification accuracies using a combination of the PCA and CPBF feature vectors in the WPCM approach

Number of features Accuracy (%)

PCA (d) CPBF (R) Total number

of features (dþ R)

100 50 150 10

200 50 250 9

300 50 350 14

400 50 450 33

500 50 550 81

600 50 650 96

700 50 750 96

800 50 850 96

900 50 950 96

1000 50 1050 96

1350 50 1400 97

1550 50 1600 97

1750 50 1800 97

(12)

Oq¼ f ðnetqÞ ¼ 1=ð1 þ enetqÞ: ð13Þ Adaptation of the weights between outputðrÞ and hidden ðqÞ layers is given by

Wrqðt þ 1Þ ¼ WrqðtÞ þ DWrqðt þ 1Þ; ð14Þ where

DWrqðt þ 1Þ ¼ gdrOqþ aDWrqðtÞ; ð15Þ

dr¼ Orð1  OrÞðtr OrÞ: ð16Þ

Then the output function at the output layerðkÞ is given by netr¼X

r

WrqOqþ hr; ð17Þ

Or¼ f ðnetrÞ ¼ 1=ð1 þ enetrÞ: ð18Þ

The parameters of error back-propagation neural networks are set as in Table 4.

3.5. Neural networks training

In order to train the neural networks, a set of training documents and specification of the predefined classes to which the document belongs is re- quired. Each training example is an input–output pair Tj¼ ðDocj; csaÞ where Tj

is the training document, Docjis the reduced feature vector of the jth training document, and csa is the desired classification vector corresponding to Docj. The component values of csaare determined based on the classified informa- tion provided in the database as shown in Table 3. In our case, a¼ 1; . . . ; 12.

During training, the connection weights of the neural networks are initial- ized with some random values. The training examples in the training set are then presented to the neural network classifier in random order and the con- nection weights are adjusted according to the error back-propagation learning rule. This process is repeated until the mean squares error (MSE) falls below a predefined tolerance level or the maximum number of iterations is achieved.

Table 4

The error back-propagation neural networks parameters that have been used in the experiments

No. Neural networks parameters Values

1 Learning rate (g) 0.005

2 Momentum rate (a) 0.001

3 Number of iteration (t) 1000

4 MSE 0.005

(13)

4. Experiments

We have used a web page dataset from the Yahoo sports news as shown in Table 3. The types of news in the database are baseball, boxing, cycling, football, golf, hockey, motor-sports, rugby, skiing, soccer, swimming, and tennis. The total of documents are 4096. For training, we have selected ran- domly 1000 documents from different classes. The rest of the documents are used as the test sets. We have used the same test as being done by Yang and Honavar [28], which includes the Bayesian and TF-IDF classifiers. Also we include the WPCM approach as a comparison to the methods that are used in order to examine the applicability of the classification system. The description of WPCM has been mentioned in Section 2. The PCA-NN, TF-IDF and Bayesian methods for the tests are described as below.

4.1. PCA-NN classifier

The PCA-neural networks (PCA-NN) is a process of web page classification using a number of features selected from the PCA and fed into the neural networks for classification. In the experiments, we have selected 600 principal components. These principal components have been fed into the neural net- works for classification. The neural networks parameters used in the experi- ments are similar to the WPCM parameters as shown in Table 4.

4.2. TF-IDF classifier

TF-IDF classifier is based on the relevant feedback algorithm by Rocchio using the vector space model [29]. The algorithm represents documents as vectors so that the documents with similar contents have similar vectors. Each component of a vector corresponds to a term in the document, typically a word. The weight of each component is calculated using the term frequency inverse document frequency weighting scheme (TF-IDF) which tries to reward words that occur many times in a few documents. To classify a new document Doc0, the cosines of the prototype vectors with corresponding document vectors are calculated for each class. Doc0 is assigned to the class whose document vector has the highest cosine. Further description on the TF-IDF measure is described by Joachim [30].

4.3. Bayesian classifier

For the statistical classification, we have used a standard Bayesian classifier [30]. When using the Bayesian classifier, we have assumed that term’s occur- rence is independent of other terms. We want to find a class cs that gives the highest conditional probability for given document Doc0. wmk ¼ fw1; w2; . . . ; wmg

(14)

is the set of words representing the textual content of the document Doc0and k is the term number where k¼ 1; . . . ; m. The classification score is measured by

JðcsÞ ¼Ym

k¼1

PðwkjcsÞP ðcsÞ; ð19Þ

where PðcsÞ is the prior probability of class cs, and P ðwkjcsÞ is the likelihood of a word wk in a class cs that is estimated on a labeled training document. A given web page document is then classified in a class that maximizes JðcsÞ. If values of JðcsÞ for all the classes in the sports category are less than a given threshold, then the document is considered unclassified. The Rainbow toolkit [31] has been used for the classification of the training and test news web pages using the Bayesian and TF-IDF methods.

5. Evaluation

The TF-IDF, Bayesian, PCA-NN, and WPCM classification methods are evaluated using the standard information retrieval measures that are precision, recall, and F 1 [32]. They are defined as follows

precision¼ a

aþ b; ð20Þ

recall¼ a

aþ c; ð21Þ

F1¼ 2

1

precisionþrecall1 ; ð22Þ

where the values of a, b, and c are defined in Table 5. The relationship between the system classification and the expert judgment is expressed using four values as shown in Table 6. The F 1 measure is a kind of average of precision and recall.

5.1. Simulation results

The classification results using the PCA-NN and WPCM methods are shown in Tables 7 and 8. The average for precision, recall, and F 1 measures

Table 5

The definitions of the parameters a, b, and c which are used in Table 6

Value Meaning

a The system and the expert agree with the assigned category b The system disagrees with the assigned category but the expert did c The expert disagrees with the assigned category but the system did d The system and the expert disagree with the assigned category

(15)

using the PCA-NN approach are 87.79%, 92.04%, and 89.89%, respectively. In comparison with the WPCM approach, the precision, recall, and F 1 measures are 90.35%, 93.81%, and 91.65%, respectively. Furthermore, we have compared

Table 6

The decision matrix for calculating the classification accuracies

Expert System

Yes No

Yes a b

No c d

Table 7

The classification results using the PCA-NN method

Class no. Precision (%) Recall (%) F1 (%)

1 89.82 100.00 94.64

2 84.75 100.00 91.74

3 71.60 72.50 72.05

4 97.06 99.00 98.02

5 100.00 97.00 98.48

6 90.83 99.00 94.74

7 97.09 100.00 98.52

8 100.00 100.00 100.00

9 99.01 100.00 99.01

10 71.26 67.00 67.07

11 82.62 95.00 92.23

12 69.44 75.00 72.12

Average 87.79 92.04 89.89

Table 8

The classification results using the WPCM method

Class no. Precision (%) Recall (%) F1 (%)

1 97.14 68.00 80.00

2 86.96 100.00 93.02

3 98.68 93.75 96.15

4 86.09 99.00 92.09

5 89.91 98.00 93.78

6 94.29 99.00 96.59

7 84.03 100.00 91.32

8 97.62 100.00 98.80

9 89.29 100.00 94.34

10 73.91 68.00 70.83

11 90.09 100.00 94.79

12 96.15 100.00 98.04

Average 90.35 93.81 91.65

(16)

the WPCM with the TF-IDF and Bayesian methods using the F 1 measure.

They are 78.06% and 81.07% for the TF-IDF and Bayesian methods as shown in Table 9. This indicates that if the feature vectors are selected carefully, the improvement of web sports news classification using the combination of the PCA and CPBF in the WPCM approach will increase the classification accu- racy. The cycling and rugby classes (class numbers 3 and 8) contain a small number of documents in the dataset as shown in Table 3. The F 1 measure using the PCA-NN approach for the cycling and rugby classes are 73% and 100%, respectively, as shown in Table 7. But for the WPCM approach the classification accuracies for both classes are 98.80% and 96.15%, respectively, as shown in Table 7. These indicate that the WPCM approach has been able to classify the documents correctly although the number of documents repre- senting the class is small. Also, we have overcome the limitation of the PCA- NN in supervised data where the characteristic variables that describe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components, as the numbers of features from the CPBF are included during the classification process.

5.2. Discussions

The F 1 measures for the cycling, rugby, swimming, and boxing classes using the WPCM approach are better compared to the Bayesian, PCA-NN, and TF- IDF approaches. The numbers of documents belongings to each of these classes are less compared to the baseball, football, golf, hockey, motor-sports, skiing, soccer, and tennis classes as shown in Table 10. For example, there are

Table 9

The classification results using the F 1 measure for the TF-IDF and Bayesian methods

Class no. TF-IDF (%) Bayesian (%)

1 83.04 83.04

2 88.89 90.48

3 100.00 100.00

4 79.39 78.79

5 86.86 81.75

6 68.03 86.89

7 56.46 59.49

8 50.00 75.00

9 91.30 89.86

10 88.46 80.77

11 64.00 56.00

12 80.26 90.79

Average 78.06 81.07

(17)

only 80, 52, 169, and 210 documents belongings to the cycling, rugby, swim- ming, and boxing classes, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the PCA-NN approach are 72.05%, 100.00%, 92.23%, and 91.74%, respectively. For the TF-IDF approach, the F 1 measures of the cycling, rugby, swimming, and boxing classes are 100.00%, 50.00%, 64.00%, and 88.89%, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the Bayesian approach are 100.00%, 75.00%, 56.00%, and 90.48%, respectively. However, the F 1 measures of the cycling, rugby, swimming, and boxing classes using the WPCM ap- proach are 96.15%, 98.80%, 94.79%, and 93.02%, respectively. The average F 1 measures using the PCA-NN, TF-IDF, Bayesian, and WPCM approaches are 89.89%, 78.06%, 81.07%, and 91.65% respectively.

The CPBF feature selection process applied in the WPCM approach is based on the highest entropy of the keywords calculated by using the Eqs. (7) and (8).

The first 50 keywords with the highest entropy values are selected and com- bined with the features taken from the PCA approach. The suitability of the keywords selection belonging to the particular classes has not been carefully considered in the WPCM approach while doing the experiments as described in Section 3.2. For example, the entropies for the keywords Ôfamous’, Ôfield’, Ôskills’, and Ômanager’ from the class soccer are selected although these key- words also exist in the other classes such as the football and hockey classes.

However, identifying keywords alone are not enough. The associated weights related to each of the keywords are also important to improve the F 1 classification performance using the proposed approach as the same word may occur in different classes. This is the main reason why the classification accu- racies of the baseball, football, golf, motor-sports, and skiing classes (94.64%, 98.02%, 98.48%, 98.52%, and 99.01%) are better when using the PCA-NN approach compared with the WPCM approach as shown in Table 7. Fur- thermore, the stemming and stopping processes are also degrading the classi- fication performance using the WPCM approach compared with the PCA-NN approach. However, the average F 1 performance of WPCM approach is better than the other approach as shown in Tables 7–9.

Table 10

The F 1 measures for the classes that contain small number of documents to be classified using the WPCM, PCA-NN, TF-IDF, and Bayesian methods

Class no. Class nameðcsaÞ Number of docu- ments

F1(%) WPCM (%)

PCA-NN (%)

TF-IDF (%)

Bayesian (%)

3 Cycling 80 96.15 72.05 100.00 100.00

8 Rugby 52 98.80 100.00 50.00 75.00

11 Swimming 169 94.79 92.23 64.00 56.00

2 Boxing 210 93.02 91.74 88.89 90.48

(18)

Also, the classification accuracies of the baseball and soccer classes (83.00%

and 88.46%) are better when using the TF-IDF approach compared with the WPCM approach as shown in Table 9. Moreover, the classification accuracies of the baseball and soccer classes (83.04% and 80.77%) are better when using the Bayesian approach compared with the WPCM approach (80.00% for the baseball class, 70.83% for the soccer class) as shown in Tables 8 and 9. This is because the contents of training documents belong to the football and hockey classes contain many sparse keywords. A better document selection approach needs to be used for selecting the candidate documents from each class in order to increase the F 1 classification results by using the WPCM approach.

6. Conclusions

We have presented a new approach of web news classification using the WPCM. The feature selection for the WPCM is based on the combination of the PCA and CPBF methods. The comparison of classification accuracy using the Bayesian, TF-IDF, PCA-NN, and WPCM approaches have been presented in this paper. The WPCM approach has been applied to classify the Yahoo sports news web pages. The experimental evaluation with different classification algorithms demonstrates that this method provides acceptable classification accuracy with the sports news datasets. Also, we have overcome the limitation of the PCA-NN in supervised data where the characteristic variables that de- scribe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components. Although the classification accuracy using the WPCM approach is high in comparison with the other approaches, the time taken for training is relatively long compared with the other methods.

Acknowledgement

The authors wish to thank the reviewers for their helpful suggestions.

References

[1] H.M. Lee, C.M. Chen, C.W. Hwang, A neural network document classifier with linguistic feature selection, in: Proceedings of 13th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, New Orleans, Louisiana, USA, IEA/AIE 2000, 19–22 June 2000, pp. 555–560.

[2] W. Lam, M.E. Ruiz, P. Srinivasan, Automatic text categorization and its applications to text retrieval, IEEE Transactions on Knowledge Data Engineering 11 (6) (1999) 865–879.

(19)

[3] Z.Q. Liu, Y. Zhang, A competitive neural network approach to web-page categorization, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 9 (6) (2001) 731–741.

[4] D.A. Karras, B.G. Mertzios, A robust meaning extraction methodology using supervised neural networks, in: Proceedings of Australian Joint Conference on Artificial Intelligence, 2002, pp. 498–510.

[5] A. Rauber, D. Merkl, M. Dittenbach, The growing hierarchical self-organizing map:

exploratory analysis of high-dimensional data, IEEE Transactions on Neural Networks 13 (6) (2002) 1331–1341.

[6] C. Enhong, W. Shangfei, Z. Zhenya, W. Xufa, Document classification with CC4 neural network, in: Proceedings of ICONIP 2001, Sanghai, China.

[7] J.H. Chang, J.W. Lee, Y. Kim, B.T. Zhang, Topic extraction from text documents using multiple-cause networks, in: Proceedings of 7th Pacific Rim International Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence, vol. 2417, 2002, pp. 434–443.

[8] G.L. Gentili, M. Marinilli, A. Micarelli, F. Sciarrone, Text categorization in an intelligent agent for filtering information on the Web, International Journal of Pattern Recognition and Artificial Intelligence 15 (3) (2001) 527–549.

[9] T. Kohonen, S. Kaski, K. Lagus, J. Saloj€aavi, J. Honkela, V. Paatero, A. Saarela, Self organization of a massive text document collection, in: E. Oja, S. Kaski (Eds.), Kohonen Maps, Elsevier, Amsterdam, 1999, pp. 171–182.

[10] S. Wermeter, Neural network agents for learning semantic text classification, Information Retrieval 3 (2) (2000) 87–103.

[11] E.M. Ruiz, P. Srinivasan, Hierarchical text categorization using neural networks, Information Retrieval 5 (1) (2002) 87–118.

[12] A.S. Weigend, E.D. Wiener, J.O. Pedersen, Exploiting hierarchy in text categorization, Information Retrieval 1 (3) (1999) 193–216.

[13] R.A. Calvo, H.A. Ceccatto, Intelligent document classification, Intelligent Data Analysis 4 (5) (2000).

[14] S.Y. Lam, D.L. Lee, Feature reduction for neural network based text categorization, in:

Proceedings of the 6th International Conference on Database Systems for Advanced Applications, Hsinchu, Taiwan, April 1999.

[15] F. Sebastini, Machine learning in automated text categorization, ACM Computing Surveys 34 (1) (2002) 1–47.

[16] G. Karypis, E.H. Sam, Fast dimensionality reduction algorithm with applications to document retrieval and categorization, in: Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 2000, pp.

12–19.

[17] S. Zelikovitz, H. Hirsh, Improving short text classification using unlabeled background knowledge, in: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, USA, June 2000.

[18] A. Selamat, S. Omatu, H. Yanagimoto, Web news classification using neural networks based on PCA, in: SICE, Osaka International Convention Center, Osaka, Japan, 5–7 August 2002, pp. 2388–2393.

[19] S.T. Dumais, Improving the retrieval of information from external sources, Behavior Research Methods, Instruments and Computers 23 (2) (1991) 229–236.

[20] Yahoo Web Pages, 2001, Available from <http://www.yahoo.com>.

[21] H. Mase, H. Tsuji, Experiments on automatic web page categorization for information retrieval system, IPSJ Journal 42 (2) (2001) 334–347.

[22] R.A. Calvo, M. Partridge, M. Jabri, A comparative study of principal components analysis techniques, in: Proceedings Ninth Australian Conference on Neural Networks, Brisbane, QLD, 1998, pp. 276–281.

(20)

[23] R.A. Johnson, W.D. Wichern, Applied Multivariate Statistical Analysis, fifth ed., Prentice- Hall, USA, 2002.

[24] A. Selamat, S. Omatu, H. Yanagimoto, Information retrieval from the Internet using mobile agent search system, in: Proceedings of International Conference on Information Technology and Multimedia 2001, University Tenaga Nasional (UNITEN), Malaysia, August 2001, pp.

9–16.

[25] A. Selamat, S. Omatu, H. Yanagimoto, Web news retrieval agent using neural networks based on PCA, in: Proceedings of the 46th Annual Conference of the Institute of Systems, Control and Information Engineers (ISCIE), Kobe, Japan, May 2002, pp. 401–401.

[26] S. Jones, Karen, P. Willet, Readings in information retrieval, San Francisco, USA, 1997.

[27] Salton&McGill, Introduction to modern information retrieval, New York, McGraw-Hill, USA, 1983.

[28] J.P. Yang, V. Honavar, L. Miller, Mobile intelligent agents for document classification and retrieval: a machine learning approach, in: Proceedings of the European Symposium on Cybernetics and Systems Research, Vienna, Austria, 1998, pp. 707–712.

[29] R.R. Korfhage, Information Storage and Retrieval, John Wiley and Sons Inc., USA, 1997.

[30] T. Joachims, Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in: Proceedings of International Conference on Machine Learning, Nashville, TN, USA, 1997, pp. 143–151.

[31] A.K. McCallum, Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering (http://www.cs.cmu.edu/mccallum/~bow), 1996.

[32] D.D. Lewis, Evaluating and optimizing autonomous text classification systems, in: E.A. Fox, P. Ingwersen, R. Fidel (Ed.), SIGIR’95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 1995, pp. 246–254.

References

Related documents

The total coliform count from this study range between 25cfu/100ml in Joju and too numerous to count (TNTC) in Oju-Ore, Sango, Okede and Ijamido HH water samples as

It was decided that with the presence of such significant red flag signs that she should undergo advanced imaging, in this case an MRI, that revealed an underlying malignancy, which

IL-23 measured by qPCR from RNA extracted from 24 h PPD-J stimulated PBMC isolated from HAV vaccinated (triangles) or Sham vaccinated (squares) calves, taken immediately prior to

Forward-looking statements included in the MD&amp;A include, but are not limited to, statements related to the annualized 2016 common share dividend; targeted annual dividend

19% serve a county. Fourteen per cent of the centers provide service for adjoining states in addition to the states in which they are located; usually these adjoining states have

Standardization of herbal raw drugs include passport data of raw plant drugs, botanical authentification, microscopic &amp; molecular examination, identification of

The preliminary results of growth radial technique revealed that all the oils tested exhibited different degrees of antifungal activity against Aspergillus

Field experiments were conducted at Ebonyi State University Research Farm during 2009 and 2010 farming seasons to evaluate the effect of intercropping maize with