Web page feature selection
and classification using neural networks
Ali Selamat
*, Sigeru Omatu
Computer and System Sciences, Graduate School of Engineering, Osaka Prefecture University, Sakai, Osaka 599-8531, Japan
Received 20 June 2002; received in revised form 30 December 2002; accepted 18 March 2003
Abstract
Automatic categorization is the only viable method to deal with the scaling problem of the World Wide Web (WWW). In this paper, we propose a news web page classifi- cation method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features. Each news web page is represented by the term-weighting scheme. As the number of unique words in the col- lection set is big, the principal component analysis (PCA) has been used to select the most relevant features for the classification. Then the final output of the PCA is com- bined with the feature vectors from the class-profile which contains the most regular words in each class. We have manually selected the most regular words that exist in each class and weighted them using an entropy weighting scheme. The fixed number of regular words from each class will be used as a feature vectors together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM method provides acceptable classification accuracy with the sports news datasets.
Ó 2003 Elsevier Inc. All rights reserved.
Keywords: Text categorization; WWW; Profile-based neural networks; Principal component analysis
*Corresponding author. Address: UTM, Faculty of Computer Science and Information Systems, UTM Skudai, Johor 81310, Malaysia. Tel.: +81-722-54-9278; fax: +81-722-57-1788.
E-mail address:[email protected](A. Selamat).
0020-0255/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2003.03.003
www.elsevier.com/locate/ins
1. Introduction
Neural networks have been widely applied by many researchers to classify the text documents with different types of feature vectors. For example, Lee et al. [1] have proposed a neural network for document classification based on a linguistic feature selection with a fuzzy learning technique. Liu and Zhang [3], Enhong et al. [6], and Lam et al. [2] have proposed a term frequency method to select the feature vectors for document categorization using neural networks.
Furthermore, Karras and Mertzios [4] have proposed a feature extraction method based on a word semantic and the categories of words co-occurrence analysis for document classification using neural networks. Chang et al. [7]
have proposed another approach of document classification using multiple- cause networks with latent variables as a feature selection method.
A feature selection using a hybrid case-based architecture has been proposed by Gentili et al. [8] for text categorization where two multi-layer perceptrons are integrated into a case-based reasoner. Rauber et al. [5] have proposed a different neural network architecture for document categorization namely as a growing hierarchical self-organizing map (GSOM). However, the GSOM is concentrated on a huge number of documents in the datasets to be classified and the feature selection method used is similar to approach applied by Ko- honen et al. [9].
Furthermore, Wermeter [10] has used the document title as the vectors to be used for document categorization. Ruiz and Srinivasan [11], Weigend et al.
[12], and Calvo and Ceccatto [13] have used the X2 measure to select the relevant features before classifying the text documents using the neural net- works. Lam and Lee [14] have used the principal component analysis (PCA) method as a feature reduction technique to the input data to the neural networks. However, if some original terms are particularly good when dis- criminating a class category, the discrimination power may be lost in the new vector space after using a latent semantic indexing (LSI) as described by Se- bastini [15].
Moreover, the techniques based on dimensionality reduction, such as the LSI, have been shown to improve the quality of the information being re- trieved by capturing the latent meaning of words present in the documents. As a result of the dimensionality reduction using the PCA or the LSI for data classification, the characteristic variables that describe smaller classes tend to be lost. Hence, the classification accuracy on the smaller classes can be de- graded in the reduced dimensional space [16,18]. Zelikovitz and Hirsh [17] have included the background knowledge to the LSI method in order to classify the text. However, if a small number of documents that are represented in a particular class in the datasets, the classification accuracy may be decreased.
This problem has not been taken into consideration in the previous ap- proach [17].
Therefore, we propose a web page classification method (WPCM), which is based on the PCA and the class profile-based features (CPBF). The proposed method is different compared to the previous works based on the improvement of web page categorization accuracies using the PCA features selection ap- proach with a combination of background knowledge. If a small number of web page documents use in a category class, the classification accuracy can also be increased. Each web page is represented by the term frequency-weighting scheme. As the dimensionality of a feature vector in the collection set is big, the PCA has been used to reduce it into a small number of principal components.
Then we combine the feature vectors generated from the PCA with the feature vectors from the class-profile which contains the most regular words in each class before feeding them to the neural networks for classification. We have manually selected the most regular words that exist in each class and weighted them using an entropy-weighting scheme [19]. The Yahoo sports [20] news web pages have been used for the classification purpose. The web page clas- sification techniques such as a Bayesian, a term frequency-inverse documents frequency weighting scheme (TF-IDF), a PCA-neural network (PCA-NN), and the WPCM methods have been used as the benchmark tests for the classifi- cation accuracy. The experimental evaluation demonstrates that the proposed method provides an acceptable classification accuracy with the sports news datasets.
The organization of this paper is as follows: The news classification using the WPCM is described in Section 2. The preprocessing of web pages is ex- plained in Section 3. The comparisons of web pages classification using the Bayesian, TF-IDF, PCA-NN, and WPCM methods are discussed in Section 4.
In Section 5, the discussions of the classification results using different meth- ods for the sports news web pages are stated. In Section 6, we will conclude the classification accuracy by using the WPCM compared with the other methods.
2. Web page classification method
The news web pages have different characteristics where the text length for each of them is variable. Also the structures of the pages are different in the tags usage (i.e., XML, html, SGML tags, etc.). Furthermore, a huge number of distinct words exist in those pages including proper or misspelled noun words as there is no restriction on word usage in the news web pages discussed by Mase and Tsuji [21].
The high dimensionality of the news web pages dataset has made the process of classification difficult. This is because there are many categories of news in the web news pages such as sports, weathers, politics, economy, etc. In each category there are many different classes. For example, the classes that exist in
the business category are stock market, financial investment, personal finance, etc. Our approach is based on the sports news category of web pages.
As we have mentioned in Section 1, the main purpose of this paper is to improve the classification accuracies of the classes in the datasets that are represented with a small number of documents. Our approach is different compared to the previous works [17]. It is based on the improvement of web page categorization accuracies using the PCA features selection approach with a combination of background knowledge where a small number of documents have been used in the document class.
In order to classify the news web pages, we propose the WPCM which uses the PCA and CPBF as the input to the neural networks. First, we have used the PCA algorithm [22,23] to reduce the original data vectors to a small number of relevant features. Then we combine these features to the CPBF before input- ting them to the neural networks for classification as shown in Fig. 1.
Web page retrieval
Stemming and stopping processes
Term weighting scheme using TFidf method
Feature reduction and selection using PCA
Select the most regular words from each class
Calculate the weight of each of the regular words using
an entropy method
Combine two types of feature vectors
Input to the neural networks for classification
Classification results
d R
PCA CPBF
WPCM
d is a feature vectors from the PCA R is a feature vectors from the CPBF Note:
Fig. 1. The process of classifying a news web page using the WPCM method.
3. Preprocessing of web pages
The process of classifying a news web page using the WPCM method is shown in Fig. 1. It consists of web news retrieval process, stemming and stopping processes, feature reduction process using our proposed method, and web classification process using error back-propagation neural networks. The retrieving process of sports news web pages has been done by our software agent during night-time [24,25]. Only the latest sports news web pages category will be retrieved from the Yahoo web server from the WWW. Then these web pages will be stored in the local news database. Stopping is a process of re- moving the most frequent word that exists in a web page document such as Ôto’, Ôand’, Ôit’, etc. Removing these words will save spaces for storing document contents and reduce time taken during the search process. Stemming is a process of extracting each word from a web page document by reducing it to a possible root word. For example, the words Ôcompares’, Ôcompared’, and Ôcomparing’ have similar meaning with the word Ôcompare’. We have used the Porter [26] stemming algorithm to select only Ôcompare’ to be used as a root word in a web page document. After the stemming and stopping processes of the terms in each document, we will represent them as the document-term frequency matrixðDocj TFjkÞ as shown in Table 1. Docjis referring to each web page document that exists in the news database where j¼ 1; . . . ; n. Term frequency TFjkis the number of how many times the distinct word wkoccurs in document Docjwhere k¼ 1; . . . ; m. The calculation of the terms weight xjk of each word wk is done by using a method that has been used by Salton [27]
which is given by
xjk ¼ TFjk idfk; ð1Þ
where the document frequency dfk is the total number of documents in the database that contains the word wk. The inverse document frequency idfk¼ logðdfn
kÞ where n is the total number of documents in the database.
Table 1
The document-term frequency data matrix after the stemming and stopping processes
Docj TF1 TF2 . . . TFm
Doc1 2 4 . . . 5
Doc2 2 3 . . . 2
Doc3 2 3 . . . 2
Doc4 2 6 . . . 1
Doc5 4 3 . . . 3
...
...
...
...
...
Docn 1 3 . . . 7
3.1. Feature reduction using the PCA
Suppose that we have A, which is a matrix document-terms weight as below
A¼
x11 x12 x1k x1m x21 x22 x2k x2m
x31 x32 x3k . . . ...
... ...
... .. .
. . . xn1 xn2 xnk xnm 0
BB BB B@
1 CC CC CA
;
where xjk is the terms weight that exists in the collection of documents. The definitions of j, k, m, and n have been described in the previous paragraph.
There are a few steps to be followed in order to calculate the principal com- ponents of data matrix A. The mean of m variables in data matrix A will be calculated as follows
xxk ¼1 n
Xn
j¼1
xjk: ð2Þ
After that the covariance matrix, S¼ fsjkg is calculated. The variance s2kkis given by
s2kk¼1 n
Xn
j¼1
ðxjk xxkÞ2: ð3Þ
The covariance sik is given by
sik ¼1 n
Xn
j¼1
ðxji xxiÞðxjk xxkÞ; ð4Þ
where i¼ 1; . . . ; m. Then we determine the eigenvalues and eigenvectors of the covariance matrix S which is a real symmetric positive matrix. An eigenvalue k and the eigenvector e can be found such that, Se¼ ke.
In order to find the eigenvector e the characteristic equation jS kIj ¼ 0 must be solved. If S is an m m matrix of full rank, m eigenvalues ðk1;k2; . . . ;kmÞ can be found. By using
ðS kiIÞei¼ 0; ð5Þ
all corresponding eigenvectors can be found. The eigenvalues and corre- sponding eigenvectors will be sorted so that k1P k2P P km. Let a square matrix E be constructed from the eigenvector columns where E¼
½e1 e2 e3 e4 em. Also let us denote K as
K¼
k1 0 0 0
0 k2 0 0 0
0 0 k3 0 0
... ...
... .. .
. . .
0 0 0 0 km
0 BB BB B@
1 CC CC CA :
In order to get the principal components of matrix S, we will perform ei- genvalue decomposition which is given by
ETSE¼ K: ð6Þ
Then we select the first d 6 m eigenvectors where d is the desired value, e.g., 100, 200, 400, etc. The set of principal components is represented as Y1¼ eT1s, Y2¼ eT2s; . . . ; Yd¼ eTds. An n d matrix M is represented as
M¼
f11 f12 f13 f1d
f21 f22 f23 f2d f31 f32 f33 . . . f41 f42 f43 .. .
. . . fn1 fn2 fn3 fnd 0
BB BB B@
1 CC CC CA
;
where fijis a reduced feature vectors from the m m original data size to n d size.
3.2. Feature selection using the CPBF
For the feature selection using class profile-based approach, we have man- ually identified the most regular words that exist in each category and weighted them using the entropy weighting [19] scheme before adding them to the fea- ture vectors that have been selected from the PCA. For example, the words that exist regularly in a boxing class are ÔLewis’, ÔTyson’, Ôheavyweight’, Ôfighter’, Ôknockout’, etc. Then a fixed number of regular words from each class will be used as a feature vector together with the reduced principal components from the PCA. These feature vectors are then used as the input to the neural networks for classification. The entropy weighting scheme on each term is calculated as Ljk Gk where Ljk is the local weighting of term k and Gk is the global weighting of term k. The Ljk and Gk are given by
Ljk¼ 1þ log TFjk ðTFjk>0Þ
0 ðTFjk¼ 0Þ
ð7Þ and
Gk ¼1þPn k¼1
TFjk Fk logTFFjk
k
log n ; ð8Þ
where n is the number of document in a collection and TFjk is the term fre- quency of each word in Docjas mentioned previously. The Fkis a frequency of term k in the entire document collection. We have selected R¼ 50 words that have the highest entropy value to be added to the first d components from the PCA to be an input to the neural networks for classification.
3.3. Input data to the neural networks
After the preprocessing of news web pages, a vocabulary that contains all the unique words in the news database has been created. We have limited the number of unique words in the vocabulary to 1800 as the number of distinct words is big. Each of the word in the vocabulary represents one feature vector.
Each feature vector contains the document-terms weight. The high dimensio- nality of feature vectors to be an input to the neural networks is not practical due to poor scalability and performance. Therefore, the PCA has been used to reduce the original feature vectors m¼ 1800 into a small number of principal components. In our case, we have selected a few values of d (e.g., 100, 200, 300, 400, and 500, 600) together with R¼ 50 features selected from the CPBF ap- proach because this parameter performs better for web news classification compared to the other parameters to be an input to the neural networks. The loading factor graph for the accumulated proportion of eigenvalues are shown in Fig. 2.
The value of d ð¼ 600Þ is selected based on the value of accumulated pro- portion of 81.6% from the original feature vectors for document classification (see Figs. 2 and 3). The classification accuracies of the classes that are repre-
0 20 40 60 80 100 120
0 500 1000 1500 2000
Number of principal components
Accumulated loading factors(%)
Fig. 2. Accumulated loading factors of principal components generated by the PCA.
sented with a small number of documents in the datasets are less accurate by using the PCA-NN approach. Further explanation on the PCA-NN has been described in Section 4.1. However, if a small number of documents used in the training of neural networks using the features selected from the PCA and combined with the features selected from the CPBF in the WPCM approach, the classification accuracies have been increased as described in Section 5.1.
Therefore, we propose an improvement of web page classification accuracy where a small number of documents are represented in a document category class based on a combination of the PCA and CPBF in the WPCM approach.
The combinations of these features are shown in Fig. 5. We have found that the numbers of feature vectors selected from the PCA (d¼ 600) and the features selected from the CPBF (R¼ 50) are the best values to be used in the WPCM approach. The classification accuracies with different value of d are shown in Table 2. We have found that the value of (dþ R ¼ 650) is the best, which contributes 96% of classification accuracy using the proposed WPCM method as shown in Fig. 3.
3.4. Characterization of the neural networks
The architecture of neural networks that have been used for classification process is shown in Fig. 4. The number of input layers ðpÞ is based on the number of principal components d ð¼ 600Þ and R ð¼ 50Þ. The number of hidden layersðqÞ is 50. The trial and error approach has been used to find a suitable number of hidden layers that provides good classification accuracy based on the input data to the neural networks. The number of output layers ðrÞ is 12 based on the number of classes in the sports news category as shown in Table 3.
We have defined t as the iteration number, g is a learning rate, a is a mo- mentum rate, hqis a bias on hidden unit q, hris a bias on output unit r, dqis the
0 0.2 0.4 0.6 0.8 1 1.2
0 500 1000 1500 2000
No. of features
% Accuracy
Fig. 3. The classification accuracy with a combination of the PCA and CPBF feature vectors.
generalized error through a layer q, and dr is the generalized error between layers q and r. The input values to the neural network are represented by f1; f2; . . . ; fdþR.
Adaptation of the weights between hiddenðqÞ and input ðpÞ layers is given by
Wqpðt þ 1Þ ¼ WqpðtÞ þ DWqpðt þ 1Þ; ð9Þ
f1...f600
Doc1 f601...f650
PCA (d=600) CPBF (R=50)
Doc2
Docn
f1...f600 f601...f650
Doc3 f1...f600 f601...f650
Doc4 f1...f600 f601...f650
f1...f600 f601...f650
feature vectors to be input to the neural networks Docj
Fig. 5. The features vectors selected from the PCA are augmented with the features selected from the CPBF before classifying the web pages documents using the neural networks.
Wqp Wrq
netr
netq
Or
Oq
p q r
Input layers Hidden layers Output layers
f1
f2
f3
Op
cs1
cs2
cs12
represents a class of web sports documents where = csa
represents input vectors from the PCA and CPBF.
. 12 ,..., 1 a fd+R
fd+R
Fig. 4. The feature vectors from the PCA and CPBF are fed to the neural networks for classifi- cation.
where
DWqpðt þ 1Þ ¼ gdqOpþ aDWqpðtÞ; ð10Þ
dq¼ Oqð1 OqÞX
r
drWrq: ð11Þ
Note that the first transfer function at hidden layerðqÞ is given by netq¼X
q
WqpOpþ hq; ð12Þ
Table 3
The number of documents that are stored in the news database
Class no. Class nameðcsaÞ Number of documents
1 Baseball 569
2 Boxing 210
3 Cycling 80
4 Football 550
5 Golf 456
6 Hockey 856
7 Motor-sports 405
8 Rugby 52
9 Skiing 233
10 Soccer 261
11 Swimming 169
12 Tennis 255
Total 4096
These are the classes that exist in sports category.
Table 2
The web pages classification accuracies using a combination of the PCA and CPBF feature vectors in the WPCM approach
Number of features Accuracy (%)
PCA (d) CPBF (R) Total number
of features (dþ R)
100 50 150 10
200 50 250 9
300 50 350 14
400 50 450 33
500 50 550 81
600 50 650 96
700 50 750 96
800 50 850 96
900 50 950 96
1000 50 1050 96
1350 50 1400 97
1550 50 1600 97
1750 50 1800 97
Oq¼ f ðnetqÞ ¼ 1=ð1 þ enetqÞ: ð13Þ Adaptation of the weights between outputðrÞ and hidden ðqÞ layers is given by
Wrqðt þ 1Þ ¼ WrqðtÞ þ DWrqðt þ 1Þ; ð14Þ where
DWrqðt þ 1Þ ¼ gdrOqþ aDWrqðtÞ; ð15Þ
dr¼ Orð1 OrÞðtr OrÞ: ð16Þ
Then the output function at the output layerðkÞ is given by netr¼X
r
WrqOqþ hr; ð17Þ
Or¼ f ðnetrÞ ¼ 1=ð1 þ enetrÞ: ð18Þ
The parameters of error back-propagation neural networks are set as in Table 4.
3.5. Neural networks training
In order to train the neural networks, a set of training documents and specification of the predefined classes to which the document belongs is re- quired. Each training example is an input–output pair Tj¼ ðDocj; csaÞ where Tj
is the training document, Docjis the reduced feature vector of the jth training document, and csa is the desired classification vector corresponding to Docj. The component values of csaare determined based on the classified informa- tion provided in the database as shown in Table 3. In our case, a¼ 1; . . . ; 12.
During training, the connection weights of the neural networks are initial- ized with some random values. The training examples in the training set are then presented to the neural network classifier in random order and the con- nection weights are adjusted according to the error back-propagation learning rule. This process is repeated until the mean squares error (MSE) falls below a predefined tolerance level or the maximum number of iterations is achieved.
Table 4
The error back-propagation neural networks parameters that have been used in the experiments
No. Neural networks parameters Values
1 Learning rate (g) 0.005
2 Momentum rate (a) 0.001
3 Number of iteration (t) 1000
4 MSE 0.005
4. Experiments
We have used a web page dataset from the Yahoo sports news as shown in Table 3. The types of news in the database are baseball, boxing, cycling, football, golf, hockey, motor-sports, rugby, skiing, soccer, swimming, and tennis. The total of documents are 4096. For training, we have selected ran- domly 1000 documents from different classes. The rest of the documents are used as the test sets. We have used the same test as being done by Yang and Honavar [28], which includes the Bayesian and TF-IDF classifiers. Also we include the WPCM approach as a comparison to the methods that are used in order to examine the applicability of the classification system. The description of WPCM has been mentioned in Section 2. The PCA-NN, TF-IDF and Bayesian methods for the tests are described as below.
4.1. PCA-NN classifier
The PCA-neural networks (PCA-NN) is a process of web page classification using a number of features selected from the PCA and fed into the neural networks for classification. In the experiments, we have selected 600 principal components. These principal components have been fed into the neural net- works for classification. The neural networks parameters used in the experi- ments are similar to the WPCM parameters as shown in Table 4.
4.2. TF-IDF classifier
TF-IDF classifier is based on the relevant feedback algorithm by Rocchio using the vector space model [29]. The algorithm represents documents as vectors so that the documents with similar contents have similar vectors. Each component of a vector corresponds to a term in the document, typically a word. The weight of each component is calculated using the term frequency inverse document frequency weighting scheme (TF-IDF) which tries to reward words that occur many times in a few documents. To classify a new document Doc0, the cosines of the prototype vectors with corresponding document vectors are calculated for each class. Doc0 is assigned to the class whose document vector has the highest cosine. Further description on the TF-IDF measure is described by Joachim [30].
4.3. Bayesian classifier
For the statistical classification, we have used a standard Bayesian classifier [30]. When using the Bayesian classifier, we have assumed that term’s occur- rence is independent of other terms. We want to find a class cs that gives the highest conditional probability for given document Doc0. wmk ¼ fw1; w2; . . . ; wmg
is the set of words representing the textual content of the document Doc0and k is the term number where k¼ 1; . . . ; m. The classification score is measured by
JðcsÞ ¼Ym
k¼1
PðwkjcsÞP ðcsÞ; ð19Þ
where PðcsÞ is the prior probability of class cs, and P ðwkjcsÞ is the likelihood of a word wk in a class cs that is estimated on a labeled training document. A given web page document is then classified in a class that maximizes JðcsÞ. If values of JðcsÞ for all the classes in the sports category are less than a given threshold, then the document is considered unclassified. The Rainbow toolkit [31] has been used for the classification of the training and test news web pages using the Bayesian and TF-IDF methods.
5. Evaluation
The TF-IDF, Bayesian, PCA-NN, and WPCM classification methods are evaluated using the standard information retrieval measures that are precision, recall, and F 1 [32]. They are defined as follows
precision¼ a
aþ b; ð20Þ
recall¼ a
aþ c; ð21Þ
F1¼ 2
1
precisionþrecall1 ; ð22Þ
where the values of a, b, and c are defined in Table 5. The relationship between the system classification and the expert judgment is expressed using four values as shown in Table 6. The F 1 measure is a kind of average of precision and recall.
5.1. Simulation results
The classification results using the PCA-NN and WPCM methods are shown in Tables 7 and 8. The average for precision, recall, and F 1 measures
Table 5
The definitions of the parameters a, b, and c which are used in Table 6
Value Meaning
a The system and the expert agree with the assigned category b The system disagrees with the assigned category but the expert did c The expert disagrees with the assigned category but the system did d The system and the expert disagree with the assigned category
using the PCA-NN approach are 87.79%, 92.04%, and 89.89%, respectively. In comparison with the WPCM approach, the precision, recall, and F 1 measures are 90.35%, 93.81%, and 91.65%, respectively. Furthermore, we have compared
Table 6
The decision matrix for calculating the classification accuracies
Expert System
Yes No
Yes a b
No c d
Table 7
The classification results using the PCA-NN method
Class no. Precision (%) Recall (%) F1 (%)
1 89.82 100.00 94.64
2 84.75 100.00 91.74
3 71.60 72.50 72.05
4 97.06 99.00 98.02
5 100.00 97.00 98.48
6 90.83 99.00 94.74
7 97.09 100.00 98.52
8 100.00 100.00 100.00
9 99.01 100.00 99.01
10 71.26 67.00 67.07
11 82.62 95.00 92.23
12 69.44 75.00 72.12
Average 87.79 92.04 89.89
Table 8
The classification results using the WPCM method
Class no. Precision (%) Recall (%) F1 (%)
1 97.14 68.00 80.00
2 86.96 100.00 93.02
3 98.68 93.75 96.15
4 86.09 99.00 92.09
5 89.91 98.00 93.78
6 94.29 99.00 96.59
7 84.03 100.00 91.32
8 97.62 100.00 98.80
9 89.29 100.00 94.34
10 73.91 68.00 70.83
11 90.09 100.00 94.79
12 96.15 100.00 98.04
Average 90.35 93.81 91.65
the WPCM with the TF-IDF and Bayesian methods using the F 1 measure.
They are 78.06% and 81.07% for the TF-IDF and Bayesian methods as shown in Table 9. This indicates that if the feature vectors are selected carefully, the improvement of web sports news classification using the combination of the PCA and CPBF in the WPCM approach will increase the classification accu- racy. The cycling and rugby classes (class numbers 3 and 8) contain a small number of documents in the dataset as shown in Table 3. The F 1 measure using the PCA-NN approach for the cycling and rugby classes are 73% and 100%, respectively, as shown in Table 7. But for the WPCM approach the classification accuracies for both classes are 98.80% and 96.15%, respectively, as shown in Table 7. These indicate that the WPCM approach has been able to classify the documents correctly although the number of documents repre- senting the class is small. Also, we have overcome the limitation of the PCA- NN in supervised data where the characteristic variables that describe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components, as the numbers of features from the CPBF are included during the classification process.
5.2. Discussions
The F 1 measures for the cycling, rugby, swimming, and boxing classes using the WPCM approach are better compared to the Bayesian, PCA-NN, and TF- IDF approaches. The numbers of documents belongings to each of these classes are less compared to the baseball, football, golf, hockey, motor-sports, skiing, soccer, and tennis classes as shown in Table 10. For example, there are
Table 9
The classification results using the F 1 measure for the TF-IDF and Bayesian methods
Class no. TF-IDF (%) Bayesian (%)
1 83.04 83.04
2 88.89 90.48
3 100.00 100.00
4 79.39 78.79
5 86.86 81.75
6 68.03 86.89
7 56.46 59.49
8 50.00 75.00
9 91.30 89.86
10 88.46 80.77
11 64.00 56.00
12 80.26 90.79
Average 78.06 81.07
only 80, 52, 169, and 210 documents belongings to the cycling, rugby, swim- ming, and boxing classes, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the PCA-NN approach are 72.05%, 100.00%, 92.23%, and 91.74%, respectively. For the TF-IDF approach, the F 1 measures of the cycling, rugby, swimming, and boxing classes are 100.00%, 50.00%, 64.00%, and 88.89%, respectively. The F 1 measures of the cycling, rugby, swimming, and boxing classes using the Bayesian approach are 100.00%, 75.00%, 56.00%, and 90.48%, respectively. However, the F 1 measures of the cycling, rugby, swimming, and boxing classes using the WPCM ap- proach are 96.15%, 98.80%, 94.79%, and 93.02%, respectively. The average F 1 measures using the PCA-NN, TF-IDF, Bayesian, and WPCM approaches are 89.89%, 78.06%, 81.07%, and 91.65% respectively.
The CPBF feature selection process applied in the WPCM approach is based on the highest entropy of the keywords calculated by using the Eqs. (7) and (8).
The first 50 keywords with the highest entropy values are selected and com- bined with the features taken from the PCA approach. The suitability of the keywords selection belonging to the particular classes has not been carefully considered in the WPCM approach while doing the experiments as described in Section 3.2. For example, the entropies for the keywords Ôfamous’, Ôfield’, Ôskills’, and Ômanager’ from the class soccer are selected although these key- words also exist in the other classes such as the football and hockey classes.
However, identifying keywords alone are not enough. The associated weights related to each of the keywords are also important to improve the F 1 classification performance using the proposed approach as the same word may occur in different classes. This is the main reason why the classification accu- racies of the baseball, football, golf, motor-sports, and skiing classes (94.64%, 98.02%, 98.48%, 98.52%, and 99.01%) are better when using the PCA-NN approach compared with the WPCM approach as shown in Table 7. Fur- thermore, the stemming and stopping processes are also degrading the classi- fication performance using the WPCM approach compared with the PCA-NN approach. However, the average F 1 performance of WPCM approach is better than the other approach as shown in Tables 7–9.
Table 10
The F 1 measures for the classes that contain small number of documents to be classified using the WPCM, PCA-NN, TF-IDF, and Bayesian methods
Class no. Class nameðcsaÞ Number of docu- ments
F1(%) WPCM (%)
PCA-NN (%)
TF-IDF (%)
Bayesian (%)
3 Cycling 80 96.15 72.05 100.00 100.00
8 Rugby 52 98.80 100.00 50.00 75.00
11 Swimming 169 94.79 92.23 64.00 56.00
2 Boxing 210 93.02 91.74 88.89 90.48
Also, the classification accuracies of the baseball and soccer classes (83.00%
and 88.46%) are better when using the TF-IDF approach compared with the WPCM approach as shown in Table 9. Moreover, the classification accuracies of the baseball and soccer classes (83.04% and 80.77%) are better when using the Bayesian approach compared with the WPCM approach (80.00% for the baseball class, 70.83% for the soccer class) as shown in Tables 8 and 9. This is because the contents of training documents belong to the football and hockey classes contain many sparse keywords. A better document selection approach needs to be used for selecting the candidate documents from each class in order to increase the F 1 classification results by using the WPCM approach.
6. Conclusions
We have presented a new approach of web news classification using the WPCM. The feature selection for the WPCM is based on the combination of the PCA and CPBF methods. The comparison of classification accuracy using the Bayesian, TF-IDF, PCA-NN, and WPCM approaches have been presented in this paper. The WPCM approach has been applied to classify the Yahoo sports news web pages. The experimental evaluation with different classification algorithms demonstrates that this method provides acceptable classification accuracy with the sports news datasets. Also, we have overcome the limitation of the PCA-NN in supervised data where the characteristic variables that de- scribe smaller classes tend to be lost as a result of the dimensionality reduction by using the WPCM approach. Furthermore, the classification accuracy on the small classes can be improved although they have been reduced into a small number of principal components. Although the classification accuracy using the WPCM approach is high in comparison with the other approaches, the time taken for training is relatively long compared with the other methods.
Acknowledgement
The authors wish to thank the reviewers for their helpful suggestions.
References
[1] H.M. Lee, C.M. Chen, C.W. Hwang, A neural network document classifier with linguistic feature selection, in: Proceedings of 13th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, New Orleans, Louisiana, USA, IEA/AIE 2000, 19–22 June 2000, pp. 555–560.
[2] W. Lam, M.E. Ruiz, P. Srinivasan, Automatic text categorization and its applications to text retrieval, IEEE Transactions on Knowledge Data Engineering 11 (6) (1999) 865–879.
[3] Z.Q. Liu, Y. Zhang, A competitive neural network approach to web-page categorization, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 9 (6) (2001) 731–741.
[4] D.A. Karras, B.G. Mertzios, A robust meaning extraction methodology using supervised neural networks, in: Proceedings of Australian Joint Conference on Artificial Intelligence, 2002, pp. 498–510.
[5] A. Rauber, D. Merkl, M. Dittenbach, The growing hierarchical self-organizing map:
exploratory analysis of high-dimensional data, IEEE Transactions on Neural Networks 13 (6) (2002) 1331–1341.
[6] C. Enhong, W. Shangfei, Z. Zhenya, W. Xufa, Document classification with CC4 neural network, in: Proceedings of ICONIP 2001, Sanghai, China.
[7] J.H. Chang, J.W. Lee, Y. Kim, B.T. Zhang, Topic extraction from text documents using multiple-cause networks, in: Proceedings of 7th Pacific Rim International Conference on Artificial Intelligence, Lecture Notes in Artificial Intelligence, vol. 2417, 2002, pp. 434–443.
[8] G.L. Gentili, M. Marinilli, A. Micarelli, F. Sciarrone, Text categorization in an intelligent agent for filtering information on the Web, International Journal of Pattern Recognition and Artificial Intelligence 15 (3) (2001) 527–549.
[9] T. Kohonen, S. Kaski, K. Lagus, J. Saloj€aavi, J. Honkela, V. Paatero, A. Saarela, Self organization of a massive text document collection, in: E. Oja, S. Kaski (Eds.), Kohonen Maps, Elsevier, Amsterdam, 1999, pp. 171–182.
[10] S. Wermeter, Neural network agents for learning semantic text classification, Information Retrieval 3 (2) (2000) 87–103.
[11] E.M. Ruiz, P. Srinivasan, Hierarchical text categorization using neural networks, Information Retrieval 5 (1) (2002) 87–118.
[12] A.S. Weigend, E.D. Wiener, J.O. Pedersen, Exploiting hierarchy in text categorization, Information Retrieval 1 (3) (1999) 193–216.
[13] R.A. Calvo, H.A. Ceccatto, Intelligent document classification, Intelligent Data Analysis 4 (5) (2000).
[14] S.Y. Lam, D.L. Lee, Feature reduction for neural network based text categorization, in:
Proceedings of the 6th International Conference on Database Systems for Advanced Applications, Hsinchu, Taiwan, April 1999.
[15] F. Sebastini, Machine learning in automated text categorization, ACM Computing Surveys 34 (1) (2002) 1–47.
[16] G. Karypis, E.H. Sam, Fast dimensionality reduction algorithm with applications to document retrieval and categorization, in: Proceedings of the 2000 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 2000, pp.
12–19.
[17] S. Zelikovitz, H. Hirsh, Improving short text classification using unlabeled background knowledge, in: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford University, USA, June 2000.
[18] A. Selamat, S. Omatu, H. Yanagimoto, Web news classification using neural networks based on PCA, in: SICE, Osaka International Convention Center, Osaka, Japan, 5–7 August 2002, pp. 2388–2393.
[19] S.T. Dumais, Improving the retrieval of information from external sources, Behavior Research Methods, Instruments and Computers 23 (2) (1991) 229–236.
[20] Yahoo Web Pages, 2001, Available from <http://www.yahoo.com>.
[21] H. Mase, H. Tsuji, Experiments on automatic web page categorization for information retrieval system, IPSJ Journal 42 (2) (2001) 334–347.
[22] R.A. Calvo, M. Partridge, M. Jabri, A comparative study of principal components analysis techniques, in: Proceedings Ninth Australian Conference on Neural Networks, Brisbane, QLD, 1998, pp. 276–281.
[23] R.A. Johnson, W.D. Wichern, Applied Multivariate Statistical Analysis, fifth ed., Prentice- Hall, USA, 2002.
[24] A. Selamat, S. Omatu, H. Yanagimoto, Information retrieval from the Internet using mobile agent search system, in: Proceedings of International Conference on Information Technology and Multimedia 2001, University Tenaga Nasional (UNITEN), Malaysia, August 2001, pp.
9–16.
[25] A. Selamat, S. Omatu, H. Yanagimoto, Web news retrieval agent using neural networks based on PCA, in: Proceedings of the 46th Annual Conference of the Institute of Systems, Control and Information Engineers (ISCIE), Kobe, Japan, May 2002, pp. 401–401.
[26] S. Jones, Karen, P. Willet, Readings in information retrieval, San Francisco, USA, 1997.
[27] Salton&McGill, Introduction to modern information retrieval, New York, McGraw-Hill, USA, 1983.
[28] J.P. Yang, V. Honavar, L. Miller, Mobile intelligent agents for document classification and retrieval: a machine learning approach, in: Proceedings of the European Symposium on Cybernetics and Systems Research, Vienna, Austria, 1998, pp. 707–712.
[29] R.R. Korfhage, Information Storage and Retrieval, John Wiley and Sons Inc., USA, 1997.
[30] T. Joachims, Probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in: Proceedings of International Conference on Machine Learning, Nashville, TN, USA, 1997, pp. 143–151.
[31] A.K. McCallum, Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering (http://www.cs.cmu.edu/mccallum/~bow), 1996.
[32] D.D. Lewis, Evaluating and optimizing autonomous text classification systems, in: E.A. Fox, P. Ingwersen, R. Fidel (Ed.), SIGIR’95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, 1995, pp. 246–254.