Calculation of Web Page Correlation The algorithm ﬂow is shown as below:

A New Linear Feature Item Weighting Algorithm

20.3 Calculation of Web Page Correlation The algorithm ﬂow is shown as below:

1. According to the training document, we obtain the predetermined topic vector by artiﬁcial. The theme vector is as s¼(t1,w1;t2,w2;. . .tn,wn), ti denotes the feature,wias the weight ofti.

2. The web page text is expressed in the form of vector space model; the vector is

d¼(t10,w10;t20,w20;. . .tn0,wn0),ti0denotes the feature, andwi0is the weight ofti0. 3. According to the vector of “s” to determine the dimension of vector d, ifti02s,

keep its weight the same; otherwise, change corresponding weight to 0. 4. The formula for calculating the similarity between a given web page text vector

and a given topic vector is shown as below: simðs;dÞ ¼ sd

j j j jd ð20:8Þ

5. Comparing the similarity value with a predetermined threshold value to determine whether the web page text is associated with the theme.

20.4 Experimental Results and Analysis

20.4.1 Experimental Data

Based on the previous literature, the dimension of the topic feature vector web page should not be too high, and the number of feature items is not higher than 20. In this paper, the economic class is adopted in the corpus of Fudan University as the training sample set, and 20 characteristics are selected as the theme vector by manual work. We’ve also downloaded 1000 documents from Sina, of which 600 are related to the economics with the remaining being irrelevant.

20.4.2 Method of Evaluation

We use the recall rate, the accuracy rate, and the F-measure to compare the performance of such three methods in the case, in which the threshold has different values [1]. The recall rate is deﬁned as the proportion of correct theme web page quantity in the total topic relevant web page quantity under the given threshold c. The accuracy is deﬁned as under the given threshold c the proportion of correct theme web page quantity, and all theme similarity value is greater than c of the number of pages. The formula is shown as below:

Recall¼kcorrect

ntotal

ð20:9Þ precision¼kcorrect

m ð20:10Þ

kcorrectrefers to the correct number of theme web page, m is the quantity of web page whose value is greater than the threshold, andntotalis the quantity of all the theme web page.

Generally speaking, the precision rate decreases as the recall rate increases, and vice versa. Thus, it is necessary to take them into account. This is theF-measure value. The formula is shown as below:

F-measure¼2precisionRecall

precisionþRecall ð20:11Þ

The range of theF-measure is 0 to 1. We can use theF-measure to measure the performance of different feature weighting methods under the same threshold value and then determine the optimal /c/ value.

20.4.3 Results and Analysis

In the weight calculation of candidate features, the balance adjustment factor /a/, /b/, and /r/ should be set up according to the web page form. If the form of a web page is ﬁxed, the theme features generally appear in the title and anchor text, so the value of /b/ should be greater. For topic class of “economic,” the experimental sustained debug the value of /a/, /b/, and /r/ to obtain the best theme features. The ﬁnally selected value is shown as below (Table20.1).

In the case of the same similarity calculation formula, we use different methods of calculating feature item weighting to compare the value of recall rate, accuracy, and F-measure. This paper compares the traditional TF-IDF method with the improved TF-IDF method ﬁrstly, when the values of c were 0.35, 0.50, 0.65, and 0.80; the comparative result of these two methods is shown as below:

The results can be seen in Table 20.2. The improved TF-IDF method has achieved certain results in web page ﬁltering, but the effect is not dramatic because they are based on the statistical method, without considering the semi-structure nature of web documents. The comparative results of the traditional TF-IDF method and the new method is shown in Table20.3:

The results can be seen in Table20.3. It shows that the linear weighting method is more effective than the traditional TF-IDF method.

From Tables20.2and20.3, we can see that the three methods get the maximize value ofF-measure around 0.50. In order to obtain the optimal threshold, we set

0.50 as the center and 0.05 of the scale of the segmentation. The changes ofF- measure value is as follow (Fig.20.1):

When the threshold equals to 0.45, theF-measure value obtains the maximum value, which means that the best ﬁltering effect is at this point. In this experiment, the threshold value is determined as 0.45 to compare the ﬁltering effect of the three methods under this threshold value with the results shown as below:

As indicated in Table20.4, the new liner feature extraction method improves the topic ﬁltering recall and precision rates dramatically below this threshold value, Table 20.1 Parameter list _{Balance adjustment factor} _a _b _r

Value 0.3 0.4 0.3

Table 20.2 Comparative results of the traditional TF-IDF method and the improved TF-IDF

The value of c

The improved TF-IDF Traditional TF-IDF

Recall Precision F Recall Precision F

0.35 0.851 0.440 0.580 0.832 0.412 0.556

0.50 0.659 0.523 0.583 0.628 0.498 0.557

0.65 0.476 0.660 0.553 0.416 0.613 0.495

0.80 0.280 0.780 0.412 0.263 0.756 0.391

Table 20.3 Comparative results of the traditional TF-IDF method and the new method

The value ofc

The new method Traditional TF-IDF

Recall Precision F Recall Precision F

0.35 0.966 0.561 0.728 0.832 0.412 0.556 0.50 0.793 0.679 0.731 0.628 0.498 0.555 0.65 0.530 0.721 0.610 0.416 0.613 0.495 0.80 0.361 0.826 0.502 0.263 0.756 0.391 0.9 0.8 0.7 0.6 0.5 0.4 0.35 0.4 0.45 0.5 0.55 0.6 0.65 C F–measure F–measure

Fig. 20.1 The changes ofF-measure value

Conclusion

In this paper, the new method of combining the improved TF-IDF and considering the location and the mutual information of feature items has better results than the traditional TF-IDF. Experiments show that it is a feasible feature item weighting method.

References

1. Johnson J, Tsiuotsiuoliklis K, Giles CL. Evolving strategies for focused Web crawling. // Proceedings of the 20th International Conference (ICML 2003), Menlo Park, CA: AAAI press, 2003. p. 298–305

2. Zheng GL, Ye FY, Lin GJ. Subject information acquisition method based on domain ontology. Comput Appl. 2008;28(12):3275–6. In Chinese.

3. Salton G, Wong A, Yang GA. Vector space model for automatic indexing. Commun ACM. 1975;18(11):613–20.

4. Li Z, Yang S. Improvement of calculation web page feature weight based on vector space model. Comput Modern. 2010;178(6):137–9 (In Chinese).

5. Lin BX. Research and implementation of topic crawler based on domain ontology. Cheng Du: Southwest Jiaotong University, 2010. (In Chinese)

Table 20.4 Comparison of the three methods at the value of 0.45

The name of method The number of processed web page The actual number of relevant web page Meet the threshold number of pages The collection number of theme web

page Recall Precision New method 1,000 600 869 523 0.870 0.601 Improved TF-IDF 1,000 600 890 425 0.711 0.479 Traditional TF-IDF 1,000 600 893 418 0.697 0.468 178 S. Tian et al.

Chapter 21 Trust Value of the Role Access Control Model

In document W Eric Wong Proceedings of the 4th International Conference on Computer Engineering and Networks CENet2014 pdf (Page 180-184)