THE increasing of Internet resources brings up the problem

(1)

User Interest Analysis in Web Filtering

A-Ning DU and Bin-Xing FANG

Abstract— Web filtering can help people find the most

inter-esting and valuable information. However, current web filtering techniques can not retrieve results which accurately represent the user interest. This paper investigated the user interest in web filtering and analyzed the problems of current machine learning base web filter. According to the difference of user interest, the task of web filtering is divided into three levels:

relativity-filter,similarity-filter and homology-filter. And Biased Support Vector Machine(BSVM) is used to make the filter adaptable

according to the difference of user interest. Experiments show that BSVM can greatly improve the web filtering performance.

Index Terms— Web Filtering, User Interest, Biased Support

Vector Machine.

I. INTRODUCTION

T

HE increasing of Internet resources brings up the problem of information overload, quality enhancement, which means that people want to read the most interesting mes-sages, and avoid having to read low-quality or uninteresting messages. Web filtering is the activity of classifying a stream of incoming web pages dispatched in an asynchronous way by an information producer to an information consumer[1], which helps people find the most interesting and valuable information and saves Internet users from drowned by the flood of incoming information.

Recent years, the machine learning (ML) paradigm[2], instead of knowledge engineering and domain experts, be-comes more popular in solving the above problem, because of its automatically-learning and relativity-analysis abilities. However, these ML algorithms are insufficiently accurate and do not adapt well to the ever-changing user inter-est/approprateness of the web document to the user. For example, distinguishing Pornography from SexEd may be less easy, and distinguishing Pornography from Erotica is even harder, since the border is extremely subjective.

This paper studies how to adjust the web filtering results to be more fit for the user interest. Based on the careful study of the user interest, the web filtering result is divided into three scopes of relativity, similarity and homology, which help describe the user interest more accurately. To achieve more precisely the filtering result, the inductive process is improved so that it can get better precision and recall ability according to the user interest. The improved machine learn-ing algorithm in this paper is based on the Support Vector Machine (SVM) algorithm because that of all the generic machine learning algorithms (Decision Tree, Rule Induction, Bayesian algorithm and SVM), SVM algorithm has shown to be superior to other machine learning algorithms with the solid foundation of Statistical Learning Theory (SLT). The improved A-Ning DU and Bin-Xing FANG are with the Research Center of Computer Network and Information Security Technology, Harbin Institute of Technol-ogy, People’s Republic of China.

algorithm is called Biased Support Vector Machine (BSVM), which imports a stimulant function, uses training examples distributionn+/n− and a user-adaptable parameterk to deal

imbalancedly different classes of the pre-assigned pages so as to adjust the filtering result to be best fit for the user interest. The remainder of the paper is organized as follows: Section 2 introduces web filtering, analyzes the user interest and corresponding difference in filtering result, and discusses the failure of current machine learning approaches. Section 3 puts forward the model of Biased Support Vector Machine, and analyzes its efficiency in web filtering. Section 4 closes the paper with our conclusions and future work.

II. WEBFILTERING ANDUSERINTEREST A. Web Filtering

Web filtering is the task of assigning a boolean value to each web page vector di ∈D, where D is a domain of web

pages. A value ofT RU Eassigned todiindicates a decision to

page di relative to the user interest, whileF ALSE indicates

not. More formally, the task is to approximate the unknown target functionΨ :D→ {T RU E, F ALSE}(which describes how web pages ought to be assigned) by means of a function

Φ :D→ {T RU E, F ALSE}called the filter. How to improve the precision and recall of the filterΦare the core problem of web filtering.

The general process of web filtering includes five steps: 1) user interest acquiring: acquire many user-assigned web

pages as training set

2) web pages pre-processing: translate the assigned pages into a set of compact representations of page content. Usually a page di is represented as a vector of term

weights di = {w1i, w2i,· · · , w|F_|i}, where F is the

set of features that occur at least once in at least one document ofD, and0< wki<1represents how much

feature fk contributes to the semantics of pagedi

3) dimensionality reduction: select feature of high contri-bution to reduce the size of feature setF

4) construction of web filters: build a filter to describe user interest automatically

5) predict unfiltered web pages: use the filter to predict an unmarked web page is relative or not

Representation of web pages is the basic step of the pro-cess, while the degree of dimensionality reduction is the key infecting factor. And what decides the effectiveness of web filters is that the generalization and description ability of web filtering algorithm.

Current implementations of web filtering mainly use four techniques of URL blocking, keyword filtering, rating systems, and intelligent content analysis. URL blocking restricts or allows access by comparing the requested web page’s URL

(2)

(and equivalent IP address) with URLs in a stored list. The advantages are speed and efficiency, while this approach requires a URL list, and it is quite costly to generate and maintain the list. Keyword filtering blocks access to web site on the basis of the occurrence of offensive words and phrases on those sites. However, many web sites that do not contain objectionable content will be blocked. Rating systems let web publishers associate labels or metadata with web pages to limit certain web content to target audiences. while in general this approach can not provide a reliable source of information. Intelligent content analysis system can automatically classify web content by use of ML algorithms, such learning and adaptation programs can help give semantic meaning to context-dependent words, and thereby they are the dominate approaches used in web filtering.

Almost all existing filtering software use URL blocking, while some also provide rating and keyword option. Perfor-mance of a filtering system can be measured in terms of blocking rate which is the percentage of the correctly blocked Web pages, and overblocking rate which is the percentage of legitimate pages that are blocked. The Netprotect project evaluated 50 commercially available filtering systems using 2,794 URLs with pornographic content and 1,655 URLs with normal content [3]. Their results reproduced in Table I show that the accuracy of existing systems is far from satisfactory.

TABLE I

NETPROTECT’SEVALUATION FORWEBFILTERINGTOOLS[3]

Filtering Tools Blocking Efficiencies Overblocking Rate BizGuard 55% 10% Cyber Patrol 52% 2% CYBER sitter 46% 3% CYBER Snoop 65% 23% Internet Watcher 2000 30% 0% Net Nanny 20% 5%

Norton Internet Security 45% 6%

Optenet 79% 25%

SurfMonkey 65% 11%

X-Stop 65% 4%

B. Analysis of User Interest

In practical web filtering applications, the web pages set related to user interest may be considerable large. However, what the user desired may be just several homologous pages. In order to show the difference of user interest, we first give some examples and analyze the true requirement of user.

Example 1: Problems in Pornographic Pages Filtering Nowadays, Internet has been becoming an important source of information. However it is also host to pornographic, violent contents and others that are inappropriate for most viewers. Web filtering can be used to block access to pages that are against a defined policy. If a page contains a certain number of forbidden keywords, it is considered undesirable. The problem is that the meanings of words depend on the context.

• Different Page Subjects: For example, sites about breast cancer research, or sexual harassment, or even the home page of someone named Sexton, could be blocked as a forbidden page of ”Pornographic Class”.

• Different Writer’s Viewpoint: Articles on combating

pornographic pages are harmless.

• Different Expression Orientation: The pornographic

pages also contains many sub-classes such as gambling, nudity, violence, drugs, alcohol and so on. For exam-ple, Itzin[4] classified pornography into three sets: The sexually explicit and violent; the sexually explicit and nonviolent, but subordinating and dehumanizing; and the sexually explicit, nonviolent, and non-subordinating based upon mutuality. Research consistently shows that harmful effects are associated with the first two, but that the third is usually harmless.

Example 2: Problems in Personal Information Filtering Information filtering deals with the delivery of information that is relevant to the user in a timely manner. An information filtering system assists users by filtering the data stream and delivering the relevant information to the user. The system selects the articles deemed to be interesting to the user and eliminates the rest. However, a filtering system might not be able to perfectly differentiate the articles that are actually relevant to the user from the ones that are not. The proportion of irrelevant articles delivered to the user should be as low as possible. The proportion of relevant articles eliminated should also be as low as possible.

• Different Page Subjects: An information filtering agent assists the user with the task of finding interesting news articles. While the articles may in a particular domain or many domains of academics, entertainment, migration, sports etc.

• Different Writer’s Viewpoint: The user task of finding interesting news articles may only include articles sup-porting the event, or include all the articles about the event.

• Different Expression Orientation: For example, the user

task of finding news articles about ”disaster” may include articles about bailout, damnification etc, or only the articles about one aspect.

As shown in the examples above, the user may be inter-ested in portions of the web filtering result according to the difference of page subjects, writer’s viewpoints and expression orientations. So we can divide web filtering tasks into three levels according to the user interest:

• relativity-filter: the filtering result contains all the web

pages with the same key phrases or key sentences. These web pages express the same subject, but may be not con-sistent in viewpoint or orientation. Typical applications of relativity-filtering include erotic web pages filtering and hot topic tracing which expect to collect all the web pages related to the topic, regardless of approval or not.

• similarity-filter: the filtering result contains all the web pages that hold the same subject, viewpoint and orien-tation with the user. Typical applications of similarity-filtering include similarity-filtering of web pages on racialism or splittism. The similarity-filtering is more strict than relativity-filtering as not only key words or sentences but also orientation is taken into consideration.

(3)

pages with quite a lot of same sentences or paragraphs. The filtering results are almost the same as the user interest, and always this is because that the articles from the official or authoritative website are redistributed by other websites with little modification. An examples of homology-filtering is counting which article is the most reprinted one on the Bulletin Board Systems.

We can define the all the filtering results acquired by ML algorithms as relative results(R1) and the filtering results which the ML algorithms assign TRUE with probability near-to-1 as homologous results(R_k). So the results of similarity-filtering R_i _{∈ {}R_k _⊆R_i _⊆R1}. As is illustrated in the left of Fig.1, most filtering tasks can be described as application of similarity-filtering with different similarity degrees between the web pages acquired and the user interest.

User interest (Uk)

Generkl Filtering Result (R1)

User interest of high similkrity

(Rk)

Adkptkble filtering result (Ri)

Internet (U)

Fig. 1. Analysis and demonstration of filtering result estimation. Outside the biggest circles means filtering scopeU_{, the smallest circle means user interest} Uk, the biggest circleR1 is the filtering result of general ML algorithms

as content relativity, the smaller one Rk is the filtering result as content

homology. The middle circleRimeans the biased filtering result according

to user demand as content similarity.

C. Current Machine Learning Approaches and The Failure Web filtering by ML techniques is widely discussed in the literature. A few major ML algorithms are often chosen to construct web filter because of their simplicity, flexibility and robustness:

Decision Trees is a ML approach to automatic induction of filtering trees based on training data[5], [6], [7]. It is a graph of nodes connected by arcs with each internal node corresponding to a feature and each arc to a possible value of that feature. Decision tree is easily interpretable by humans and has low computational complexity, which is a quite simple and practical idea in the field of ML.

Rules Induction methods[8], [9] try to find a proper set of DNF rules for filtering task such that the error rate on training set is minimal. By use of local optimization techniques, rule induction methods dynamically evaluate rules and revise the covering rule set.

K-Nearest neighbor (KNN)[10], [11] selects k most similar documents from the training set and uses the categories of these documents to determine categories of the document being classified. Documents are represented by vectors of words and the similarity between two documents is measured

using Euclidean distance or other functions between these vectors.

In [12], [13], [10], [14], [15], Na¨ıve Bayes has been applied to web pages filtering. It uses the joint probabilities of words and categories to estimate the probabilities of categories given a document. Documents with a probabilities above a certain threshold are considered relevant.

Lee et al.[2] applied Artificial Neural Networks to identify members of the forbidden class, which learns patterns by mod-ifying the weights among nodes based on learning examples. Support Vector Machines (SVM)[16], [17], [18] is also a major statistical method. SVM is a process of finding a surface which separates the positives from the negatives with the widest possible margin among all the surfaces in |F_| -dimensional space. SVM acts well in dealing with large scale training set and it has no need of human and machine efforts in parameter tuning.

As is compared in [19], [20], [21], SVM achieved the best performance on different filtering corpus with strong robustness and acceptable efficiency . While the precondition of Na¨ıve Bayes that omitting the feature dependence reduces its web content analysis ability. Artificial Neural Networks is computationally expensive, and over-fitting problem of Decision Trees and Rule Induction occurring in the procedure of user interest description makes it not satisfied.

However, as is shown above, web filters based on ML algorithms can not achieve satisfactory results. This is because that it is difficult to understand and express the true meaning of user interest. Current ML algorithms acquire the user interest only by analyzing the arrange modes of words and expressions in the training examples. They neglect much information hidden in the training set, such as the distribution of number of positive example and negative examples, the max distributing radius of positives, the max distributing radius of negatives, and so on. In fact, such hidden information is quite valuable to express what portions of the web filtering result the user may be interested in. As a result, this paper tries to import the ML algorithms the ability to analyze these information. The improved ML algorithm is based on SVM because of its strong robustness and acceptable efficiency.

III. BIASEDSUPPORTVECTORMACHINE FORWEB FILTERING

A. Biased Support Vector Machine Algorithm

To fit the user interest better, we must import adjusting ability into the ML algorithms. So the approach proposed in this paper imports a stimulant function, uses training examples distributionn+/n− and a user-adaptable parameterkto deals

imbalancedly different classes of the pre-assigned pages, so as to be best fit for the user interest. The approach is called Biased Support Vector Machine, and a detailed description and analysis are in [22].

In the classical SVM, a penalty functionF =C·Pξi is

introduced as additional capacity control function, where the non-negative variable ξi is a measure of the misclassification

errors and the coefficientC emphasizes the tolerant degree of misclassification error. Consequently the width of the margin decreases withC increasing.

(4)

BSVM introduces a stimulant function,F =C·[(k−1)·

n−Pyi=1ξi−n+

P

yi=−1ξi]/n, as the extension of penalty

function. In BSVM, we describe positives as the examples of yi = +1, negatives as the examples of yi = −1, thus we

define n+ = |{yi = +1}| and n− = |{yi = −1}|. The

stimulant function uses both training examples distribution n+/n−and an user-adaptable parameterkto express the user

bias degree of different classes. Together with the effect of penalty function, the bias is described in Equation 1. The width of the margin to the positive side decreases withn+/n−

or k increasing. Thus BSVM can find a proper separating hyperplane with filtering result Ri betweenR1 andRk.

bias=C+C·(k−1)·n−/n C−C·n+/n = 1+(k−1)·n−/n 1−n+/n =n+/n+k·n−/n n−/n =k+n+/n (1) BSVM is shown as follows. The generalized optimal sep-arating hyperplane is determined by the vector w, that mini-mizes the functional,

min w,b,ξ 1 2kwk 2 +CXξi+C1 X yi=1 ξi−C2 X yi=−1 ξi whereC1=C·(k−1)·n−/n, C2=C·n+/n, k≥0 (2)

subject to the constraints of:

yi(w·xi−b)≥1−ξi where ξi≥0,∀i (3)

Here C1 and C2 are the classification errors stimulant coefficients, k≥0 is an adaptable parameter. The solution to the optimization problem of Equation 2 under the constraints of Equation 3 is given by the saddle point of the Lagrangian: L(w, b, ξ, α, β) = 1 2||w|| 2 + (C+C1)X yi=1 ξi+ (C−C2) X yi=−1 ξi− X αi(yi[wTxi−b]−1 +ξi)− X βiξi (4)

whereα,β are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w, b, ξ and maximized with respect toα, β. Hence the solution to the problem is given by:

minQ(α) = 1 2 n X i,j=1 αiαjyiyjK(xi,xj)− n X i=1 αi (5)

with constraints of:

n X i=1 yiαi= 0 (6) and 0≤αi≤C+C1 if yi= 1 0≤αi≤C−C2 if yi=−1 (7)

B. Experiments and Analysis

In our experiment, the forbidden pages belong to the cat-egory of Adult content. We have collected a total of 500 web pages by searching with the keyword porn. The corpus has been reviewed and classified as containing adult contents by human editors, which includes 100 non-pornographic web pages and 400 pornographic web pages. After taking 1/5 of each as training examples, we measured the training accuracy for SVM and BSVM in Table II.

TABLE II

TRAINING ACCURACY OFSVMANDBIASEDSVM(K=5)

Algorithm WebPage Correct Incorrect Total SVM Porngraphic 378(94.5%) 22(5.5%) 400 Non-porngraphic 69(69.0%) 31(31.0%) 100 Total 447(89.4%) 53(10.6%) 500 BSVM Porngraphic 396(99.0%) 4(1.0%) 400 Non-porngraphic 78(78.0%) 22(22.0%) 100 Total 474(94.8%) 26(5.2%) 500

To show the impact of adaptable parameters on BSVM, we experiment on benchmark collections of Chinese web pages1prepared by FuDan University. The collections include 9804 training examples and 9833 evaluating documents, which consist of a set of Chinese newswire stories classified under 20 categories. In this paper, we experiment on a document set made of two related categories (history and politics) of the benchmark. The document set contains totally 2800 web pages (2000 pages about politics as positives, 800 pages about history as negatives and 1/10 of each as training examples). We compute the positive sentences filtering precision under different C, and exhibit the influence of d =n+/n− andk

in Fig. 2. Concluded from the result, the positive sentences filtering precision increases withn+/n− andkincreasing.

Fig. 2. BSVM filtering efficiency on differentkandn+/n−. The left figure

shows the influence of parameter d=n+/n₋on the positive sentences filtering

precision (k=1). The right figure shows the influence of parameter k on the positive sentences filtering precision (n+/n−=1).

IV. CONCLUSION ANDFUTUREWORK

In this paper, we give a study on different scopes of filtering result according to different filtering task and user interest. We find that the web filtering result can be divided three sets of relative pages set(R₁_{), similar pages set(}R_i_{) and homologous}

pages set(R_k) with the relationship of R_k _⊆ R_i _⊆ R₁. To adjust the web filtering result to be more fit for the user

1 _{The benchmark and a detailed description(in Chinese) are}

avail-able at http://www.nlp.org.cn/docs/doclist.php?cat_id= 16\&type=15.

(5)

interest, a Biased Support Vector Machine (BSVM) algorithm in introduced which imports a stimulant function, uses training examples distribution n+/n− and a user-adaptable parameter

k to deals imbalanced different classes of the pre-assigned pages. Experiments show that BSVM can greatly improve the web filtering performance. But problems of user bias description and parameter self-adaptable are still open and we leave them as future work.

REFERENCES

[1] N. J. Belkin and W. B. Croft, “Information Filtering and Information Retrieval: Two Sides of the Same Coin?” Communications of the ACM, vol. 35, no. 12, pp. 29–38, Dec. 1992.

[2] P. Y. Lee, S. C. Hui, and A. C. M. Fong, “Neural networks for web content filtering,” IEEE Intelligent Systems, vol. 17, no. 5, pp. 48–57, 2002.

[3] N. Project, “Report on currently available cots filtering tools,” Technicle

report, 2001.

[4] O. B. Longe and F. A. Longe, “The nigerian web content: Combating pornographic using content filters,” Journal of Information Technology

Impact, vol. 5, no. 2, pp. 59–64, 2005.

[5] J. R. Quinlan, “Discovering rules by induction from large collections of examples,” Expert Systems in the Micro-Electronic Age, pp. 168–201, 1979.

[6] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.

[7] J. R. Quinlan, C4.5: programs for machine learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1993.

[8] F. D. Chid Apte and S. Weiss, “Text miningwith decision rules and decision trees,” in Proceedings of the Conference on Automated Learning

and Discovery, CMU, June 1998.

[9] P. Clark and T. Niblett, “The cn2 induction algorithm,” Mach. Learn., vol. 3, no. 4, pp. 261–283, 1989.

[10] M. Iwayama and T. Tokunaga, “Cluster-based text categorization: a comparison of category search strategies,” in Proceedings of

SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval, E. A. Fox, P. Ingwersen, and R. Fidel, Eds.

Seattle, US: ACM Press, New York, US, 1995, pp. 273–281. [11] B. Masand, G. Linoff, and D. Waltz, “Classifying news stories using

memory based reasoning,” in SIGIR ’92: Proceedings of the 15th annual

international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 1992, pp. 59–65.

[12] S. Chakrabarti, B. E. Dom, and P. Indyk, “Enhanced hypertext cat-egorization using hyperlinks,” in Proceedings of SIGMOD-98, ACM

International Conference on Management of Data, L. M. Haas and

A. Tiwary, Eds. Seattle, US: ACM Press, New York, US, 1998, pp. 307–318.

[13] K. M. A. Chai, H. L. Chieu, and H. T. Ng, “Bayesian online classifiers for text classification and filtering,” in SIGIR ’02: Proceedings of the

25th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM Press, 2002, pp. 97–104.

[14] A. McCallum and K. Nigam, “A comparison of event models for naive bayes text classification,” in AAAI-98 Workshop on Learning for Text

Categorization, 1998.

[15] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “A machine learn-ing approach to buildlearn-ing domain-specific search engines,” in The

Six-teenth International Joint Conference on Artificial Intelligence (IJCAI-99), 1999.

[16] T. Joachims, “Text categorization with support vector machines: Learn-ing with many relevant features,” in ProceedLearn-ings of the European

Conference on Machine Learning. Berlin,German: Springer, 1998, pp. 137–142.

[17] T. Joachims, N. Cristianini, and J. Shawe-Taylor, “Composite kernels for hypertext categorisation,” in Proceedings of ICML-01, 18th International

Conference on Machine Learning, C. Brodley and A. Danyluk, Eds.

Williams College, US: Morgan Kaufmann Publishers, San Francisco, US, 2001, pp. 250–257.

[18] V. Vapnik, Statistical Learning Theory. New York: John Wiley, Sons, 1998.

[19] A. Du and B. Fang, “Comparison of maching learning algorithms in chinese web filtering,” in proceedings of The third International

Conference on Machine Learning and Cybernetics. Shanghai,China: IEEE Press, 2004, pp. 2521–2526.

[20] F. Sebastiani, “Machine learning in automated text categorization,” ACM

Comput. Surv., vol. 34, no. 1, pp. 1–47, 2002.

[21] Y. Yang and X. Liu, “A re-examination of text categorization methods,” in SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR

conference on Research and development in information retrieval. New

York, NY, USA: ACM Press, 1999, pp. 42–49.

[22] A. Du and B. Fang, “A biased support vector machine approach to web filtering,” in ICAPR ’05: Proceedings of the Third International

Conference on Advances in Patten Recognition, C. A. P. P. Sameer Singh,

Maneesha Singh, Ed. Springer Verlag, Heidelberg, D-69121, Germany, 2005, pp. 363–370.