Vol 13, No 1 (2014)

(1)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 62

Progression of String Matching

Practices in Web Mining –

A Survey

Kaladevi A. C.

Associate Professor,

Department of Computer Science and Engineering, Sona College of Technology,

Salem, India

Nivetha S. M.

PG Scholar,

Department of Computer Science and Engineering, Sona College of Technology,

Salem, India

ABSTRACT

String matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching can be classified into two sub-problems namely finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately. The basic technique is the Dictionary-based entity extraction. It identifies entities from a document which are predefined. Here the value of recall is lesser. Next trend for improving the recall is the approximate entity extraction. For a given query it finds all substrings in a document that roughly match entities in a given dictionary. This causes redundancy and lowers its performance. To overcome this drawback in the performance of string matching, a technique called Approximate Membership Localization is used. It is solved via P-Prune Algorithm. This paper is a survey on performance and accuracy of the string matching process and exposes an idea on using P-Prune in Blog-Search Framework.

Keywords

Blog, P-Prune, Approximate membership localization, Approximate membership Extraction, RSS Feeds

1. INTRODUCTION

Data mining is the process of discovering interesting patterns from a large data set. It is the automatic or semi-automatic analysis of large amount of data for extracting previously unknown patterns such as groups of data records, unusual records and dependencies.

(2)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 63

The primary data mining techniques are classification, clustering and association rule mining. Data mining has its uses in fields such as games, business, science and engineering, medical data mining, sensor data mining, pattern mining, spatial data mining, knowledge grid, etc. Though there are wider applications of data mining two critical factors exist in it namely,

 Database size: Large data to be processed and maintained requires more powerful system.

 Query complexity: Highly complex and larger number of queries to be processed requires a system with high potent.

Web mining is the application of data mining techniques to extract knowledge from Web data. It can also be defined as a collection of interrelated files on web server(s).In this paper we analyze various techniques used for String Matching in Web Mining. The String matching problem is finding all the occurrences of a given pattern in a text where they are the sequences of characters from a finite set. If k, l, m are strings then k is said to be prefix of kl and suffix of mk and a factor of lkm. Various algorithms exist in solving it. Three main search approaches are the Prefix searching, Suffix searching and Factor Searching. The extension is the Multiple String matching technique where the length of the string is taken into account. A simpler solution to it is repeating the searches. The most recent algorithm for string match is the Potential Redundancy Prune (P - Prune) algorithm applied for a Web Search based Framework. The search in textual documents requires only a static dictionary. The major applications of String matching are Intrusion Detection, Plagiarism, Bioinformatics, Digital Forensics, Text Mining Research, etc.

Blogs have created a highly active part of the World Wide Web due to their rapid growth [1]. With simple technical knowledge any person can create a blog using popular blogging services such as Blogger, WordPress, etc. Similar to web search the String matching algorithms can also be applied to the Blog Search Framework. This can be made possible by collecting Really Simple Syndication (RSS) Feeds from various Blogs. It needs a dynamic dictionary since we ought to analyze the opinion [2] about the blogs which are updated often. We have presented a paper [3] on this research. The results revealed that the search over Blogs is much better when P-Prune algorithm is applied to solve AML problem.

2. EVOLUTION OF STRING MATCHING TECHNIQUES 2.1 BloomFilter

(3)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 64

dictionary with allowable fraction of errors. He has explained two new hash functions related with the conventional hash coding method. Two major computational factors used here are Reject time and Space. In conventional hash coding errors are not permitted. Here the hash area is split up into cells and a pseudorandom number is generated from the messages. The messages are then stored in the cells. The cell content is then compared with the test message. The match indicates that it is a member. The new hash coding methods are a slight variation from the conventional method.

In the first method the hash area is organized in the same way as that of the conventional method but instead of storing the entire message code generated from the message is stored. With the smaller fraction of allowable errors the cell size increases. The codes are tested similar to the conventional method. Due to the lack of uniqueness in codes there may arise errors of commission. In the second method hash area is divided into individual addressable bit. Message is hash coded to number of distinct bit addresses.

2.2 Bloom Filter along with Password Security

U. Manber and S. Wu [5] extended the Bloom filter for exact matches to approximate matches along with password security. The primary focus of this paper is the prevention of password guessing. Even a small change in input value can modify the hash value on using Bloom filter. Hence it is not good for approximate queries. All possible variations of each dictionary string are generated and are inserted into Bloom filters. This works well for small d values because for large distance threshold generation of variations becomes computationally expensive. The data structure here can be used in applications that require fast approximate queries to large databases. The authors have discussed two such applications here. First is spell check in large bibliographic files. Regular spell checker is used initially and the words that are not found in dictionary are checked for their distance. The second application is the use of Filtering. The pattern-matching problem is reduced by splitting them into smaller size pieces.

The two extensions to Bloom filters that permit fast approximate set membership are as follows:

 Reduction of an approximate query to several exact queries.

 Effective utilization of secondary memory by directing all hashing of the same element to same page.

2.3 Multipattern Matching technique

(4)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 65

technique which is building a tree over all patterns. This idea significantly reduces the number of comparisons between substrings and patterns. Though there are advantages, the drawback of this method is that it works only for exact matches and not for approximate matches. This is because the construction of tree using approximate matches is not possible. Approximate matching is modeled using a distance function. Four types of existing algorithms are to be adapted for Approximate String Matching technique.

First is the oldest approach that is based on Dynamic Programming. The next approach which uses a function of pattern suits well for short patterns. The third approach is Bit-Parallelism which retrieves most of the successful results. Final Approach is Filtration in which the areas without match are discarded and the remaining part is verified using some algorithm. But this method is found to have efficiency degradation when the threshold value is higher. The Efficiency in approximate matching can be increased by adapting additional measures.

2.4 Cumulative Gain

K. Jarvelin and J. Kekalainen [8] presented about Cumulative gain in their paper. It is the sum of the graded relevance values of the results in a search result list. The Information Retrieval Techniques extended this concept for retrieving highly relevant documents. They took into account the Graded relevance judgments. The cumulative gain is analyzed on three measures namely

 Direct Cumulated Gain (CG): Accumulation of the relevance scores of retrieved documents along the ranked result list.

 Discounted Cumulated Gain (DCG): Applying a discount factor to the relevance scores in order to devaluate late-retrieved documents.

 Normalized (D)CG Measure (nDCG):Sorting documents of a list of results by relevance, producing maximum possible DCG till position.

The strength of these measures is as follows:

 Relevance of the document and their rank are combined.

 Irrespective of the count of documents Cumulative gain can be given as a single measure.

 Independent of outliers.

 Explicitly provides the count of documents that hold good for nDCG values.

(5)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 66

Though there are many advantages, there are certain limitations of these measures such as,

 Redundancy of documents is not considered.

 Though the relevance is multidimensional the measures here consider only single dimension.

 Any measure based on static relevance judgments cannot handle dynamic changes.

2.5 Fast Similarity Search (FastSS)

B. Bocek et al., [9] pioneer an algorithm called Fast Similarity Search (FastSS) which is an exhaustive similarity search in a dictionary. This is based on the edit distance form of string similarity which is the minimum number of operations needed to transform one string to another. The Approximate Dictionary Queries [10] are the basis for this search scheme. For a dictionary containing „w‟ words of average length „l‟ with „k‟ maximum number of spelling errors , a deletion dictionary of size O(wlk) is used by FastSS. At the time of search each query is mutated to generate a deletion neighborhood of size O(lk). This contributes a faster search since the insertions and replacements are not taken into account. Various algorithms both online and offline are considered here. Online algorithms search without processing whereas Offline algorithms perform pre-processing and store the data in memory to speed up the process. FastSS is an exhaustive offline search technique. Diverse algorithms have been applied in random dictionary and the results reveal that FastSS outperforms other algorithms.

2.6 Filtration - Verification

(6)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 67

candidate members and the memory space required is comparable with LSH. Finally they had proved that this filter efficiently filters out a large number of the non-member substrings.

2.7 Similarity Functions defined by Exploiting Web Search Engines

S. Chaudhuri et al., [12] proposed a method exploiting web search engines to define new similarity functions. The entity matching task identifies entity pairs one from a reference entity table and the other from an external entity list. The task is to check whether or not a candidate string matches with member of reference table. Consider another application. The entity matching task identities entity pairs, one from a reference entity table and the other from an external entity list, matching with each other. An example application is the offer matching system which consolidates offers (e.g., listed price for products) from multiple retailers. New document-based similarity measures are proposed to quantify the similarity in the context of multiple documents. However, the challenge is that it is quite hard to obtain a large number of documents containing a string unless large portion of the web is crawled and indexed as done by search engines. A Class of synonyms where each synonym for an entity e is an identifying set of tokens, which when mentioned contiguously (or within a small window) refers to e with high probability and identifying token set as IDTokenSets. The approach is used to compute string similarity score between the candidate and the reference strings.

Further they developed efficient techniques to assist approximate matching in the context of certain similarity functions. In an extensive experimental evaluation, demonstrate the accuracy and efficiency of a technique. The drawback is that it does not match document words and the Quality of id token set is low.

2.8 Approximate Membership extraction (AME)

Jiaheng Lu et al., [13] applied in their research, the Jaccard Similarity Function in K-Signature scheme.

) 2 1 ( ) 2 1 ( ) 2 , 1 ( S S wt S S wt S S J  

 (1)

The Eq.1 gives the Jaccard similarity function where s1 and s2 are any two strings whose weight is known.

For a string S and a threshold θ, tokens are sort in descending order and subset of tokens in prefix signature set of s in sig(s) is chosen as,

0 ) ( ) 1 ( )) ( ( )

(s wt sig s   wt s 

r  (2)

(7)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 68

Initially a Signature based Inverted List (SIL) is built by generating signature for reference string based on k-Signature scheme. The authors of this paper proposed two Algorithms namely,

 EvScan: Scan Inverted List so as to avoid overlapping between strings but consumes time in unnecessary list scanning.

 EvIter: An Optimized version of EvSCAN that reduces unnecessary list scanning.

The modified SIL answers queries with dynamic similarity threshold. The experimental results show that these algorithms save cost in terms of scanning the inverted list. The major drawback of AME is that it does not analyze the redundancy which is caused by overlapping strings. This problem is studied by Li et al., [14] in their paper

2.9 Approximate Membership Localization (AML)

Li et al., [14] proposed an efficient technique, Approximate Membership Localization (AML) using the Potential Redundancy Prune (P-Prune) Algorithm. It is also a dictionary based problem. This overcomes the drawback of the Approximate Membership Extraction (AME) which poses much redundancy that lowers the efficiency and decreases the performance of Real World applications. The efficiency of AML is proven to be high since it prunes the overlapped strings before generating them. The experimentation is done on both AME and AML within a Web-Based Framework and the results reveal that precision and recall of this method is much higher than the other similarity metrics.

The dictionary considered here is static and is not proven for certain real time applications such as Blogs which requires a dynamic dictionary.

2.10 Ranking of Opinions in Blogs

G. Mishne [15] analyzed ranking of opinions in Blogs. Three Components taken into account are,

 Fact-oriented information retrieval.

 Dictionary-based opinion expression detection.

 Spam filtering.

(8)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 69 2.11 Opinion Retrieval

W. Zhang et al., [16] also made their study on a three various modules on Opinion Retrieval. They are

 Information retrieval

 Opinion classification

 Similarity ranking.

At first the information is retrieved from Blogs. Secondly, these documents are classified into opinionative and non opinionative documents. Finally ensures that the opinions are related to the query, and ranks the documents in certain order.

The score of this system is about 13% higher than that of the system developed by Mishne.

2.12 Statistical approach to retrieve opinionated blog posts

B. He et al. [17] presented a statistical approach to retrieve opinionated blog posts. The system automatically generates a dictionary from the Blogosphere without requiring manual effort. The dictionary can be derived by removing the rare terms since they can‟t be generalized. Further a weight is assigned to each term in the dictionary and also assigns an opinion score to each document in the collection using the top weighted terms from the dictionary as a query. Finally the opinion score is combined with the initial relevancescore produced by the retrieval baseline.

This paved the foundation for our idea of combining Viewers‟ Opinion with the p-Prune Algorithm.

2.13 P-Prune in Blogs

Based on these ideas we [3] have presented a paper on using P-Prune algorithm for Blogs. The major variation of Blog Search from normal Web Search is that in Blogs the scoring of documents can be done based on the opinion retrieved from the viewers‟.

On collecting the RSS feeds of Blogs [18] we can retrieve the opinion of viewer‟s. Once viewers‟ subscribe to a website they need not manually check it. Instead, the browser persistently monitors the site and informs the user of any updates. The new data for the user can be automatically downloaded by commanding the browser.

3. CONCLUSION AND DIRECTIONS TO FUTURE WORK

(9)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 70

faster, than simply adapting former AME methods. The precision and recall of blog-based join with the AML results largely outperform AME. Future work is to apply the P-Prune algorithm for AML to scenarios, such Vlogs. Vlog entries often combine embedded video with supporting text, images, and other metadata. In addition to textualsimilarity, other similarity measures are to be considered. The AML-based solutions are more apposite than the AME-based solutions for the real-world applications, since the matches of the AML are much nearer to the true matched pairs.

REFERENCES

[1] http://blogsearch.google.com

[2] B. Liu, M. Hu and J. Cheng “Opinion Observer: Analyzing and Comparing Opinions,”

Proceedings of the 14th WWW Conference, 2005.

[3] A.C.Kaladevi and S.M.Nivetha, “Efficient Approximate Membership Localization using P-Prune Algorithm in Blogs,” in International Conference on Computer Communication and Informatics, pp. 14, 2014.

[4] B. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors,” Comm. ACM, vol. 13, no. 7, pp. 422-426, 1970.

[5] U. Manber and S. Wu, “An Algorithm for Approximate Membership Checking with Application to Password Security,” Information Processing Letters, vol. 50, no. 4, pp. 191-197, 1994.

[6] G. Navarro and M. Raffinot, Flexible Pattern Matching in Strings: Practical On-line Search Algorithms for Texts and Biological Sequences. Cambridge Univ. Press, 2002. [7] A. Aho and M. Corasick, “Efficient String Matching: an Aid to Bibliographic Search,”

Comm. ACM, vol. 18, no. 6, pp. 333-340, 1975.

[8] K. Jarvelin and J. Kekalainen, “Cumulated Gain-Based Evaluation of IR Techniques,”

ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422-446, 2002.

[9] B. Bocek, E. Hunt, and B. Stiller, “Fast Similarity Search in Large Dictionaries,” Technical Report ifi-2007.02, Dept. of Informatics University of Zurich, 2007.

[10]G. Brodal and L. Gasieniec, “Approximate Dictionary Queries,” Proceedings of the 7th Symp. Combinatorial Pattern Matching, vol. 1075, pp. 65-74, 1996.

[11]K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, “An Efficient Filter for Approximate Membership Checking,” Proceedings of ACM SIGMOD International Conf. Management of Data, pp. 805-818, 2008.

[12]S. Chaudhuri, V. Ganti, and D. Xin, “Exploiting Web Search to Generate Synonyms for Entities,” Proceedings of the 18th International Conf. World Wide Web (WWW), pp. 151-160, 2009.

[13]J. Lu, J. Han, and X. Meng, “Efficient Algorithms for Approximate Member Extraction Using Signature-Based Inverted Lists,” Proc. 18th CIKM ACM Conf. Information and Knowledge Management, pp. 315-324, 2009.

[14]Z. Li, L. Sitbon, L. Wang, X. Du and X. Zhou: “AML: Efficient Approximate Membership Localization within a Web-Based Join Framework,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no.2,Feb.2013.

[15]G. Mishne, “Multiple Ranking Strategies for Opinion Retrieval in Blogs,” Proceedings of TREC Blog Track, 2006. Retrieval in Blogs,” Proceedings of TREC Blog Track, 2006.

(10)

IJCSBI.ORG

ISSN: 1694-2108 | Vol. 13, No. 1. MAY 2014 71

[17]B. He, C. Macdonald, J. He, I. Ounis, “An Effective Statistical Approach to Blog Post Opinion Retrieval,” Proceedings of the CIKM ACM International Conf. Information and Knowledge Management, pp.1063-1072, 2008.

[18]J. Elsas, J. Arguello, J. Callan, J. Carbonell, “Retrieval and Feedback Models for Blog Feed Search,” SIGIR ACM Conf. Special Interest Group on Information Retrieval, pp. 347–354, 2008.

This paper may be cited as: