AN EFFICIENT CHUNK BASED TECHNIQUE FOR DUPLICATE DETECTION AND RECORD LINKAGE

(1)

AN EFFICIENT CHUNK BASED TECHNIQUE FOR DUPLICATE DETECTION AND RECORD LINKAGE

Ashwini Patil

¹

, Priyanka Rathod

²

, Dhanashreesawale

³

, Divyani Shah

⁴

, Balasahebtarle

⁵

1,2,3,4

Dept. Of Computer Science, NDMVPS KBTCOE, (India)

5 Dept. Of Computer Science, H.O.D., NDMVPS KBTCOE, (India)

ABSTRACT

Similarity join is very important technique, it involves in many applications such as data integration, record linkage and pattern recognition. Currently fetching grams from string and consider only string that share certain gram as candidate. Now we propose fetching non-overlapping substring or chunk from string. Chunk scheme based on tail-restricted chunk boundary dictionary (CBD). This approach integrated existing approach for calculating similarity with several new filters unique to chunk based method. Greedy algorithm automatically select good chunking scheme from given data set. Then show the result our method occupies less space and faster performance to compute the value.

Keywords: Approximate String Matching, Chunk Bound Dictionary, Edit Distance. Edit Similarity Joins, Prefix Filtering, Q-Gram.

I. INTRODUCTION

One of the major problemaccompanying the fast growth of raw data on the Internet and the growing need to linking data from different sources is the existence of redundant data. There are many causes for the existence of near duplicate data like typing errors, versioned, plagiarized documents, and multiple representations of the same physical objects [1][2].In this paper we will design of an efficient algorithm to reduce the space requirements for edit similarity join using the rank method to reduce the overlapping of filtered data from the dataset during data integration process. A similarity join finds pairs of objects from two datasets such that the pair‟s similarity value is no less than a given threshold. (Limit for that pairs of objects).Early research in this area focuses on similarities defined in the Euclidean space [2][4]. Current research has spread to various similarity (or distance) functions. Similarity joins with edit distance constraint of no more than a constant. The major technical difficulty lies in the fact that the edit distance metric is complex and costly to compute. In order to scale to large datasets, all existing algorithms adopt the filter-and-verify paradigm. This first eliminates non- promising string pairs with relatively inexpensive computation and then performs verification by the edit distance computation. Also, many existing algorithms are based on the principle of prefix filtering to avoid computing similarity values for all possible records. They are linked into the implemented methods and thus reduce the candidate sizes andhence improve the performance.

Gram-based methods extract all substrings of certain length (q-grams), or according to a dictionary (VGRAMs), for each string. Therefore, subsequent grams are overlapping and are highly redundant. This results in large index size and incurs high query processing costs when the index cannot be entirely accommodated in the main

(2)

memory. The total size of the q- gram six time more than text string. Our observation is that the main redundancy in q-gram based approach in existing overlapping gram. In each string, we divide it into several disjoint substrings, or chunks, and we only need to process and index these chunks to improve the space usage.

Chunk based robust chunking scheme that have salient property for each string. Robust chunking scheme based on tail-restrict Chunk Boundary Dictionary (CBD). Edit distance not more than the constant threshold. Here applying prefix filtering and location-based mismatch filtering methods, the number of signatures generated by our algorithm is guaranteed to be fewer than or equal to that of the q-grams-based method, q>=2. Another feature of chunk based method that use existing filter and several new filtering unique to chunk based method including rank, chunk number and virtual CBD filtering. We also consider the problem of selecting a good chunking scheme for a given dataset.

II. LITERATURE SURVEY

Previously similarity join with edit distance was used to extract overlapping grams from strings. q-gram is continuous substring of the length q of the given string; we then find the position of the string to be matching. If they have same content and positions are within the edit distance, it performs an operation to calculate similarity through length, count position, and prefix filtering of the data from the dataset. The q-Gram method index value are very large and then fixed length chunk based method‟s index values are huge, and variable gram(VGRAM) dictionary based method chunk does not predict easily and this complicates the work to implement this approach.

Firstly, theissue is efficient algorithms for similarity joins with an edit distance constraint. Recently, the approach is based on fetching overlapping grams from strings and considering only strings that share a specific number of grams as candidates. Secondly, the next problem we face is efficient algorithms for similarity search with an edit distance constraint. We define a general signature-based framework and showthat the minimum size is т+1.The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.

Ed-Join: Fetching q-grams and arranging by decreasing order according by id[8].

Winnowing: The q-grams having minimum hashed value is fetched within each sliding window.

VChunkJoin:It collects vchunks,selects CBD, andarrange by decreasing id.

VGram-Join: Thisalgorithm is related to variablelengthgrams for edit similarity selectionqueries [6].

Gram based:The Fixed length q-grams are used for edit similarity queries, becausein pruning candidatesthe count filtering is very effective[7]. Tree based: This method is not practically implemented but, builds a tree forthe data set and also support edit similarity joins[8].

Enumeration based:It enumerates all strings obtained by edit operations.

Drawbacks of existing system

• Increases the combination level.

• Process time becomes very slow

• Looping time increases

• Large amount of data cannot be handled

• User have to wait for long time

(3)

III. PROPOSED SYSTEM

In our proposed system edit similarity join based on extracting non-overlapping substring, or chunk from string.

A class of chunk based method notion tail-restricted chunk boundary dictionary uses several existing filtering &

new powerful filtering techniques to reduce the size of index. Our proposed algorithm performs length, position, prefix filtering, content-based filtering & location-based filtering approach help to reduce the amount of space required to calculate the similarity of only the satisfied similar data. Resulting algorithm we call it as VChunkJoin-NoLC. Here we introduce index, chunk number and virtual CBD algorithm to improve the efficiency of the problem. This approach should satisfy the edit distance value which is less than the threshold value will be calculated as the similarity value.

IV. INITIAL OPERATION

To perform similarity join operation on data strings, first datasets must be gathered and should be parsed. Then parsed data is added to the database.

4.1 Dataset Gathering

A data dataset (or data set) is a collection of data. Most commonly a dataset corresponds to the contents of a single database table, where every column of the table represents a particular variable, and each row corresponds to a given member of the dataset in question.

4.1.1. Database A, B

Select two dataset from the database. OR Select two tables from same database.

(4)

4.1.2. Select Table and CBD Selection

After selecting the table from the dataset performs CBD selection.CBD selection means chunk bounded dictionary from the table1 and table2.For selecting CBD column names of the tables should be same.

4.1.3. Index Assignment

After the CBD selection is done perform index assignment for the table .Index assignment means unique id assigned to row. This id is provided from the database itself. For ex. (R_ID).

4.1.4. Apply Search Technique

There are three search techniques a) Fully matched

In this search technique columns of table are selected .After selecting the column, searching is done and fully matched records are stored in temporary variable.

b) Non matches

For the Remaining record Non match technique is applied. After applying this technique, we will get the non matches and these non matches are stored in temporary variable

c) Possible matches

In Possible matches there are three types of technique:

 Startswith: In this method the start of the string is compared from the start position of the word.

 Endswith: In this method the end of the string is compared from the end position of the word.

 Contains: In this method the content of chunk is compared with other chunk.

Example:

SHAUNASHIKIT SHAUNASIKIT

Start contains end Start contains end Fig.4.2: Example of possible Matches

 q-grams:In this method string is divided number of gram. It is also called as substring of the string with length q.

Example:

SHAUNASHIKIT Possible q-gram:

SH HA AU UN NA AS SH HI IK KI IT

4.2 Data Cleaning

4.2.1 Fully Matches

Delete (): If two chunks are same then the user is provided with the delete option to delete the chunk.

4.2.2 Possible matches

Link (): If two chunks are partially same then the user is provided with the link option to link the chunk

4.2.3 Non Matches

Update (): If no two chunks are same then the user is provided with the update option to update the chunk

(5)

4.3 Data Integration

After the data cleaning process is done, the data integration takes place for integrating of two tables i.e. it combines the two tables in one updated table. This updated table is Stored in the System.

V. DATA FILTERING

The next step is to filter out the data that are likely to be similar. Two data are considered similar by different methods like the most used Prefix filtering. But on analysis prefix filtering method produces overlapping data as filtered result. This overlapped data occupy a large space in the memory. So to overcome this problem it was decided to use index based filtering method to reduce the overlapping of data

5.1 Prefix Filtering

Prefix Filtering method filters documents that have fields containing terms with a specified prefix. It is similar to phrase query, except that it acts as a filter. It can be placed within queries that accept a filter. This approach is applied to two data sets that we have taken before. In this approach the first dataset is considered ad prefix and the second dataset is considered as the suffix. For the filtering the length of the prefix string is compared with that of the suffix string. If the prefix length matches the suffix length, then they are filtered out and placed in a separately. Consider two datasets P and Q on which prefix filtering is to be applied. The filtered elements are placed in the set F such that F= {x | length (P (m)) is equal to length (Q (n))} Where, x – The filtered element.

length(P(m)) – The length of the string m of the set P. length(Q(n)) – The length of the string n of the set Q.

Prefix filtering method was fast but it yielded a large number of overlapping data which occupied too much space in memory.

5.2 Index Based Filtering

The existing prefix filtering method occupied a large amount of memory, so we needed to use a new approach to reduce space occupancy by the filtered data. Thus we introduced a index based method to reduce the overlapping data. In index based approach we consider only the overlapped data. Other filtered elements are kept as such. Here we apply index to the overlapped data on both datasets. The rank of both elements whose difference is greater than the chosen threshold is taken for further processing and rest are ignored. Rank calculation involves certain Combination operations to obtain rank of the string. F= {x│x is equal to rank}

Where, rank = │rank (p (m)) – rank (Q (n)) │≥ τ τ- Threshold value. Rank (P (m)) – Rank of the string m of the dataset P. rank (Q (n)) – Rank of the string n of the dataset Q. By this approach most of the overlapping data are ignored for the next process.

VI. SIMILARITY CALCULATION

The final step is to calculate the similarity between the strings of the filtered strings. The Similarity calculation involves the following steps.

1. Edit distance calculation.

2. Chunk the strings.

3. Similarity calculation.

(6)

6.1 Edit Distance Calculation

In this module we calculate the edit distance value from the matched text string from two dataset values. Edit distance calculation have to be performed by dynamic programming concept to get the minimum number of operation required to perform insert, delete, substitute and transform the string. Consider two string where both the strings have seen the length and position of the string are same the edit distance value as zero otherwise the value taken as how many operation need to transform one string to another as the edit distance value.

The various operations are.

 Substitution.

 Deletion.

 Insertion.

 Transformation.

6.2 String Chunking

Chunks are very basic and necessary condition to perform similarity value between the two selected similar data from dataset. Chunks are the disjoint substring of the given string. Our proposed dataset TREC as book oriented dataset in this paper CBD selection have to be done by the character set {a,e,g,l,s,v}. We split the string or substring into the given set of data as chunk and taken the number of chunk in the string as input condition to perform similarity value calculation. It should be less than the constant threshold value.

Example: Consider the string „Adolescence‟. According to the chunking rule above the string would be chunked as follows (underlined).

A dol e s cenceThus the string Adolescence has been chunked into 5 parts.

6.3 Similarity Calculation

Similarity calculation is the final result of this algorithm. We have to perform the similarity value calculation between the two datasets. It depends on the edit distance value. This similarity value calculation is performed by which data are less than threshold and between the two set ranks should less than the threshold value. Similarity value is calculated by subtracting Edit distance value from the bigger length then divided by the same value as similarity value between the data.

VII. CONCLUSION

Here we have introduced Novel approach to process edit similarity join efficiently based on partition strings into non- overlapping chunks. We have designed an efficient edit similarity join algorithm to incorporate all existing filter which occupies less space than the existing filtering methods and its more powerful than the existing algorithm to calculate the similarity value between the two set of strings. The existing q- gram and VGRAM approaches occupies more amount of index value and unnecessary condition strings. To perform similarity value calculation it has to overcome through our proposed algorithm VChunkJoin. CBD selection algorithm is helpful to reduce the index size of the string and dynamic program is used to perform the edit distance value of the two dataset.

(7)

VIII. ACKNOWLEDGEMENT

we would like to express our special thanks to my teacher Prof. B. S Tarleas well as our team members who gave me the golden opportunity to do this wonderful project on the topic An Efficient chunk based technique for similarity joins, which also helped me in doing a lot of Research and we came to know about so many new things we are really thankful to them. We also thanks of Savitri bai Phule,University Pune for consent to include copyrighted pictures as a part of our paper.

REFERENCES

[1] Arasu, A. Ganti, V. and Kaushik, R. “Efficient exact set-similarity joins,” in VLDB, 2006.

[2] Behm, A. Ji, S. Li, S. and Lu, J. “Space-constrained gram-based indexing for efficient approximate string search,” in ICDE, 2009.

[3] Elmagarmid, A. Ipeirotis, K. P. G. and Verykios, V. S. “Duplicate record detection: A survey,” TKDE, vol. 19, no. 1, pp. 1–16, 2007

[4] Li, C. Wang, B. and Yang, X. “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” in VLDB, 2007.

[5] Manber, U. “Finding similar files in a large file system,” in USENIX Winter, 1994, pp. 1–10.

[6] Ramasamy, Patel, J. M. Naughton, J. F. and Kaushik, R. “Set containment joins: The good, the bad and the ugly,” in VLDB, 2000, pp. 351–362.

[7] Sarawagi, S. and Kirpal, A. “Efficient set joins on similarity predicates,” in SIGMOD, 2004.

[8] Wang, J. Feng, J. and Li, G. “Trie-join: Efficient trie-based string similarity joins with edit,” in VLDB, 2010. [9] Xiao, C. Wang, W. Lin, X. and Yu, J. X. U. “Efficient similarity joins for near duplicate detection,” in WWW, 2008.

[10] Xiao, C. Wang, W. and Lin, X. “Ed-Join: an efficient algorithm for similarity joins with edit distance constraints,” PVLDB, vol. 1, no. 1, pp. 933–944, 2008.

[11] Yang, X. Wang, B. and Li, C. “Cost-based variable-length-gram selection for string collections to support approximate queries efficiently,” in SIGMOD Conference, 2008, pp. 353–364.