Rapid Detection of Duplicates by Comparing the Records

Full text


Technology (IJRASET)

Rapid Detection of Duplicates by Comparing

the Records

Shruthi. B1, Sushmitha. B2, Swaroopa. S. K3, Vandana. M4 1

Assistant Professor, 2,3,4Student, Department of Information Science, Atria Institute of Technology, Bengaluru, India

Abstract: Duplicate detection is the procedure to identify various identical real world entities. Duplicate detection methods needs larger datasets to be processed in shorter time: to maintain the excellence of a dataset become difficult. Progressive duplicate detection algorithms that increase the effectiveness of finding duplicates if the execution time is restricted: They maximize the increase of the overall process within the time available by reporting most results much earlier than traditional approaches. Complete experiments show that our progressive algorithms can twice the effectiveness over time of traditional duplicate detection and considerably improve upon related work.

Keywords: Duplicate detection, record linkage, effectiveness, pay-as-you-go and data cleaning


Data are most important for a company. But appropriate to data variations and partial data entry, fault such as redundant entries might be found, making data cleaning and in particular duplicate detection is necessary.

Progressive duplicate detection identifies most duplicates soon in detection process. Progressive approaches try to reduce the average time after which a duplicate is found rather than reducing overall time. Completes results are obtained on a progressive algorithm than on any traditional approach. Here we target on progressive algorithms, which report most matches before time, while possibly slightly raising their overall runtime. To achieve this, they need to calculate approximately the similarity of all comparison candidates in order to compare most promising record pairs first. With the pair selection techniques of the duplicate detection process, there exists a transaction between the time which is required to run a duplicate detection algorithm and the fullness of the results. Progressive techniques make helpful as they deliver more complete results in shorter time. Furthermore, they make it easier for the user to define this trade-off, because the detection time or result size can directly be specified instead of parameters whose influence on detection time and result size is hard to guess.

Duplicate detection is the procedure to find various representations of equivalent real world entities. Today, duplicate detection methods need to process ever larger datasets are processed by duplicate detection method in shorter time, it becomes difficult to maintain the quality of dataset. Algorithm of Progressive duplicate detection that considerably raises the effectiveness of identifying duplicates, if there is limitation in execution time. Within the time available they increase the gain of the on the whole process by reporting the majority results much earlier than usual approaches. Our progressive algorithms can double the efficiency over time of traditional duplicate detection and considerably get better upon associated work which is shown by all the experiments.


Entity resolution is the difficulty of recognizing which record in a database refers to the same entity .E.g.: two companies that merge may want to combine their customer’s records. In such case, the same customers can be characterized by multiple records, so these matching records must be identified and combined. An entity resolution is very costly due to very huge dataset and intensive record comparison. [1]

Duplicate records do not have a common key and or they contain flaws that make matching of duplicates as complicated task. Flaws are indicated as imperfect information, doesn’t have of standard formats, or any combinations of these factors. Similar field entities can be detected by similarity matrices which is an wide set of duplicate detection algorithm that can detect duplicate records in a database. [2]


Technology (IJRASET)

chief challenges for record linkage and deduplication.

In recent years, various indexing techniques have been established for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious no matching pairs, while at the same instance maintaining high matching quality. Their difficulty is analysed, and their performance and scalability is assessed. [3] Complexities are meeting in dividing reliable system for searching and bring up to date large file of documents that must be recognized mainly on the basis of names and other particulars. Record linkage is the duty of locating records in the dataset that refers to the identical entity across diverse data sources. [4]

Data cleansing is used for cleaning list of names of likely customers in a marketing type application. Produces many data at low cost. The difficulty of merging various database of information about familiar entities is commonly meet in KDD (Knowledge Discovery in Database).The problem we study often is called merge/purge problem. We use sorted neighbourhood method to for solving merge/purge problem (duplicate elimination method).

Huge repositories of data usually have various duplicate entries about the similar entities that are hard to gather together without an “equational theory” that recognize comparable items by a complex, domain-dependent matching procedure. They have developed a system for this Data Cleansing and exhibit its use for cleansing list of names of probable customers in a direct marketing-type application. Results of data which are generated are shown to be accurate and efficient when processing the data many times using different keys for sorting on each consecutive pass.

This paper details improvements in system, and reports on the successful implementation for a real-world database that conclusively validates results previously achieved for statistically generated data. [5]

Adaptive techniques improve the effectiveness of duplicate detection, but in contrast to progressive technique they have to run for certain period of time and cannot maximize the efficiency for any given timeslot. Two adaptive methods of SNM, sliding window size is dynamically adjusted by these, to adaptively fit the duplicate distribution a key parameter is analysed in SNM during blocking phase. Complete adaptivity is not achieved is the drawback [6].

Duplicate detection is the procedures to find various records in a dataset that signify the very similar real-world entity. Exhaustive comparison is due to the enormous cost, typical algorithms select only promising record pairs are selected by the typical algorithm for comparing. Blocking is one of the methods and windowing is the other method. Using blocking method records are divided into disjoint subsets. In windowing methods, slide a window over the sorted records and records are compared only contained by the window. Sorted Blocks in many alternatives, is a new algorithm presented, in which both approaches are generalized. Extensive experiments are being conducted with different datasets to estimate sorted blocks. To discover the equivalent number of duplicates, the new algorithm needs fewer comparision. [7]

A. Disadvantage of Existing System

A user has only limited, maybe unknown time for data cleansing and wants to make best possible use of it. Then, simply start the algorithm and terminate it when needed. The result size will be maximized.

1) A user has little knowledge about the given data but still needs to configure the cleansing process.

2) A user needs to do the cleaning interactively to, for instance, find good sorting keys by trial and error. Then, run the progressive algorithm repeatedly; each run quickly reports possibly large results.

3) All presented hints produce static orders for the comparisons and miss the opportunity to dynamically adjust the comparison order at runtime based on intermediate results


Progressive algorithms are been focused here, Most of the early matches are been reported, overall runtime are possibly slightly increased. To attain this we compare most capable record pairs first, we need to approximate the parallel of all the comparision candidates.

Two novels are proposed by the algorithm Progressive duplicate detection they are (PSNM) progressive sorted neighbourhood method and (PB) progressive blocking PSNM achieves finest on small and approximately spotless datasets. PB achieves finest on huge and very unclean datasets. PSNM and PB improve the effectiveness of detecting duplicates on very huge datasets.

PSNM and PB intend two progressive duplicate detection algorithm and which representation different strength and do better than current approaches.


Technology (IJRASET)

A. Advantages of Proposed System

1) Enhanced early quality

2) Same ultimate quality

3) Our PSNM and PB algorithms vigorously adjust their behavior by automatically preferring optimal parameters, e.g., window sizes, block sizes, and sorting keys, representing their manual specification superfluous. In this way, we considerably ease the parameterization complexity for duplicate detection and give the growth of more user interactive applications.

Fig: System Architecture

Data set is selected from the data. Target data is the data which is been selected for detection duplicates. This target data is pre-processed. Pre-processing is the procedure of cleaning the unwanted data, missing values is been filled. Pre-processing helps to maintain the data quality. After pre-processing we obtain pre-processed data. Data separation is the procedure of separating the files according to the size of the file. It determines numbers and size of partitions. Separated files will undergo duplicate detection. Duplicate detection is the procedure of removing the duplicate replica of the repeated data. This helps to improve the storage utilization. We can clean the data by deleting, replacing or merging the duplicate reports by duplicate detection. Patterns will be obtained after deduplication. A pattern is the formation of graph of efficiency. We analyses the 1graph and the estimation of all this process is knowledge base.

B. Detailed Design

1) Dataset Collection: To collect and/or retrieve data about activities, results, context and other factors. It is vital to regard as the kind of information it want to congregate from your participants and the ways you will analyse that information. Single database table contents are corresponded by a data set, or a single statistical data matrix, where particular variable is represented by each and every column of the table. The data is accumulated in the Database after collecting the data.


Technology (IJRASET)

effortlessly and efficiently processed for the principle of the user.

3) Data Separation: After recompleting the pre-processing, the data separation to be performed. The blocking algorithms assign each record to a fixed group of similar records (the blocks) and then compare all pairs of records within these groups. Each block within the block comparison matrix represents the comparisons of all records in one block with all records in another block, the equidistant blocking; every block has the equal size.

4) Duplicate Detection: The administrator sets rules to detect the duplicates, when the user attempts to generate new records or update existing records; the system alerts the user about the possible duplicates. We can schedule a duplicate detection job to verify the duplicates for all records that match a certain criteria, to uphold the data quality. The data can be cleaned by removing, disabling, or combining the duplicates reported by a duplicate detection.

5) Quality Measures: The cost-benefit calculation is used to measure the quality of the system. It is difficult to meet a budget limitation especially for conventional duplicate detection process, because their runtime is hard to predict. In the amount of time given, it delivers a various duplicates as possible, progressive processes optimize the cost-benefit ratio. In manufacturing, a measure being free from defects, deficiencies and significant variations. By firm and consistent commitment to certain standards that achieve uniformity of a product in order to satisfy specific customer or user requirements.


This paper introduced two methods, one is progressive sorting neighbour method and the other is progressive blocking. With limited execution time the duplicate detection algorithm efficiency is increased; they vigorously change the status of comparison candidates based on intermediate results to execute promising comparisons first and less promising comparisons later. To determine the performance gain of our algorithms, we proposed a novel quality measure for progressiveness that integrates seamlessly with existing measures. Using this measure, experiments showed that our approaches outperform the traditional SNM by up to 100 percent and related work by up to 30 percent For the construction of a fully progressive duplicate detection workflow, we proposed a progressive sorting method, Magpie, a progressive multi-pass execution model, Attribute concurrency and an incremental transitive closure algorithm. The adaptations AC-PSNM and AC-PB use multiple sort keys concurrently to interleave their progressive iterations. Analysing intermediary results, both approaches vigorously rank the different sort keys at runtime, significantly easing the key selection difficulty. In future work, we want to combine our progressive approaches with scalable approaches for duplicate detection to deliver results ever faster. Two phases parallel SNM was initiated by Kolb et al, which executes traditional SNM on balanced, overlapping partitions. Here, we can instead use our PSNM to progressively find duplicates in parallel.


[1] S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-as-you-go entity resolution,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.

[2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan.


[3] P.Chriten,”A survey of indexing techniques for scalable record linkage and deduplication,”IEEE Trans.Knows.Data Eng., vol.24, no.9, pp.1537-1555, Sep.20

[4] ]H. B. Newcombe and J. M. Kennedy, “Record linkage: Making maximum use of the discriminating power of identifying information,” Commun. ACM, vol. 5,

no. 11, pp. 563–566, 1962.

[5] M. A. Hernandez and S. J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining Knowl. Discovery, vol. 2, no. 1, pp.

9–37, 1998.

[6] S.Yan, D.Lee, M.-Y.Kan, and Giles, “Adaptive sorted neighbourhood methods for efficient record linkage,” in Proc. 7th ACM/IEEE Joint

Int.Conf.Digit.Libraries, 2007, pp.185-194.

[7] U.Draishbach and F.Naumann,”A generalization of blocking and windowing algorithms for duplicate detection,” in Proc.Int.Conf.data Knowl.Eng., 2011,






Related subjects :