Finding the duplicated data in cloud storage by using AdjDup Technique

(1)

Finding the duplicated data in cloud storage by using AdjDup

Technique

Pranay Kumar Katta1 ,Yogendra Prasad P2

1_{M.Tech, Computer Science and Engineering, Sree Rama Engineering College, Tirupathi, Andhra Pradesh, India} 2_{Assisstant Professor, Department of CSE, Sree Rama Engineering College, Tirupathi, Andhra Pradesh, India}

ABSTRACT

Cloud computing greatly facilitates information providers who need to source their information to the cloud while not revealing their sensitive information to external parties and would like users with bound credentials to be ready to access the data.Data reduction has become progressively vital in storage systems due to the explosive growth of digital information within the world that has ushered within the huge information era. one amongst the most challenges facing large-scale information reduction is a way to maximally discover and eliminate redundancy at terribly low overheads. during this paper, we tend to present DARE, a low-overhead Deduplication-Aware resemblance detection and Elimination theme that effectively exploits existing duplicate-adjacency info for extremely economical resemblance detection in information deduplication based mostly backup/archiving storage systems. the most plan behind DARE is to use a theme, decision Duplicate-Adjacency based mostly alikeness Detection (DupAdj), by considering any 2 information chunks to be similar (i.e., candidates for delta compression) if their several adjacent information chunks are duplicate in an exceedingly deduplication system, so more enhance the resemblance detection potency by an improved super-feature approach. Our experimental results supported real-world and artificial backup datasets show that DARE solely consumes regarding 1/4 and 1/2 severally of the computation and assortment overheads needed by the standard super-feature approaches whereas detecting 2-10% a lot of redundancy and achieving a better turnout, by exploiting

existing duplicate-adjacency information for resemblance detection and finding the “sweet spot” for the super-feature approach.

Keywords : Data Deduplication, Delta Compression, Storage System, Index Structure, Performance Evaluation

I. INTRODUCTION

Cloud computing greatly facilitates data providers who need to source their data to the cloud while not revealing their sensitive data to external parties and would like users with bound credentials to be able to access the data. this needs data to be hold on in encrypted forms with access management policies specified nobody except users with attributes (or credentials) of specific forms will decrypt the encrypted data. the quantity of digital knowledge is

(2)

data reduction approach that not only reduces space for storing by eliminating duplicate data however conjointly minimizes the transmission of redundant knowledge in lowbandwidth network environments. In general, a chunk-level data deduplication theme splits knowledge blocks of an information stream (e.g., backup files, databases, and virtual machine images) into multiple knowledge chunks that square measure every uniquely known and duplicate-detected by a secure SHA-1 or MD5 hash signature (also known as a fingerprint). Storage systems then take away duplicates of information chunks and store just one copy of them to realize the goal of area savings. whereas data deduplication has been wide deployed in storage systems for area savings, the fingerprint-based deduplication approaches have an inherent drawback: they usually fail to find the similar chunks that are for the most part identical apart from some changed bytes, as a result of their secure hash digest are entirely completely different even just one computer memory unit of an information chunk was modified. It becomes an

enormous challenge once applying data

deduplication to storage datasets and workloads that have often changed data, that demands {an effective|an economical|a good} and efficient way to eliminate redundancyamong often changed and so similar data. Delta compression, an economical approach to removing redundancy among similar data chunks has gained increasing attention in storage systems.

II. RELATED WORK

With the ascension of rising applications like social network, semantic internet, device networks and LBS (Location based mostly Service) applications, a spread of information to be processed continues to witness a fast increase. Effective management and process of large-scale knowledge poses a motivating however essential challenge. Recently, huge data has attracted heaps of attention from world, business similarly as government.” Extracting price from

chaos” introduces many huge processing techniques from system and application aspects. First, from the read of cloud data management and large processing mechanisms, we present the key problems with huge processing, together with definition of massive data, huge data management platform, huge data service models, distributed filing system, data storage, data virtualization platform and distributed applications. Following the Map scale back multiprocessing

framework, we introduce some MapReduce

improvement ways reported within the literature. Finally, we discuss the open problems and challenges, and deeply explore the analysis directions within the future on huge processing in cloud computing environments. data diminution method will increase the importance of storage system area that's accrued because of the digital data storage within the huge data. the most task is that the diminution of data from the detected outside elimination of duplicate data. Here we use Binary conversion (BDC) for reducing the resembled data and it detects the economical elimination of duplicate data. extremely economical and exploited duplicate data detection system deploys {the data|the info|the information} chunk that has similar data. In “Key issues as deduplication evolves into primary storage they deploy Binary conversion method for diminution of information from the space for storing and de-duplicate all the data. The born-again binary type of keep data are going to be straightforward and quicker to de-duplicate the similar data that resembles one another. The output for detection will be beyond the prevailing duplication similitude identification approaches. The binary computation rate for getting redundancy elimination helps in larger data diminution.

III. PROPOSED SYSTEM

(3)

economical resemblance detection in information deduplication based mostly backup/archiving storage systems. the most plan behind DARE is to use a theme, decision Duplicate-Adjacency based mostly similitude Detection (DupAdj), by considering any 2 information chunks to be similar (i.e., candidates for delta compression) if their various adjacent information chunks are duplicate in an exceedingly deduplication system, then any enhance the resemblance detection potency by an improved super-feature approach. Our experimental results supported real-world and artificial backup datasets show that DARE only consumes regarding 1/4 and 1/2 severally of the computation and classification overheads needed by the standard super-feature approaches whereas detecting 2-10% a lot of redundancy and achieving the next turnout, by exploiting existing duplicate-adjacency information for resemblance detection and finding the “sweet spot” for the super-feature approach.

IV.MODULES

There are three modules

1. Deduplication Module

2. DupAdj Detection Module

3. Improved Super-Feature Module

Deduplication Module:

DARE is intended to enhance resemblance detection for added data reduction in deduplication-based backup/archiving storage systems., the DARE design consists of 3 practical modules, namely, the Deduplication module, the DupAdj Detection module, and therefore the improved Super-Feature module. additionally, there area unit 5 key data structures in DARE, namely, Dedupe Hash Table, SFeature Hash Table, locality Cache, Container, Segment, and Chunk.

DupAdj Detection Module

As a salient feature of DARE, the DupAdj approach detects alikeness by exploiting existing duplicate adjacency information of a deduplication system. the

most plan behind this approach is to contemplate chunk combines closely adjacent to any confirmed duplicate-chunk pair between two data streams as resembling pairs and so candidates for delta compression

Improved Super-Feature Module

Traditional super-feature approaches generate options by Rabin fingerprints and cluster these options into super-features to sight resemblance for data reduction. for example, Feature i of a chunk (length = N), is uniquely generated with a at random pre-defined value pair mi & ai and N Rabin fingerprints (as utilized in Content-Defined Chunking).

V. CONCLUSION

In this paper, we tend to present DARE, a

deduplication-aware, low-overhead likeness

detection and elimination theme for data reduction in backup/archiving storage systems. DARE uses a unique approach, DupAdj, that exploits the duplicate-adjacency info for economical likeness detection in existing deduplication systems, and employs an improved super-feature approach to

more detecting resemblance once the

(4)

effectively enlarge the logical house of the restoration cache.

VI.REFERENCES

1. "The data deluge," http://econ.st/fzkuDq. 2. J Gantz and D. Reinsel, "Extracting value from

chaos," IDC review, pp. 1-12, 2011.

3. M A. L. DuBois and E. Sheppard, "Key considerations as deduplication evolves into primary storage," White Paper 223310, Mar 2011.

4. W J. Bolosky, S. Corbin, D. Goebel, and et al, "Single instance storage in windows 2000," in

the 4th USENIX Windows Systems

Symposium. Seattle,WA, USA: USENIX

Association, August 2000, pp. 13-24.

5. S Quinlan and S. Dorward, "Venti: a new approach to archival storage," in USENIX Conference on File and Storage Technologies (FAST-02). Monterey, CA, USA: USENIX Association, January 2002, pp. 89-101.

6. B Zhu, K. Li, and R. H. Patterson, "Avoiding the disk bottleneck in the data domain deduplication file system." in the 6th USENIX Conference on File and Storage Technologies (FAST-08), vol. 8. San Jose, CA, USA: USENIX Association, February 2008, pp. 1-14.

7. D T. Meyer and W. J. Bolosky, "A study of practical deduplication," ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012.

8. G Wallace, F. Douglis, H. Qian, and et al, "Characteristics of backup workloads in production systems," in the Tenth USENIX Conference on File and Storage Technologies (FAST-12). San Jose, CA: USENIX Association, February 2012, pp. 33-48.

9. A El-Shimi, R. Kalach, A. Kumar, and et al, "Primary data deduplication-large scale study and system design," in the 2012 conference on USENIX Annual Technical Conference. Boston, MA, USA: USENIX Association, June 2012, pp. 285-296.

10. L. L. You, K. T. Pollack, and D. D. Long, "Deep store: An archival storage system architecture," in the 21st International Conference on Data Engineering (ICDE-05). Tokyo, Japan: IEEE Computer Society Press, April 2005, pp. 804-815.

11. A. Muthitacharoen, B. Chen, and D. Mazieres, "A low-bandwidth network file system," in the ACM Symposium on Operating Systems Principles (SOSP-01). Banff, Canada: ACM Association, October 2001, pp. 1-14.

12. P. Shilane, M. Huang, G. Wallace, and et al, "WAN optimized replication of backup datasets using stream-informed delta compression," in the Tenth USENIX Conference on File and Storage Technologies (FAST-12). San Jose, CA, USA: USENIX Association, February 2012, pp. 49-64.

13. S. Al-Kiswany, D. Subhraveti, P. Sarkar, and M. Ripeanu, "Vmflock: virtual machine co-migration for the cloud," in the 20th international symposium on High Performance Distributed Computing, San Jose, CA, USA, June 2011, pp. 159-170.

14. X. Zhang, Z. Huo, J. Ma, and et al, "Exploiting data deduplication to accelerate live virtual

machine migration," in 2010 IEEE

International Conference on Cluster

Computing (CLUSTER). Heraklion, Crete, Greece: IEEE Computer Society Press, September 2010, pp. 88-96.

15. F. Douglis and A. Iyengar, "Application-specific delta-encoding via resemblance detection," in

USENIX Annual Technical Conference,

General Track. San Antonio, TX, USA: USENIX Association, June 2003, pp. 113-126.

(5)

17. P. Shilane, G. Wallace, M. Huang, and W. Hsu, "Delta compressed and deduplicated storage using stream-informed locality," in the 4th USENIX conference on Hot Topics in Storage and File Systems. Boston, MA, USA: USENIX Association, June 2012, pp. 201-214.

18. Q. Yang and J. Ren, "I-cash: Intelligently coupled array of ssd and hdd," in The 17th IEEE International Symposium on High Performance Computer Architecture (HPCA-11). San Antonio, TX, USA: IEEE Computer Society Press, February 2011, pp. 278-289. 19. G. Wu and X. He, "Delta-ftl: improving ssd

lifetime via exploiting content locality," in Proceedings of the 7th ACM European conference on Computer Systems (EuroSys). Bern, Switzerland: ACM, April 2012, pp. 253-266.

20. D. Gupta, S. Lee, M. Vrable, and et al, "Difference engine: harnessing memory redundancy in virtual machines," in the 5th Symposium on Operating Systems Design and Implementation. San Diego, CA, USA: USENIX Association, December 2008, pp. 309- 322.

21. B. Debnath, S. Sengupta, and J. Li,

"Chunkstash: speeding up inline storage deduplication using flash memory," in the 2010 USENIX conference on USENIX annual technical conference. Boston, MA, USA: USENIX Association, June 2010, pp. 1-14. 22. R. C. Burns and D. D. Long, "Efficient

distributed backup with delta compression," in the fifth workshop on I/O in parallel and distributed systems. San Jose, CA, USA: ACM Association, November 1997, pp. 27-36.

23. J. MacDonald, "File system support for delta compression." Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley., 2000.

24. C. Dubnicki, L. Gryz, L. Heldt, M.

Kaczmarczyk, W. Kilian, P. Strzelczak, J.

Szczepkowski, C. Ungureanu, and M.

Welnicki, "Hydrastor: A scalable secondary

storage." in USENIX Conference on File and Storage Technologies (FAST-09). San Jose, CA, USA: USENIX Association, February 2009, pp. 197-210.

25. M. Lillibridge, K. Eshghi, D. Bhagwat, and et al, "Sparse indexing: Large scale, inline deduplication using sampling and locality." in the 7th USENIX Conference on File and Storage Technologies (FAST-09). San Jose, CA: USENIX Association, February 2009, pp. 111-123.

26. L. Aronovich, R. Asher, E. Bachmat, and et al, "The design of a similarity based deduplication system," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. Haifa, Israel: ACM Association, May 2009, pp. 1-12.

27. F. Guo and P. Efstathopoulos, "Building a high-performance deduplication system," in the 2011 USENIX conference on USENIX Annual Technical Conference. Portland, OR, USA: USENIX Association, June 2011, pp. 271-284. 28. D. Bhagwat, K. Eshghi, D. D. Long, and et al,

"Extreme binning: Scalable, parallel deduplication for chunk-based file backup," in IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS-09). London, UK: IEEE Computer Society Press, September 2009, pp. 1-9.

29. W. Xia, H. Jiang, D. Feng, and Y. Hua, "Silo: a

similarity-locality based near-exact

deduplication scheme with low ram overhead and high throughput," in the 2011 USENIX conference on USENIX annual technical conference. Portland, OR, USA: USENIX Association, June 2011, pp. 285-298.

30. K. Eshghi and H. K. Tang, "A framework for

analyzing and improving content-based

(6)

31. E. Kruus, C. Ungureanu, and C. Dubnicki, "Bimodal content defined chunking for backup streams," in the 7th USENIX Conference on File and Storage Technologies. USENIX Association, 2010.

32. B. Roma nski, L. Heldt, W. Kilian, K. Lichota, and C. Dubnicki, "Anchor-driven subchunk

deduplication," in The 4th Annual

International Systems and Storage Conference (SYSTOR-11). Haifa, Israel: ACM Association, May 2011, pp. 1-13.

33. G. Lu, Y. Jin, and D. H. Du, "Frequency based chunking for data de-duplication," in 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS-10). Miami Beach, FL, USA: IEEE Computer Society Press, August 2010, pp. 287-296.

34. M. Rabin, Fingerprinting by random

polynomials. Center for Research in

Computing Techn., Aiken Computation

Laboratory, Univ., 1981. June 20.

35. N. Jain, M. Dahlin, and R. Tewari, "Taper: Tiered approach for eliminating redundancy in replica synchronization." in the USENIX Conference on File and Storage Technologies (FAST-05). San Francisco, CA, USA: USENIX Association, March 2005, pp. 281- 294.

36. M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan, "Design tradeoffs for data deduplication performance in backup workloads," in Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST-15). USENIX Association, 2015, pp. 331-344.

37. D. Meister and A. Brinkmann, "Multi-level comparison of data deduplication in a backup scenario," in Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. Haifa, Israel: ACM Association, May 2009, pp. 13-24.

38. A. Broder, "Identifying and filtering near-duplicate documents," in Combinatorial

Pattern Matching. Montreal, Canada: Springer, June 2000, pp. 1-10.

39. "On the resemblance and containment of documents," in Compression and Complexity of Sequences (SEQUENCES-97). Washington, DC, USA: IEEE, June 1997, pp. 21-29.

40. U. Manber et al., "Finding similar files in a large file system," in Proceedings of the USENIX Winter 1994 Technical Conference. San Francisco, CA, USA: USENIX Association, January 1994, pp. 1-10.

41. V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, "Generating realistic datasets for deduplication analysis," in the 2012 USENIX conference on Annual Technical Conference. Boston, MA, USA: USENIX Association, June 2012, pp. 261-272. 42. D. Meister, J. Kaiser, and A. Brinkmann, "Block