A Survey on Security of Big Data on
Cloud
Swati Batra1, Dr. A.K Sharma2
M. Tech Scholar, Dept. of Computer Science and Engineering, BSAITM, Faridabad, India1 Dean and P.G Research, Dept. of Computer Science and Engineering, BSAITM, Faridabad, India2
ABSTRACT: with the tremendous increase in date ‘big data’ is in scenario in this paper, we discuss about the big data , sources of big data and related aspects ,cloud computing for big data, security issues for cloud computing, Map Reduce and Hadoop environment. The main focus is on security issues in cloud computing that are associated with big data and how to achieve the privacy in cloud of big data.
KEYBORDS: Big data, HACE Theorem, , Hadoop, HDFS, Map Reduce, cloud computing, cloud security hybrid cloud, encryption.
I. INTRODUCTION
In a Digital World, due to this data is overwhelming from various sources e.g. every single click of a picture through mobile phone is creating data. The amount of data is so enormous that it become challenging to store and manage data with traditional data management tools and strategies. So we can say that the amount of data which is beyond the capability of storing and processing power of traditional data management tools is what we call BIG DATA. Various sources which are responsible for growth of huge data are CCTV and Spy camera, Airlines, Sensors, Social Media, Transactions, Enterprise data, Public data etc. Big data is not only huge in amount but also heterogeneous in nature which makes it difficult to analyse.
As the data is in very huge amount cannot be stored and retrieve easily.Since a key value proposition of big data is access to data from multiple and diverse domains, security and privacy will play a very important role in big data research and technology. Cloud computing security is developing at a rapid pace which includes computer security, network security, information security, and data privacy. Cloud computing plays a very vital role in protecting data, applications and the related infrastructure with the help of policies, technologies, controls, and big data tools.
Fig 1: Big data classification
II. RELATED WORK
1. Inukollu et.al. Introduces the three main terms that signify Big Data have the following properties:
a) Volume: Many factors contribute towards increasing Volume streaming data and data collected from sensors etc.,
b) Variety: Today data comes in all types of formats emails, video, audio, transactions etc.,
c) Velocity: This means how fast the data is being produced and how fast the data needs to be processed to meet the demand. The other two dimensions that need to consider with respect to Big Data are Variability and Complexity [1].
d) Variability: Along with the Velocity, the data flows can be highly inconsistent with the periodic peaks. e) Complexity: Complexity of the data also needs to be considered when the data is coming from multiple sources.
Hadoop- Google has introduced MapReduce [2] framework for processing large amounts of data on commodity hardware and Apache’s Hadoop distributed file system (HDFS) is evolving as a superior software component for cloud computing combined along with integrated parts such as MapReduce.
1. MapReduce [3] job first divides the data into individual chunks which are processed by Map jobs in parallel. The outputs of the maps sorted by the framework are then input to the reduce tasks.
Generally the input and the output of the job are both stored in a file-system. Scheduling, Monitoring and re-executing failed tasks are taken care by the framework.
2. HDFS [4] is a file system that spans all the nodes in a Hadoop cluster for data storage. It links together file systems on local nodes to make it into one large file system. HDFS improves reliability by replicating data across multiple sources to overcome node failures
from anywhere by connecting to the cloud using the Internet. Some of the real time applications which use Cloud Computing are Gmail, Google Calendar, Google Docs and Drop box etc.,
The challenges of security in cloud computing environments can be categorized into network level, user authentication level, data level, and generic issues.
a. Network level: The challenges that can be categorized under a network level deal with network protocols and network security, such as distributed nodes, distributed data, Internodes communication.
b. Authentication level: The challenges that can be categorized under user authentication level deals with encryption/decryption techniques, authentication methods such as administrative rights for nodes, authentication of applications and nodes, and logging.
c. Data level : The challenges that can be categorized under data level deals with data integrity and availability such as data protection and distributed data.
2. Bharti [8], introduced the structures of big data i.e. Structured data are numbers and words that can be easily categorized and analyzed. Structured data also include things like sales figures, account balances, and transaction data.
Unstructured data include more complex information, such as customer reviews from commercial websites, photos and other multimedia, and comments on social networking sites. These data cannot easily be separated into categories or analyzed numerically.
HACE theorem- Big Data starts with large-volume, Heterogeneous; Autonomous sources with distributed and decentralized control, and seeks to explore Complex and Evolving relationships among data.
3. Shirudkar et.al [9] describes the security and privacy challenges and the necessary recommendations to solve it via hybrid cloud.
The main focus is to maintain these parameters– 1.confidentiality
2.integrity 3.Availability
And the security methods for big data are-
Type Based Keyword Search for Security Of Big Data-this method provide a novel keyword from encryption-protection data. Moreover, the encrypted big data could be managed by different type that was assigned by data owner. the access rights can be given to others according to the user’s willingness Researchers also explore new search patterns for searchable encryption. The public key encryption with keyword search (PEKS) scheme was proposed in order to offers the user to retrieve files through keyword searching.
4. Talia [12] discussed the complexity and variety of data types and processing power to perform analysis on large datasets. The author stated that cloud computing infra- structure can serve as an effective platform to address the data storage required to perform big data analysis. Cloud computing is correlated with a new pattern for the provision of computing infrastructure and big data processing method for all types of resources available in the cloud through data analysis. Several cloud-based technologies have to cope with this new environment because dealing with big data for concurrent processing has become increasingly complicated
Achieving Big Data Privacy via Hybrid Cloud
The original data come from private cloud, and are processed on servers within private cloud. If there are no sensitive data, the original data may be sent to public cloud directly. Otherwise, the original data will be processed to make no sensitive data leaked out. After being processed, most data are sent to public cloud, and a small amount of sensitive data is kept in private cloud. When a user queries the data, both private cloud and public cloud will be contacted to provide the complete query result. We consider an un-trusted public cloud who are curious and may intend to browse users‟
Fig 2: big data security via hybrid cloud It provides various applications-
Malicious URL filtering-Big Data Application Feature Extraction
Lexical Features Lexical Features
III. CLOUD COMPUTING AND BIG DATA
Cloud computing and big data are conjoined .Big data provide user the ability to use commodity computing to process distributed queries across multiple data sets and return resultant set .Cloud computing provides the underlying engine through the use of Hadoop , distributed data-processing platforms. Large data sources are stored in a distributed fault-tolerant data base and processed through a programming model for large data sets with a parallel distributed algorithm in a cluster. The main purpose of data visualization is to view analytical results presented visually through different graphs for decision making. Big data utilizes distributed storage technology based on cloud computing rather than local storage attached to a computer or electronic device .Big data evaluation is driven by fast-growing cloud-based applications developed usingvirtualized technologies. Therefore, cloud computing not only provides facilities for the computation and processing of big data but also serves as a service model.
Fig: big data usage on cloud
IV. CHALLENGES
Although cloud computing has been broadly accepted by many organizations, research on big data in the cloud remains in its early stages. Several existing issues have not been fully addressed. Moreover, new challenges continue to emerge from applications by organization
1. Scalability is the ability of the storage to handle increasing amounts of data in an appropriate manner. Scalable distributed data storage systems have been a critical part of cloud computing infrastructures [13]. The lack of cloud computing features to support RDBMSs associated with enterprise solutions has made RDBMSs less attractive for the deployment of large-scale applications in the cloud. This drawback has resulted in the popularity of NoSQL [14]
2. Availability refers to the resources of the system accessible on demand by an authorized individual [98]. In a cloud environment, one of the main issues concerning cloud service providers is the availability of the data stored in the cloud. For example, one of the pressing demands on cloud service providers is to effectively serve the needs of the mobile user who requires single or multiple data within a short amount of time. Therefore, services must remain operational even in the case of a security breach [15].
3. Data integrity: A key aspect of big data security is integrity. Integrity means that data can be modified only by authorized parties or the data owner to prevent misuse. The proliferation of cloud-based applications provides users the opportunity to store and manage their data in cloud data centers. Such applications must ensure data integrity. However, one of the main challenges that must be addressed is to ensure the correctness of user data in the cloud. Given that users may not be physically able to access the data, the cloud should provide a mechanism for the user to check whether the data is maintained [16].
4. Transforming data into a form suitable for analysis is an obstacle in the adoption of big data [18]. Owing to the variety of data formats, big data can be transformed into an analysis
5. Data quality: with the emergence of big data, data originate from many different sources; not all of these sources are well-known or verifiable. Poor data quality has become a serious problem form any cloud service providers because data are often collected from different sources .For example, huge amounts of data are generated from smart phones, where inconsistent data formats can be produced as a result of heterogeneous sources. The data quality problem is usually defined as “any difficulty encountered a long one or more quality dimensions that render data completely or largely unfit for use” [17]. Therefore, obtaining high- quality data from vast collections of data sources is a challenge.
from multiple sources are generally of different types and representation forms and significantly interconnected; they have incompatible for- mats and are inconsistently represented [19].
7. Privacy: Privacy concerns continue to hamper users who out- source their private data into the cloud storage. This concern has become serious with the development of big data mining and analytics, which require personal information to produce relevant results, such as personalized and location-based services [20]. Information on individuals is exposed to scrutiny, a condition that gives rise to concerns on profiling, stealing, and loss of control [21].
8. Governance: Data governance embodies the exercise of control and authority over data-related rules of law, transparency, and accountabilities of individuals and information systems to achieve business objectives . The key issues of big data in cloud governance pertain to applications that consume massive amounts of data streamed from external sources. Therefore, a clear and acceptable data policy with regard to the type of data that need to be stored, how quickly an individual needs to access the data, and how to access the data must be defined .
V. ISSUES IN CLOUD SECURITY OF BIG DATA
With the cryptographic approaches to cloud security i.e. functional encryption and homomorphism encryption on the trusted, entrusted cloud and isolated cloud .cannot solve the security issues
1. Secure Computations in Distributed Programming Framework 2. Security Best Practices for Non Relational Data Stores 3. Secure Data Storage and Transaction Logs
4. End Point Input Validation/Filtering
5. Real –Time Security/Compliance Monitoring
6. Scalable and Compos able Privacy-Preserving Data Mining And Analytics 7. Cryptographically Enforced Access Control And Secure Communication 8. Granular Access Control
9. Granular Audits
While encrypting the whole cloud there are several issues associated with it. With the further research we will try to make the cloud of big data more secure and authenticated .and try to reduce the time processing for a query.
VI. CONCLUSION
In big data we talk about pent bytes or zeta bytes of data that is generated everyday that can be structured or unstructured. Hadoop is a framework that is employed for processing large amount of data. HDFS is a special file system for storing Big Data and Map Reduce is a programming model used to process large datasets. Cloud computing is used for storing the big data. But cloud comes with the explicit security challenges at the network level, authentication level and data level. In this paper we review trying to solve the problem of security at cloud using hybrid, applying encryption at the cloud and some other. We discussed the background of Hadoop technology and its core components, namely, Map Reduce and HDFS. We presented cur- rent Map Reduce projects and related software. We also reviewed some of the challenges in big data processing. The review covered volume, scalability, availability, data integrity, data protection, data transformation, data quality/heterogeneity, privacy and legal/regulatory issues, data access, and governance. In further approaches we are trying to make the cloud storage more secure and efficient.
REFERENCES
1. A, Katal, Wazid M, and Goudar R.H. "Big data: Issues, challenges, tools and Good practices.". Noida:2013, pp. 404 – 409, 8-10 Aug. 2013
2. Ren, Yulong, and Wen Tang. "A SERVICE INTEGRITY ASSURANCE FRAMEWORK FOR CLOUD COMPUTING BASED ON MAPREDUCE."Proceedings of IEEE CCIS2012. Hangzhou: 2012, pp 240 –244, Oct. 30 2012-Nov. 1 2012.
4. K, Chitharanjan, and Kala Karun A. "A review on hadoop — HDFS infrastructure extensions.". JeJuIsland: 2013, pp. 132-137, 11-12 Apr. 2013.
5. Cloud Security Alliance Top Ten Big Data Security And Privacy Challenges “by CSA Big Data Working Group
6. SECURITY ISSUES ASSOCIATED WITH BIG DATA IN CLOUD COMPUTINGVenkata Narasimha Inukollu1 , Sailaja Arsi1 and Srinivasa Rao Ravuri3International Journal of Network Security & Its Applications (IJNSA), Vol.6, No.3, May 2014
7. C.Ji,Y.Li,W.Qiu, U.Awada, K.Li, Big data processing in cloud computing environments, Pervasive Systems, Algorithms and Net- works(ISPAN),2012,in:Proceedingsofthe12th International Symposium on,IEEE,2012,pp.17–23.
8. Data Mining for Big Data: A Review Bharti Thakur, Manish Mann Computer Science Department LRIET, Solan (H.P), India
9. International Journal of Advanced Research in Computer Science and Software Engineering Available online at: www.ijarcsse.com Big-Data Security Kalyani Shirudkar, Dilip Motwani
10. The riseof “big data” on cloud computing: Review and open research issues Ibrahim Abaker TargioHashem a,n, Ibrar Yaqoob a, Nor Badrul Anuar a, Salimah Mokhtar a, Abdullah Gani a, Samee Ullah Khan
11. D. Talia,Clouds for scalable big data analytics,Computer46(2013) 98–101.
12. P. Mell, T. Grance, The NIST definition of cloud computing (draft), NIST Spec. Publ. 800 (2011) 7.
13. R.Cattell, Scalable SQL and No SQL data stores, ACMSIGMOD Record,39(4),ACMNewYork,NY,USA,2011,12–27.
14. D.Zissis,D.Lekkas,Addressing cloud computing security issues, Futur. Gener. Comput. Syst. 28(2012)583–592. [99] M.Schroeck, R.Shockley, J.Smart, D.Romero- Morales, P.Tufano, Analytics: Thereal-world use of big data, in, IBMGlobalBusiness Services, 2012.
15. R. Sravan Kumar, A.Saxena, Data integrity proofs in cloud storage, in: Proceedings of the Third International Conferenceon Commu- nication Systems and Networks(COMSNETS),2011,pp.1–4. [101] R. Akerkar,BigDataComputing,CRCPress,2013
16. D.M. Strong, Y.W.Lee, R.Y.Wang, Data quality in context, Commun. ACM,40,,1997,103–110. 17. R. Akerkar, Big Data Computing, CRC Press, 2013
18. D. Che, M.Safran, Z.Peng, From big data to big data mining: challenges, issues, and opportunities, in: B. Hong, X. Meng, L. Chen, W. Winiwarter, W. Song(Eds.),Database Systems for Advanced Applications, Springer,BerlinHeidelberg,2013,pp.1–15.
19. O. Tene, J. Polonetsky, Privacy in the age of big data: a time for big decisions, Stanford Law Review Online 64 (2012) 63. 20. P. Malik, Governing big data: principles and practices, IBM J. Res. Dev. 57 (1) (2013) 1. (-1: 13).