The major problem in bigdata is that we do not know about the under lying empirical micro processes which leads to emergence of typical network characteristics. Future development can predict through some of the algorithms that is fed by large number of data based on past experiences. To make predictions in the changing environment it requires the theory of systems dynamic. Bigdata approaches with computer simulations are complex systems and agent-based models. Agent-based models are used for predicting the outcomes of social complexities of unknown future scenarios through computer simulations that are based on collection of mutually interdependent algorithms. Factor analysis and cluster analysis are the two multivariate methods that probe the latent structure of data. The analysis of smaller data sets is compared to bigdata. The challenges is to extract, load part of data processing in bigdata projects seems to be difficult because there is no large data analysis. The bias problem cannot be solved by adding more data but the other sources such as twitter and Google translate are not represented by overall population and results from sources that may lead to wrong conclusions. The results may be skewed dramatically. Multiple comparison problems are the new term introduced by bi data. These problems are used for testing a large set of hypothesis that is likely to produce many false results that appear mistakenly significant.
BigData is a key word that denotes to structured or unstructured or semi structured data with combinations of 3Vs. The impression of usingHadooptechnology on BigData is transforming approach towards analysis, maintaince and generation of enormous data .Now a days all IT companies are accepting various scripting platforms with Hadooptechnology to reduce time for studying vast data. Apache‟s Pig is an important component of Hadoop system which reduces the coding and analyzing time for BigData. Bigdata is collection of complex and large data sets, which include information, may be produced by multiple services. The main task here is to combine multiple data from multiple systems. This paper‟s first part explains the concept of BigData, second part explains concept of Hadoop architecture using Map Reduce function and third part explains Apache‟s Pig execution environment. This paper is a brief study to learn Apache‟s Pig and HDFS.
Azure HDInsight is the cloud service for usingHadooptechnology ecosystem for bigdata solution. It enables the provision of Hadoop on cloud, Apache Spark, R server, HBase and Storm clusters. The service also includes implementation of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari and so on. Apache Spark and Storm support the real-time in-memory processing, HBase being a columnar NoSQL transactional database and Hive for SQL queries execution. There are different connectivity options which enable the solution architects to build the hybrid architecture for having the data on-premise and cloud. The storage capability in cloud is phenomenal providing the flexibility to hold the data both primary and secondary in different data centers. Customers access their data round the clock with very limited downtime and high availability cluster in the cloud. This comprehensive set of Apache bigdata projects within cloud promote reduced infrastructure cost, easy integration with on-premise Hadoop clusters, deployment in Windows or Linux processing unstructured and semi- structured data.
ABSTRACT: Today, cyber threats are increasing because existing security systems are not capable of detecting them. Previously, attacks had simple aim to attack or destroy the system. However, the goal of recent hacking attacks has changed from leaking information and destruction of services to attacking large-scale systems such as critical infrastructures and state agencies. Existing defence technologies to detect these attacks are based on pattern matching methods which are very limited. Because of this fact, in the event of new and previously unknown attacks, detection rate becomes very low and false negative increases. To defend against these unknown attacks,we propose a new model based on bigdata analysis techniques that can extract information from a variety of sources to detect future attacks.
f) Finally, issues or frauds that are identified are added into the business use case system, which is a part of the hybrid framework. 2) Predictive Analytics for BigData: Predictive analytics include the use of text analytics and sentiment analysis to look at bigdata for fraud detection. Consider a scenario when a person raises a claim saying that his car caught fire, but the story that was narrated by him indicates that he took most of the valuable items out prior to the incident. That might indicate the car was torched on purpose. Claim reports span across multiple pages, leaving very little room for text analytics to detect the scam easily. Bigdataanalytics helps in sifting through unstructured data, which wasn’t possible earlier and helps in proactively detecting frauds. There has been an increase in the use of predictive analyticstechnology, which is a part of bigdataanalytics concept, to spot potentially fraudulent claims and speed the payment of legitimate ones. In the past, predictive analytics were used to analyze statistical information stored in the structured databases, but now it is branching out into the bigdata realm. The potential fraud present in the written report above is spotted using text analytics and sentiment analysis. Here’s how the text analyticstechnology works:
An HDFS cluster has two types of node operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The namenode also knows the datanodes on which all the blocks for a given file are located. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Name Node decides about replication of data blocks. In a typical HDFS, block size is 64MB and replication factor is 3 (second copy on the local rack and third on the remote rack). The Figure 4 shown architecture distributed file system HDFS. Hadoop MapReduce applications use storage in a manner that is different from general-purpose computing. To read an HDFS file, client applications simply use a standard Java file input stream, as if the file was in the native filesystem. Behind the scenes, however, this stream is manipulated to retrieve data from HDFS instead. First, the Name Node is contacted to request access permission. If granted, the Name Node will translate the HDFS filename into a list of the HDFS block IDs comprising that file and a list of Data Nodes that store each block, and return the lists to the client. Next, the client opens a connection to the “closest” Data Node (based on Hadoop rack-awareness, but optimally the same node) and requests a specific block ID. That HDFS block is returned over the same connection, and the data delivered to the application. To write data to HDFS, client applications see the HDFS file as a standard output stream. Internally, however, stream data is first fragmented into HDFS-sized blocks (64MB) and then smaller packets (64kB) by the client thread. Each packet is enqueued into a FIFO that can hold up to 5MB of data, thus decoupling the application thread from storage system latency during normal operation. A second thread is responsible for dequeuing packets from the FIFO, coordinating with the Name Node to assign HDFS block IDs and destinations, and transmitting blocks to the Data Nodes (either local or remote) for storage. A third thread manages acknowledgements from the Data Nodes that data has been committed to disk.
From all the execution benchmark numbers what's more, their examination, it can be unquestionably contemplated that for BigData investigation need, conventional shared stockpiling model can't be completely precluded. While because of engineering and plan issues, a group document framework may not scale at the same rate as a mutual nothing model does, yet for use situations where web request versatility is not required, a bunched record framework can make a nice showing even in the BigData examination area. A bunched document framework like SF-CFS can give various different advantages its plenty of elements. This decade has seen the achievement of virtualization which presented the late patterns of server onsolidation, green registering activities in endeavours .
Smart cities is a buzzword of the moment this paper has tried to establish that while the political and economic drivers of smart cities tend towards technology supremacism, smart cities, at least in Europe, will still suffer as a project if they fail to get privacy right; and that at the moment this failure is very likely, suffering as they do from the combination of three of the most difficult issues for modern privacy law to regulate: the IoT, BigData and Cloud based infrastructure. DP is still fit for purpose and in principle does not need modified, though the detail may need some fine honing to deal with threats such as the increasing marginalisation of informed consent, bigdata, the IoT and the Cloud. The FTC’s reaction is surprisingly similar: faced with the enumerated issues of IoT above, and even without the cultural foundation of an omnibus privacy law founded in human rights to depend on, they still assert “protecting privacy and enabling innovation are not mutually exclusive and must consider principles of accountability and Privacy by Design”. “Code” solutions may be more useful and should certainly be investigated to supplement the law. Four particular suggestions for further research involvement are herein promoted:
Manufacturing and production managers believe the greatest opportunities of BigData for their function are to detect product defects and boost quality, and to improve supply planning. Better detection of defects in the production processes is next on the list. A $2 billion industrial manufacturer said that analyzing sales trends to keep its manufacturing efficient was the main focus of its BigData investments. Understanding the behaviour of repeat customers is critical to delivering in a timely and profitable manner. Most of its profitability analysis is to make sure that the company has good contracts in place. The company says its adoption of analytics has facilitated its shift to lean manufacturing, and has helped it determine which products and processes should be scrapped. They see far less opportunity in usingBigData for mass customization, simulating new manufacturing processes, and increasing energy efficiency.
To optimize the MapReduce program, this intermediate phase is very important. As soon as the Mapper output from the Map phase is available, this intermediate phase will be called automatically. After the completion of the Map phase, all the emitted intermediate (key, value) pairs will be partitioned by a Partitioner at the Mapper side, only if the Partitioner is present. The output of the Partitioner will be sorted out based on the key attribute at the Mapper side. Output from sorting the operation is stored on buffer memory available at the Mapper node, TaskTracker. The Combiner is often the Reducer itself. So by compression, it's not Gzip or some similar compression but the Reducer on the node that the map is outputting the data on. The data returned by the Combiner is then shuffled and sent to the reduced nodes. To speed up data transmission of the Mapper output to the Reducer slot at TaskTracker, you need to compress that output with the Combiner function. By default, the Mapper output will be stored to buffer memory, and if the output size is larger than threshold, it will be stored to a local disk. This output data will be available through Hypertext Transfer Protocol (HTTP).
Educational organizations are one of the important parts of our society and playing a vital role for growth and development of any nation. Data Mining is an emerging technique with the help of this one can efficiently learn with historical data and use that knowledge for predicting future behavior of concern areas. Growth of current education system is surely enhanced if data mining has been adopted as a futuristic strategic management tool. The Data Mining tool is able to facilitate better resource utilization in terms of student performance, course development and finally the development of nation's education related standards. In this paper a student data from a community college database has been taken and various classification approaches have been performed and a comparative analysis has been done. In this research work Support Vector Machines (SVM) are established as a best classifier with maximum accuracy and minimum root mean square error (RMSE). The study also includes a comparative analysis of all Support Vector Machine Kernel types and in this the Radial Basis Kernel is identified as a best choice for Support Vector Machine. A Decision tree approach is proposed which may be taken as an important basis of selection of student during any course program. The paper is aimed to develop a faith on Data Mining techniques so that present education and business system may adopt this as a strategic management tool.
Smart grid provides taking control of the electricity used by the consumer will present the new dimensions in the entire electricity system. Smart meter sends the consumption data of individual meter at the small interval of time that is generating bigdata at the end. To process and store this amount of data is difficult with the traditional database system. A large amount of data is required for accurate results and to process more granular data is required and to overcome the bigdata problems Hadoop concept is used. The Smart meter helps for planning and meeting the energy requirement of the different users. Dataanalytics on smart meter data helps the consumers to view their electricity consumption behavior pattern. This process helps to control their energy consumption and optimize to reduce the electricity bill by shifting the load from the day of the tariff to non-peak interval of time.
Data and analytics are at the heart of the digital revolution. They are an imperative across all industries. To survive and thrive in the digital era, now is the time to drive data and analytics into the core of your business and scale outward to every employee, customer, supplier, and partner. Scaling the value of data and analytics requires a culture of data enablement that extends throughout every facet of your organization. A culture where data and analytics inform and drive business objectives, operational efficiencies, and innovation from Gartner Data & Analytics Summit 2018. Forrester has redefined the bigdatatechnology ecosystem that they introduced in 2014. They investigated the current state of vendor innovations and identified a new set of the 22 most important technologies in the bigdata ecosystem. See Fig -2 for more details. Forrester surveyed 63 vendors, interviewed 17 experts in the field, and leveraged their deep expertise in evaluating many bigdata technologies through the Forrester Wave™ process. There detailed research with 65 current or potential customers and users of the technologies. Key findings from the data collected: 1) the current state of the technology; 2) the technology’s potential impact on customers’ businesses; 3) the time that experts think the technology will need to reach the next stage of maturity; and 4) the technology’s overall trajectory — from minimal success to significant success. See Fig 3 for details.
Bigdata is a word for datasets that are so big or multifaceted that traditional data dispensation applications are inadequate to deal with them. It is not merely a data, rather it has become a complete subject, which involves various tools, techniques, and frameworks the face up to of extracting value from bigdata is parallel in many ways to the age-old problem of distilling business aptitude from transactional data. At the heart of this challenge is the process used to extract data from multiple sources, transform it to fit your analytical needs, and load it into a data warehouse for consequent analysis, a process known as “Extract, Transform & Load” (ETL). The Hadoop Distributed File System (HDFS) is the storage component of Hadoop which is used to implement the disaster management process. Map Reduce method has been calculated in this paper which is required for implement BigData Analysis using HDFS.
Is ‘BIG-DATA’? When we combine several data sets consisting of a variety of data types, co-relations, trends and patterns, it usually forms a complex and a huge cluster like formation, which we call it as ‘BigData’. In INDUSTRY 4.0 a huge amount of information is produced and collected on a daily basis that their processing and analysis is beyond the capabilities of traditional tools. However, there is a technology by which we can conduct analysis and that is BigData. BigData allows us to quickly and efficiently manage and use this constantly growing database. Bigdataanalytics helps in analysis and separation of the key components of a business or an organization, it is like breaking up the cluster of data and extracting the important or germane information from it so as to make an informed decision and support effective transfer of knowledge to carry out business objectives taking into contemplation, the present scenario. 
K. Shvachko presented that, Hadoop is designed to run on a large collection of machines that shares neither memory nor disks. That means that unlike HPC huddle each node serves a dual-purpose: on the one hand it is a computing resource, on the other hand it is a storage unit. The advantages of this software are that it can handle PetaBytes data sets simply, and it provides a framework for distributing processing over a cluster. Its major drawback is the difficulty to handle complex data structure and to perform complex queries on it. Fortunately, other frameworks on top of Hadoop like Cascading exist .
Abstract: Bigdata refers to the organizational data asset that exceeds the volume, velocity, and variety of data typically stored using traditional structured database technologies. This type of data has become the important resource from which organizations can get valuable insightand make business decision by applying predictive analysis. This paper provides a comprehensive view of current status of bigdata development,starting from the definition and the description of Hadoop and MapReduce – the framework that standardizes the use of cluster of commodity machines to analyze bigdata. For the organizations that are ready to embrace bigdatatechnology, significant adjustments on infrastructure andthe roles played byIT professionals and BI practitioners must be anticipated which is discussed in the challenges of bigdata section. The landscape of bigdata development change rapidly which is directly related to the trend of bigdata. Clearly, a major part of the trend is the result ofthe attempt to deal with the challenges discussed earlier. Lastly the paper includes the most recent job prospective related to bigdata. The description of several job titles that comprise the workforce in the area of bigdata are also included.
Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W. (18-22 Dec. 2012) “Shared disk bigdataanalytics with Apache Hadoop”- This paper present the Bigdataanalytics define the analysis of large amount of data to get the useful information and uncover the hidden patterns. Bigdataanalytics refers to the Mapreduce Framework which is developed by the Google. Apache Hadoop is the open source platform which is used for the purpose of implementation of Google’s Mapreduce Model . In this the performance of SF-CFS is compared with the HDFS using the SWIM by the facebook job traces .SWIM contains the workloads of thousands of jobs with complex data arrival and computation patterns.
Hadoop uses Google‟s MapReduce and Google File System technologies as its foundation . It is optimized to handle massive quantities of data which could be structured, unstructured or semi-structured, using commodity hardware, that is, relatively inexpensive computers . Hadoop replicates its data across different computers, so that if one goes down, the data are processed on one of the replicated computers. Hadoop is used for BigData. It complements OnLine Transaction Processing and OnLine Analytical Processing. Yahoo is also the largest contributor to the Hadoop open source project. They can integrate analytic solutions to the mix to derive valuable information that can combine structured legacy data with new unstructured data . Two major components of Hadoop, the distributed file system component and the MapReduce component, with an emphasis on a distributed filesystem called Hadoop Distributed File System or HDFS an important feature of Hadoop called "rack awareness" or "network topology awareness".
For examples in Figure 7. here is problem occurred at 12:15:30 pm this a time of event occurrence to system. Then after within few seconds i.e. 12.15.32 pm the analyzer who is responsible for analysis the incoming data can find problem. And after that they will inform to Troubleshooting Team (TS Team) then TS team allocating troubleshooters to that problem according to near positions at 12:15:34 pm and troubleshooters visit to the area where problems are occurred within a few minutes i.e. at 12.20 pm. Here within 5-6 seconds the problem is resolved by using stream processing which based on minute or seconds basis analysis. In Traditional system whenever problems occurred to any system our analysis team can find that problem at the end of the day because they using batch processing.