Since this book is geared towards analysts, it might be relevant to provide analytical examples; for instance, if the reader has a problem similar to the one described previously, Hadoop might be of use. Hadoop is not a universal solution to all BigData issues; it's just a good technique to use when large data needs to be divided into small chunks and distributed across servers that need to be processed in a parallel fashion. This saves time and the cost of performing analytics over a huge dataset. If we are able to design the Map and Reduce phase for the problem, it will be possible to solve it with MapReduce. Generally, Hadoop provides computation power to process data that does not fit into machine memory. (R users mostly found an error message while processing large data and see the following message: cannot allocate vector of size 2.5 GB.)
Huge amount of diverse data is coming from variety of places in an incredible speed thus forming BigData. These data's are precious but only if we can manage it properly. Lots of challenges arise when we want to store, integrate and analyze BigData from scattered place. BigDataanalytics tools should be powerful enough to deal with these kind of data. It should support visualization, prediction and optimization in order to uncover the hidden facts to improve decision making which is helpful in almost all kinds of business. To address these challenges, various solutions are given by the industry experts. Today, Cloud computing is one of the reliable and cheapest solution. No SQL databases and Distributed file systems are appropriate for storing and managing huge datasets. Hadoop framework provide solution for huge amount of data storage, data management as well as data processing. R is most famous programming framework for BigDataanalytics and statistics tasks.
Mukherjee, A.; Datta, J.; Jorapur, R.; Singhvi, R.; Haloi, S.; Akram, W. (18-22 Dec. 2012) “Shared disk bigdataanalytics with Apache Hadoop”- This paper present the Bigdataanalytics define the analysis of large amount of data to get the useful information and uncover the hidden patterns. Bigdataanalytics refers to the Mapreduce Framework which is developed by the Google. Apache Hadoop is the open source platform which is used for the purpose of implementation of Google’s Mapreduce Model . In this the performance of SF-CFS is compared with the HDFS using the SWIM by the facebook job traces .SWIM contains the workloads of thousands of jobs with complex data arrival and computation patterns.
Abstract-- Recent Technology have led to a large volume of data from different areas(ex. medical, aircraft, internet and banking transactions) from last few years. BigData is the collection of this field’s information. BigData contains high volume, velocity and high variety of data. For example Data in GB or PB which are in different form i.e. structured, unstructured and semi-structured data and its requires more fast processing or real-time processing. Such Real time data processing is not easy task to do, Because BigData is large dataset of various kind of data and Hadoop system can only handle the volume and variety of data but for real-time analysis we need to handle volume, variety and the velocity of data. To solve or to achieve the high velocity of data we are using two popular technologies i.e. Apache Kafka, which is message broker system and Apache Storm, which is stream processing engine.
An HDFS cluster has two types of node operating in a master-worker pattern: a NameNode (the master) and a number of DataNodes (workers). The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. The namenode also knows the datanodes on which all the blocks for a given file are located. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Name Node decides about replication of data blocks. In a typical HDFS, block size is 64MB and replication factor is 3 (second copy on the local rack and third on the remote rack). The Figure 4 shown architecture distributed file system HDFS. Hadoop MapReduce applications use storage in a manner that is different from general-purpose computing. To read an HDFS file, client applications simply use a standard Java file input stream, as if the file was in the native filesystem. Behind the scenes, however, this stream is manipulated to retrieve data from HDFS instead. First, the Name Node is contacted to request access permission. If granted, the Name Node will translate the HDFS filename into a list of the HDFS block IDs comprising that file and a list of Data Nodes that store each block, and return the lists to the client. Next, the client opens a connection to the “closest” Data Node (based on Hadoop rack-awareness, but optimally the same node) and requests a specific block ID. That HDFS block is returned over the same connection, and the data delivered to the application. To write data to HDFS, client applications see the HDFS file as a standard output stream. Internally, however, stream data is first fragmented into HDFS-sized blocks (64MB) and then smaller packets (64kB) by the client thread. Each packet is enqueued into a FIFO that can hold up to 5MB of data, thus decoupling the application thread from storage system latency during normal operation. A second thread is responsible for dequeuing packets from the FIFO, coordinating with the Name Node to assign HDFS block IDs and destinations, and transmitting blocks to the Data Nodes (either local or remote) for storage. A third thread manages acknowledgements from the Data Nodes that data has been committed to disk.
MapR is privately held company at California that contributes to Apache Hadoop projects like HBase, Pig, Hive and ZooKeeper. Their products include MapR FS file system, MapR-DB NoSQL database and MapR Streams. It develops technology for both, commodity hardware and cloud computing services.In its standard, open source edition, Apache Hadoop software comes with a number of restrictions. Vendor distributions are aimed at overcoming the issues that the users typically encounter in the standard editions. Under the free Apache license, all the three distributions provide the users with the updates on core Hadoop software. But when it comes to handpicking any one of them, one should look at the additional value it is providing to the customers in terms of improving the reliability of the system (detecting and fixing bugs etc), providing technical assistance and expanding functionalities.All three top Hadoop distributions, Cloudera, MapR and Hortonworks offer consulting, training, and technical assistance. But unlike its two rivals, Hortonworks’ distribution is claimed to be 100 precent open source. Cloudera incorporates an array of proprietary elements in its Enterprise 4.0 version, adding layers of administrative and management capabilities to the core Hadoop software.Going a step further, MapR replaces HDFS component and instead uses its own proprietary file system, called MapRFS. MapRFS helps incorporate enterprise-grade features into Hadoop, enabling more efficient management of data, reliability and most importantly, ease of use. In other worlds, it is more production ready than its other two competitors.Up to its M3 edition, MapR is free, but the free version lacks some of its proprietary features namely, JobTracker HA, NameNode HA, NFS-HA, Mirroring, Snapshot and few more.
Hadoop  represents another example of BigData innovation on the IT infrastructure. Apache Hadoop is an open source framework that allows companies to process vast amounts of information in a highly paral- lelized way. Hadoop represents a speciﬁc implementation of the MapReduce paradigm and was designed by Doug Cutting and Mike Cafarella in 2005 to use data with varying structures. It is an ideal technical framework for many BigData projects, which rely on large or unwieldy datasets with unconventional data structures. One of the main beneﬁts of Hadoop is that it employs a distributed ﬁle system, meaning it can use a distributed cluster of servers and commodity hardware to process large amounts of data. Some of the most common examples of Hadoop implementations are in the social media space, where Hadoop can manage transactions, give textual updates, and develop social graphs among millions of users. Twitter and Facebook generate massive amounts of unstructured data and use Hadoop and its ecosystem of tools to manage this high volume. Hadoop and its ecosystem are covered in Chapter 10, “Advanced Analytics— Technology and Tools: MapReduce and Hadoop.”
Pig was initially developed at Yahoo! to allow individuals using Apache Hadoop to focus a lot of on analyzing massive data sets and pay less time having to put in writing mapper and reducer programs. Like actual pigs, who eat nearly something, the Pig programming language is meant to handle any reasonably data—hence the name! Pig is made up of two components: the first component is that the language itself, which is known as Pig Latin (yes, people naming various Hadoop projects do tend to have a way of humor related to their naming conventions), and the second could be a runtime environment wherever Pig Latin programs are executed. Think of the link between a Java Virtual Machine (JVM) and a Java application. In this section, we’ll just refer to the total entity
Azure HDInsight is the cloud service for using Hadoop technology ecosystem for bigdata solution. It enables the provision of Hadoop on cloud, Apache Spark, R server, HBase and Storm clusters. The service also includes implementation of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari and so on. Apache Spark and Storm support the real-time in-memory processing, HBase being a columnar NoSQL transactional database and Hive for SQL queries execution. There are different connectivity options which enable the solution architects to build the hybrid architecture for having the data on-premise and cloud. The storage capability in cloud is phenomenal providing the flexibility to hold the data both primary and secondary in different data centers. Customers access their data round the clock with very limited downtime and high availability cluster in the cloud. This comprehensive set of Apache bigdata projects within cloud promote reduced infrastructure cost, easy integration with on-premise Hadoop clusters, deployment in Windows or Linux processing unstructured and semi- structured data.
its business. Faster bigdataanalytics leads to faster decision making scenarios which indeed leads to faster growth of an industry and when we consider the present-day market: fast is what sells, slow is no longer an option. As stated earlier bigdata related to Industry 4.0 consists of information like: manufacturing, production, scheduling, orders, demands and supply trends, customer services etc. Addressing such information so as to come up with an efficient solution should be a fast process so as to beat the competition at every level. When the demand is for faster bigdataanalytics ‘Apache Spark’ would suffice all the needs. Spark has a framework, which allows it to be the data processing engine that operates in-memory, on RAM; this means there is no more reading and writing the data back to the discs or nodes like it happens in Hadoop. Currently, using the Hadoop framework underpinned by MapReduce, between every map and reduce task, data has to be read to the disc and written to disc which makes it a slower process taking into consideration the huge amount of data available for processing. If in the same scenario Spark is put into use, the information would largely stay in-memory and so would the processing. Thus, making it iterate the data faster. How fast will these happen? Well, tests prove that Apache Spark is 100 times faster than Hadoop in memory and 10 times faster on disc. For example: Last year, Spark Pureplay Databricks used Spark to sort 100 Terabytes of records within 23 minutes thus, beating Hadoop MapReduce’s 72 minutes. 
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a frontier which best segregates the two classes (hyper-plane/ line).There are many linear classifiers (hyper planes) that separate the data. However, only one of these achieves maximum separation. The reason we need it is because if we use a hyper plane to classify, it might end up closer to one set of datasets compared to others and we do not want this to happen and thus we see that the concept of maximum margin classifier or hyper plane as an apparent solution.
technologies for development (also known as ICT4D) suggests that bigdata technology can make important contributions but also present unique challenges to International development. Advancements in bigdata analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management.
f) Finally, issues or frauds that are identified are added into the business use case system, which is a part of the hybrid framework. 2) Predictive Analytics for BigData: Predictive analytics include the use of text analytics and sentiment analysis to look at bigdata for fraud detection. Consider a scenario when a person raises a claim saying that his car caught fire, but the story that was narrated by him indicates that he took most of the valuable items out prior to the incident. That might indicate the car was torched on purpose. Claim reports span across multiple pages, leaving very little room for text analytics to detect the scam easily. Bigdataanalytics helps in sifting through unstructured data, which wasn’t possible earlier and helps in proactively detecting frauds. There has been an increase in the use of predictive analytics technology, which is a part of bigdataanalytics concept, to spot potentially fraudulent claims and speed the payment of legitimate ones. In the past, predictive analytics were used to analyze statistical information stored in the structured databases, but now it is branching out into the bigdata realm. The potential fraud present in the written report above is spotted using text analytics and sentiment analysis. Here’s how the text analytics technology works:
K. Shvachko presented that, Hadoop is designed to run on a large collection of machines that shares neither memory nor disks. That means that unlike HPC huddle each node serves a dual-purpose: on the one hand it is a computing resource, on the other hand it is a storage unit. The advantages of this software are that it can handle PetaBytes data sets simply, and it provides a framework for distributing processing over a cluster. Its major drawback is the difficulty to handle complex data structure and to perform complex queries on it. Fortunately, other frameworks on top of Hadoop like Cascading exist .
BigDataanalytics is now being implemented across various business of banking sector, because the volume of data operated upon by banks are growing at a tremendous rate, posing intriguing challenges for parallel and distributed computing platforms. These challenges range from building storage systems that can accommodate these large datasets to collecting data from vastly distributed sources into storage systems to running a diverse set of computations on data, hence bigdata analytic came as solution. The technology is helping bank to deliver better services to their customers, and for our case study is cross selling activity for loan product in Bank XYZ, a largest commercial bank in Indonesia. This study design the application architecture of BigDataanalytics based on collective cases and interview with bank XYZ persona. We also define business rule or model in for analytics and test design effectiveness. The outcome we see as following: 1. By leveraging Cloudera Hadoop, Aster Analytics as big
Map Reduce is a data flow paradigm for such applications . It‟s simple, explicit data flow programming model, favored over the traditional high level data base approaches. Map Reduce paradigm parallelize huge data sets using clusters or grids. A Map Reduce program comprises of two functions, a Map( ) Procedure and a Reduce( ) Procedure.
are being applied. Even though some technologies are used and developed still it remains the difficulties to carry the machine learning with bigdata. To use and optimize the large tables in relational database management system it should have the ability of loading, monitoring, back up by using implicit function. The fundamental structure of massive data sets is been programmed by topological data analysis. The bigdataanalytics is processed for slower shared storage which prefers for direct attached storage in different forms to have high capacity buried inside the parallel processing nodes from solid state drive. There are two types of storage architecture. They are Storage area network (SAN) and Network attached storage (NAS).The two types of storage architecture are relatively slow, complex and expensive. The qualities are not consistent with bigdataanalytics systems which are for system performance, low cost and commodity infrastructure. The bigdataanalytics can be characterized by real or near real time information delivery. Whenever and wherever needed it is possible to avoid latency.
a) Hadoop architecture: Hadoop is an open source Apache project started in 2005 by engineers at Yahoo, based on Google’s earlier research papers. Hadoop then consisted of a distributed file system, called HDFS, and a data processing and execution model called Map Reduce. The Apache Hadoop architecture consists of the Hadoop common package, which provides file system and operating system (OS)-level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).To store a large file on the HDFS, the input file is split into smaller data sets and sent over to different nodes (servers) for parallel processing of data and the nodes hold the processed data. The framework, which is used for overall processing of data, is called MapReduce.
The second common objective of bigdata technologies and solutions is time reduction. Macy’s merchandise pricing optimization application provides a classic example of reducing the cycle time for complex and large-scale analytical calculations from hours or even days to minutes or seconds . The department store chain has been able to reduce the time to optimize pricing of its 73 million items for sale from over 27 hours to just over 1 hour. Described by some as “bigdataanalytics,” this capability set obviously makes it possible for Macy’s to re-price items much more frequently to adapt to changing conditions in the retail marketplace. This bigdataanalytics application takes data out of a Hadoop cluster and puts it into other parallel computing and in-memory software architectures . Macy’s also says it achieved 70% hardware cost reductions. Kerem Tomak, VP of Analytics at Macys.com, is using similar approaches to time reduction for marketing offers to Macy’s customers. He notes that the company can run a lot more models with this time savings.
--------------------------------------------------------------------------------------------------------------------------------------------------------- Abstract —BigData is a term connected to information sets whose size is past the capacity of customary programming advancements to catch, store, oversee and prepare inside a passable slipped by time. The well known supposition around Huge Data examination is that it requires web scale adaptability: over many figure hubs with connected capacity. In this paper, we wrangle on the need of an enormously adaptable disseminated registering stage for Enormous Data examination in customary organizations. For associations which needn't bother with a flat, web request adaptability in their investigation stage, BigData examination can be based on top of a customary POSIX Group File Systems utilizing a mutual stockpiling model. In this study, we looked at a broadly utilized bunched record framework: (SF-CFS) with Hadoop Distributed File System (HDFS) utilizing mainstream Guide diminish. In our investigations VxCFS couldn't just match the execution of HDFS, yet, additionally beat much of the time. Along these lines, endeavors can satisfy their BigData examination need with a customary and existing shared stockpiling model without relocating to an alternate stockpiling model in their information focuses. This likewise incorporates different advantages like soundness and vigor, a rich arrangement of elements and similarity with customary examination application.