Big Data Analytics Using Apache Hadoop

Full text





Submitted in partial fulfilment of

the requirements for the award of Bachelor of Technology Degree in Computer Science and Engineering

of the University of Kerala

Submitted by


Roll No : 1

Seventh Semester

B.Tech Computer Science and Engineering










This is to certify that this seminar report entitled “BIG DATA ANALYTICS USING

APACHE HADOOP” is a bonafide record of the work done by Abin Baby, under our

guidance towards partial fulfilment of the requirements for the award of the Degree of

Bachelor of Technology in Computer Science and Engineering of the University of

Kerala during the year 2011-2015.

Dr. Abdul Nizar A Mrs. Sabitha S Mrs. Rani Koshi Professor Assoc. Professor Assoc. Professor Dept. of CSE Dept. of CSE Dept. of CSE




I would like to express my sincere gratitude and heartful indebtedness to my guide Dr. Abdul Nizar , Head of Department, Department of Computer Science and Engineering for her valuable guidance and encouragement in pursuing this seminar.

I am also very much thankful to, Mrs. Sabitha S, Associate Professor, Department of Computer Science and Engineering for their help and support.

I also extend my hearty gratitude to Seminar Co-ordinator, Mrs. Rani Koshi, Associate Professor, Department of CSE, College of Engineering Trivandrum for providing necessary facilities and their sincere co-operation. My sincere thanks is extended to all the teachers of the department of CSE and to all my friends for their help and support.

Above all, I thank God for the immense grace and blessings at all stages of the project. Abin Baby




The paradigm of processing huge datasets has been shifted from centralized architecture to distributed architecture. As the enterprises faced issues of gathering large chunks of data they found that the data cannot be processed using any of the existing centralized architecture solutions. Apart from time constraints, the enterprises faced issues of efficiency, performance and elevated infrastructure cost with the data processing in the centralized environment.

With the help of distributed architecture these large organizations were able to overcome the problems of extracting relevant information from a huge data dump. One of the best open source tools used in the market to harness the distributed architecture in order to solve the data processing problems is Apache Hadoop. Using Apache Hadoop’s various components such as data clusters, map-reduce algorithms and distributed processing, we will resolve various location-based complex data problems and provide the relevant information back into the system, thereby increasing the user experience.




1. Introduction


2. Big Data


2.1 Attributes of Big data


2.2 Big data Applications


3. Big Data Analytics


3.1 Challenges of Big Data Analytics


3.2 Objective of Big Data Analytics


4. Apache Hadoop


5. Hadoop Distributed File System


5.1 Architecture


6. Hadoop Map-Reduce


6.1 Architecture


6.2 Map Reduce Paradigm


6.3 Comparing RDBMS & Map Reduce


7. Apache Hadoop Ecosystem


8. Conclusion






Amount of data generated every day is expanding in drastic manner.Big data is a popular term used to describe the data which is in zetta bytes. Government , companies many organisations try to acquire and store data about their citizens and customers in order to know them better and predict the customer behaviour . social networking websites generate new data every second and handling such a data is one of the major challenges companies are facing. Data which is stored in data warehouses is causing disruption because it is in a raw format ,proper analysis and processing is to be done in order to produce usable information out of it. Big Data has to deal with large and complex datasets that can be structured , semi structured,or unstructured and will typically not fit into memory to be processed. They have to be processed in place, which means that computation has to be done where the data resides for processing.

Big data challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and privacy violations. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, prevent diseases, combat crime and so on.

Big Data usually includes datasets with sizes. It is not possible for such systems to process this amount of data within the time frame mandated by the business. Big Data volumes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single dataset. Faced with this seemingly insurmountable challenge, entirely new platforms are called Big Data platforms.

New tools are being used to handle such a large amount of data in short time.ApacheHadoop is java based programming framework which is used for processing large data sets in distributed computer environment. Hadoop is used in



system where multiple nodes are present which can process terabytes of data.hadoop uses its own file system HDFS which facilitates fast transfer of data which can sustain node failure and avoid system failure as whole. Hadoop uses MapReduce algorithm which breaks down the big data into smaller chunks and performs the operations on it. Hadoop framework is used by many big companies like Google, yahoo, IBM for applications such as search engine, advertising and information gathering and processing.

Various technologies will come in hand-in-hand to accomplish this task such as Spring Hadoop Data Framework for the basic foundations and running of the Map-Reduce jobs, Apache Maven for distributed building of the code, REST Web services for the communication, and lastly Apache Hadoop for distributed processing of the huge dataset.





Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

The data lying in the servers of the company was just data until yesterday – sorted and filed. Suddenly, the slang Big Data got popular and now the data in a company is Big Data. The term covers each and every piece of data an organization has stored till now. It includes data stored in clouds and even the URLs that you have been bookmarked.A company might not have digitized all the data. They may not have structured all the data already. But then, all the digital, papers, structured and non-structured data with the company is now Big Data.

Big data is an all-encompassing term for any collection of data sets so large and

complex that it becomes difficult to process using traditional data processing applications. It refers to the large amounts, at least terabytes, of poly-structured data that flows continuously through and around organizations, including video, text, sensor logs, and transactional records. The business benefits of analyzing this data can be significant. According to a recent study by the MIT Sloan School of Management, organizations that use analytics are twice as likely to be top performers in their industry as those that don’t.

Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. In a nutshell, Big Data is your data. It's the information owned by your company, obtained and processed through new techniques to produce value in the best way possible.



Companies have sought for decades to make the best use of information to improve their business capabilities. However, it's the structure (or lack thereof) and size of Big Data that makes it so unique. Big Data is also special because it represents both significant information - which can open new doors – and the way this information is analyzed to help open those doors. The analysis goes hand-in-hand with the information, so in this sense "Big Data" represents a noun – "the data" - and a verb – "combing the data to find value." The days of keeping company data in Microsoft Office documents on carefully organized file shares are behind us, much like the bygone era of sailing across the ocean in tiny ships. That 50 gigabyte file share in 2002 looks quite tiny compared to a modern-day 50 terabyte marketing database containing customer preferences and habits.

The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s as of 2012, every day 2.5 exabytes (2.5×1018)

of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.

Some of the popular organizations that hold Big Data are as follows: • Facebook: It has 40 PB of data and captures 100 TB/day

• Yahoo!: It has 60 PB of data • Twitter: It captures 8 TB/day

• EBay: It has 40 PB of data and captures 50 TB/day

How much data is considered as Big Data differs from company to company.Though true that one company's Big Data is another's small, there is something common: doesn't fit in memory, nor disk, has rapid influx of data that needs to be processed and would benefit from distributed software stacks. For some companies, 10 TB of data would be considered Big Data and for others 1 PB would be Big Data. So only you can determine whether the data is really Big Data. It is sufficient to say that it would start in the low terabyte range.




As far back as 2001, industry analyst Doug Laney (currently with Gartner) articulated the now mainstream definition of big data as the three Vs of big data: volume, velocity and variety.

Volume. The quantity of data that is generated is very important in this context.It is the

size of the data which determines the value and potential of the data under

consideration and whether it can actually be considered as Big Data or not.The name ‘Big Data’ itself contains a term which is related to size and hence the

characteristic.Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.

 It is estimated that 2.5 Quintillion data is generated every day.

 40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005  6 billion people around the world are using mobile phones

Velocity. The term ‘velocity’ in the context refers to the speed of generation of data or

how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and development.Data is streaming in at

unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.

 20 Hours of video being uploaded every minute  2.9 million emails sent every second

 NewYork stock exchange captures 1TB of information in every trading session  Variety. The next aspect of Big Data is its variety.This means that the category to which



analysts.This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data.Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.

 Global size of data in health care is about 150 exabytes as of 2011  30 Billion pieces of content are shared on facebook every month  4 billion hours of video are watched on youtube each month

Veracity. In addition to the increasing velocities and varieties of data, data flows can be

highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data involved. This is a factor which can be a problem for those who analyse the data.

 Poor data quality causes US economy around $3.1 trillion a year

Complexity - Data management can become a very complex process,especially when

large volumes of data come from multiple sources.These data need to be

linked,connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data.This situation,is therefore,termed as the ‘complexity’ of Big Data.

Volatility : Big data volatility refers to how long is data valid and how long should it be

stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.




ADVERTISING : Big data analytics help companies like google and other advertising

companies to identify the behavior of a person and to target the ads accordingly.Big data analytics help in more personal and targeted ads.

ONLINE MARKETING : Big data analytics is used by online retailers like

amazon,ebay,flipkart etc to identify their potential customers giving them offers , for varying the price of products according to the trends etc

HEALTH CARE : The average amount of data per hospital will increase from 167TB to

665TB in 2015.With Big Data medical professionals can improve patient care and reduce cost by extracting relevant clinical information.

CUSTOMER SERVICE : Service representatives can use data to gain a more holistic

view of their customers , understanding their likes amd dislikes in real time

INSURANCE : An insurance or citizen service provider can apply advanced analytics on

data to detect fraud quickly


increasingly used to optimize business processes. Retailers are able to optimize their stock based on predictions generated from social media data, web search trends and weather forecasts. One particular business process that is seeing a lot of big data analytics is supply chain or delivery route optimization.

FINANCIAL TRADING : High-Frequency Trading (HFT) is an area where big data finds a

lot of use today. Here, big data algorithms are used to make trading decisions. Today, the majority of equity trading now takes place via data algorithms that increasingly take into account signals from social media networks and news websites to make, buy and sell decisions in split seconds.




many aspects of our cities and countries. For example, it allows cities to optimize traffic flows based on real time traffic information as well as social media and weather data. A number of cities are currently piloting big data analytics with the aim of turning themselves into Smart Cities, where the transport infrastructure and utility processes are all joined up.

IMPROVING SECURITY AND LAW ENFORCEMENT : Big data is applied heavily in

improving security and enabling law enforcement. I am sure you are aware of the revelations that the National Security Agency (NSA) in the U.S. uses big data analytics to foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and prevent cyber attacks.


machines and devices become smarter and more autonomous. For example, big data tools are used to operate Google’s self-driving car. The Toyota Prius is fitted with cameras, GPS as well as powerful computers and sensors to safely drive on the road without the intervention of human beings. Big data tools are also used to optimize energy grids using data from smart meters.

IMPROVING SCIENCE AND RESEARCH : Science and research is currently being

transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss nuclear physics lab with its Large Hadron Collider, the world’s largest and most powerful particle accelerator. Experiments to unlock the secrets of our universe – how it started and works - generate huge amounts of data. The CERN data center has 65,000 processors to analyze its 30 petabytes of data. However, it uses the computing powers of thousands of computers distributed across 150 data centers worldwide to analyze the data.

IMPROVING SPORTS PERFORMANCE : Most elite sports have now embraced big data

analytics. We have the IBM SlamTracker tool for tennis tournaments; we use video analytics that track the performance of every player in a football or baseball game, and sensor technology in sports equipment such as basket balls or golf clubs allows us to get feedback (via smart phones and cloud servers) on our game and how to improve it.





Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead "massively parallel software running on tens, hundreds, or even thousands of servers"

Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that can scale with the amount of data and the scope of analysis. Most organizations with traditional data platforms—typically relational database management systems (RDBMS) coupled to enterprise data warehouses (EDW) using ETL tools—find that their legacy infrastructure is either technically incapable or financially impractical for storing and analyzing big data. A traditional ETL process extracts data from multiple sources, then cleanses, formats, and loads it into a data warehouse for analysis. When the source data sets are large, fast, and unstructured, traditional ETL can become the bottleneck, because it is too complex to develop, too expensive to operate, and takes too long to execute.

By most accounts, 80 percent of the development effort in a big data project goes into data integration and only 20 percent goes toward data analysis. Furthermore, a



traditional EDW platform can cost upwards of USD 60K per terabyte. Analyzing one petabyte—the amount of data Google processes in 1 hour—would cost USD 60M. Clearly “more of the same” is not a big data strategy that any CIO can afford.So we require more efficient analytics for Big Data.

Big Analytics delivers competitive advantage in two ways compared to the traditional analytical model. First, Big Analytics describes the efficient use of a simple model applied to volumes of data that would be too large for the traditional analytical environment. Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. The algorithm is not the competitive advantage; the ability to apply it to huge amounts of data—without compromising performance—generates the competitive edge.

Second, Big Analytics refers to the sophistication of the model itself. Increasingly, analysis algorithms are provided directly by database management system (DBMS) vendors. To pull away from the pack, companies must go well beyond what is provided and innovate by using newer, more sophisticated statistical analysis.

To analyze such a large volume of data, big data analytics is typically performed using specialized software tools and applications for predictive analytics, data mining, text mining, forecasting and data optimization. Collectively these processes are separate but highly integrated functions of high-performance analytics. Using big data tools and software enables an organization to process extremely large volumes of data that a business has collected to determine which data is relevant and can be analyzed to drive better business decisions in the future.




For most organizations, big data analysis is a challenge. Consider the sheer volume of data and the many different formats of the data (both structured and unstructured data) collected across the entire organization and the many different ways different types of data can be combined, contrasted and analyzed to find patterns and other useful information.

1.Meeting the need for speed : In today’s hypercompetitive business environment, companies not only have to find and analyze the relevant data they need, they must find it quickly. Visualization helps organizations perform analyses and make decisions much more rapidly, but the challenge is going through the sheer volumes of data and accessing the level of detail needed, all at a high speed. One possible solution is hardware. Some vendors are using increased memory and powerful parallel processing to crunch large volumes of data extremely quickly. Another method is putting data in-memory but using a grid computing approach, where many machines are used to solve a problem.

2.Understanding the data : It takes a lot of understanding to get data in the right shape so that you can use visualization as part of data analysis. For example, if the data comes from social media content, you need to know who the user is in a general sense – such as a customer using a particular set of products – and understand what it is you’re trying to visualize out of the data. One solution to this challenge is to have the proper domain expertise in place. Make sure the people analyzing the data have a deep understanding of where the data comes from, what audience will be consuming the data and how that audience will interpret the information.

3.Addressing data quality : Even if you can find and analyze data quickly and put it in the proper context for the audience that will be consuming the information, the value of data for decision-making purposes will be jeopardized if the data is not accurate or



timely. This is a challenge with any data analysis, but when considering the volumes of information involved in big data projects, it becomes even more pronounced.Again, data visualization will only prove to be a valuable tool if the data quality is assured. To address this issue, companies need to have a data governance or information management process in place to ensure the data is clean.

4.Displaying meaningful results : Plotting points on a graph for analysis becomes difficult when dealing with extremely large amounts of information or a variety of categories of information. For example, imagine you have 10 billion rows of retail SKU data that you’re trying to compare. The user trying to view 10 billion plots on the screen will have a hard time seeing so many data points. One way to resolve this is to cluster data into a higher-level view where smaller groups of data become visible. By grouping the data together, or “binning,” you can more effectively visualize the data.

5. Dealing with outliers : The graphical representations of data made possible by visualization can communicate trends and outliers much faster than tables containing numbers and text. Users can easily spot issues that need attention simply by glancing at a chart. Outliers typically represent about 1 to 5 percent of data, but when you’re working with massive amounts of data, viewing 1 to 5 percent of the data is rather difficult. How do you represent those points without getting into plotting issues? Possible solutions are to remove the outliers from the data (and therefore from the chart) or to create a separate chart for the outliers.

The availability of new in-memory technology and high-performance analytics that use data visualization is providing a better way to analyze data more quickly than ever. Visual analytics enables organizations to take raw data and present it in a meaningful way that generates the most value. Nevertheless, when used with big data, visualization is bound to lead to some challenges. If you’re prepared to deal with these hurdles, the opportunity for success with a data visualization strategy is much greater.




1. Cost Reduction from Big Data Technologies : Some organizations pursuing big data

believe strongly that MIPS and terabyte storage for structured data are now most cheaply delivered through big data technologies like Hadoop clusters. Organizations that were focused on cost reduction made the decision to adopt big data tools primarily within the IT organization on largely technical and economic criteria.

2. Time Reduction from Big Data : The second common objective of big data technologies and solutions is time reduction.Many companies make use of big data analytics to generate real-time decisions to save a lot of time. Another key objective involving time reduction is to be able to interact with the customer in real time, using analytics and data derived from the customer experience.

3. Developing New Big Data-Based Offerings : One of the most ambitious things an organization can do with big data is to employ it in developing new product and service offerings based on data. Many of the companies that employ this approach are online firms, which have an obvious need to employ data-based products and services.

4. Supporting Internal Business Decisions : The primary purpose behind traditional, “small data” analytics was to support internal business decisions. What offers should be presented to a customer? Which customers are most likely to stop being customers soon? How much inventory should be held in the warehouse? How should we price our products?

These types of decisions employ big data when there are new, less structured data sources that can be applied to the decision. For example, any data that can shed light on customer satisfaction is helpful, and much data from customer interactions is unstructured





Doug Cutting, and Mike Carafella helped create Apache Hadoop in 2005 out of necessity as data from the web exploded, and grew far beyond the ability of traditional systems to handle it. Hadoop was initially inspired by papers published by Google outlining its approach to handling an avalanche of data, and has since become the de facto standard for storing, processing and analyzing hundreds of terabytes, and even petabytes of data.

Apache Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct-attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster.

Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and processing data. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.

Reveal Insight From All Types of Data,From All Types of Systems

Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email – just about anything you can think of, regardless of its native format. Even when different types of data have been stored in unrelated systems, one can dump it all into a Hadoop cluster with no prior need for a schema. In other words, we don’t need to know how we intend to query the



data before storing it; Hadoop lets one to decide later and over time can reveal questions that never even thought to ask.

By making all data useable, not just what’s in a databases, Hadoop lets one to see relationships that were hidden before and reveal answers that have always been just out of reach. One can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples.

Redefine the Economics of Data: Keep Everything, Forever, Online

In addition, Hadoop’s cost advantages over legacy systems redefine the economics of data. Legacy systems, while fine for certain workloads, simply were not engineered with the needs of Big Data in mind and are far too expensive to be used for general purpose with today's largest data sets.

One of the cost advantages of Hadoop is that because it relies in an internally redundant data structure and is deployed on industry standard servers rather than expensive specialized data storage systems, you can afford to store data not previously viable. And we all know that once data is on tape, it’s essentially the same as if it had been deleted - accessible only in extreme circumstances.

Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and theHadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java Archive (JAR) files and scripts needed to start Hadoop.

For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. HDFS uses this method when replicating data to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure, so that even if these events occur, the data may still be readable.



HDFS : Self-healing, high-bandwidth, clustered storage.

MapReduce : Distributed, fault-tolerant resource management,coupled with scalable data processing.

YARN : YARN is the architectural center of Hadoop that allows multiple data processing

engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. It is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a modern data architecture.

YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.

YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they to can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.





HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data storage layer of Hadoop. HDFS is derived from concepts of Google filesystem. An important characteristic of Hadoop is the partitioning of data and computation across many (thousands of) hosts, and the execution of application computations in parallel, close to their data. On HDFS, data files are replicated as sequences of blocks in the cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O bandwidth by simply adding commodity servers. HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use.

The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred other organizations worldwide are known to use Hadoop.

HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce. HDFS will “just work” under a variety of physical and systemic circumstances. By distributing storage and computation across many servers, the combined storage resource can grow with demand while remaining economical at every size.

HDFS supports parallel reading and writing and is optimized for streaming reading and writing.The bandwidth scales linearly with the number of nodes.HDFS provides a block redundancy factor which is normally 3 , ie every block will be replicated 3 times in various nodes .This helps to get higher fault tolerance.

These specific features ensure that the Hadoop clusters are highly functional and highly available:



Rack awareness allows consideration of a node’s physical location, when

allocating storage and scheduling tasks

Minimal data motion. MapReduce moves compute processes to the data on

HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth.

Utilities diagnose the health of the files system and can rebalance the data on

different nodes

Rollback allows system operators to bring back the previous version of HDFS

after an upgrade, in case of human or system errors

Standby NameNode provides redundancy and supports high availability

Highly operable. Hadoop handles different types of cluster that might otherwise

require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes.


HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers.Data is organized in to files and directories.

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

NameNode : They executes file system namespace operations like opening, closing, and

renaming files and directories. It also determines the mapping of blocks to DataNodes. The Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another



replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM.

DataNodes : are responsible for serving read and write requests from the file system’s

clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. HDFS provide automatic block replication if any nodes fail. A failed disk or node need not to be repaired immediately unlike RAID systems.Typically repair is done periodically for a collection of failures which makes it more efficient.





Central to the scalability of Apache Hadoop is the distributed processing framework known as MapReduce. MapReduce helps programmers solve data-parallel problems for which the data set can be sub-divided into small parts and processed independently. MapReduce is an important advance because it allows ordinary developers, not just those skilled in high-performance computing, to use parallel programming constructs without worrying about the complex details of intra-cluster communication, task monitoring, and failure handling. MapReduce simplifies all that.

MapReduce is a programming model for processing large datasets distributed on a large cluster. MapReduce is the heart of Hadoop. Its programming paradigm allows performing massive data processing across thousands of servers configured with Hadoop clusters. This is derived from Google MapReduce. Hadoop MapReduce is a software framework for writing applications easily, which process large amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.


MapReduce is also implemented over master-slave architectures. Classic MapReduce contains job submission, job initialization, task assignment, task execution, progress and status update, and job completion-related activities, which are mainly managed by the JobTracker node and executed by TaskTracker. Client application submits a job to the JobTracker. Then input is divided across the cluster. The JobTracker then calculates the number of map and reducer to be processed. It commands the



TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a local machine and launches JVM to map and reduce program over the data. Along with this, the TaskTracker periodically sends update to the JobTracker, which can be considered as the heartbeat that helps to update JobID, job status, and usage of resources.

• JobTracker : This is the master node of the MapReduce system, which manages the jobs and resources in the cluster (TaskTrackers). The JobTracker tries to schedule each map as close to the actual data being processed on the TaskTracker, which is running on the same DataNode as the underlying block.

• TaskTracker : These are the slaves that are deployed on each machine. They are responsible for running the map and reducing tasks as instructed by the JobTracker.




Hadoop MapReduce is a software framework for writing applications easily, which process large amounts of data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. This MapReduce paradigm is divided into two phases, Map and Reduce that mainly deal with key and value pairs of data. The Map and Reduce task run sequentially in a cluster; the output of the Map phase becomes the input for the Reduce phase. These phases are explained as follows:

• Map phase: Once divided, datasets are assigned to the task tracker to perform the Map phase. The data functional operation will be performed over the data, emitting the mapped key and value pairs as the output of the Map phase.

• Reduce phase: The master node then collects the answers to all the

subproblems and combines them in some way to form the output; the answer to the problem it was originally trying to solve.



The five common steps of parallel computing are as follows:

1. Preparing the Map() input: This will take the input data row wise and emit key value pairs per rows, or we can explicitly change as per the requirement.

Map input: list (k1, v1) 2. Run the user-provided Map() code

Map output: list (k2, v2)

3. Shuffle the Map output to the Reduce processors. Also, shuffle the similar keys (grouping them) and input them to the same reducer.

4. Run the user-provided Reduce() code: This phase will run the custom reducer code designed by developer to run on shuffled data and emit key and value.

Reduce input: (k2, list(v2)) Reduce output: (k3, v3)

5. Produce the final output: Finally, the master node collects all reducer output and combines and writes them in a text file.




Traditional RDBMS Map Reduce Data Size Gigabyte(Terabytes) Pentabytes(Exabytes) Data Format Relational Tables Key / Value Pairs Access Interactive and Batch Batch

Updates Read and Write many times Write once,Read many Structure Static schema Dynamic schema

Integrity High(ACID) Low

Scaling Non Linear Linear

Processing Online Transactions Offline Batch Processing Querying Declarative Queries Functional Programming





Apache Hadoop ecosystem consist of many other implementations apart from hadoop hdfs and hadoop map reduce.They include :

1.Apache PIG (Scripting platform) : Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time,Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig's language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible.

2.Apache Hive : Hive is a data warehouse system for Hadoop that facilitates easy data

summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

3. Apache HBase : HBase (Hadoop DataBase) is a distributed, column oriented

database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).

The main components of HBase are as described below:

 HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data storage or retrieval path.

 RegionServer is deployed on each machine and hosts data and processes I/O requests.



4Apache Mahout : Mahout is a scalable machine learning library that implements many

different approaches to machine learning. The project currently contains implementations of algorithms for classification, clustering, frequent item set mining, genetic programming and collaborative filtering. Mahout is scalable along three dimensions: It scales to reasonably large data sets by leveraging algorithm properties or implementing versions based on Apache Hadoop.

5. Apache HCatalog : HCatalog is a metadata abstraction layer for referencing data

without using the underlying filenames or formats. It insulates users and scripts from how and where the data is physically stored.WebHCat provides a REST-like web API for HCatalog and related Hadoop components. Application developers make HTTP requests to access the Hadoop MapReduce, Pig, Hive, and HCatalog DDL from within the applications. Data and code used by WebHCat is maintained in HDFS. HCatalog DDL commands are executed directly when requested. MapReduce, Pig, and Hive jobs are placed in queue by WebHCat and can be monitored for progress or stopped as required. Developers also specify a location in HDFS into which WebHCat should place Pig, Hive, and MapReduce results.

6. Apache ZooKeeper : ZooKeeper is a centralized service for maintaining

configuration information, naming, providing distributed synchronization, and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.

7. Apache Flume : Flume is a top level project at the Apache Software Foundation.

While it can function as a general purpose event queue manager, in the context of Hadoop it is most often used as a log aggregator, collecting log data from many diverse sources and moving them to a centralized data store.




Big data is considered as the next big thing in the world of information technology. The use of Big Data is becoming a crucial way for leading companies to outperform their peers. Big Data will help to create new growth opportunities and entirely new categories of companies, such as those that aggregate and analyse industry data. Sophisticated analytics of big data can substantially improve decision-making, minimise risks, and unearth valuable insights that would otherwise remain hidden.

Big data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions can't touch.

Apache Hadoop is a popular open source big data analytics tool. Using Hadoop, we can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Instead of relying on expensive, proprietary hardware and different systems to store and process data, Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits.




[1] Big data analysis using Apache Hadoop Nandimath, J. ; Banerjee, E. ; Patil, A. ; Kakade, P. ; Vaidya, S. Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on

[2] Extending Map-Reduce for Efficient Predicate-BasedSampling

Grover, R. ; Carey, M.J. Data Engineering (ICDE), 2012 IEEE 28th International Conference on

[3] Hadoop: Addressing challenges of Big Data Singh, K. ; Kaur, R. Advance Computing Conference (IACC), 2014 IEEE International [4] Big Data analytics frameworks Chandarana, P. ; Vijayalakshmi, M.

Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on

[5] MapReduce: Simplified Data Processing on Large

Clusters.Available at pdf

[6] “Pro Hadoop- build scalable distributed applications in the cloud” by Jason Venner

[7] “Hadoop the definitive guide” by Tom White publisher: O’ REILLY



Related subjects :