FINE-GRAIN INCREMENTAL PROCESSING OF
MAPREDUCE AND MINING IN BIG DATA
ENVIRONMENT
S.SURESH KUMAR, Jay Shriram Group of Institutions
Tirupur
sureshmecse25@gmail.com
Mr.A.M.RAVISHANKKAR M.E.,
Assistant Professor, Jay Shriram Group of
Institutions, Tirupur
ravi662shankkar@gmail.com
Dr.S.RAJALAKSHMI Ph.D.,
Associate Professor, Jay Shriram Group of
Institutions, Tirupur,
mrajislm@gmail.com
ABSTRACT
Seeing that new data and updates are continuously arriving, the results of data mining applications
turn out to be stale and obsolete over time. Incremental processing is a talented move towards to refreshing
mining results. It utilizes previously saved states to avoid the expense of re-computation from scratch. Suggest
i
2MapReduce, a work of fiction incremental processing extension to MapReduce, the mostly used structure for
mining data. Compared with the state-of-the-art work on Incoop, i
2MapReduce performs key-value pair level
incremental processing somewhat than task level re-computation, supports not only one-step computation but
also more sophisticated iterative computation, which is widely used in data mining applications, and
incorporates a set of tale techniques to reduce I/O overhead for accessing preserved fine-grain computation
states. Evaluate i2MapReduce using a one-step algorithm and four iterative algorithms with diverse
computation description.
Keywords
Big Data, MapReduce, Incremental, and Datamining.
1. INTRODUCTION
In this work, the MapReduce framework offers
techniques for convenient, distributed processing of
data by enabling a simple coding model that reduces the
burden of implementing a complex logic or
infrastructure for parallelization, data transfer, fault
tolerance and scheduling. An important property of the
workloads processed by MapReduce applications is that
they are often incremental by nature; i.e., MapReduce
jobs often run repeatedly with small changes in their
input. For example, search engines will periodically
crawl the Web and perform various computations on
this input, such as computing a Web index or the
PageRank metric, often with very small modifications.
This growing nature of data suggests that
performing large-scale computations incrementally can
improve efficiency dramatically. Broadly speaking
there are two methods to reach such efficient
incremental updates. The first method would be to
devise systems that provide the programmer with
facilities to save and use state across successive runs so
that only computations that are affected by the changes
to the input would need to be executed. This is precisely
the strategy taken by major Internet companies who
developed systems like Percolator or CBP. However
this method, requires adopting a new programming
model and a new Application Programming Interface
(API) that differs from the one used by MapReduce.
These new APIs also require the coder to devise a way
algorithmic and software complexity. Research in the
algorithms community on algorithms for dynamically or
dynamic data show that such algorithms can be very
complex even for problems that are relatively
straightforward in the non-incremental case. The second
method would be to develop systems that can reuse the
results of prior computation transparently. This method
would shift the complexity of incremental processing
from the programmer to the processing system,
essentially keeping the spirit of high-level models such
as MapReduce. A few methods have taken this
approach, e.g., DryadInc, and Nectar, in the context of
the Dryad system by providing techniques for task-level
or LINQ expression-level memorization.
Fig. 1 Architecture diagram
The design of Incoop contains the following
new methods that we incorporated into the Hadoop
MapReduce framework, and which showed to be
instrumental in achieving efficient incremental
computations.
Incremental HDFS. Instead of relying on HDFS to store the input to MapReduce jobs, we devise a file
system called Inc-HDFS that provides mechanisms to
identify similarities in the input data of consecutive job
runs. The idea is to divide the input into chunks whose
boundaries depend on the file contents so that small
changes to input don’t change all chunk boundaries.
Therefore partitions the input in a way that maximizes
the opportunities for reusing results from previous
computations, while preserving compatibility with
HDFS, by providing the same interface and semantics.
Contraction phase. We introduce methods for controlling the granularity of tasks so that large tasks
can be divided into smaller subtasks that can be re-used
even when the large tasks cannot. This is particularly
challenging in Reduce tasks, whose granularity depends
solely on their input. Our solution is to introduce a new
Contraction phase that leverages Combiner functions,
normally used to reduce network traffic by anticipating
a small part of the processing done by Reducer tasks, to
control the granularity of the Reduce tasks.
Memoization-aware scheduler. To increase effectiveness of memoization, we suggest an
affinity-based scheduler that uses a work stealing algorithm to
reduce the amount of data movement across machines.
Our new scheduler strikes a balance between exploiting
the locality of previously computed results and
executing tasks on any available machine to prevent
straggling effects.
Use cases. We employ Incoop to demonstrate two important use cases of incremental processing:
incremental log processing, where we use Incoop to
build a framework to incrementally process logs as
more entries are added to them; and incremental query
processing, where we layer the Pig framework on top of
Incoop to enable relational query processing on
When we design and implement an
incremental computation framework, many factors
including algorithm accuracy, run time, and space
overhead need to be taken into consideration. This
paper puts focus on the transplantation of parallel
algorithms based on MapReduce model and
compatibility of nonincremental and incremental
processing. A parallel programming framework is
presented, which aims to be compatible with the
original MapReduce APIs so that programmers do not
need to rewrite the algorithms. It makes the following
contributions:
Incremental MapReduce framework. It supports incremental data input, incremental data processing,
intermediate state preservation, incremental map and
reduce functions. The input manager can dynamically
discover new inputs and then submit jobs to the master
node.
Dynamic resource allocation based on the state. The state provides an important reference for resource
request and allocation of the next execution. State
information includes prior processing results,
intermediate results, execution time, and the number of
reduce tasks. Input data, acting as the observation, will
change the current state into a new state after the job
finishes.
Friendly APIs for applications. Method submitJob() in Class JobClient is overloaded. Users only need to
follow the method parameters to submit their jobs
without modifying their algorithms or application
programs. Furthermore, for continuous inputs, users can
get updated outputs in time.
2 RELATED WORKS
The Page Rank algorithm computes ranking
scores of web pages based on the web graph structure
for supporting web search. The web graph structure is
constantly evolving web pages and hyper-links are
created, deleted, and updated. As the underlying web
graph evolves, the PageRank ranking results gradually
become stale, potentially lowering the quality of web
search. It is desirable to refresh the PageRank
computation regularly. Incremental processing is a
promising approach to refreshing mining results. Given
the size of the input big data, it is often very expensive
to rerun the entire computation from scratch.
Incremental processing exploits the fact that the input
data of two subsequent computations A and B are
similar. Only a very small fraction of the input data has
changed. The idea is to save states in computation A,
use A’s states in computation B, and perform
re-computation only for states that are affected by the
changed input data.
A number of existing studies have followed
this principle and designed new programming models to
support incremental processing. The new programming
models are drastically different from MapReduce,
requiring programmers to completely re-implement
their algorithms.
Incoop extends MapReduce to support
incremental processing. It has two main limitations.
First, Incoop supports only task-level incremental
processing. It saves and reuses states at the granularity
of individual Map and Reduce tasks. Each task typically
processes a large number of key-value pairs. If Incoop
detects any data changes in the input of a task, it will
rerun the entire task. While this approach easily
leverages existing MapReduce features for state
savings, it may incur a large amount of redundant
computation if only a small fraction of kv-pairs have
changed in a task. Second, Incoop supports only
one-step computation, while important mining algorithms,
such as PageRank, require iterative computation.
Incoop would treat each iteration as a separate
MapReduce job. A small number of input data changes
intermediate states after a number of iterations,
resulting in expensive global re-computation
afterwards.
The computation is broken down into a
sequence of supersteps. In each superstep, a Compute
function is invoked on each vertex. It communicates
with other vertices by sending and receiving messages
and performs computation for the current vertex. This
method can efficiently support a large number of
iterative graph algorithms. It provides a group wise
processing operator Translate that takes state as an
explicit input to support incremental analysis. But it
adopts a new programming model that is very different
from MapReduce. Several research studies support
incremental processing by task-level re-computation,
but they require users to manipulate the states on their
own. In contrast, i2MapReduce exploits a fine-grain
kv-pair level re-computation that are more advantageous.
Incremental processing for iterative
application, Proposes a timely dataflow paradigm that
allows state full computation and arbitrary nested
iterations. To support incremental iterative
computation, programmers have to completely rewrite
their MapReduce programs. Extend the widely used
MapReduce model for incremental iterative
computation. Previous Map-Reduce programs can be
slightly changed to run on i2MapReduce for
incremental processing.
A. Limitations of MapReduce
MapReduce programming model is popular
due to its simplicity and high efficiency. In MapReduce
framework, when a job is submitted, the related input is
portioned into fixed size pieces called blocks which are
located in different data nodes. A job creates multiple
splits according to the number of blocks and each split
is processed independently as the input of a separate
map task. A map task in a worker node runs the
user-defined map function for each record in the split and
writes its output to the local disk. When there are
multiple reducers, the map task divides their outputs
and each partition is for one reduce task. Reduce tasks
will write their outputs to the distributed file system.
Despite its powerful automatic parallelization with
strong fault-tolerance, the original MapReduce exhibits
the following limitations.
Stateless. When map and reduce tasks finish, they write their outputs to a local disk or a distributed file system,
and then inform the scheduler. When a job completed,
related intermediate outputs will be deleted by a
cleanup mechanism. When new data or input arrives, a
new job needs to be created to process it. This is just
like HTTP, a stateless protocol, which provides no
means of storing a user’s data between requests. To
some extend, we can also say that MapReduce model is
stateless.
Stage independent. A MapReduce job can be divided into two stages: map stage and reduce stage. Each stage
will not interrupt the other’s execution. In the map
stage, each computing thread executes the map method
according to the input split allocated to it, and writes the
output to the local node. In the reduce stage, each
reduce thread fetches input from designate nodes,
executes the reduce method, and writes the output to the
specified file system. All map tasks and reduce tasks
execute their codes without disturbing each other.
Singe-step. Map tasks and reduce tasks will execute only once orderly for a job. Map tasks may finish at
different times, and reduce tasks start copying their
outputs as soon as all map tasks complete successfully.
B. Extension of MapReduce
Because of the limitations listed above,
computation. They feature different input patterns,
separate or coupled control approaches between state
and dataflow, compatible or new added interfaces.
Batch parallel processing refers to those that provide
high efficient large scale parallel computation with one
batch input and produce one output such as Google
MapReduce, Hadoop and Dyrad/DryadLINQ.
Incremental algorithms, more complicated than the
original algorithms, can improve the runtime by
modifying algorithms. Its input includes newly added
data and the latest running result. Continuous bulk
processing is another kind of solution to support
incremental processing by providing new primitives for
developers to design delicate dataflow oriented
applications. It takes the secondary results of the prior
executions as a part of explicit input. CBP of Yahoo
and Percolator of Google both provide such incremental
computation frameworks.
Incremental computation based on
MapReduce, making full use of MapReduce
programming model, supports incremental processing
by modifying the kernel implementation of map and
reduces stages. Because HDFS does not support
appends currently, some approaches also modify the
distributed file system to support incremental data
discovery and intermediate result storage.
IncMR is an improved method for large-scale
incremental data processing. The framework, inherits
the simplicity of the original MapReduce model. It
doesn’t modify HDFS and still uses the same APIs of
MapReduce. All algorithms or programs can complete
incremental data processing without any modification.
Compatibility. Original MapReduce interfaces are retained to avoid incompatibility with the existing
implementation of applications. New interfaces are
added by overloading methods. HDFS is still used
without modification.
Transparency. Users don’t need to know how their incremental data are processed. All state data including
their storage paths are transparent to users.
Reduced resource usage. Computing resource request and allocation are determined by the historical state
information and current added data size. Dynamic
scheduling decision is useful for minimizing the
overhead. Several important modules are added in
IncMR framework. The input manager is used to find
the newly added data automatically. According to the
execution delay configuration or input size threshold, it
determines if a new data processing will be started. Job
scheduler, different from traditional job scheduler
which determines the number of tasks to run only
according to the configuration file defined in advance,
takes the state into consideration when choosing nodes
for reduce tasks. A state manager store all needed
information for incremental jobs and provides decision
support for job scheduler. Output manager is
responsible for the storage and update of all outputs.
C. Locality control and optimization
Job scheduler is responsible for creating map
tasks and reduce tasks. The main overhead of IncMR
lies in the storage and transmission of many
intermediate results. In the shuffle phase, the
framework fetches the relevant partition of the output of
all the mappers to each reduce task node via HTTP. The
shuffle and sort phases occur simultaneously; while
mapoutputs are being fetched they are merged. Locality
control is always used for optimization of job and task
scheduling.
Naive locality control. When there is only one reduce task and it is always located in the same node, the
reduce task can fetch cached state from the local node
from other nodes. Basically, it is the simplest to control
the data flow when there is only one reduce task.
Complex locality control. If the number of reduce tasks is more than 1 and is not changed, job scheduler
will try to allocate reduce tasks to the nodes that have
performed reduce tasks recently because they have
cached related state. What’s more, the fetched prior
results are also sorted. This locality control will save
plenty of time for data transferring. If the number of
reduce tasks or the position of reduce task is changed,
state fetching and data repartition operations can’t be
avoided.
Iterative MapReduce. Iterative computing widely exists in data mining, text processing, graph processing,
and other data intensive and computing-intensive
applications. MapReduce programming paradigm is
designed initially for single step execution. Its high
efficiency and simplicity attract a lot of enthusiasm of
applying it in iterative environment and algorithms.
HaLoop, a runtime based on Hadoop, supports iterative
data analysis applications especially for large-scale
data. By caching the related invariant data and reducers’
local outputs for one job, it can execute recursively.
Twister is a lightweight MapReduce runtime. It uses
publish/subscribe messaging infrastructure for
communication and supports iterative task execution. In
order to provide solutions for applications such as data
mining or social network analysis with relational data,
iMapReduce is designed to support automatically
processing iterative tasks by reusing the prior task
processors and eliminating the shuffle load of the static
data. Additionally, iterative computing is an
indispensable part in incremental processing. We also
apply related iterative computing methods in our
IncMR framework.
Continuous MapReduce. Ad-hoc data processing is a critical paradigm for wide-scale data processing
especially for unstructured data. Reference implements
an ad-hoc data processing abstraction in a distributed
stream processor based on MapReduce programming
model to support continuous inputs. MapReduce Online
adopts pipelining technique within a job and between
jobs, supports single-job and multi-job online
aggregation, and also provides database continuous
queries over data streams. Reference combines
MapReduce programming model with the continuous
query model characterized by Cut-Rewind to process
dynamic stream data chunk. CMR, continuous
MapReduce, is an architecture for continuous and
large-scale data analysis. Continuous processing is a special
case of incremental processing. Although the input is
continuous, the processing is discrete. Time interval or
delay is an important factor to be considered. IncMR
presented in this paper supports continuous processing.
Applications based on MapReduce. When programming in MapReduce framework, all we should
do is to prepare the input data, implement the mapper,
the reducer, and optionally, the combiner and the
partitioner. The execution is handled transparently by
the framework in clusters ranging from a single node to
thousands of nodes with the data-set ranging from
gigabytes to petabytes and different data structures
including rational database, text, graph, video and
audio. Through a simple interface with two functions,
map and reduce, MapReduce model facilitates parallel
implementation of many real-world tasks such as data
processing for search engines and machine learning.
MapReduce is now popularly used in scientific
computing fields too. When the new input arrives, it is a
challenge for applications to continuously deal with it
based on the original MapReduce. This paper addresses
3 OUR WORK
We Propose i2MapReduce, an extension to
MapReduce that supports fine-grain incremental
processing for both one step and iterative computation.
Compared to previous solutions, i2MapReduce
incorporates the following three novel features:
Fine-grain incremental processing using
MRBG-Store
Incoop, i2MapReduce supports kv-pair level
fine-grain incremental processing in order to minimize
the amount of re-computation as much as possible.
Model the kv-pair level data flow and data dependence
in a MapReduce computation as a bipartite graph,
called MRBGraph. A MRBG-Store is designed to
preserve the fine-grain states in the MRBGraph and
support efficient queries to retrieve fine-grain states for
incremental processing.
General-purpose iterative computation
With modest extension to MapReduce API,
previous work proposed iMapReduce to efficiently
support iterative computation on the MapReduce
platform. It targets types of iterative computation where
there is a one-to-one/all-to-one correspondence from
Reduce output to Map input. In comparison, our current
method provides general-purpose support, including not
only one-to-one, but also one-to-many, many-to-one,
and many-to-many correspondence. Enhance the Map
API to allow users to easily express loop-invariant
structure data, and propose a Project API function to
express the correspondence from Reduce to Map. While
users need to slightly change their algorithms in order
to take full advantage of i2MapReduce, such
modification is modest compared to the effort to
re-implement algorithms on a completely different
programming paradigm.
Incremental processing for iterative computation.
Incremental iterative processing is
substantially more challenging than incremental
one-step processing because even a small number of updates
may propagate to affect a large portion of secondary
states after a number of iterations. To address this
problem, propose to reuse the converged state from the
previous computation and employ a change propagation
control (CPC) mechanism. We also enhance the
MRBG-Store to better support the access patterns in
incremental iterative processing. i2MapReduce is the
first MapReduce-based solution that efficiently supports
incremental iterative computation.
TECHNIQUES
i2MapReduce
Query Algorithm in MRBG-Store
PageRank in MapReduce
Kmeans in MapReduce
GIM-V (Generalized Iterated Matrix-Vector multiplication) in MapReduce
CONCLUSIONS
We have described i2MapReduce, a
MapReduce-based framework for incremental big data
processing. i2MapReduce combines a fine-grain
incremental engine, a general-purpose iterative model,
and a set of effective techniques for incremental
iterative computation. Real-machine experiments show
that i2MapReduce can significantly reduce the run time
for refreshing big data mining results compared to
re-computation on both plain and iterative MapReduce.
In Future evaluate MapReduce computation
using a one-step algorithm and within four iterative
algorithms with diverse computation characteristics.
Experimental applications results show significant
performance improvements MapReduce performing
REFERENCES
[1] J. Dean and S. Ghemawat, “Mapreduce:
Simplified data processing on large clusters,”
in Proc. 6th Conf. Symp. Opear. Syst. Des.
Implementation, 2004, p. 10.
[2] M. Zaharia, M. Chowdhury, T. Das, A. Dave,
J. Ma, M. McCauley, M. J. Franklin, S.
Shenker, and I. Stoica, “Resilient distributed
datasets: A fault-tolerant abstraction for,
in-memory cluster computing,” in Proc. 9th
USENIX Conf. Netw. Syst. Des.
Implementation, 2012, p. 2.
[3] R. Power and J. Li, “Piccolo: Building fast,
distributed programs with partitioned tables,”
in Proc. 9th USENIX Conf. Oper. Syst. Des.
Implementation, 2010, pp. 1–14.
[4] G. Malewicz, M. H. Austern, A. J. Bik, J. C.
Dehnert, I. Horn, N. Leiser, and G.
Czajkowski, “Pregel: A system for large-scale
graph processing,” in Proc. ACM SIGMOD
Int. Conf. Manage. Data, 2010, pp. 135–146.
[5] S. R. Mihaylov, Z. G. Ives, and S. Guha, “Rex:
Recursive, deltabased data-centric
computation,” in Proc. VLDB Endowment,
2012, vol. 5, no. 11, pp. 1280–1291.
[6] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin,
A. Kyrola, and J. M. Hellerstein, “Distributed
graphlab: A framework for machine learning
and data mining in the cloud,” in Proc. VLDB
Endowment, 2012, vol. 5, no. 8, pp. 716–727.
[7] S. Ewen, K. Tzoumas, M. Kaufmann, and V.
Markl, “Spinning fast iterative data flows,” in
Proc. VLDB Endowment, 2012, vol. 5, no. 11,
pp. 1268–1279.
[8] Y. Bu, B. Howe, M. Balazinska, and M. D.
Ernst, “Haloop: Efficient iterative data
processing on large clusters,” in Proc. VLDB
Endowment, 2010, vol. 3, no. 1–2, pp. 285–
296.
[9] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne,
S.-H. Bae, J. Qiu, and G. Fox, “Twister: A
runtime for iterative mapreduce,” in Proc. 19th
ACM Symp. High Performance Distributed
Comput., 2010,pp. 810–818.
[10] Y. Zhang, Q. Gao, L. Gao, and C. Wang,
“imapreduce: A distributed computing
framework for iterative computation,” J. Grid