Hadoop Performance Monitoring Tools

Full text


Hadoop Performance Monitoring Tools

Kai Ren, Lianghong Xu, Zongwei Zhou

Abstract—Diagnosing performance problems in large-scale data-intensive programs is inherently difficult due to its scale and distributed nature. This paper presents a non-intrusive, effective, distributed monitoring tool to facilitate performance diagnosis on Hadoop platform, an open source implementation of the well-known MapReduce programming framework.

The tool explores data from Hadoop logs and aggregates per-job/per-node operating system level metrics on each cluster node. It provides multi-view visualization and view-zooming functionality to assist application developers in better reasoning about job execution progress and system behavior. Experimen-tal results show that, due to the extra levels of information it reveals as well as its effective visualization scheme, our tool is able to diagnose some kinds of problems that previous tools fail to.


Data intensive applications are gaining increasing pop-ularity and importance in various fields such as scientific computations and large scale Internet services. Several soft-ware programming frameworks have been proposed for such applications on clusers consisting of commodity machines, among which MapReduce[3] is the most well-known and generally understood. Hadoop [5] is an open-source Java implementation of MapReduce, which is currently widely used by many Internet services companies (e.g., Yahoo! and Facebook).

Monitoring Hadoop and diagnosing MapReduce perfor-mance problems, however, are inherently difficult due to its large scale and distributed nature. Debugging Hadoop programs by looking at its various logs is painful. Hadoop logs can get excessively large, which makes it impractical to handle them manually. Most current Hadoop monitoring tools such as Mochi [13] are merely log-based and lack certain important operating system level metrics, which may be a significant addition to the diagnosis process. Cluster monitoring tools such as Ganglia [7] expose per-node and cluster-wide OS metrics but provide no MapReduce-specific information.

We present a non-intrusive, effective, distributed Hadoop monitoring tool which targets at making Hadoop program debugging simpler. It extracts a number of useful MapRe-duce and File System (HDFS) level metrics from Hadoop logs and correlates them with operating system level metrics such as CPU and disk I/O utilization on a per-job/per-task basis. In order to effectively present these metrics to programmers, we propose a flexible multi-view visualization scheme, including cluster view, node view, job view and task

view, each of which represents a different level of granular-ity. We propose a view-zooming model to assist application developers in better reasoning about their MapReduce job execution progress and program behavior. As far as we know, our tool, along with its multi-view visulization and view-zooming model, is the first monitoring tool to cor-relate per-job/per-task level operating system metrics with MapReduce-specific metrics, and to support both online monitoring and offline analysis. We give some case studies to show how our tool can effectively diagnose performance problems in MapReduce programs, some of which cannot be easily diagnosed by other tools.

In the rest of this paper, Section II introduces the architecture of our monitoring tool, and Section III describes implementation details. We show some evaluation results in Section IV, and provides several examples to demonstrate the effectiveness of our tool in diagnosing MapReduce programs. Section VI compares our tool with related works, and Section VII concludes this paper.


Our non-intrusive monitoring tool explores data from Hadoop logs and extracts OS level information of job-related processes on each cluster node. We construct four different views of the collected metrics, namely cluster view, node view, job view, and task view, and propose a view-zooming model in order to assist system administrators and application developers to better reason about system behavior and job execution progess.

A. Metrics aggregation

In order to expose as much useful information as pos-sible, our monitoring tool collects a number of metrics in MapReduce, Hadoop File System (HDFS), and operating system levels. MapReduce and HDFS metrics are extracted from logs of Hadoop job tracker, task tracker and HDFS data node, while operating system metrics are gathered on each node in a cluster.

1) MapReduce and HDFS metrics: Hadoop provides var-ious logs detailing MapReduce job execution as well as the underlying data node activities. These logs get excessively large after a MapReduce job runs for a while. However, we observe that a large amount of information in the logs are redundant and overly trivial. As a result, there exists a good chance for us to filter only useful information from the massive logs. A close examination of all these logs enables us to extract a number of MapReduce and HDFS


level metrics that to our knowledge may be of interest to the potential users of our tool.

MapReduce level metrics are mainly obtained from job-tracker and taskjob-tracker logs. Jobjob-tracker log provides rich information of job execution. To be specific, it records job duration, lifetime span of each task and various job-related counters, including total launched maps and reduces, number of bytes read from and written to HDFS for each task, map input records, reduce shuffle bytes and so on. Tasktracker logs are distributed over every node and are sent to a central collector for analysis. They reveal detailed information of a task execution state such as the Java Virtual Machine ID associated with each task and the shuffle data flow including total shuffle bytes, source and destination node location. HDFS data node logs mainly describe data flow on the underlying file system level. A typical data node log entry contains operation type (read or write), source and destination network addresses of the two nodes involved, number of transfer bytes, and operation timestamp. We don’t look into master node log because it only contains metadata-related information, which we believe is of little interest to cluster administrators and application developers.

2) Operating system metrics: A distinguishing feature of our monitoring tool is the ability to aggregate per-node operating system level metrics within a cluster. For now, the exported metrics include CPU, memory, disk and network utilization on each node, which we believe can satisfy most need of our users for now.

B. Multi-view visulization

Our monitoring tool provides both coarse-grained and fine-grained views of MapReduce job execution. Specifi-cally, four levels of views-cluster view, job view, task view, and node view- are presented according to users’ need.

1) Cluster view: Cluster view constructs an overall per-ception of a cluster’s running state. It does so by visulizing resource utilization of multiple jobs on each node within the cluster. For example, cluster view shows that for every job executing within the temporal scope, a cluster view is able to present the change over time of the average CPU usage across the entire cluster. Cluster view offers useful information to system administrators and assists them to obtain a comprehensive perspective of the utilization of cluster resources. MapReduce program developers may also benefit from a cluster view, from which they can examine the concurrent running jobs during the execution of their own jobs. When there are a large number of concurrent jobs competing for the cluster resources, it would not be surprising if the performance of the job execution is worse than expectation.

2) Node view: Node view is similar to cluster view but it only exposes information related to a specific node. Because all the information are gathered on a single node, it is

reasonable to assume that node view reveals more find-grained data than cluster view, in which data is averaged across all the nodes and may hide some deviations among them. An example of node view is shown in Figure 8. The CPU percentage of Hadoop processes are aggregated against the host percentage during that period. From this figure, we can clearly see how the CPU resource is shared among the concurrent running processes on this node as well as how much portion a specific process takes as to the total aggregated percentage.

3) Job view: Job view provides coarse-grained informa-tion of the execuinforma-tion progress of a specific job. Job view fo-cuses on high-level job-centric information, including three main statistical distributions: task-based distribution, node-based distribution and time-node-based distribution. Figure 2 shows an exemplary task-based distribution of file read bytes for reducer tasks. Figure 3 gives an example of node-based distribution of shuffle bytes sent and received by each node. Time-based distribution is illustrated by Figure 4, which presents the duration of all the tasks involved in a job execution. Job view provides rich information about job execution status and is probably the most important view that application programmers would like to look into. It is easy to diagnose job execution abnormalies with the assistance of job view. For instance, it is not difficult to detect that a task runs for a much longer time than the other peer tasks or that a reducer task shuffles much more bytes than other reducers.

4) Task view: Sometimes job view is not sufficient to reveal all the underlying details about the execution state of a specific task. As a result, our tool presents task view, which provides fine-grained information about a specific task execution. It can be viewed as a part of job view, but with extra level of details in order to reveal all the information that to our knowledge application developers may care about

0 20 40 60 80 100 0 50 100 150 200 CPU perceptage (%) time job4011 job4010 tasktracker datanode host


a given task. Task view breaks task duration down into sereral periods, each representing a specific task execution state, as shown in Figure 5. To help users understand exactly which execution state causes the skewness, it also shows the average and standard deviation of execution time among all the peer tasks for each period. For instance, for the shuffle period of a reduce task, task view constructs the shuffle data flow including the source and destination node locations as well as the associated volume of transferred bytes, as shown in Figure 6.

C. View zooming

Above we have shown what the four views are and what they look like. We also proposes a view-zooming model to correlate these views together to assist application developers in reasoning about performance degeneration as illustrated in Figure 7. However, users are not necessarily restricted to this model and may leverage the multiple views provided by our tool according to their needs. We construct the four views in two dimensions: the physical dimension indicates whether a view’s scope is on a single node or aross the entire cluster, while the MR dimension represents the granularity of the views in terms of number of jobs and tasks.

For application developers, before launching any MapRe-duce jobs, they may want to have a look at the cluster view, which tells them how many jobs are concurrently running and how system resources are distributed among them, and then decide whether or not it is a good time to join the contention. After a job is finished, they may want to zoom in from the cluster view to the job view and focus on the execution progress of their own jobs. The three distributions presented by job view provides ample information for them to detect some, if any, performance problems. For example, from time-based or task-based distribution, they may find

0 2 4 6 8 10 12 14 16 18 0 100 200 300 400 500 600 HDFS Read Bytes (MB) Map Task ID

Figure 2. task-based distribution of shuffle bytes for reducer tasks

a suspect task consuming much more time to finish than its peer tasks. It is most likely that at this point applcation developers would like to zoom in to the task view of the morbid task to obtain the insights of the root cause of the problem. However, another possibility exists that application developers, by examining the node-distribution from the job view and detect some node exhibits abnormal behavior, would like to zoom in to node view to find out the detailed information on this node during the job execution.

In order to make the discussion concrete, we also in-vestigate how cluster administrators may benefit from these views. Because our monitoring tool correlates MapReduce metrics together with OS metrics, it can expose information as to the resource sharing situation of each job running in the cluster, which cluster administrators may use for service-oriented charging and can not be provided by traditional sys-tem monitoring tool such as Ganglia. In our view monitoring model, a typical path taken by cluster administrators may be zooming between cluster view and node view, which show the resource utilization on the entire cluster and a single node respectively.

III. IMPLEMENTATION A. Metric Aggregation

For the performance diagnosis of Hadoop map-reduce jobs, we gather and analyze OS-level peromance metrics for each map-reduce task, without requiring any modifications to the Hadoop system and the OS.

The basic idea to collect OS level metrics for each task is to infer them from its JVM process. Hadoop runs each map/reduce task in its own Java Virtual Machine to isolate them from other running tasks. In Linux, OS-level performance metrics for each process are made available as text files in the /proc/[pid]/ pseduo file system. It provides many aggregation counters for specific machine

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 0 10 20 30 40 50 60 70 Shuffle bytes (MB) Machine ID send receive


0 200 400 600 800 1000 1200 0 10 20 30 40 50 60 70 80 90 100 Task ID Time (1000s) map reduce

Figure 4. time-based distribution of task durations

0 10000 20000 30000 40000 50000 60000 70000 80000

phase 1 phase 2 phase 3 phase 4 phase 5 total

Time (s)

reduce task no.28 average

Figure 5. task duration breakdown

resources. For examplewrite bytesin/proc/pid/io/gives the number of bytes that is sent to the storage layer in the lifetime of a particular process. If these aggreation counters are periocally collected, then the throughput of some resource like disk I/O can be approximated. This motivates us to collect metrics data from/proc/file system and correlate them with its corresponding map/reduce task. However, the difficulty lies in how to correlate process metrics to each map/reduce task. One map/reduce task can correspond to one or serveral processes in OS, because a task may spawn serveral subprocesses, which is a common case in stream processing jobs. To solve this problem, we create the process tree for each JVM process through the parent id provided in/procfile system, and aggregate the metrics of every subprocess to the counters of each map/reduce task.

Another difficulty is that Hadoop allows the reuse of Java Virtual Machine. With the reuse of JVM enabled,

0 10 20 30 40 50 60

Machine ID map task no.18

map-input map-output bytes

Figure 6. map task data flow

! ! "#$%&'(! )*'+! ,-.! )*'+! ! ! /-0'! )*'+! 12%3! )*'+! "#$%&'(!45$#&*6#'! 7-0'%8! 9*7:#'!7-0'! ;$#&*6#'!<-.%! 9*7:#'!<-.! 45$#&*6#'!&2%3%8! 9*7:#'!&2%3! =(-:(255'(%! "#$%&'(!205*7*%&(2&-(%! =>?%*@2#!0*5'7%*-7! ;A!0*5'7%*-7!

Figure 7. View-zooming model

multiple tasks of one job can run in the same JVM process sequentially in order to reduce the overhead of starting a new JVM. However, in the OS-level metrics, we can only infer the identification of the first task running on a JVM process. There is no additional information in OS level about the tasks that later reuse the JVM process. After checked the Hadoop log, we found that Hadoop logs the JVM id of each mapreduce task as well as the event of creating a new JVM. Thus, we can identify the JVM id of a particular JVM process from the Hadoop log. And according to the JVM id and the timestamp of creating and destorying a mapreduce task, we can infer which process this task run in.










RRD tool


RRD tool


File Access Network Collect Metrics Collect Metrics Collect Metrics

Figure 8. Online monitoring system architecture

B. Online Reporting

To achieve online reporting job-centric metrics, we adopt the source code from Ganglia. Ganglia is a distributed monitoring system that has great scalability and portability. It runs a daemon called ”gmond” on each cluster node that periodically collects metric data of the local machine and reports the data directly to a central node ”gmetad”. The central node stores metric data into RRD Database tools.

Gmond supports python module which can be used to add functionality to collect metrics. It is easy fo us to plug-in new metric collection module. However, the central node ”gmetad” has its limitation in organization of metrics data. Its interal data structure orgnazies the metrics by its hostname and the cluster it belongs to. In order to orga-nize the metrics by mapreduce job, we modify its internal data structure as well as its storage organization in RDD database.


The effectiveness of our monitoring tool relies on its low overhead which enables it to collect metrics consistenly without slowing down the cluster, as well as its extensive coverage of the events monitored, which prevents potentially interesting behaviors from flying under the raddar. Thus our evaluation consists of two parts: one is to evaluate the overhead of our monitoring tool and another one is to do case study to show the effectiveness of our monitoring tool to assist cluster administrator and hadoop programmer to trace interesting events in Hadoop system.

We perform our experiments on a 64-nodes testbed cluster. Each node has 2 quad-core Intel E5440, 16GB RAM, four 1TB SATA disks and a Qlogic 10GE NIC. Each node runs Debian GNU/Linux 5.0 with Linux kernel 2.6.32.

A. Benchmarks

We use the following Hadoop programs to evalute the overhead of our monitoring tools. We show that through our monitoring tools users can infer some root causes of performance problems in their programs. The two Hadoop programs are:

TeraSort: Terasort benchmark sorts 1010

records (ap-proximately 1TB), each including a 10-byte key. Terasort implemented using Hadoop distributes the record dataset to a number of mappers to generate key-value pairs; then reducers fetch these records from mappers and sort them by their keys. We run TeraSort in multiple times, and test the overhead caused by our monitoring tool.

Identity Change Detection: This is a data-mining job that analyzes telephone call graph and detect the idenfication changes of the owner of a particular telephone number. This program does analysis on a very large data sets, and consists of multiple mapreduce jobs. We use one of its sub-jobs as our case-study. One characteristic of this job is that its generates very large output in its map phase. Suppose the input size of this job is M key-value pairs, and the its immediate result emitted by mapper function is about M2

values. For the largest data set in our experiment, it writes nearly 3TB data into Hadoop filesystem.

B. Overhead

From the information provided by the /proc filesystem, we observe that local per-node overhead of our monitoring tool accounts for less than0.01%of the CPU resource. The peak virtual memory usage is about 55MB. We also test disk I/O throughput of the log-version of our monitoring tool. During the normal operation time of the cluster, the disk write throughput of our monitoring tool is about 0.5KB/s in average. The write rate will increase as the number of map-reduce jobs increases.

V. CASESTUDY A. Detecting disk-hog

In this case study, we inject a disk-hog, a C program that repeatly write large files into disks, to certain machines in the experiment cluster. We use this disk-hog to simulate real world disk I/O hotspot in cluster, and study how this I/O hotspot influences our MapReduce job. We run the identity change detection program describled in Section IV for experiment. The experimental setting is also identical to the one in Section IV.

We show that only extracting MapReduce level informa-tion from Hadoop log is not sufficient to detect our simpel disk-hog, but corelatting OS level metrics with MapReduce level information could help us quickly detect it. Figure 9 shows the running time distribution of map tasks in normal case and disk-hog case. In each case, there are map tasks running in different amout of time. Even if disk-hog can cause some map tasks running for a little bit longer time,


0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 Duration Time (s) Map Task ID

(a) Normal Case

0 200 400 600 800 1000 1200 0 100 200 300 400 500 600 Duration Time (s) Map Task ID (b) Disk-hog Case

Figure 9. Comparing map task running time distribution of normal case and disk-hog case

it still can not change the entire running time distribution significantly. As we can see, the distribution under two cases still looks quite similar, and it is difficult to detect anomaly which is caused by disk-hog. Figure 10 shows the running time distribution of reduce tasks in normal case and disk-hog case. As we can see, all reduce tasks in disk-hog task are running slower than normal case, because each reduce task in our MR job have to wait for the shuffle bytes from the map task(s) running on the disk-hog machine(s). However, under disk-hog case, the distribution of reduce task running time are still quite similar with the normal case. The influence of disk-hog on reduce task running time distribution is not significant enough for anomaly detection.

0 200 400 600 800 1000 1200 1400 0 5 10 15 20 25 30 35 40 45 50 Duration Time (s) Reduce Task ID

(a) Normal Case

0 200 400 600 800 1000 1200 1400 0 5 10 15 20 25 30 35 40 45 50 Duration Time (s) Reduce Task ID (b) Disk-hog Case

Figure 10. Comparing reduce task running time distribution of normal case and disk-hog case

However, by corelatting OS level metrics with MapRe-duce level information, we can detect the disk-hog. There is one graph in our Job View (see Figure 11). This graph shows the comparison of the average job I/O write throught and machine I/O write throughput on each cluster machine. It is easy to see that on machine no. 10 and 23, the average machine I/O write throughput is much larger than job I/O write throughput, and there is no other MR job running on the cluster. It is quite possible that there is I/O write anomalies in these two machines. The programmer can then use zooming methodology to see the actual I/O write throughput variation on the suspected machines, and inform cluster administrator to check the I/O status on those


0 5000 10000 15000 20000 25000 30000 0 10 20 30 40 50 60 70 write bytes (KB/s) machine ID job4061 host

(a) Normal Case

0 10000 20000 30000 40000 50000 60000 70000 0 10 20 30 40 50 60 70 write bytes (KB/s) machine ID job4062 host (b) Disk-hog Case

Figure 11. Comparing MR job and machine I/O wirte bytes throughut in normal case and disk-hog case


B. Concurrent Job influence

This case study shows how zooming technology and fine-grained information help with debugging MapReduce pro-gram. When there are multiple jobs running on your cluster and you want to evaluate how these jobs influence each other. From system-wide Cluster View resources utilization, you may see two jobs consume almost the same CPU usage in an extended time period, and concludes that these two jobs influence each other. However, when we use zooming methodology to jump into Node View, we will get more fine-grained information, such as exact CPU usage temporal variation. From this fine-grained informaiton, we can get a

0 20 40 60 80 100 0 50 100 150 200 250 300 CPU perceptage (%) time aveCPU-4011:10.84% aveCPU-4010:18.02% job4011 job4010 0 20 40 60 80 100 0 50 100 150 200 250 300 CPU perceptage (%) time aveCPU-4022:17.78% aveCPU-4023:12.56% job4022 job4023

Figure 12. How two job influence each other in Node View

exact perception of how these concurrent job interleave with each other, and how significant they influence each other. As shown in Figure 12, job 4011 and 4010 is scheduled to run concurrently, and they influence each other in CPU usage, but job 4022 and 4023 is totally interleaved. Their average CPU usage in the time period shown is quie similar in these two cases.


Mochi [13], a visual log-analysis based tool, partially solves some of the mentioned problems. It parses the logs generated by Hadoop under debugging mode, using SALSA technique [12], infers the casual relations among recorded events, and then reports visualized metrics like per-task’s execution time, workload of each node. However, it still does not correlate per-job / per-task MR-specific behavior with OS-level metrics. And it does not provide any root-cause analysis for the perfomrance problems of MR programs.


Also, its analysis runs offline and thus cannot provide instant visualized monitoring information to users.

Ganesha [9] and Kahuna [14] combine black-box OS met-rics together with MapReduce state information to diagnose performance problems of MapReduce programs. However, they expose OS metrics on a per-node level, and our work can trace OS metrics into per-job or even per-task level, enabling more fine-grained performance problem diagnosis. Java debugging/profiling tools (jstack [11], hprof [10]) focus on helping debug local code-level errors rather than distributed problems over the cluster. Path-tracing tools (e.g., Xtrace [4], Pinpoint [1]), although report fined-grained data like RPC call graph, fail to provide insights at the higher level of abstraction (e.g. Maps and Reduces) that is more natural to application programmers. They tracks information by using instrumented middleware or libraries, and will introduce more overhead.

Ganglia [7], LiveRAC [8] and ENaVis [6] are cluster monitoring tools that collect per-node as well as system-wide data of several high-level variables (e.g., CPU utilization, I/O throughput, free disk space for each monitored nodei). However, they mainly focus on high-level variables and track only system wide totals. It is usually used to help flag misbehaviors (e.g., ”a node went down”). Our system correlates OS metrics together with high level MapReduce abstraction, and thus would be more powerful in debugging MapReduce performance problems. Artemis [2] is a plug-gable framework for distributed log collection, analysis and visualization. Our system collected and analyzed MapRe-duce level information from Hadoop logs and also OS level metrics on cluster machines. We can build our system as Artemis plugins for online monitoring.


We have built a non-intrusive, effective, distributed mon-itoring tool to facilitate the Hadoop program debugging process. The tool correlates rich MapReduce metrics ex-tracted from Hadoop logs and per-job/per-task level op-erating system metrics. We have proposed a multi-view visualization scheme to effectively present these metrics, as well as a view-zooming model to help programmers better reason about job execution progress and system behavior. The preliminary results are promising: our tool can success-fully diagnose several performance problems that previous monitoring tools cannot.

As for future work, we plan to instrument the Hadoop framework in the hope of exploring more information from Hadoop metrics APIs. Another potential direction is to achieve online automatic data analysis based on the aggre-gated metrics. Instead of storing temporary data on RRD database, we also would like to find an effective way to maintain long-term storage of the collected metrics.


[1] Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. Pinpoint: Problem determination in large, dynamic internet services. In DSN ’02: Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002.

[2] Gabriela F. Cretu-Ciocarlie, Mihai Budiu, and Moises Gold-szmidt. Hunting for problems with artemis. In USENIX workshop on Analysis of System Logs (WASL), 2008. [3] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified

data processing on large clusters. In OSDI ’04, pages 137– 150, December 2004.

[4] Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In NSDI ’07, 2007.

[5] The Apache Software Foundation. Hadoop. http://hadoop. apache.org/core.

[6] Qi Liao, Andrew Blaich, Aaron Striegel, and Douglas Thain. Enavis: Enterprise network activities visualization. In Large Installation System Administration Conference (LISA), 2008. [7] Matthew L. Massie, Brent N. Chun, and David E. Culler. The ganglia distributed monitoring system: Design, implementa-tion and experience. Parallel Computing, 30:2004, 2003. [8] Peter McLachlan, Tamara Munzner, Eleftherios Koutsofios,

and Stephen North. Liverac: interactive visual exploration of system management time-series data. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1483–1492, 2008. [9] Xinghao Pan, Jiaqi Tan, Soila Kavulya, Rajeev Gandhi, and

Priya Narasimhan. Ganesha: blackbox diagnosis of mapre-duce systems. SIGMETRICS Perform. Eval. Rev., 37(3):8–13, 2009.

[10] Oracle SUN. Hprof: A heap/cpu profiling tool in j2se 5.0. http://java.sun.com/developer/technicalArticles/ Programming/HPROF.html.

[11] Oracle SUN. jstack - stack trace for sun java real-time sys-tem. http://java.sun.com/j2se/1.5.0/docs/tooldocs/share/jstack. html.

[12] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Salsa: Analysing logs as state machines. In USENIX Workshop on Analysis of System Logs (WASL), 2008.

[13] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Mochi: Visual log-analysis based tools for debugging hadoop. In HotCloud ’09, San Diego, CA, Jun 2009.

[14] Jiaqi Tan, Xinghao Pan, Solia Kavulya, Rajeev Gandhi, and Priya Narasimhan. Kahuna: Problem diagnosis for mapreduce-based cloud computing environments. In IEEE/IFIP Network Operations and Management Symposium (NOMS), 2010.


Figure 1. An example of node view

Figure 1.

An example of node view p.2
Figure 3. node-based distribution of shuffle bytes for reducer tasks

Figure 3.

node-based distribution of shuffle bytes for reducer tasks p.3
Figure 2. task-based distribution of shuffle bytes for reducer tasks

Figure 2.

task-based distribution of shuffle bytes for reducer tasks p.3
Figure 4. time-based distribution of task durations

Figure 4.

time-based distribution of task durations p.4
Figure 8. Online monitoring system architecture

Figure 8.

Online monitoring system architecture p.5
Figure 9. Comparing map task running time distribution of normal case and disk-hog case

Figure 9.

Comparing map task running time distribution of normal case and disk-hog case p.6
Figure 11. Comparing MR job and machine I/O wirte bytes throughut in normal case and disk-hog case

Figure 11.

Comparing MR job and machine I/O wirte bytes throughut in normal case and disk-hog case p.7