D E C E M B E R 1 9 9 0
WRL
Technical Note TN-17
MTOOL: A Method
For Detecting
Memory Bottlenecks
The Western Research Laboratory (WRL) is a computer systems research group that
was founded by Digital Equipment Corporation in 1982. Our focus is computer science
research relevant to the design and application of high performance scientific computers.
We test our ideas by designing, building, and using real systems. The systems we build
are research prototypes; they are not intended to become products.
There is a second research laboratory located in Palo Alto, the Systems Research
Cen-ter (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge,
Massachusetts (CRL).
Our research is directed towards mainstream high-performance computer systems. Our
prototypes are intended to foreshadow the future computing environments used by many
Digital customers. The long-term goal of WRL is to aid and accelerate the development
of high-performance uni- and multi-processors. The research projects within WRL will
address various aspects of high-performance computing.
We believe that significant advances in computer systems do not come from any single
technological advance. Technologies, both hardware and software, do not all advance at
the same pace. System design is the art of composing systems which use each level of
technology in an appropriate balance. A major advance in overall system performance
will require reexamination of all aspects of the system.
We do work in the design, fabrication and packaging of hardware; language processing
and scaling issues in system software design; and the exploration of new applications
areas that are opening up with the advent of higher performance systems. Researchers at
WRL cooperate closely and move freely among the various levels of system design. This
allows us to explore a wide range of tradeoffs to meet system goals.
We publish the results of our work in a variety of journals, conferences, research
reports, and technical notes. This document is a technical note. We use this form for
rapid distribution of technical material. Usually this represents research in progress.
Research reports are normally accounts of completed research and may include material
from earlier technical notes.
Research reports and technical notes may be ordered from us. You may mail your
order to:
Technical Report Distribution
DEC Western Research Laboratory, UCO-4
100 Hamilton Avenue
Palo Alto, California 94301 USA
Reports and notes may also be ordered by electronic mail. Use one of the following
addresses:
Digital E-net:
DECWRL::WRL-TECHREPORTS
DARPA Internet:
[email protected]
CSnet:
[email protected]
UUCP:
decwrl!wrl-techreports
instruc-MTOOL: A Method For Detecting
Memory Bottlenecks
Aaron Goldberg
Digital Equipment Corporation Western Research Laboratory
John Hennessy
Stanford University
December, 1990
Abstract
This paper presents a new method for detecting regions of a
program where the memory hierarchy is performing poorly. By
observing where actual measured execution time differs from the
time predicted given a perfect memory system, we can isolate
memory bottlenecks. MTOOL, an implementation of the
ap-proach aimed at Fortran programs running on MIPS-chip based
workstations is described and results for some of the Perfect Club
and SPEC benchmarks are reported.
Many mo dern computer architectures including cache-based uniprocessors and most shared memory multiprocessors present the programmer with a (deceptive)uniformaccess model of memory. RISCarchitectures for example provide simple load and store instructions to represent memory op erations. Inpractice,however, the loadinstructionmayinv olv e accessingon-chipcache which is backed by a second level cache of static RAMwhich is back ed by mainmemory. The time dierence b etweena hit inthe rst level cache and amiss tomainmemory canb e one to twoorders of magnitude (see Table 1).
Programmers seekingto improv e performance sometimes ndit useful to optimize analgorithmwithresp ect toaparticular memoryhierarchy. Studies like [1], [2], [7] and [8] hav e reported sp eed-ups of 100% and more owing to improved cache p erformance when nested loops are reordered and matrix algorithms are blocked. To make use of such techniques, the user must knowwhere memory bottlenecks lie and when a transformation improv es p erformance.
Two techniques are typically employed to isolate memory bottlenecks. The least time consuming approach is to statically analyze a program, us -ing dependency analysis to identify howmany items will b e incache after a certain numb er of iterations of a loop [3, 6]. Suchtechniques are imp ortant b ecause they are p otentially fast enough to incorp orate into compilers to automatically manage transformations. Howev er, the static techni ques rely on the simple structure of b oththe loop and the memory systemto p erform their analyses. The approximate nature of analytic techniques and the i
n-Miss Penalty (cycles) Machine Primary Secondary Remote DEC3100 6 { { DEC5000 10 { { SGI (4 node) 14 40 { MIPS6280 2-4 50 { StanfordDASH 14 28 103-136
theminappropriate as a complete performance debugging solution.
At the opp osite end of the sp ectrum, there are trace driv en simulators whichsimulate the executionof ev erymemoryreference ina program. These simulations canmodel the entire memory hierarchy andproduce access time estimates for everyvariable reference inaprogram. The obvious drawbackof simulationis its cost;for large programs, simulationmay be quite exp ensive. Also, it is non-trivial to correctly model a complicated memory hierarchy, particularly whenmultiprogramming or multiprocessing is involv ed.
Our technique strik es a balance between the exp ense of simulation and the inaccuracy of static analysis. Our key observ ation is that if we assume memory access time is uni formthen, at least for simpler architectures, it is relatively cheapto correctly estimate the CPU executiontime of a program. By comparing this uniformaccess model estimate withactual observ edexe -cutiontime, we canisolate regions inaprogramwhere the memoryhierarchy p erforms p oorly.
The next sectiondiscusses the method for estimating executiontime as -suming constant memory access time. Section 3 describ es howto use the estimate to isolate memory bottlenecks. Section 4 presents a memory b ot -tleneck tool implementation, MTOOL, whichruns on the DECstation 3100 and5000. Section5provides examples of MTOOL's user interface. Section6 rep orts results whenthe tool is runonthe Perfect Clubb enchmarks andsci -enticbenchmarks inthe SPECset. Finally, section7gives some conclusions ab out the breadth of applicability of our technique and suggests directions for future research.
2 Estimating ExecutionTime
Consider a computer where all instruction schedulingis handledbysoftware (i.e., no hardware interlocks) andwhere eachinstruction (including memory access instructions) has a known, xedexecutiontime. For sucha computer, we candeterminethe executiontime of aprogramgiv eninstructionexecution counts using the formula,
divide the programintobasic blocks. Abasic block is a groupof instructions witha unique entry point such that when the entry instruction executes, all
other instructions inthe basic blockwill execute. It is possible to identify all basic blocks in most executable programs by examining branch instruction
destinations andindirect jumptables. After determining the basic blocks, we instrument theexecutable lebyprecedingeachblockwithcodetoincrement acounter. Runningtheinstrumentedprogramproduces atable of basic block counts. Adiscussionof methods for instrumentingcompiledcode is provided in the Appendix.
Using the counts and our knowledge of when hardware interlocks are triggered, we canestimate howlongeachbasic blockexecutes. Our estimates will hav e two shortcomings:
1. Memory access instructions do not execute inconstant time. 2. There may b e hardware interlocks across basic block boundaries.
The rst shortcoming is actually the feature on which our bottleneck detectiontechnique is based. We assume all memory accesses take the mini -mumpossible time (typicallythe time for a primarycachehit) andwhenour prediction disagrees with measuredexecution time we report a bottleneck.
The second weakness is not a problemfor many RISCarchitectures b e -cause there are few hardware interlocks and these interlocks rarely cross basic block b oundaries. Onthe MIPSprocessor where our exp eriments were
p erformed, the inter-block interlocks were negligible in real code. If such interlocks occur with appreciable frequency, they can be estimated by i n-strumenting to collect branch frequencies as well as basic block counts. The branch frequencies tell us howoften one basic block precedes another and we can improv e our estimate by including interlocks b etween adjacent basic blocks.
Inthis section, we dev elopa frameworkfor detecting b ottlenecks by measur -ing div ergence frompredicted b ehavior. We b egin by formalizing the notion of measuring actual time spent in a regionof code. Ameasurable object or m-object is a set of instructions in which we can identify all entry and exit p oints. The object is measurable because we can place start timer and stop timer calls at these entry and exit p oints to measure the time spent in the object. For example, we cantime aprocedure byplacing astart timer call at the topof the procedure andstop timer calls b efore everyreturn statement. Similarly, we can time a loop by placing a start timer ab ov e the top of the loop anda stoptimer belowthe b ottomof the loop.
We say anm-object is timableif we caninstrument the programto me a-sure the time sp ent within the object. Atimable object must satisfy two criteria:
1. The total time spent withintheobject substantiallyexceeds timer gr an-ularity.
2. The perturbation created by the timer is not signicant.
The rst asp ect of timability is rarely a problemas most systems provide at least a 1/60thof a second timer, andm-objects of interest typically execute for seconds, minutes, or ev en hours. The p erturbation issue is more di -cult. To av oid changing memory p erformance, we require that the number of memory op erations performed by the m-object substantially exceed the numb er p erformed in a clock timer call. In addition, to avoid appreciably slowing the programdown, we require that the time spent in the m-object substantially exceedthe time to make a clock call.
canb e accuratelypredictedbybasic-blockcountingtechniques are estimable. To identify estimable m-objects, we exploit informationab out the str uc-ture of a program's call graph. In a call graph, nodes represent procedures and there is a directed edge fromnode v to node w if procedure v calls w during the execution of the program. The call graph has a distinguished node, the root, whichis the procedure where execution b egins.
We say a node v dominates wif ev ery path through the graph fromthe root to wpasses through v. The useful asp ect of the call graph is that a node is estimable if it dominates all of its descendants. Intuitiv ely, if a node v dominates wthen all the work in wis done on b ehalf of v. Hence, if a
node dominates all of its children, then the estimated time for that node andits descendants is just the sum of eachof their estimatedtimes, ignoring procedure calls.
This observationserv es as a working denition of estimability. The de-nition implies that b oththe root of the call graph, whichcorresponds to the executionof the whole program, andthe leav es whichcorrespondto call-free procedures, are estimable. Giv en this operational denition of estimabil -ity, the primary issue in isolating memory b ottlenecks nowb ecomes one of granularity of detection.
We could estimate the execution time of the full programand compare this number against actual runtime, but this will onlydescribe themagnitude of the memoryeects, not localize them. Instead, our approachis tondaset of smaller, timable m-objects containingthe majorityof memory op erations, and then to select members of this set that are estimable. The next section outlines our algorithm.
4 The I mplementation
This sectiondescribes one implementationof the estimable, timable m-object approachtoisolating memory b ottlenecks, MTOOL. This sp ecic impleme n-tation is for Fortran programs running on MIPS-chip based workstations. Fortranwas chosen as a target language b ecause large Fortran programs of -tenhave memorybottleneck edregions andmuchof theresearchonalleviating memory bottlenecks has concentrated onscientic code.
to the user, satisfy the deni tionof m-object, andtypically runlong enough to meet the timability criteria. The rst step of the b ottleneck isolation process is to instrument the programof interest to collect basic block counts and to run the instrumented code on a representativ e input. The basic block counts are useful not only for estimating execution time; they also provide MTOOL with precise knowledge of where memory op erations are
concentrated.
MTOOLsorts the procedures by the number of memory op erations they
execute andselects those that containthe rst 95%of all memoryop erations. For each of these procedures, MTOOL makes a list of loops and tries to measure rst the indi vidual loops and then the whole procedure, subject to timability and estimability constraints. Timability constraints are strongly systemdependent. DEC's ULTRIXprovides a1/60thof asecondgranularity
clock, but a systemcall is requiredto readthe clock.
The overhead of the systemcall p erturbs execution undesirably. The cleanest solution wouldb e to modify the op erating systemto create a clock directly accessible by user processes, but a simpler, more widely appl icable option was to add an interval timer to create a clock inuser memory. Start andstopclock calls access the clock in user memory whichis up datedbythe interrupts of the interval timer.
Using the user memory clock, MTOOL's timer has granularityof 1/60th of a second and work per start/stop call of about 70 instructions. The re -sults rep orted in Section 6 seemto indi cate this granularity and overhead are acceptable. Also, we exp ect that interest in characterizing and impr ov-ing performance will drive architects andop erating systems programmers to provide b etter clocks in the future.
At this p oint, MTOOL has gathered a collection of timable loops and procedures. Thenext stepis todetermine whichof these objects is estimable. Conceptually, MTOOL uses a simple depth rst algorithmto label each procedure inthe call graph with two parameters:
Tot(v) =1+
X cal lsfr omvto w
Tot(w) 3
#of cal lsfr omv to w total #of cal l s to w
and TDES(w) =1 +number of descendants of w:
tionof the previous sectionthat a node is estimable whenall the work of its descendants is done on its b ehalf.
Thus, we have a test for estimability. Extending the test to loops simply involv es checking the analogous conditionthat:
X wcalledinl oop
TDES(w) +1 =
X wcal led inloop
Tot(w) 3
#of call s froml oopto w total #of call s to w
:
In most cases, the loops and procedures selected by MTOOL immediately
satisfy the estimability condition. The condition is met frequently because the objects selectedby MTOOLcontainthe majority of memory op erations
whichtypically means they inv olv e the most frequently executedp ortions of code whichare normally leaves or objects all of whose childrenare leav es.
When the test is not satised immediately, several options are available. One natural choice is to mov e upthe call graphas we knowthat ev entually we will encounter an estimable node, the root. This option is not ideal, howev er, b ecause it reduces the precisionwithwhichwe localize bottlenecks. Ab etter choice is to check whether we can in fact accurately estimate the time sp ent ina procedure call. Below, we describ e three cases where we can make anaccurate estimate.
Many scientic library routines are simple, loop-free leaf procedures that alwa ys followthe same execution path. Their estimatedexecution times are consequently constant. We can often identify such procedures by checking that they meet two conditions:
1. The procedure is a loop-free andcall-free.
2. The av erage time p er call as determinedby basic blockcounts is equal to the maximumor minimumpossible time per call.
2. Select loops andprocedures containing most memory operations. 3. Eliminate selectedobjects that fail tomeet timabilityconstraints. 4. Eliminate selected objects that fail to meet estimability
con-straints.
5. Instrument code to measure actual time sp ent in remaining se -lected objects.
6. Run instrumented code, correlate actual times with estimated times to isolate b ottlenecks, and rep ort bottlenecks to the user.
Figure 1: MTOOL's Bottleneck IsolationAlgorithm
Two other simple heuristics suce to eliminate many other problematic calls. Suppose object v calls procedure p. Then, we can normally assume average time p er call fromv to pis average time p er call to vwheneither:
1. The v ast majority (98%) of calls to pare made by v, or
2. (avg. time per call to p) * (#of calls fromv to p) time sp ent in v, so anyerror in the approximation is negligible.
Both of these heuristics can b e inaccurate under pathological conditions (whenthe varianceof the executiontime of pacross calls is large), soMTOOL issues a warning whenev er it inv okes them. They hav e not causedproblems withthe benchmarks measured inthis study.
The steps MTOOLuses to isolate memory b ottlenecks are summarized in Figure 1. Step6 is of course the most signicant to the user.
5 User I nterface
Measured Time (compensated for counters): user 132.4 sys 0.7 Overhead estimates: User (memory) System (I/0)
39.3 0.7 Proceed with memory bottleneck probe insertion (y/n)?
Figure 2: MTOOL's Initial Bottleneck Estimate
counts, runs andtimes the instrumentedcode, anddisplays a result like that shownin Figure 2. MTOOLgenerates the data in the gure by estimating the run-time of the programandcomparing it withthe measuredexecution time, appropriately adjustedfor the eects of the counters. The information prevents a a user fromproceeding when no signicant memory bottlenecks are present.
Assuming the user asks MTOOLto produce an instrumented program, MTOOLexecutes steps 2to5of Figure 1producingale of m-object descri p-tors andactual execution time measurements. This summary le is handed o to the front end.
The front end' s toplev el windowis showninFigure 3for the Perfect Club b enchmark TFS.f, an air owsimulation. Overheadis dened as
(actu al time 0estimatedtime)=estimatedtime:
The histograminthelower left of the windowvisuallysummarizes the datain the upper right. The line \Measured91% of memoryoperations"reports the
p ercentage of all executedmemory operations that are included inmeasured objects. All times and overheads refer to user time only; time spent in the op erating systemon behalf of a programis ignored(thoughit was rep orted at anearlier stage{see Figure 2).
ing the text. Bottlenecks are highlighted one at a time, with the ov erall overheadcontributionandthe extra cycles p er memory operationassociated with the bottleneck rep orted at the top of the text window. Pressing the INFObutton op ens a popup windowthat gives further information about the distribution of memory op erations when the measured object includes loops or procedure calls. Figure 6 shows a simple INFOwindowfor the top b ottleneck in dflux(). Pressing PREV or NEXTdisplays the previous or next b ottleneck; bottlenecks are sortedbymagnitude. The text windowmay b e scrolled by pressing the vertical arrows at the right side of the window, and multiple text windows may be op ensimultaneously.
Three aspects of the user interface deserv e comment froman implem
en-tor's p ersp ective. First, the interface is relativ elyp ortable because it is built usingInterviews [4], anobject orientedtoolkit that runs ontopof X. Second, the notion of clicking on a bar to obtain further information on the associ -ated b ottleneck is quite general. For example, it would b e easy to supp ort an optionwhere clicking ona procedure's estimated time bar wouldprovide information implicit in our time estimate like register usage, oating p oint stalls, MFLOPS, most often executed lines, etc. Athird note ab out the interface is that is uses standard line numb er information available in the object le to relate basic block level information back to source code lines. In this sense, the user interface is language-independent.
6 Results
%of Ov er- plained InAll Median/ Max/ InAll Program mops head Ov erhd m-obj. m-obj m-obj Code tfs 91% 40% 4% 252 8 53 2020 nas 97% 25% 0% 591 84 395 3976 sds 93% 6% 2% 167 12 120 7607 lws 93% 9% 0% 100 100 100 1237 lgs 98% 12% 0% 323 35 132 2327 fpppp 85% 50% 11% 940 274 666 2718 doduc 85% 29% 5% 2541 102 477 5334 spice2g6 95% 46% 2% 756 41 434 18411 dnasa7 97% 107% 1% 257 9 118 1105
Table 2: MTOOLEectiveness
containedinthe m-objects. Onhalf of the programs, the medianobject con-tains fewer than 50 lines, but for some programs lik e fpppp, we are p erhaps failing to localize bottlenecks adequately. This problemis discussed further in the Conclusions sectionb elow.
7 Conclusions
tioned m-objects containing these highlighted memory op erations, subject to MTOOL's checks for estimability andtimability.
MTOOLcould also b e enhancedbybroadening the class of programs for whichit can detect bottlenecks. MTOOLis nowrestricted to non-recursive programs that do not use procedure v ariables. The restriction on recursion derives primarilyfromthe fact that thesimple start/stopclocktimers will not work when an m-object can b e re-entered (and the clock re-started) b efore it is exited (and the clock is stopped). The restriction could b e removedby prohibiting the timing of recursive procedures or by using more complicated timers that keep track of the depth of the recursive call and only start and stopthe clock on the rst entry and last exit. Because MTOOLis language indep endent (except for the restriction on recursion) modifying the timers (andmodifying the checkfor estimability of section4 to account for loops in the call graph) wouldallowMTOOLto handle languages withrecursion.
The restriction onprocedure v ariables is also easy to remov e. MTOOL's denitionof estimabilityreqires awell-denedcall graphandthe use of proce -dure variables means that thefull structure of the call graphis not determined until run-time. By modifying the basic block counting instrumentation to record a dynamic call graph, this restrictioncould b e circumv ented.
Thenal andperhaps most excitingapplicationfor MTOOLtechnologyis p ortingit toasharedmemorymultiprocessor. Memoryb ottlenecks onshared memory machines can b e severe and detecting themis dicult. Analytic techniques cannot handle the complexity of parallel systems and simulating multipleinteractingprocessors is extremelycomplexandexpensiv e. MTOOL should provide a viable alternativ e to these approaches.
Acknowledgements: I wouldlik e to thank DavidWall of DECWRLwho
sp onsored part of this research during a summer internship and provided invaluable advice onthe ins and outs of patching code.
8 Appendix: I nstrumenting Code
solutionis to include anindirect jumptable whichmaps every address inthe original executable to its corresp onding address in the patched executable. Indi rect jumps are always made through this table. The drawback of Pixie is that it canintroduce non-negligible ov erhead.
Asecondapproachis topatchthe code at linktime, modifyingp c-relative jumps, text addresses in the data segment, and the relocation dictionary to correctly reect changes in the executable le. The adv antage of this approach is that all jump destination addresses are clearly identiable and no ov erheadis added. See [9] for a a complete descriptionof the technique.
MTOOLuses a thirdapproach. MTOOLpatches the executable directly, but it does not use anindirect jumptable. Instead, MTOOLpatternmatches to nd instructions which load addresses in the text segment. The loaded text address is modiedtorepresent the corresp ondi ngaddress inthe instr u-mentedcode. Thus, a JR r1 will succeedbecause the instructions to loadr1 with a value hav e b een correctly up dated. This techni que suers fromthe p otential drawback that the pattern matcher could fail in highly optimized code, but we hav e encounteredno problems to date.
References
[1] W. Abu-Sufah, D. Kuck, and D. Lawrie. On the p erformance enhance -ment of paging systems through programanalysis and transformations. IEEETransanctions onComputing, 30(5):341{56, 1981.
[2] K. Galliv an, W. Jalby, U. Meier, andA. Sameh. The impact of hierarchi -cal memorysystems onlinear algebraalgorithmdesign. Techni cal Report CSRD625, University of Illinois, 1987.
[3] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management byglobal program transformation. Journal of Par -allel andDistributedComputing, 5(5):587{616, 1988.
[4] Mark A. Linton, JohnM. Vlissides, andPaul R. Calder. Comp osing user interfaces with interviews. IEEEComputer, 22(2):8{22, 1989.
onSupercomputer Applications. PhDthesis, Rice University, May 1989.
[7] E. Rothberg andA. Gupta. Ecient sparse matrix factorizationonhi gh-p erformance workstations { exploiting the memory hierarchy. To app ear ACMTransactions onMathematical Software.
[8] E. Rothberg and A. Gupta. Techni ques for improving the p erformance of sparse matrix factorizationonmultiprocessor workstations. To app ear Proceedings of Supercomputing'90, Nov. 1990.