WRL TN 17 pdf

(1)

D E C E M B E R 1 9 9 0

WRL

Technical Note TN-17

MTOOL: A Method

For Detecting

Memory Bottlenecks

(2)

The Western Research Laboratory (WRL) is a computer systems research group that

was founded by Digital Equipment Corporation in 1982. Our focus is computer science

research relevant to the design and application of high performance scientific computers.

We test our ideas by designing, building, and using real systems. The systems we build

are research prototypes; they are not intended to become products.

There is a second research laboratory located in Palo Alto, the Systems Research

Cen-ter (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge,

Massachusetts (CRL).

Our research is directed towards mainstream high-performance computer systems. Our

prototypes are intended to foreshadow the future computing environments used by many

Digital customers. The long-term goal of WRL is to aid and accelerate the development

of high-performance uni- and multi-processors. The research projects within WRL will

address various aspects of high-performance computing.

We believe that significant advances in computer systems do not come from any single

technological advance. Technologies, both hardware and software, do not all advance at

the same pace. System design is the art of composing systems which use each level of

technology in an appropriate balance. A major advance in overall system performance

will require reexamination of all aspects of the system.

We do work in the design, fabrication and packaging of hardware; language processing

and scaling issues in system software design; and the exploration of new applications

areas that are opening up with the advent of higher performance systems. Researchers at

WRL cooperate closely and move freely among the various levels of system design. This

allows us to explore a wide range of tradeoffs to meet system goals.

We publish the results of our work in a variety of journals, conferences, research

reports, and technical notes. This document is a technical note. We use this form for

rapid distribution of technical material. Usually this represents research in progress.

Research reports are normally accounts of completed research and may include material

from earlier technical notes.

Research reports and technical notes may be ordered from us. You may mail your

order to:

Technical Report Distribution

DEC Western Research Laboratory, UCO-4

100 Hamilton Avenue

Palo Alto, California 94301 USA

Reports and notes may also be ordered by electronic mail. Use one of the following

addresses:

Digital E-net:

DECWRL::WRL-TECHREPORTS

DARPA Internet:

CSnet:

UUCP:

decwrl!wrl-techreports

(3)

instruc-MTOOL: A Method For Detecting

Memory Bottlenecks

Aaron Goldberg

Digital Equipment Corporation Western Research Laboratory

John Hennessy

Stanford University

December, 1990

Abstract

This paper presents a new method for detecting regions of a

program where the memory hierarchy is performing poorly. By

observing where actual measured execution time differs from the

time predicted given a perfect memory system, we can isolate

memory bottlenecks. MTOOL, an implementation of the

ap-proach aimed at Fortran programs running on MIPS-chip based

workstations is described and results for some of the Perfect Club

and SPEC benchmarks are reported.

(4)

Many mo dern computer architectures including cache-based uniprocessors and most shared memory multiprocessors present the programmer with a (deceptive)uniformaccess model of memory. RISCarchitectures for example provide simple load and store instructions to represent memory op erations. Inpractice,however, the loadinstructionmayinv olv e accessingon-chipcache which is backed by a second level cache of static RAMwhich is back ed by mainmemory. The time dierence b etweena hit inthe rst level cache and amiss tomainmemory canb e one to twoorders of magnitude (see Table 1).

Programmers seekingto improv e performance sometimes ndit useful to optimize analgorithmwithresp ect toaparticular memoryhierarchy. Studies like [1], [2], [7] and [8] hav e reported sp eed-ups of 100% and more owing to improved cache p erformance when nested loops are reordered and matrix algorithms are blocked. To make use of such techniques, the user must knowwhere memory bottlenecks lie and when a transformation improv es p erformance.

Two techniques are typically employed to isolate memory bottlenecks. The least time consuming approach is to statically analyze a program, us -ing dependency analysis to identify howmany items will b e incache after a certain numb er of iterations of a loop [3, 6]. Suchtechniques are imp ortant b ecause they are p otentially fast enough to incorp orate into compilers to automatically manage transformations. Howev er, the static techni ques rely on the simple structure of b oththe loop and the memory systemto p erform their analyses. The approximate nature of analytic techniques and the i

n-Miss Penalty (cycles) Machine Primary Secondary Remote DEC3100 6 { { DEC5000 10 { { SGI (4 node) 14 40 { MIPS6280 2-4 50 { StanfordDASH 14 28 103-136

(5)

theminappropriate as a complete performance debugging solution.

At the opp osite end of the sp ectrum, there are trace driv en simulators whichsimulate the executionof ev erymemoryreference ina program. These simulations canmodel the entire memory hierarchy andproduce access time estimates for everyvariable reference inaprogram. The obvious drawbackof simulationis its cost;for large programs, simulationmay be quite exp ensive. Also, it is non-trivial to correctly model a complicated memory hierarchy, particularly whenmultiprogramming or multiprocessing is involv ed.

Our technique strik es a balance between the exp ense of simulation and the inaccuracy of static analysis. Our key observ ation is that if we assume memory access time is uni formthen, at least for simpler architectures, it is relatively cheapto correctly estimate the CPU executiontime of a program. By comparing this uniformaccess model estimate withactual observ edexe -cutiontime, we canisolate regions inaprogramwhere the memoryhierarchy p erforms p oorly.

The next sectiondiscusses the method for estimating executiontime as -suming constant memory access time. Section 3 describ es howto use the estimate to isolate memory bottlenecks. Section 4 presents a memory b ot -tleneck tool implementation, MTOOL, whichruns on the DECstation 3100 and5000. Section5provides examples of MTOOL's user interface. Section6 rep orts results whenthe tool is runonthe Perfect Clubb enchmarks andsci -enticbenchmarks inthe SPECset. Finally, section7gives some conclusions ab out the breadth of applicability of our technique and suggests directions for future research.

2 Estimating ExecutionTime

Consider a computer where all instruction schedulingis handledbysoftware (i.e., no hardware interlocks) andwhere eachinstruction (including memory access instructions) has a known, xedexecutiontime. For sucha computer, we candeterminethe executiontime of aprogramgiv eninstructionexecution counts using the formula,

(6)

divide the programintobasic blocks. Abasic block is a groupof instructions witha unique entry point such that when the entry instruction executes, all

other instructions inthe basic blockwill execute. It is possible to identify all basic blocks in most executable programs by examining branch instruction

destinations andindirect jumptables. After determining the basic blocks, we instrument theexecutable lebyprecedingeachblockwithcodetoincrement acounter. Runningtheinstrumentedprogramproduces atable of basic block counts. Adiscussionof methods for instrumentingcompiledcode is provided in the Appendix.

Using the counts and our knowledge of when hardware interlocks are triggered, we canestimate howlongeachbasic blockexecutes. Our estimates will hav e two shortcomings:

1. Memory access instructions do not execute inconstant time. 2. There may b e hardware interlocks across basic block boundaries.

The rst shortcoming is actually the feature on which our bottleneck detectiontechnique is based. We assume all memory accesses take the mini -mumpossible time (typicallythe time for a primarycachehit) andwhenour prediction disagrees with measuredexecution time we report a bottleneck.

The second weakness is not a problemfor many RISCarchitectures b e -cause there are few hardware interlocks and these interlocks rarely cross basic block b oundaries. Onthe MIPSprocessor where our exp eriments were

p erformed, the inter-block interlocks were negligible in real code. If such interlocks occur with appreciable frequency, they can be estimated by i n-strumenting to collect branch frequencies as well as basic block counts. The branch frequencies tell us howoften one basic block precedes another and we can improv e our estimate by including interlocks b etween adjacent basic blocks.

(7)

Inthis section, we dev elopa frameworkfor detecting b ottlenecks by measur -ing div ergence frompredicted b ehavior. We b egin by formalizing the notion of measuring actual time spent in a regionof code. Ameasurable object or m-object is a set of instructions in which we can identify all entry and exit p oints. The object is measurable because we can place start timer and stop timer calls at these entry and exit p oints to measure the time spent in the object. For example, we cantime aprocedure byplacing astart timer call at the topof the procedure andstop timer calls b efore everyreturn statement. Similarly, we can time a loop by placing a start timer ab ov e the top of the loop anda stoptimer belowthe b ottomof the loop.

We say anm-object is timableif we caninstrument the programto me a-sure the time sp ent within the object. Atimable object must satisfy two criteria:

1. The total time spent withintheobject substantiallyexceeds timer gr an-ularity.

2. The perturbation created by the timer is not signicant.

The rst asp ect of timability is rarely a problemas most systems provide at least a 1/60thof a second timer, andm-objects of interest typically execute for seconds, minutes, or ev en hours. The p erturbation issue is more di -cult. To av oid changing memory p erformance, we require that the number of memory op erations performed by the m-object substantially exceed the numb er p erformed in a clock timer call. In addition, to avoid appreciably slowing the programdown, we require that the time spent in the m-object substantially exceedthe time to make a clock call.

(8)

canb e accuratelypredictedbybasic-blockcountingtechniques are estimable. To identify estimable m-objects, we exploit informationab out the str uc-ture of a program's call graph. In a call graph, nodes represent procedures and there is a directed edge fromnode v to node w if procedure v calls w during the execution of the program. The call graph has a distinguished node, the root, whichis the procedure where execution b egins.

We say a node v dominates wif ev ery path through the graph fromthe root to wpasses through v. The useful asp ect of the call graph is that a node is estimable if it dominates all of its descendants. Intuitiv ely, if a node v dominates wthen all the work in wis done on b ehalf of v. Hence, if a

node dominates all of its children, then the estimated time for that node andits descendants is just the sum of eachof their estimatedtimes, ignoring procedure calls.

This observationserv es as a working denition of estimability. The de-nition implies that b oththe root of the call graph, whichcorresponds to the executionof the whole program, andthe leav es whichcorrespondto call-free procedures, are estimable. Giv en this operational denition of estimabil -ity, the primary issue in isolating memory b ottlenecks nowb ecomes one of granularity of detection.

We could estimate the execution time of the full programand compare this number against actual runtime, but this will onlydescribe themagnitude of the memoryeects, not localize them. Instead, our approachis tondaset of smaller, timable m-objects containingthe majorityof memory op erations, and then to select members of this set that are estimable. The next section outlines our algorithm.

4 The I mplementation

This sectiondescribes one implementationof the estimable, timable m-object approachtoisolating memory b ottlenecks, MTOOL. This sp ecic impleme n-tation is for Fortran programs running on MIPS-chip based workstations. Fortranwas chosen as a target language b ecause large Fortran programs of -tenhave memorybottleneck edregions andmuchof theresearchonalleviating memory bottlenecks has concentrated onscientic code.

(9)

to the user, satisfy the deni tionof m-object, andtypically runlong enough to meet the timability criteria. The rst step of the b ottleneck isolation process is to instrument the programof interest to collect basic block counts and to run the instrumented code on a representativ e input. The basic block counts are useful not only for estimating execution time; they also provide MTOOL with precise knowledge of where memory op erations are

concentrated.

MTOOLsorts the procedures by the number of memory op erations they

execute andselects those that containthe rst 95%of all memoryop erations. For each of these procedures, MTOOL makes a list of loops and tries to measure rst the indi vidual loops and then the whole procedure, subject to timability and estimability constraints. Timability constraints are strongly systemdependent. DEC's ULTRIXprovides a1/60thof asecondgranularity

clock, but a systemcall is requiredto readthe clock.

The overhead of the systemcall p erturbs execution undesirably. The cleanest solution wouldb e to modify the op erating systemto create a clock directly accessible by user processes, but a simpler, more widely appl icable option was to add an interval timer to create a clock inuser memory. Start andstopclock calls access the clock in user memory whichis up datedbythe interrupts of the interval timer.

Using the user memory clock, MTOOL's timer has granularityof 1/60th of a second and work per start/stop call of about 70 instructions. The re -sults rep orted in Section 6 seemto indi cate this granularity and overhead are acceptable. Also, we exp ect that interest in characterizing and impr ov-ing performance will drive architects andop erating systems programmers to provide b etter clocks in the future.

At this p oint, MTOOL has gathered a collection of timable loops and procedures. Thenext stepis todetermine whichof these objects is estimable. Conceptually, MTOOL uses a simple depth rst algorithmto label each procedure inthe call graph with two parameters:

Tot(v) =1+

X cal lsfr omvto w

Tot(w) 3

#of cal lsfr omv to w total #of cal l s to w

and TDES(w) =1 +number of descendants of w:

(10)

tionof the previous sectionthat a node is estimable whenall the work of its descendants is done on its b ehalf.

Thus, we have a test for estimability. Extending the test to loops simply involv es checking the analogous conditionthat:

X wcalledinl oop

TDES(w) +1 =

X wcal led inloop

Tot(w) 3

#of call s froml oopto w total #of call s to w

:

In most cases, the loops and procedures selected by MTOOL immediately

satisfy the estimability condition. The condition is met frequently because the objects selectedby MTOOLcontainthe majority of memory op erations

whichtypically means they inv olv e the most frequently executedp ortions of code whichare normally leaves or objects all of whose childrenare leav es.

When the test is not satised immediately, several options are available. One natural choice is to mov e upthe call graphas we knowthat ev entually we will encounter an estimable node, the root. This option is not ideal, howev er, b ecause it reduces the precisionwithwhichwe localize bottlenecks. Ab etter choice is to check whether we can in fact accurately estimate the time sp ent ina procedure call. Below, we describ e three cases where we can make anaccurate estimate.

Many scientic library routines are simple, loop-free leaf procedures that alwa ys followthe same execution path. Their estimatedexecution times are consequently constant. We can often identify such procedures by checking that they meet two conditions:

1. The procedure is a loop-free andcall-free.

2. The av erage time p er call as determinedby basic blockcounts is equal to the maximumor minimumpossible time per call.

(11)

2. Select loops andprocedures containing most memory operations. 3. Eliminate selectedobjects that fail tomeet timabilityconstraints. 4. Eliminate selected objects that fail to meet estimability

con-straints.

5. Instrument code to measure actual time sp ent in remaining se -lected objects.

6. Run instrumented code, correlate actual times with estimated times to isolate b ottlenecks, and rep ort bottlenecks to the user.

Figure 1: MTOOL's Bottleneck IsolationAlgorithm

Two other simple heuristics suce to eliminate many other problematic calls. Suppose object v calls procedure p. Then, we can normally assume average time p er call fromv to pis average time p er call to vwheneither:

1. The v ast majority (98%) of calls to pare made by v, or

2. (avg. time per call to p) * (#of calls fromv to p) time sp ent in v, so anyerror in the approximation is negligible.

Both of these heuristics can b e inaccurate under pathological conditions (whenthe varianceof the executiontime of pacross calls is large), soMTOOL issues a warning whenev er it inv okes them. They hav e not causedproblems withthe benchmarks measured inthis study.

The steps MTOOLuses to isolate memory b ottlenecks are summarized in Figure 1. Step6 is of course the most signicant to the user.

5 User I nterface

(12)

Measured Time (compensated for counters): user 132.4 sys 0.7 Overhead estimates: User (memory) System (I/0)

39.3 0.7 Proceed with memory bottleneck probe insertion (y/n)?

Figure 2: MTOOL's Initial Bottleneck Estimate

counts, runs andtimes the instrumentedcode, anddisplays a result like that shownin Figure 2. MTOOLgenerates the data in the gure by estimating the run-time of the programandcomparing it withthe measuredexecution time, appropriately adjustedfor the eects of the counters. The information prevents a a user fromproceeding when no signicant memory bottlenecks are present.

Assuming the user asks MTOOLto produce an instrumented program, MTOOLexecutes steps 2to5of Figure 1producingale of m-object descri p-tors andactual execution time measurements. This summary le is handed o to the front end.

The front end' s toplev el windowis showninFigure 3for the Perfect Club b enchmark TFS.f, an air owsimulation. Overheadis dened as

(actu al time 0estimatedtime)=estimatedtime:

The histograminthelower left of the windowvisuallysummarizes the datain the upper right. The line \Measured91% of memoryoperations"reports the

p ercentage of all executedmemory operations that are included inmeasured objects. All times and overheads refer to user time only; time spent in the op erating systemon behalf of a programis ignored(thoughit was rep orted at anearlier stage{see Figure 2).

(13)

(14)

ing the text. Bottlenecks are highlighted one at a time, with the ov erall overheadcontributionandthe extra cycles p er memory operationassociated with the bottleneck rep orted at the top of the text window. Pressing the INFObutton op ens a popup windowthat gives further information about the distribution of memory op erations when the measured object includes loops or procedure calls. Figure 6 shows a simple INFOwindowfor the top b ottleneck in dflux(). Pressing PREV or NEXTdisplays the previous or next b ottleneck; bottlenecks are sortedbymagnitude. The text windowmay b e scrolled by pressing the vertical arrows at the right side of the window, and multiple text windows may be op ensimultaneously.

Three aspects of the user interface deserv e comment froman implem

en-tor's p ersp ective. First, the interface is relativ elyp ortable because it is built usingInterviews [4], anobject orientedtoolkit that runs ontopof X. Second, the notion of clicking on a bar to obtain further information on the associ -ated b ottleneck is quite general. For example, it would b e easy to supp ort an optionwhere clicking ona procedure's estimated time bar wouldprovide information implicit in our time estimate like register usage, oating p oint stalls, MFLOPS, most often executed lines, etc. Athird note ab out the interface is that is uses standard line numb er information available in the object le to relate basic block level information back to source code lines. In this sense, the user interface is language-independent.

6 Results

(15)

%of Ov er- plained InAll Median/ Max/ InAll Program mops head Ov erhd m-obj. m-obj m-obj Code tfs 91% 40% 4% 252 8 53 2020 nas 97% 25% 0% 591 84 395 3976 sds 93% 6% 2% 167 12 120 7607 lws 93% 9% 0% 100 100 100 1237 lgs 98% 12% 0% 323 35 132 2327 fpppp 85% 50% 11% 940 274 666 2718 doduc 85% 29% 5% 2541 102 477 5334 spice2g6 95% 46% 2% 756 41 434 18411 dnasa7 97% 107% 1% 257 9 118 1105

Table 2: MTOOLEectiveness

containedinthe m-objects. Onhalf of the programs, the medianobject con-tains fewer than 50 lines, but for some programs lik e fpppp, we are p erhaps failing to localize bottlenecks adequately. This problemis discussed further in the Conclusions sectionb elow.

7 Conclusions

(16)

tioned m-objects containing these highlighted memory op erations, subject to MTOOL's checks for estimability andtimability.

MTOOLcould also b e enhancedbybroadening the class of programs for whichit can detect bottlenecks. MTOOLis nowrestricted to non-recursive programs that do not use procedure v ariables. The restriction on recursion derives primarilyfromthe fact that thesimple start/stopclocktimers will not work when an m-object can b e re-entered (and the clock re-started) b efore it is exited (and the clock is stopped). The restriction could b e removedby prohibiting the timing of recursive procedures or by using more complicated timers that keep track of the depth of the recursive call and only start and stopthe clock on the rst entry and last exit. Because MTOOLis language indep endent (except for the restriction on recursion) modifying the timers (andmodifying the checkfor estimability of section4 to account for loops in the call graph) wouldallowMTOOLto handle languages withrecursion.

The restriction onprocedure v ariables is also easy to remov e. MTOOL's denitionof estimabilityreqires awell-denedcall graphandthe use of proce -dure variables means that thefull structure of the call graphis not determined until run-time. By modifying the basic block counting instrumentation to record a dynamic call graph, this restrictioncould b e circumv ented.

Thenal andperhaps most excitingapplicationfor MTOOLtechnologyis p ortingit toasharedmemorymultiprocessor. Memoryb ottlenecks onshared memory machines can b e severe and detecting themis dicult. Analytic techniques cannot handle the complexity of parallel systems and simulating multipleinteractingprocessors is extremelycomplexandexpensiv e. MTOOL should provide a viable alternativ e to these approaches.

Acknowledgements: I wouldlik e to thank DavidWall of DECWRLwho

sp onsored part of this research during a summer internship and provided invaluable advice onthe ins and outs of patching code.

8 Appendix: I nstrumenting Code

(17)

solutionis to include anindirect jumptable whichmaps every address inthe original executable to its corresp onding address in the patched executable. Indi rect jumps are always made through this table. The drawback of Pixie is that it canintroduce non-negligible ov erhead.

Asecondapproachis topatchthe code at linktime, modifyingp c-relative jumps, text addresses in the data segment, and the relocation dictionary to correctly reect changes in the executable le. The adv antage of this approach is that all jump destination addresses are clearly identiable and no ov erheadis added. See [9] for a a complete descriptionof the technique.

MTOOLuses a thirdapproach. MTOOLpatches the executable directly, but it does not use anindirect jumptable. Instead, MTOOLpatternmatches to nd instructions which load addresses in the text segment. The loaded text address is modiedtorepresent the corresp ondi ngaddress inthe instr u-mentedcode. Thus, a JR r1 will succeedbecause the instructions to loadr1 with a value hav e b een correctly up dated. This techni que suers fromthe p otential drawback that the pattern matcher could fail in highly optimized code, but we hav e encounteredno problems to date.

References

[1] W. Abu-Sufah, D. Kuck, and D. Lawrie. On the p erformance enhance -ment of paging systems through programanalysis and transformations. IEEETransanctions onComputing, 30(5):341{56, 1981.

[2] K. Galliv an, W. Jalby, U. Meier, andA. Sameh. The impact of hierarchi -cal memorysystems onlinear algebraalgorithmdesign. Techni cal Report CSRD625, University of Illinois, 1987.

[3] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management byglobal program transformation. Journal of Par -allel andDistributedComputing, 5(5):587{616, 1988.

[4] Mark A. Linton, JohnM. Vlissides, andPaul R. Calder. Comp osing user interfaces with interviews. IEEEComputer, 22(2):8{22, 1989.

(18)

onSupercomputer Applications. PhDthesis, Rice University, May 1989.

[7] E. Rothberg andA. Gupta. Ecient sparse matrix factorizationonhi gh-p erformance workstations { exploiting the memory hierarchy. To app ear ACMTransactions onMathematical Software.

[8] E. Rothberg and A. Gupta. Techni ques for improving the p erformance of sparse matrix factorizationonmultiprocessor workstations. To app ear Proceedings of Supercomputing'90, Nov. 1990.

(19)

(20)

WRL Research Reports

‘‘Titan System Manual.’’

‘‘MultiTitan: Four Architecture Papers.’’

Michael J. K. Nielsen.

Norman P. Jouppi, Jeremy Dion, David Boggs,

Mich-WRL Research Report 86/1, September 1986.

ael J. K. Nielsen.

WRL Research Report 87/8, April 1988.

‘‘Global Register Allocation at Link Time.’’

David W. Wall.

‘‘Fast Printed Circuit Board Routing.’’

WRL Research Report 86/3, October 1986.

Jeremy Dion.

WRL Research Report 88/1, March 1988.

‘‘Optimal Finned Heat Sinks.’’

William R. Hamburgen.

‘‘Compacting Garbage Collection with Ambiguous

WRL Research Report 86/4, October 1986.

Roots.’’

Joel F. Bartlett.

‘‘The Mahler Experience: Using an Intermediate

WRL Research Report 88/2, February 1988.

Language as the Machine Description.’’

David W. Wall and Michael L. Powell.

‘‘The Experimental Literature of The Internet: An

WRL Research Report 87/1, August 1987.

Annotated Bibliography.’’

Jeffrey C. Mogul.

‘‘The Packet Filter: An Efficient Mechanism for

WRL Research Report 88/3, August 1988.

User-level Network Code.’’

Jeffrey C. Mogul, Richard F. Rashid, Michael

‘‘Measured Capacity of an Ethernet: Myths and

J. Accetta.

Reality.’’

WRL Research Report 87/2, November 1987.

David R. Boggs, Jeffrey C. Mogul, Christopher

A. Kent.

‘‘Fragmentation Considered Harmful.’’

WRL Research Report 88/4, September 1988.

Christopher A. Kent, Jeffrey C. Mogul.

WRL Research Report 87/3, December 1987.

‘‘Visa Protocols for Controlling Inter-Organizational

Datagram Flow: Extended Description.’’

‘‘Cache Coherence in Distributed Systems.’’

Deborah Estrin, Jeffrey C. Mogul, Gene Tsudik,

Christopher A. Kent.

Kamaljit Anand.

WRL Research Report 87/4, December 1987.

WRL Research Report 88/5, December 1988.

‘‘Register Windows vs. Register Allocation.’’

‘‘SCHEME->C A Portable Scheme-to-C Compiler.’’

David W. Wall.

Joel F. Bartlett.

WRL Research Report 87/5, December 1987.

WRL Research Report 89/1, January 1989.

‘‘Editing Graphical Objects Using Procedural

‘‘Optimal Group Distribution in Carry-Skip

Representations.’’

Adders.’’

Paul J. Asente.

Silvio Turrini.

WRL Research Report 87/6, November 1987.

WRL Research Report 89/2, February 1989.

‘‘The USENET Cookbook: an Experiment in

‘‘Precise Robotic Paste Dot Dispensing.’’

Electronic Publication.’’

William R. Hamburgen.

(21)

‘‘Simple and Flexible Datagram Access Controls for

‘‘Link-Time Code Modification.’’

Unix-based Gateways.’’

David W. Wall.

Jeffrey C. Mogul.

WRL Research Report 89/17, September 1989.

WRL Research Report 89/4, March 1989.

‘‘Noise Issues in the ECL Circuit Family.’’

‘‘Spritely NFS: Implementation and Performance of

Jeffrey Y.F. Tang and J. Leon Yang.

Cache-Consistency Protocols.’’

WRL Research Report 90/1, January 1990.

V. Srinivasan and Jeffrey C. Mogul.

‘‘Efficient Generation of Test Patterns Using

WRL Research Report 89/5, May 1989.

Boolean Satisfiablilty.’’

‘‘Available Instruction-Level Parallelism for Super-

Tracy Larrabee.

scalar and Superpipelined Machines.’’

WRL Research Report 90/2, February 1990.

Norman P. Jouppi and David W. Wall.

‘‘Two Papers on Test Pattern Generation.’’

WRL Research Report 89/7, July 1989.

Tracy Larrabee.

‘‘A

Unified

Vector/Scalar

Floating-Point

WRL Research Report 90/3, March 1990.

Architecture.’’

‘‘Virtual Memory vs. The File System.’’

Norman P. Jouppi, Jonathan Bertoni, and David

Michael N. Nelson.

W. Wall.

WRL Research Report 90/4, March 1990.

WRL Research Report 89/8, July 1989.

‘‘Efficient Use of Workstations for Passive

Monitor-‘‘Architectural and Organizational Tradeoffs in the

ing of Local Area Networks.’’

Design of the MultiTitan CPU.’’

Jeffrey C. Mogul.

Norman P. Jouppi.

WRL Research Report 90/5, July 1990.

WRL Research Report 89/9, July 1989.

‘‘A One-Dimensional Thermal Model for the VAX

‘‘Integration and Packaging Plateaus of Processor

9000 Multi Chip Units.’’

Performance.’’

John S. Fitch.

Norman P. Jouppi.

WRL Research Report 90/6, July 1990.

WRL Research Report 89/10, July 1989.

‘‘1990 DECWRL/Livermore Magic Release.’’

‘‘A 20-MIPS Sustained 32-bit CMOS

Microproces-Robert N. Mayo, Michael H. Arnold, Walter S. Scott,

sor with High Ratio of Sustained to Peak

Don Stark, Gordon T. Hamachi.

Performance.’’

WRL Research Report 90/7, September 1990.

Norman P. Jouppi and Jeffrey Y. F. Tang.

WRL Research Report 89/11, July 1989.

‘‘The Distribution of Instruction-Level and Machine

Parallelism and Its Effect on Performance.’’

Norman P. Jouppi.

WRL Research Report 89/13, July 1989.

‘‘Long Address Traces from RISC Machines:

Generation and Analysis.’’

Anita Borg, R.E.Kessler, Georgia Lazana, and David

W. Wall.

(22)

WRL TN 17 pdf

D E C E M B E R 1 9 9 0

WRL

Technical Note TN-17

MTOOL: A Method

For Detecting

Memory Bottlenecks

The Western Research Laboratory (WRL) is a computer systems research group that

was founded by Digital Equipment Corporation in 1982. Our focus is computer science

research relevant to the design and application of high performance scientific computers.

We test our ideas by designing, building, and using real systems. The systems we build

are research prototypes; they are not intended to become products.

There is a second research laboratory located in Palo Alto, the Systems Research

Cen-ter (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge,

Massachusetts (CRL).

Our research is directed towards mainstream high-performance computer systems. Our

prototypes are intended to foreshadow the future computing environments used by many

Digital customers. The long-term goal of WRL is to aid and accelerate the development

of high-performance uni- and multi-processors. The research projects within WRL will

address various aspects of high-performance computing.

We believe that significant advances in computer systems do not come from any single

technological advance. Technologies, both hardware and software, do not all advance at

the same pace. System design is the art of composing systems which use each level of

technology in an appropriate balance. A major advance in overall system performance

will require reexamination of all aspects of the system.

We do work in the design, fabrication and packaging of hardware; language processing

and scaling issues in system software design; and the exploration of new applications

areas that are opening up with the advent of higher performance systems. Researchers at

WRL cooperate closely and move freely among the various levels of system design. This

allows us to explore a wide range of tradeoffs to meet system goals.

We publish the results of our work in a variety of journals, conferences, research

reports, and technical notes. This document is a technical note. We use this form for

rapid distribution of technical material. Usually this represents research in progress.

Research reports are normally accounts of completed research and may include material

from earlier technical notes.

Research reports and technical notes may be ordered from us. You may mail your

order to:

Technical Report Distribution

DEC Western Research Laboratory, UCO-4

100 Hamilton Avenue

Palo Alto, California 94301 USA

Reports and notes may also be ordered by electronic mail. Use one of the following

addresses:

Digital E-net:

DECWRL::WRL-TECHREPORTS

DARPA Internet:

[email protected]

CSnet:

[email protected]

UUCP:

decwrl!wrl-techreports

instruc-MTOOL: A Method For Detecting

Memory Bottlenecks

Aaron Goldberg

Digital Equipment Corporation Western Research Laboratory

John Hennessy

Stanford University

December, 1990

Abstract

This paper presents a new method for detecting regions of a

program where the memory hierarchy is performing poorly. By

observing where actual measured execution time differs from the

time predicted given a perfect memory system, we can isolate

memory bottlenecks. MTOOL, an implementation of the

ap-proach aimed at Fortran programs running on MIPS-chip based

workstations is described and results for some of the Perfect Club

and SPEC benchmarks are reported.

WRL Research Reports

‘‘Titan System Manual.’’

‘‘MultiTitan: Four Architecture Papers.’’

Michael J. K. Nielsen.

Norman P. Jouppi, Jeremy Dion, David Boggs,

Mich-WRL Research Report 86/1, September 1986.

ael J. K. Nielsen.

WRL Research Report 87/8, April 1988.

‘‘Global Register Allocation at Link Time.’’

David W. Wall.

‘‘Fast Printed Circuit Board Routing.’’

WRL Research Report 86/3, October 1986.

Jeremy Dion.