Solutions Manual

(1)

Computer Architecture: Fundamentals and Principles of Computer Design

Solutions Manual

Joseph D. Dumas II

University of Tennessee at Chattanooga

Department of Computer Science and Electrical Engineering

Copyright © 2006

(2)

1 Introduction to Computer Architecture

1. Explain in your own words the differences between computer systems architecture and implementation. How are these concepts distinct, yet interrelated? Give a historical example of how implementation technology has affected architectural design (or vice versa).

Architecture is the logical design of a computer system, from the top level on down to the subsystems and their components – a specification of how the parts of the system will fit together and how everything is supposed to function.

Implementation is the physical realization of an architecture – an actual, working hardware system on which software can be executed.

There are many examples of advances in implementation technology

affecting computer architecture. An obvious example is the advent of magnetic core memory to replace more primitive storage technologies such as vacuum tubes, delay lines, magnetic drums, etc. The new memory technology had much greater storage capacity than was previously feasible. The availability of more main memory

resulted in changes to machine language instruction formats, addressing modes, and other aspects of instruction set architecture.

2. Describe the technologies used to implement computers of the first, second, third, fourth, and fifth generations. What were the main new architectural features that were

introduced or popularized with each generation of machines? What advances in software went along with each new generation of hardware?

First generation computers were unique machines built with very primitive implementation technologies such as electromagnetic relays and (later) vacuum tubes. The main new architectural concept was the von Neumann stored-program paradigm itself. (The early first generation machines were not programmable in the sense we understand that term today.) Software, for those machines where the concept was actually relevant, was developed in machine language.

(3)

Second-generation computers made use of the recently invented transistor as a basic switching element. The second generation also saw the advent of magnetic core memory as a popular storage technology. At least partly in response to these technological advances, new architectural features were developed including virtual memory, interrupts, and hardware representation of floating-point numbers. Advances in software development included the use of assembly language and the first high-level languages including Fortran, Algol, and COBOL. Batch processing systems and multiprogramming operating systems were also devised during this time period.

The third generation featured the first use of integrated circuits (with multiple transistors on the same piece of semiconductor material) in computers. Not only was this technology used to create smaller CPUs requiring less wiring between components, but semiconductor memory devices began to replace core memory as well. This led to the development of minicomputer architectures that were less expensive to implement and helped give rise to families of computer

systems sharing a common instruction set architecture. Software advances included increased use of virtual memory, the development of more modern, structured programming languages, and the dawn of timesharing operating systems such as UNIX.

Fourth generation computers were the first machines to use VLSI integrated circuits including microprocessors (CPUs fabricated on a single IC). VLSI

technology continued to improve during this period, eventually yielding

microprocessors with over one million transistors and large-capacity semiconductor RAM and ROM devices. VLSI “chips” allowed the development of inexpensive but powerful microcomputers during the fourth generation. These systems gradually began to make use of virtual memory, cache memory, and other techniques previously reserved for mainframes and minicomputers; they provided direct

(4)

support for high-level languages either in hardware (CISC) or by using optimizing compilers (RISC). Other software advances included new languages like BASIC, Pascal, and C, and the first object-oriented language (C++). Office software including word processors and spreadsheet applications helped microcomputers gain a permanent foothold in small businesses and homes.

Fifth generation computers exhibited fewer architectural innovations than their predecessors, but advances in implementation technology (including pipelined and superscalar CPUs and larger, faster memory devices) yielded steady gains in performance. CPU clock frequencies increased from tens, to hundreds, and eventually to thousands of megahertz; today, CPUs operating at several gigahertz are common. “Standalone” systems became less common as most computers were connected to local area networks and/or the Internet. Object-oriented software development became the dominant programming paradigm, and network-friendly languages like Java became popular.

3. What characteristics do you think the next generation of computers (say, 5-10 years from now) will display?

The answer to this question will undoubtedly vary from student to student, but might include an increased reliance on networking (especially wireless

networking), increased use of parallel processing, more hardware support for graphics, sound, and other multimedia functions, etc.

4. What was the main architectural difference between the two early computers ENIAC and EDVAC?

ENIAC was not a programmable machine. Connections had to be re-wired to do a different calculation. EDVAC was based on the von Neumann paradigm, where instructions were not hard-wired but rather resided in main memory along with the data. The program, and thus the system’s functionality, could be changed without any modification to the hardware. Thus, EDVAC (and all its successors

(5)

based on the von Neumann architecture) were able to run “software” as we understand it today.

5. Why was the invention of solid state electronics (in particular, the transistor) so important in the history of computer architecture?

The invention of the transistor, and its subsequent use as a switching element in computers, enabled many of the architectural enhancements that came about during the second (and later) generations of computing. Earlier machines based on vacuum tubes were limited in capability because of the short lifetime of each

individual tube. A machine built with too many (more than a few thousand)

switching elements could not be reliable; it would frequently “go down” due to tube failures. Transistors, with their much longer life span, enabled the construction of computers with tens or hundreds of thousands of switching elements, which allowed more complex architectures to flourish.

6. Explain the origin of the term “core dump.”

The term “core dump” dates to the second and third generations of computing, when most large computers used magnetic core memory for main storage. Since core memory was nonvolatile (retained its contents in the absence of power), when a program crashed and the machine had to be taken down and restarted, the offending instruction(s) and their operands were still in memory and could be examined for diagnostic purposes. Some later machines with

semiconductor main memory mimic this behavior by “dumping” an image of a program’s memory space to disk to aid in debugging in the event of a crash.

7. What technological advances allowed the development of minicomputers, and what was the significance of this class of machines? How is a microcomputer different from a minicomputer?

The main technological development that gave rise to minicomputers was the invention of the integrated circuit. (The shrinking sizes of secondary storage devices

(6)

and advances in display technology such as CRT terminals also played a part.) The significance of these machines was largely due to their reduced cost as compared to traditional mainframe computers. Because they cost “only” a few thousand dollars instead of hundreds of thousands or millions, minicomputers were available to smaller businesses (and to small workgroups or individuals within larger

organizations). This trend toward proliferation and decentralization of computing resources was continued by the microcomputers of the fourth generation.

The main difference between a microcomputer and a minicomputer is the microcomputer’s use of a microprocessor (or single-chip CPU) as the main processing element. Minicomputers had CPUs consisting of multiple ICs or even multiple circuit boards. The availability of microprocessors, coupled with the miniaturization and decreased cost of other system components, made computers smaller and cheaper and thus, for the first time, accessible to the average person.

8. How have the attributes of very high performance systems (a.k.a. supercomputers) changed over the third, fourth, and fifth generations of computing?

The third generation of computing saw the development of the first supercomputer-class machines, including the IBM “Stretch”, the CDC 6600 and 7600, the TI ASC, the ILLIAC IV and others. These machines were very diverse and did not share many architectural attributes.

During the fourth generation, vector machines including the Cray-1 and its successors (and competitors) became the dominant force in high-performance computing. By processing vectors (large one-dimensional arrays) of operands in highly pipelined fashion, these machines achieved impressive performance on scientific and engineering calculations (though they did not achieve comparable performance increases on more general applications). Massively parallel machines (with many, simple processing elements) also debuted during this period.

(7)

highly parallel scalar systems using large numbers of conventional microprocessors. Many of these systems are cluster systems built around a network of relatively inexpensive, “commodity” computers.

9. What is the most significant difference between computers of the last 10-15 years versus those of previous generations?

Fifth generation computers are smaller, cheaper, faster, and have more memory than their predecessors – but probably the single most significant difference between modern systems and those of the past is the pervasiveness of networking. Almost every general-purpose or high-performance system is

connected to a local area network, or a wide area network such as the Internet, via some sort of wired or wireless network connection.

10. What is the principal performance limitation of a machine based on the von Neumann (Princeton) architecture? How does a Harvard architecture machine address this limitation?

The main performance limitation of a von Neumann machine is the “von Neumann bottleneck” – the single path between the CPU and main memory, over which instructions as well as data must be accessed. A Harvard architecture removes this bottleneck by having either separate main memories for instructions and data (with a dedicated connection to each), or (much more common nowadays) by having only one main memory, but separate cache memories (see Chapter 2) for instructions and data. The separate memories can be optimized for access patterns typical of each type of memory reference in order to maximize data and instruction bandwidth to the CPU.

11. Summarize in your own words the von Neumann machine cycle.

Fetch instruction, decode instruction, determine operand address(es), fetch operand(s), perform operation, store result … repeat for next instruction.

(8)

systems? Explain.

Not necessarily. If anything, a more general architecture tends to be more complex, as its designers try to make it capable of doing a wide variety of things reasonably well. This increased complexity, as compared to a more specialized architecture, may lead to a higher probability of “bugs” in the implementation, all else being equal.

13. How does “ease of use” relate to “user friendliness”?

Not at all; at least, not directly. User friendliness refers to the end user’s positive experience with the operating system and applications that run under it. Ease of use is an attribute that describes how well the architecture facilitates the development of system software such as operating systems, compilers, linkers, etc. In other words, it is a measure of “systems programmer friendliness.” While there is no direct connection, an architecture that is not “easy to use” could possibly give rise to systems software with a higher probability of bugs, which may ultimately lead to a lower quality experience on the part of the end user.

14. The obvious benefit of maintaining upward and/or forward compatibility is the ability to continue to run “legacy” code. What are some of the disadvantages of compatibility?

Building in compatibility with previous machines makes the design of an architecture more complex. This may result in higher design and implementation costs, less architectural ease of use, and a higher probability of flaws in the

implementation of the design.

15. Name at least two things (other than hardware purchase price, software licensing cost, maintenance, and support) that may be considered cost factors for a computer system.

Costs are not always monetary – at least, not directly. Other cost factors, depending on the nature of the system and where it is used, might include power consumption, heat dissipation, physical volume, mass, and losses incurred if a system fails due to reliability issues.

(9)

16. Give as many reasons as you can why PC compatible computers have a larger market share than Macs.

It is probably impossible to know all the reasons, but one of the biggest is that PCs have an “open”, rather than proprietary, architecture. Almost from the very beginning, compatible “clones” were available at competitive prices, holding down not only the initial cost of buying a computer, but also the prices for software and replacement parts. Success breeds success, and the larger market share meant that manufacturers who produced PC hardware were able to invest in research and development that produced better, faster, and more economical PC compatible machines.

17. One computer system has a 3.2 GHz processor, while another has only a 2.7 GHz processor. Is it possible that the second system might outperform the first? Explain.

It is entirely possible that this might be the case. CPU clock frequency is only one small aspect of system performance. Even with a lower clock frequency (fewer clock cycles occurring each second) the second system’s CPU might outperform the first because of architectural or implementation differences that result in it

accomplishing more work per clock cycle. And even if the first system’s CPU is indeed more capable, differences in the memory and/or input/output systems might still give the advantage to the second system.

18. A computer system of interest has a CPU with a clock cycle time of 2.5 ns. Machine language instruction types for this system include: integer addition/subtraction/logic instructions which require 1 clock cycle to be executed; data transfer instructions which average 2 clock cycles to be executed; control transfer instructions which average 3 clock cycles to be executed; floating-point arithmetic instructions which average 5 clock cycles to be executed; and input/output instructions which average 2 clock cycles to be

executed.

(10)

system. Determine its “peak MIPS” rating for use in your advertisements.

The fastest instructions take only one clock cycle to execute, so in order to calculate peak MIPS, assume that the whole program uses only these instructions. That means that the machine will execute one instruction every 2.5 ns. Thus we calculate:

Instruction execution rate = (1 instruction) / (2.5 * 10-9 seconds) = 4 * 108 instructions/second = 400 * 106 instructions/second = 400 MIPS

b) Suppose you have acquired this system and want to estimate its performance when running a particular program. You analyze the compiled code for this program and determine that it consists of 40% data transfer instructions, 35% integer addition, subtraction, and logical instructions, 15% control transfer instructions, and 10% I/O instructions. What MIPS rating do you expect the system to achieve while running this program?

First, we need to determine the mean number of cycles per instruction using a weighted average based on the percentages of the different types of instructions: CPIavg = (0.40)(2 cycles) + (0.35)(1 cycle) + (0.15)(3 cycles) + (0.10)(2 cycles) = (0.80 + 0.35 + 0.45 + 0.20) = 1.80 cycles/instruction

We already determined in part (a) above that if instructions take a single cycle, then we can execute 400 * 106 of them per second. This is another way of saying that the CPU clock frequency is 400 MHz. Given this knowledge and the average cycle count per instruction just calculated, we obtain:

Instruction execution rate = (400 M cycles / second) * (1 instruction / 1.8 cycles) ≈ 222 M instructions/second = 222 MIPS

c) Suppose you are considering purchasing this system to run a variety of programs using mostly floating-point arithmetic. Of the widely-used benchmark suites

discussed in this chapter, which would be the best to use in comparing this system to others you are considering?

(11)

If general-purpose floating-point performance is of interest, it would be hard to go wrong by using the SPECfp floating-point CPU benchmark suite (or some subset of it, if specific types of applications to be run on the system are known). Other possibilities include the Whetstones benchmark or (if applications of interest perform vector computations) LINPACK or Livermore Loops. Conversely, you would definitely not want to compare the systems using any of the integer-only or non-CPU-intensive benchmarks such as Dhrystones, TPC, etc.

d) What does MFLOPS stand for? Estimate this system’s MFLOPS rating; justify your answer with reasoning and calculations.

MFLOPS stands for Millions of Floating-point Operations Per Second. Peak MFLOPS can be estimated in a similar manner to parts (a) and (b) above:

Peak floating-point execution rate = (400 M cycles / second) * (1 FLOP / 5 cycles) = 80 MFLOPS

A more realistic estimate of a sustainable floating-point execution rate would have to take into account the additional operations likely to be required along with each actual numeric computation. While this would vary from one program to another, a reasonable estimate might be that for each floating-point arithmetic operation, the program might also perform two data transfers (costing a total of four clock cycles) plus one control transfer (costing three clock cycles). This would mean that the CPU could only perform one floating-point computation every 12 clock cycles for a sustained execution rate of (400 M cycles / second) * (1 FLOP / 12 cycles) ≈ 33 MFLOPS. The student may come up with a variety of estimates based on different assumptions, but any realistic estimate would be significantly less than the 80 MFLOPS peak rate.

19. Why does a hard disk that rotates at higher RPM generally outperform one that rotates at lower RPM? Under what circumstances might this not be the case?

(12)

write data on a rotating disk. These are the time required to step the read/write head in or out to the desired track, the rotational delay in getting to the start of the desired sector within that track, and then the time needed to actually read or write the sector in question. All else being equal, increasing disk RPM reduces the time it takes for the disk to make a revolution and so tends to reduce the second and third delay components, while it does nothing to address the first. If the higher-RPM drive had a longer track-to-track seek time, though, it might take just as long or even longer, overall, to access desired data as compared with a lower-RPM drive with shorter track-to-track access time.

20. A memory system can read or write a 64-bit value every 2 ns. Express its bandwidth in MB/s.

Since one byte equals 8 bits, a 64-bit value is 8 bytes. So we can compute the memory bandwidth as:

BW = (8 bytes) / (2 * 10-9 seconds) = 4 * 109 bytes/second = 4 GB/s or 4000 MB/s

21. If a manufacturer’s brochure states that a given system can perform I/O operations at 500 MB/s, what questions would you like to ask the manufacturer’s representative regarding this claim?

One should probably ask under what conditions this data transfer rate can be achieved. If it is a “peak” transfer rate, it is probably unattainable under any typical circumstances. It would be very helpful to know the size of the blocks of data being transferred and the length of time for which this 500 MB/s rate was sustained. Odds are probably good that if this is a peak rate, that it is only valid for fairly large block transfers of optimum size, and for very short periods of time. This may or may not reflect the nature of the I/O demands of a customer’s application.

22. Fill in the blanks below with the most appropriate term or concept discussed in this chapter:

(13)

the conceptual or block-level design

Babbage’s Analytical Engine - this was the first design for a programmable digital

computer, but a working model was never completed

Integrated circuits - this technological development was an important factor in moving

from second generation to third generation computers

CDC 6600 - this system is widely considered to have been the first supercomputer Altair - this early microcomputer kit was based on an 8-bit microprocessor; it introduced

10,000 hobbyists to (relatively) inexpensive personal computing

Microcontroller - this type of computer is embedded inside another electronic or

mechanical device such as a cellular telephone, microwave oven, or automobile transmission

Harvard architecture - a type of computer system design in which the CPU uses

separate memory buses for accessing instructions and data operands

Compatibility - an architectural attribute that expresses the support provided for

previous or other architectures by the current machine

MFLOPS - a CPU performance index that measures the rate at which computations can

be performed on real numbers rather than integers

Bandwidth - a measure of memory or I/O performance that tells how much data can be

transferred to or from a device per unit of time

Benchmark - a program or set of programs that are used as standardized means of

(14)

2 Computer Memory Systems

1. Consider the various aspects of an ideal computer memory discussed in Section 2.1.1 and the characteristics of available memory devices discussed in Section 2.1.2. Fill in the columns of the table below with the following types of memory devices, in order from most desirable to least desirable: magnetic hard disk, semiconductor DRAM, CD-R, DVD-RW, semiconductor ROM, DVD-R, semiconductor flash memory, magnetic floppy disk, CD-RW, semiconductor static RAM, semiconductor EPROM.

Cost/bit (will obviously fluctuate somewhat depending on market conditions): CD-R, DVD-CD-R, CD-RW, DVD-RW, magnetic hard disk, magnetic floppy disk,

semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, semiconductor static RAM.

Speed (will vary somewhat depending on specific models of devices): semiconductor static RAM, semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, magnetic hard disk, DVD-R, DVD-RW, R, CD-RW, magnetic floppy disk.

Information Density (again, this may vary by specific types of devices): Magnetic hard disk, DVD-R and DVD-RW, CD-R and CD-RW, semiconductor DRAM, semiconductor ROM, semiconductor EPROM, semiconductor flash memory, semiconductor static RAM, magnetic floppy disk.

Volatility: Optical media such as DVD-R, CD-R, DVD-RW, and CD-RW are all equally nonvolatile. The read-only variants cannot be erased and provide secure storage unless physically damaged. (The same is true of semiconductor ROM.) The read-write optical disks (and semiconductor EPROMs and flash memories) may be intentionally or accidentally erased, but otherwise retain their data indefinitely in the absence of physical damage. Magnetic hard and floppy disks are nonvolatile except in the presence of strong external magnetic fields. Semiconductor static RAM is volatile, requiring continuous application of electrical power to maintain

(15)

stored data. Semiconductor DRAM is even more volatile since it requires not only electrical power, but also periodic data refresh in order to maintain its contents. Writability (all memory is readable): Magnetic hard and floppy disks and

semiconductor static RAM and DRAM can be written essentially indefinitely, and as quickly and easily as they can be read. DVD-RW, CD-RW, and semiconductor flash memory can be written many times, but not indefinitely, and the write operation is usually slower than the read operation. Semiconductor EPROMs can be written multiple times, but only in a special programmer, and only after a relatively long erase cycle under ultraviolet light. DVD-R and CD-R media can be written once and only once by the user. Semiconductor ROM is pre-loaded with its binary information at the factory and can never be written by the user.

Power Consumption: All types of optical and magnetic disks as well as

semiconductor ROMs, EPROMs, and flash memories can store data without power being applied at all. Semiconductor RAMs require continuous application of power to retain data, with most types of SRAMS being more power-hungry than DRAMs. (Low-power CMOS static RAMs, however, are commonly used to maintain data for long periods of time with a battery backup.) While data are being read or written, all memories require power. Semiconductor DRAM requires relatively little power, while semiconductor ROMs, flash memories, and EPROMs tend to require more and SRAMs, more still. All rotating disk drives, magnetic and optical, require significant power in order to spin the media and move the read/write heads as well as to actually perform the read and write operations. The specifics vary

considerably from device to device, but those that rotate the media at higher speeds tend to use slightly more power.

Durability: In general, the various types of semiconductor memories are more durable than disk memories because they have no moving parts. Only severe physical shock or static discharges are likely to harm them. (CMOS devices are

(16)

particularly susceptible to damage from static electricity.) Optical media are also very durable; they are nearly impervious to most dangers except that of surface scratches. Magnetic media such as floppy and hard disks tend to be the least

durable as they are subject to erasure by strong magnetic fields and also are subject to “head crashes” when physical shock causes the read-write head to impact the media surface.

Removability/Portability: Flash memory, floppy disks, and optical disks are

eminently portable and can easily be carried from system to system to transfer data. A few magnetic hard drives are designed to be portable, but most are permanently installed in a given system and require some effort for removal. Semiconductor ROMs and EPROMs, if placed in sockets rather than being soldered directly to a circuit board, can be removed and transported along with their contents. Most semiconductor RAM devices lose their contents when system power is removed and, while they could be moved to another system, would not arrive containing any valid data.

2. Describe in your own words what a hierarchical memory system is and why it is used in the vast majority of modern computer systems.

A hierarchical memory system is one that is comprised of several types of memory devices with different characteristics, each occupying a “level” within the overall structure. The higher levels of the memory system (the ones closest to, or a part of, the CPU) offer faster access but, due to cost factors and limited physical space, have a smaller storage capacity. Thus, each level can typically hold only a portion of the data stored in the next lower level. As one moves down to the lower levels, speed and cost per bit generally decrease, but capacity increases. At the lowest levels, the devices offer a great deal of (usually nonvolatile) storage at relatively low cost, but are quite slow. For the overall system to perform well, the hierarchy must be managed by hardware and software such that the stored items

(17)

that are used most frequently are located in the higher levels, while items that are used less frequently are relegated to the lower levels.

3. What is the fundamental, underlying reason why low-order main memory interleaving and/or cache memories are needed and used in virtually all high-performance computer systems?

The main underlying reason why speed-enhancing techniques such as low-order interleaving and cache continue to be needed and used in computer systems is that main memory technology has never been able to keep up with the speed of processor implementation technologies. The CPUs of each generation have always been faster than any devices (from the days of delay lines, magnetic drums, and core memory all the way up to today’s high-capacity DRAM ICs) that were feasible, from a cost standpoint, to be used as main memory. If anything, the CPU-memory speed gap has widened rather than narrowed over the years. Thus, the speed and size of a system’s cache may be even more critical to system performance than almost any other factor. (If you don’t believe this, examine the performance

difference between an Intel Pentium 4 and an otherwise similar Celeron processor.)

4. A main memory system is designed using 15 ns RAM devices using a 4-way low-order interleave.

(a) What would be the effective time per main memory access under ideal conditions?

Under ideal conditions, four memory accesses would be in progress at any given time due to the low-order interleaving scheme. This means that the effective time per main memory access would be (15 / 4) = 3.75 ns.

(b) What would constitute “ideal conditions”? (In other words, under what circumstances could the access time you just calculated be achieved?)

The ideal condition for best performance of the memory system would be continuous access to sequentially numbered memory locations. Equivalently, any

(18)

access pattern that consistently used all three of the other “leaves” before returning to the one just accessed would have the same benefit. Examples would include accessing every fifth numbered location, or every seventh, or any spacing that is relatively prime with 4 (the interleaving factor).

(c) What would constitute “worst-case conditions”? (In other words, under what circumstances would memory accesses be the slowest?) What would the access time be in this worst-case scenario? If ideal conditions exist 80% of the time and worst-case conditions occur 20% of the time, what would be the average time required per memory access?

The worst case would be a situation where every access went to the same device or group of devices. This would happen if the CPU needed to access every fourth numbered location (or every eighth, or any spacing that is an integer multiple of 4). In this case, access time would revert to that of an individual device (15 ns) and the interleaving would provide no performance benefit at all.

In the hypothetical situation described, we could take a weighted average to determine the effective access time for the memory system: (0.80)(3.75 ns) + (0.20)(15 ns) = (3 + 3) = 6 ns.

(d) When ideal conditions exist, we would like the processor to be able to access memory every clock cycle with no “wait states” (that is, without any cycles wasted waiting for memory to respond). Given this requirement, what is the highest processor bus clock frequency that can be used with this memory system?

In part (a) above, we found the best-case memory access time to be 3.75 ns. Matching the CPU bus cycle time to this value and taking the reciprocal (since f = 1/T) we obtain:

f = 1/T = (1 cycle) / (3.75 * 10-9 seconds) ≈ 2.67 * 108 cycles/second = 267 MHz.

(e) Other than increased hardware cost and complexity, are there any potential

(19)

such disadvantage and the circumstances under which it might be significant.

The main disadvantage that could come into play is due to the fact that under ideal conditions, all memory modules are busy all the time. This is good if only one device (usually the CPU) needs to access memory, but not good if other devices need to access memory as well (for example, to perform I/O). Essentially all the memory bandwidth is used up by the first device, leaving little or none for others.

Another possible disadvantage is lower memory system reliability due to decreased fault tolerance. In a high-order interleaved system, if one memory device were to fail, 3/4 of the memory space would still be usable. In the low-order

interleaved case, if one of the four “leaves” fails, the entire main memory space is effectively lost.

5. Is it correct to refer to a typical semiconductor integrated circuit ROM as a “random access memory”? Why or why not? Name and describe two other logical organizations of computer memory that are not “random access.”

It is correct to refer to a semiconductor ROM as a “random access memory” in the strict sense of the definition – a “random access” memory is any memory device that has an access time independent of the specific location being accessed. (In other words, any randomly chosen location can be read or written in the same amount of time as any other location.) This is equally true of most semiconductor read-only memories as it is of semiconductor read/write memories (which are commonly known as “RAMs”). Because of the commonly-used terminology, it is probably better not to confuse the issue by referring to a ROM IC as a “RAM”, even though that is technically a correct statement.

Besides random access, the other two logical memory organizations that may be found in computer systems are sequential access (typical of tape and disk

(20)

6. Assume that a given system’s main memory has an access time of 6.0 ns, while its cache has an access time of 1.2 ns (five times as fast). What would the hit ratio need to be in order for the effective memory access time to be 1.5 ns (four times as fast as main memory)?

Since effective memory access time in such a system is based on a weighted average, we would need to solve the following equation:

ta effective = ta cache * (ph) + ta main * (1 - ph)

for the particular values given in the problem, as shown: 1.5 ns = (1.2 ns)(ph) + (6.0 ns)(1 - ph)

Using basic algebra we solve to obtain ph = 0.9375.

7. A particular program runs on a system with cache memory. The program makes a total of 250,000 memory references; 235,000 of these are to cached locations.

(a) What is the hit ratio in this case?

ph = number of hits / (number of hits + number of misses) = 235,000 / 250,000 = 0.94 (b) If the cache can be accessed in 1.0 ns but the main memory requires 7.5 ns for an

access to take place, what is the average time required by this program for a memory access assuming all accesses are reads?

ta effective = ta cache * (ph) + ta main * (1 - ph) = (1.0 ns)(0.94) + (7.5 ns)(0.06) = (0.94 +

0.45) ns = 1.39 ns

(c) What would be the answer to part (b) if a write-through policy is used and 75% of memory accesses are reads?

If a write-through policy is used, then all writes require a main memory access and write hits do nothing to improve memory system performance. The average write access time is equal to the main memory access time, which is 7.5 ns. The average read access time is equal to 1.39 ns as calculated in (b) above. The overall average time per memory access is thus given by:

(21)

8. Is hit ratio a dynamic or static performance parameter in a typical computer memory system? Explain your answer.

Hit ratio is a dynamic parameter in any practical computer system. Even though the cache and main memory sizes, mapping strategy, replacement policy, etc. (which can all affect the hit ratio) are constant within a given system, the proportion of cache hits to misses will still vary from one program to another. It will also vary widely within a given run, based on such factors as the length of time the program has been running, the code structure (procedure calls, loops, etc.) and the properties of the specific data set being operated on by the program.

9. What are the advantages of a set-associative cache organization as opposed to a direct-mapped or fully associative mapping strategy?

A set-associative cache organization is a compromise between the direct-mapped and fully associative organizations that attempts to maximize the advantages of each while minimizing their respective disadvantages. Fully associative caches are expensive to build but offer a higher hit ratio than direct-mapped caches of the same size. Direct-direct-mapped caches are cheaper and less complex to build but performance can suffer due to usage conflicts between lines with the same index. By limiting associativity to just a few parallel comparisons (two- and four-way set-associative caches are most common) the set-associative organization can achieve nearly the same hit ratio as a fully associative design at a cost not much greater than that of a direct-mapped cache.

10. A computer has 64 MB of byte-addressable main memory. It is proposed to design a 1 MB cache memory with a refill line (block) size of 64 bytes.

(a) Show how the memory address bits would be allocated for a direct-mapped cache organization.

Since 64M = 226, the total number of bits required to address the main memory space is 26. And since 64 = 26, it takes 6 bits to identify a particular byte

(22)

within a line. The number of refill lines in the cache is 1M / 64 = 220 / 26 = 214 = 16K. Since there are 214 lines in the cache, 14 index bits are required. 26 total address bits – 6 “byte” bits – 14 “index” bits leaves 6 bits to be used for the tag. So the address bits would be partitioned as follows: Tag (6 bits) | Index (14 bits) | Byte (6 bits)

(b) Repeat part (a) for a four-way set-associative cache organization.

For the purposes of this problem, a four-way set-associative cache can be treated as four direct-mapped caches operating in parallel, each one-fourth the size of the cache described above. Each of these four smaller units would thus be 256 KB in size, containing 4K = 212 refill lines. Thus, 12 bits would need to be used for the index, and 26 – 6 – 12 = 8 bits would be used for the tag. The address bits would be partitioned as follows: Tag (8 bits) | Index (12 bits) | Byte (6 bits)

(c) Repeat part (a) for a fully associative cache organization.

In a fully associative cache organization, no index bits are required.

Therefore the tags would be 26 – 6 = 20 bits long. Addresses would be partitioned as follows: Tag (20 bits) | Byte (6 bits)

(d) Given the direct-mapped organization, and ignoring any extra bits that might be needed (valid bit, dirty bit, etc.), what would be the overall size (“depth” by “width”) of the memory used to implement the cache? What type of memory devices would be used to implement the cache (be as specific as possible)?

The overall size of the direct-mapped cache would be:

(16K lines) * (64 data bytes + 6 bit tag) = (16,384) * ((64 * 8) + 6) = (16,384 * 518) = 8,486,912 bits. This would be in the form of a fast 16K by 518 static RAM.

(e) Which line(s) of the direct-mapped cache could main memory location

1E0027A16 map into? (Give the line number(s), which will be in the range of 0 to

(n-1) if there are n lines in the cache.) Give the memory address (in hexadecimal) of another location that could not reside in cache at the same time as this one (if

(23)

such a location exists).

To answer this question, we need to write the memory address in binary. 1E0027A hexadecimal equals 01111000000000001001111010 binary. We can break this down into a tag of 011110, an index of 00000000001001 and a byte offset within the line of 111010. In a direct-mapped cache, the binary index tells us the number of the only line that can contain the given memory location. So, this location can only reside in line 10012 = 9 decimal.

Any other memory location with the same index but a different tag could not reside in cache at the same time as this one. One example of such a location would be the one at address 2F0027A16.

11. Define and describe virtual memory. What are its purposes, and what are the advantages and disadvantages of virtual memory systems?

Virtual memory is a technique that separates the (virtual) addresses used by the software from the (physical) addresses used by the memory system hardware. Each virtual address referenced by a program goes through a process of translation (or mapping) that resolves it into the correct physical address in main memory, if such a mapping exists. If no mapping is defined, the desired information is loaded from secondary memory and an appropriate mapping is created. The translation process is overseen by the operating system, with much of the work done in hardware by a memory management unit (MMU) for speed reasons. It is usually done via a multi-level table lookup procedure, with the MMU internally caching frequently- or recently-used translations so that the costly (in terms of performance) table lookups can be avoided most of the time.

The principal advantage of virtual memory is that it frees the programmer from the burden of fitting his or her code into available memory, giving the illusion of a large memory space exclusively owned by the program (rather than the usually much more limited physical main memory space that is shared with other resident

(24)

programs). The main disadvantage is the overhead of implementing the virtual memory scheme, which invariably results in some increase in average access time vs. a system using comparable technology with only physical memory. Table lookups take time, and even when a given translation is cached in the MMU’s Translation Lookaside Buffer, there is some propagation delay involved in address translation.

12. Name and describe the two principal approaches to implementing virtual memory

systems. How are they similar and how do they differ? Can they be combined, and if so, how?

The two principal approaches to implementing virtual memory (VM) are

demand-paged VM and demand-segmented VM (paging and segmentation, for short).

They are similar in that both map a virtual (or logical) address space to a physical address space using a table lookup process managed by an MMU and overseen by the computer’s operating system. They are different in that paging maps fixed-size regions of memory called pages, while segmentation maps variable-length segments. Page size is usually determined by hardware considerations such as disk sector size, while segment size is determined by the structure of the program’s code and data. A paged system can concatenate the offset within a page with the translated upper address bits, while a segmented system must translate a logical address into the complete physical starting address of a segment and then add the segment offset to that value.

It is possible to create a system that uses aspects of both approaches;

specifically, one in which the variable-length segments are each comprised of one or more fixed-sized pages. This approach, known as segmentation with paging, trades off some of the disadvantages of each approach to try to take advantage of their strengths.

13. What is the purpose of having multiple levels of page or segment tables rather than a single table for looking up address translations? What are the disadvantages, if any, of

(25)

this scheme?

The main purpose of having multiple-level page or segment tables is to replace one huge mapping table with a hierarchy of smaller ones. The advantage is that the tables are smaller (remember, they are stored in main memory, though some entries may be cached) and easier for the operating system to manage. The disadvantage is that “walking” the hierarchical sequence of tables takes longer than a single table lookup. Most systems have a TLB to cache recently-used address translations, though, so this time penalty is usually only incurred once when a given page or segment is first loaded into memory (or perhaps again later if the TLB fills up and a displaced entry has to be reloaded).

14. A process running on a system with demand-paged virtual memory generates the

following reference string (sequence of requested pages): 4, 3, 6, 1, 5, 1, 3, 6, 4, 2, 2, 3. The operating system allocates each process a maximum of four page frames at a time. What will be the number of page faults for this process under each of the following page replacement policies?

a) LRU 7 page faults

b) FIFO 8 page faults

c) LFU (with FIFO as tiebreaker) 7 page faults

15. In what ways are cache memory and virtual memory similar? In what ways are they different?

Cache memory and virtual memory are similar in several ways. Both involve the interaction between two levels of a hierarchical memory system – one larger and slower, the other smaller and faster. Both have the goal of performing close to the speed of the smaller, faster memory while taking advantage of the capacity of the larger, slower one; both depend on the principle of locality of reference to achieve this. Both operate on a demand basis and both perform a mapping of addresses generated by the CPU.

(26)

One significant difference is the size of the blocks of memory that are mapped and transferred between levels of the hierarchy. Cache lines tend to be significantly smaller than pages or segments in a virtual memory system. Because of the size of the mapped areas as well as the speed disparity between levels of the memory system, cache misses tend to be more frequent, but less costly in terms of performance, than page or segment faults in a VM system. Cache control is done entirely in hardware, while virtual memory management is accomplished via a combination of hardware (the MMU) and software (the operating system). Cache exists for the sole reason of making main memory appear faster than it really is; virtual memory has several purposes, one of which is to make main memory appear larger than it is, but also to support multiprogramming, relocation of code and data, and the protection of each program’s memory space from other programs.

16. In systems which make use of both virtual memory and cache, what are the advantages of a virtually addressed cache? Does a physically addressed cache have any advantages of its own, and if so, what are they? Describe a situation in which one of these approaches would have to be used because the other would not be feasible.

All else being equal, a virtually mapped cache is faster than a physically mapped cache because no address translation is required prior to checking the tags to see if a hit has occurred. The appropriate bits from the virtual address are matched against the (virtual) tags. In a physically addressed cache, the virtual-to-physical translation must be done before the tags can be matched. A virtual-to-physically addressed cache does have some advantages, though, including the ability to

perform task switches without having to flush (invalidate) the contents of the cache. In a situation where the MMU is located on-chip with the CPU while a cache is located off-chip (for example a level-2 or level-3 cache on the motherboard) the address is already translated before it appears on the system bus and, therefore, that cache would have to be physically addressed.

(27)

17. Fill in the blanks below with the most appropriate term or concept discussed in this chapter:

Information density - a characteristic of a memory device that refers to the amount of

information that can be stored in a given physical space or volume

Dynamic Random Access Memory (DRAM) - a semiconductor memory device made

up of a large array of capacitors; its contents must be periodically refreshed in order to keep them from being lost

Magnetic RAM (MRAM) - a developing memory technology that operates on the

principle of magnetoresistance; it may allow the development of “instant-on” computer systems

Erasable/Programmable Read-Only Memory (EPROM) - a type of semiconductor

memory device, the contents of which cannot be overwritten during normal operation, but can be erased using ultraviolet light

Associative memory - this type of memory device is also known as a CAM

Argument register - a register in an associative memory that contains the item to be

searched for

Locality of reference - the principle that allows hierarchical storage systems to function

at close to the speed of the faster, smaller level(s)

Miss - this occurs when a needed instruction or operand is not found in cache and thus a

main memory access is required

Refill line - the unit of information that is transferred between a cache and main memory Tag - the portion of a memory address that determines whether a cache line contains the

needed information

Fully associative mapping - the most flexible but most expensive cache organization, in

which a block of information from main memory can reside anywhere in the cache

Write-back - a policy whereby writes to cached locations update main memory only

(28)

Valid bit - this is set or cleared to indicate whether a given cache line has been initialized

with “good” information or contains “garbage” due to not yet being initialized

Memory Management Unit (MMU) - a hardware unit that handles the details of address

translation in a system with virtual memory

Segment fault - this occurs when a program makes reference to a logical segment of

memory that is not physically present in main memory

Translation Lookaside Buffer (TLB) - a type of cache used to hold virtual-to-physical

address translation information

Dirty bit - this is set to indicate that the contents of a faster memory subsystem have

been modified and need to be copied to the slower memory when they are displaced

Delayed page fault - this can occur during the execution of a string or vector instruction

(29)

3 Basics of the Central Processing Unit

1. Does an architecture that has fixed-length instructions necessarily have only one

instruction format? If multiple formats are possible given a single instruction size in bits, explain how they could be implemented; if not, explain why this is not possible.

Not necessarily. It is possible to have multiple instruction formats, all of the same length. For example, SPARC has three machine language instruction formats, all 32 bits long. This is implemented by decoding a subset of the op code bits (in the SPARC example, the two leftmost bits) and using the decoded outputs to determine how to decode the remaining bits of the instruction.

2. The instruction set architecture for a simple computer must support access to 64 KB of byte-addressable memory space and eight 16-bit general-purpose CPU registers.

(a) If the computer has three-operand machine language instructions that operate on the contents of two different CPU registers to produce a result that is stored in a third register, how many bits are required in the instruction format for addressing registers?

Since there are 8 = 23 registers, three bits are needed to identify each register operand. In this case there are two source registers and one destination register, so it would take 3 * 3 = 9 bits in the instruction to address all the needed registers.

(b) If all instructions are to be 16 bits long, how many op codes are available for the three-operand, register operation instructions described above (neglecting, for the moment, any other types of instructions that might be required)?

16 bits total minus 9 bits for addressing registers leaves 7 bits to be used as the op code. Since 27 = 128, there are 128 distinct op codes available to specify such instructions.

(c) Now assume (given the same 16-bit instruction size limitation) that, besides the instructions described in (a), there are a number of additional two-operand instructions to be implemented, for which one operand must be in a CPU register

(30)

while the second operand may reside in a main memory location or a register. If possible, detail a scheme that allows for at least 50 register-only instructions of the type described in (a) plus at least 10 of these two-operand instructions. (Show how you would lay out the bit fields for each of the machine language instruction formats.) If this is not possible, explain in detail why not and describe what would have to be done to make it possible to implement the required number and types of machine language instructions.

We can accomplish this design goal by adopting two instruction formats that could be distinguished by a single bit. Format 1 will have a specific bit (say, the leftmost bit) = 0 while format 2 will have a 1 in that bit position. Three-operand (register-only) instructions would use format 1. With one bit already used to identify the format, of the remaining 15 bits, 6 would constitute the op code (giving us 26 = 64 possible instructions of this type). The other 9 bits would be used to identify source register 1 (3 bits), source register 2 (3 bits), and the destination register (3 bits).

The two-operand instructions would use format 2. These instructions cannot use absolute addressing for memory operands because that would require 16 bits for the memory address alone, and there are only 16 total bits per instruction.

However, register indirect addressing or indexed addressing could be used to locate memory operands. In this format, 3 of the remaining 15 bits would be needed to identify the operand that is definitely in a register. One additional bit would be required to tell whether the second operand was in a register or in a memory location. Then, another set of 3 bits would identify a second register that contains either the second operand or a pointer to it in memory. This leaves 8 bits, of which 4 or more would have to be used for op code bits since we need at least 10

instructions of this type. The remaining 4 bits could be used to provide additional op codes or as a small displacement for indexed addressing.

(31)

3. What are the advantages and disadvantages of an instruction set architecture with variable-length instructions?

For an architecture with a sufficient degree of complexity, it is natural that some instructions may be expressible in fewer bits than others. (Some may have fewer options, operands, addressing modes, etc. while others have more

functionality.) Having variable-length instructions means that the simpler instructions need take up no more space than absolutely necessary. (If all

instructions are the same length, then even the simplest ones must be the same size, in bits, as the most complex.) Variable-length instructions can save significant amounts of code memory, but at the expense of requiring a more complex decoding scheme that can complicate the design of the control unit. Variable-length

instructions also make it more difficult to pipeline the process of fetching, decoding, and executing instructions (see Chapter 4).

4. Name and describe the three most common general types (from the standpoint of functionality) of machine instructions found in executable programs for most computer architectures.

In most executable programs (one can always find isolated counter-examples) the bulk of the machine instructions are, usually in this order: data

transfer instructions, computational (arithmetic, logic, shift, etc.) instructions, and control transfer instructions.

5. Given that we wish to specify the location of an operand in memory, how does indirect addressing differ from direct addressing? What are the advantages of indirect addressing, and in what circumstances is it clearly preferable to direct addressing? Are there any disadvantages of using indirect addressing? How is register indirect addressing different from memory indirect addressing, and what are the relative advantages and disadvantages of each?

(32)

explicitly as part of the machine language instruction (as opposed to immediate addressing, which embeds the operand itself in the instruction). Indirect addressing uses the machine language instruction to specify not the location of the operand, but the location of the location of the operand. (In other words, it tells where to find a

pointer to the operand.) The advantage of indirect addressing is that if a given

instruction is executed more than once (as in a program loop) the operand does not have to be in the same memory location every time. This is of particular use in processing tables, arrays, and other multi-element data structures. The only real disadvantages of indirect addressing vs. direct addressing are an increase in complexity and a decrease in processing speed due to the need to dereference the pointer.

Depending on the architecture, the pointer (which contains the operand address) specified by the instruction may reside in either a memory location (memory indirect addressing) or a CPU register (register indirect addressing). Memory indirect addressing allows a virtually unlimited number of pointers to be active at once, but requires an additional memory access – which complicates and slows the execution of the instruction, exacerbating the disadvantages mentioned above. To avoid this complexity, most modern architectures support only register indirect addressing, which limits the pointers to exist in the available CPU registers but allows instructions to execute more quickly.

6. Various computer architectures have featured machine instructions that allow the

specification of three, two, one, or even zero operands. Explain the tradeoffs inherent to the choice of the number of operands per machine instruction. Pick a current or historical computer architecture, find out how many operands it typically specifies per instruction, and explain why you think its architects implemented the instructions the way they did.

The answer to this question will obviously depend on the architecture chosen. The main tradeoff is programmer (or compiler) convenience, which favors more

(33)

operands per instruction, versus the desire to keep instructions smaller and more compact, which favors fewer operands per instruction.

7. Why have load-store architectures increased in popularity in recent years? (How do their advantages go well with modern architectural design and implementation technologies?) What are some of their less desirable tradeoffs vs. memory-register architectures, and why are these not as important as they once were?

Load/store architectures have become popular in large measure because the decoupling of memory access from computational operations on data keeps the control unit logic simpler and makes it easier to pipeline the execution of

instructions (see Chapter 4). Simple functionality of instructions makes it easier to avoid microcode and use only hardwired control logic, which is generally faster and takes up less “real estate” on the IC. Not allowing memory operands also helps keep instructions shorter and can help avoid the need to have multiple instruction

formats of different sizes.

Memory-register architectures, on the other hand, tend to require fewer machine language instructions to accomplish the same programming task, thus saving program memory. The compiler (or the assembly language programmer) has more flexibility and not as many registers need to be provided if operations on memory contents are allowed. Given the decrease in memory prices, the

improvements in compiler technology, and the shrinking transistor sizes over the past 20 years or so, the advantages of memory-register architectures have been diminished and load/store architectures have found greater favor.

8. Discuss the two historically dominant architectural philosophies of CPU design: a) Define the acronyms CISC and RISC and explain the fundamental differences

between the two philosophies.

CISC stands for “Complex Instruction Set Computer” and RISC stands for “Reduced Instruction Set Computer.” The fundamental difference between these

(34)

two philosophies of computer system design is the choice of whether to put the

computational complexity required of the system in the hardware or in the software. CISC puts the complexity in the hardware. The idea of CISC was to support high-level language programming by making the machine directly execute high-high-level functions in hardware. This was usually accomplished by using microcode to implement those complex functions. Programs were expected to be optimized by coding in assembly language. RISC, on the other hand, puts the complexity in the software (mainly, the high level language compilers). No effort was made to encourage assembly language programming; instead there is a reliance on

optimization by the compiler. The RISC idea was to make the hardware as simple and fast as possible by eliminating microcode and explicitly encouraging pipelining of the hardware. Any task that cannot be quickly and conveniently done in

hardware is left for the compiler to implement by combining simpler functions.

b) Name one commercial computer architecture that exemplifies the CISC architectural approach and one other that exemplifies RISC characteristics.

CISC examples include the DEC VAX, Motorola 680x0, Intel x86, etc. RISC examples include the IBM 801, Sun SPARC, MIPS Rx000, etc.

c) For each of the two architectures you named in (b) above, describe one distinguishing characteristic not present in the other architecture that clearly shows why one is considered a RISC and the other a CISC.

Answers will vary depending on the architectures chosen, but may include the use of hardwired vs. microprogrammed control, the number and complexity of machine language instructions and memory addressing modes, the use of fixed- vs. variable-length instructions, a memory-register vs. a load/store architecture, the number of registers provided and their functionality, etc.

d) Name and explain one significant advantage of RISC over CISC and one significant advantage of CISC over RISC.

(35)

Significant advantages of RISC include simpler, hardwired control logic that takes up less space (leaving room for more registers, on-chip cache and/or floating-point hardware, etc.) and allows higher CPU clock frequencies, the ability to execute instructions in fewer clock cycles, and ease of instruction pipelining. Significant advantages of CISC include a need for fewer machine language instructions per program (and thus a reduced appetite for code memory), excellent support for assembly language programming, and less demand for complexity in, and optimization by, the compilers.

9. Discuss the similarities and differences between the programmer-visible register sets of the 8086, 68000, MIPS, and SPARC architectures. In your opinion, which of these CPU register organizations has the most desirable qualities, and which is least desirable? Give reasons to explain your choices.

The 8086 has a small number of highly specialized registers. Some are for addresses, some for computations; many functions can only be carried out using a specific register or a limited subset of the registers. The 68000, another CISC processor, has a few more (16) working registers and divides them only into two general categories: data registers and address registers. Within each group,

registers have identical functionality (except for address register 7 which acts as the stack pointer).

MIPS and SPARC, both RISC designs, have larger programmer-visible register sets (32 working registers) and do not distinguish between registers used for data vs. registers used for pointers to memory. For the most part, “all registers are created equal”, though in both architectures register 0 is a ROM location that always contains the value 0. SPARC processors actually have a variable number (up to hundreds) of registers and use a hardware register renaming scheme to make different subsets of 32 of them visible at different times. This “overlapping register window” scheme was devised to help optimize parameter passing across procedure

(36)

calls. Students can be expected to have different preferences, but should point to specific advantages of a given architecture to back up their choices.

10. A circuit is to be built to add two 10-bit numbers x and y plus a carry-in. (Bit 9 of each number is the MSB, while bit 0 is the LSB. c0 is the carry-in to the LSB position.) The

propagation delay of any individual AND or OR gate is 0.4 ns, and the carry and sum functions of each full adder are implemented in sum of products form.

(a) If the circuit is implemented as a ripple carry adder, how much time will it take to produce a result?

Each full adder takes (0.4 + 0.4) = 0.8 ns to produce a result (sum and carry outputs). Since the carry output of each adder is an input to the adder in the next more significant position, the operation of the circuit is sequential and it takes 10 * (0.8 ns) = 8.0 ns to compute the sum of two 10-bit numbers.

(b) Given that the carry generate and propagate functions for bit position i are given by gi = xiyi and pi = xi + yi, and that each required carry bit (c1...c10) is developed

from the least significant carry-in c0 and the appropriate gi and pi functions using

AND-OR logic, how much time will a carry lookahead adder circuit take to produce a result? (Assume AND gates have a maximum fan-in of 8 and OR gates have a maximum fan-in of 12.)

In a carry lookahead adder, all the gi and pi functions are generated simultaneously by parallel AND and OR gates. This takes 0.4 ns (one gate delay time). Since ci+1 = gi + pici, generating all the carries should take two more gate delay times or 0.8 ns. However, we have to consider the gate fan-in restrictions. Since OR gates can have a fan-in of 12 and we never need to OR that many terms, that restriction does not matter; but the fan-in limitation on the AND gates means that an extra level of logic will be needed (since there are cases where we have to AND more than 8 terms). Thus, 3 * (0.4 ns) = 1.2 ns is required for this AND-OR logic for a total of 4 * (0.4 ns) = 1.6 ns to generate all the carries. Once the carries