Conclusions - Multi Core Embedded Systems - Embedded Multi Core Systems

Review Questions . . . 25

Bibliography . . . 27

1.1 Introduction

There are many interesting “laws” in the folklore of Information Technology. One of them, attributed to Niklaus Wirth, states that software is slowing faster than hardware is accelerating—a testimonial to the irony of modern- day system design. The “slowing down” in Wirth’s law can refer to both the run-time performance as well as software development time. Due to time-to- market pressure, the software designers do not have the luxury of optimizing the code. Software development for modern systems often happens in parallel to the development of the hardware platform, using simulation models of the target hardware. There is increased pressure on software developers to reuse existing IP, which may come from multiple sources, in various degrees of softness. Compilers and software optimization tools either do not exist, have limited capabilities, or are not available during the crucial periods of system development. Due to these reasons, application software development is a slow and daunting task, rarely permitting the use of advanced features supported in hardware due to lack of automated tools. It is quite common for software developers (e.g., video games) to resort to manual assembly-language coding. Embedded systems for applications such as video streaming require very high MIPS performance, of the order of several giga operations per second, which cannot be obtained through a single on-chip signal processor. As an example, consider broadcast quality video with a specification of 30 frames/second, 720 × 480 pixels per frame, requiring about 400,000 blocks to be processed per second. In telemedicine applications, where the requirement is for 60 frames/second and 1920 × 1152 pixels per frame, about 5 million blocks must be processed per second. Today’s wireless mobile Internet devices offer a host of applications, including High-Definition Video playback and recording, Internet browsing, CD-quality audio, and SLR-quality imaging. Some applications require multiple antennas, such as FM, GPS, Bluetooth, and WLAN. For example, if a user who is watching a streamed video presentation on a WLAN network on a mobile device is interrupted by an incoming call, it is desirable that the presentation is paused and the phone switches to the Blue- tooth handset. The presentation should resume after the user disconnects the call [6]. The growth of data bandwidth in mobile networks, better video com- pression techniques, and better camera and display technology have resulted in significant interest in wireless video applications such as video telephony. Set-top boxes can provide access to digital TV and related interactive ser- vices, as well as serve as a gateway to the Internet and a hub for a home network [5]. For applications such as these, system architects resort to the use multiprocessor architectures to get the required performance. What has made this decision possible is the power granted by the VLSI system-on-chip technology, which allows the logic of several instruction-set processors and

TABLE 1.1: Growth of VLSI Technology over Four Decades 1982 1992 2002 2012 Technology (µm) 3 0.8 0.1 0.02 Transistor count 50K 500K 180M 1B MIPS 5 40 5000 50000 RAM 256B 2KB 3MB 20MB Power (mW/MIPS) 250 12.5 0.1 0.001 Price/MIPS $30.00 $0.38 $0.02 $0.003

several megabytes of memory to be integrated in the same package (Table 1.1). Unlike general purpose systems and application-specific servers such as video servers [18], the requirements of an embedded solution are very different; compactness, low-cost, low-power, pin-count, packaging, short time-to-market are among the key considerations.

Historically, multiprocessors were heralded into the scene of computer ar- chitecture as early as the 1970s, when Moore’s law was not yet in vogue and it was widely believed that uniprocessors cannot provide the kind of performance that future applications will demand. In the 1980s, the notion that we are al- ready very close to the physical limits of the frequency of operation became even more prevalent, and a number of commercial parallel processing machines were built. In a landmark 1991 paper by Stone and Cocke [28], the authors argued that an operating frequency of 250 MHz cannot be achieved due to the challenge metal interconnections will pose in achieving this kind of timing. This prediction, however, was proven false in the same decade, and uniprocessors that worked at speeds over 500 MHz became available. The relentless progress in the speed performance of uniprocessors made parallel processing a less attractive alternative and companies that were making “supercomput- ers” closed down their operations. Distributed computing on a network of workstations was seen as the right approach to solve computationally difficult problems. We have come full circle, with multiprocessors making a comeback in embedded applications.

1.1.1 What Makes Multiprocessor Solutions Attractive?

1.1.1.1 Power Dissipation

The objectives of system design have changed over the past decade. While performance and cost were the primary considerations in system design until the 1980s, the proliferation of battery-operated mobile devices has shifted the focus to power dissipation and energy dissipation. Figure 1.1 shows the power/performance numbers for mobile devices over the past two decades and extrapolates it for the next few years. The prediction of the power/performance numbers with VLSI technology scaling was made by Gene Frantz and

m W /M M A C s Year 1,000 100 10 1 0.1 0.01 0.001 0.0001 0.00001 1982 2010 Gene’s law Prediction Observed

FIGURE 1.1: Power/performance over the years. The solid line shows the prediction by Gene Frantz. The dotted line shows the actual value for digital signal processors over the years. The ‘star’ curve shows the power dissipation for mobile devices over the years.

has remained mostly true; the deviation from the prediction occurred in the early part of this decade, when leakage power of CMOS circuits became significant in the nanometer technologies. Unless the power dissipation of hand- held devices is under check, they will be too hot and demand elaborate cooling mechanisms. Packaging and the associated cost are also related to the peak power dissipation of a device. The distribution of power to the sub-systems gets complex as the average and peak power of a system become larger. In the past decade, we have also seen the concern for “green systems” growing, stemming from the concern about climatic changes, carbon emissions, and e-waste. Energy-efficient system design has therefore gained importance.

Multi-core design is one of the most important solutions for management of system power and the energy efficiency of the system. Systems designed in the 1980s featured a single power supply and a single power domain, allowing the entire system to be powered on or off. As the complexity of the systems has increased, we need an alternate method to power a system, where the system is divided into power domains and power switches are used to cut off power supply to a sub-system which is not required to be active during system operation. In a modern electronic system, there are multiple modes of operation. For example, a user may use his mobile to read e-mail, click a picture or video, listen to music, play a game, or make a phone call. Some sub- systems can be turned off during each of these modes of operation, e.g., when reading mail, the sub-system that is responsible for picture decompression need not be powered on until the user opens an e-mail which has a compressed picture attachment. Similarly, there may be many I/O interfaces in a system,

such as USB, credit card, Ethernet, Firewire, etc., not all of which will be necessary in any one mode of operation. Turning off the clock for a sub- system is a way to cut down the dynamic power dissipation in the sub-system. Powering off a sub-system helps us cut down the static as well as dynamic power that would otherwise be wasted.

The traditional way to build high-performance VLSI systems has been to increase the clocking speed. In the late 1980s and the 1990s, we saw the relentless increase in clock speed of personal computers. However, as the VLSI technology used to implement the systems moved from micrometer technology to nanometer technology, a number of challenges intimidated the semiconduc- tor manufacturers. Managing the power and energy dissipation is the most daunting of these challenges. The dynamic power of a VLSI system grows linearly with the frequency of operation and quadratically with the operating voltage. Static power dissipation due to leakage currents in the transistor has different components that increase linearly and as the cube of the operating voltage. Reducing the voltage of operation can result in significant reduction in power, but can also negatively impact the frequency of operation. The se- lection of operating voltage and frequency of operation must consider both power and performance.

An electronic system is commonly implemented by integrating IP cores which operate at different voltages and frequencies. It is also common to use dynamic voltage and frequency scaling (DVFS) in order to manage the power dissipation while constraining the performance. Sub-systems that must provide higher performance can be operated at higher frequency and voltage, while the rest of the system can operate at lower frequency and voltage. An extreme form of frequency scaling is gated clocking where the clock signal for a sub-system can be turned off. Similarly, an extreme form of power scaling is power gating, where the power supply to a sub-system can be turned off. The OMAP platform for mobile embedded products uses dynamic voltage and frequency scaling to reduce power consumption [10]. Texas Instruments uses its Smart Reflex power management technology and a special 45 nanometer CMOS process for power reduction in the latest OMAP4 series of platforms. Smart Reflex allows the device to adjust the voltage and frequency of operation of sub-blocks based on the activity, mode of operation, and temperature. The OMAP4 processors have two ARM Cortex-A9 processors on-chip and several peripherals (Figure 1.9), but only the core that is required for the target application is activated to minimize power wastage.

Consider a sub-system S that must provide a performance of T time units per operation. Since the switching speed of transistors depends directly on the voltage of operation, building a circuit that implements S may require us to operate the circuit at a higher voltage V , resulting in higher power dissipation. We may be able to use the parallelism in the functionality of the sub-system to break it down into two sub-systems S′_{and S}′′_{. The circuits that implement} S′_{and S}′′_{are roughly half in size and have a critical path that is half of T . As}

a result, they can be operated at about half the voltage V . This would result in a significant reduction in dynamic and static power dissipation.

Multi-core system design has become attractive from the view point of performance-and-power tradeoff. The tradeoff is between building a “super processor” that can operate at a high frequency (and thereby guzzling power) or building smaller processors that operate at lower frequencies (thereby con- suming less power) and yet giving a performance comparable to the super processor.

1.1.1.2 Hardware Implementation Issues

The definition of a system in system-on-a-chip has expanded to cover multiple processors, embedded DRAM, flash memory, application-specific hardware accelerators and RF components. The cost of designing a multiprocessor system-on-chip, where the processors work at moderate speeds and the system throughput is multiplied by multiplicity of processors, is smaller than designing a single processor which works at a much higher clock speed. This is due to the difficulties in handling the timing closure problem in an automated design flow. The delays due to parasitic resistance, capacitance, and the inductance of the interconnect make it difficult to predict the critical path delays accurately during logic design. Physical designers attempt to optimize the layout subject to the interconnect-related timing constraints posed by the logic design phase. Failure to meet these timing constraints results in costly iterations of logic and physical design. These problems have only aggravated with scaling down of technology, where tall and thin wires run close to one another, resulting in crosstalk. Voltage drop in the resistance of the power supply rails is another potential cause for timing and functional failures in deep submicron integrated circuits. When a number of signals in a CMOS circuit switch state, the current drawn from the power supply causes a drop in the supply voltage that reaches the cells. As a result, the delay of the individual cells will increase. This can potentially result in timing failure on critical paths, unless the power rail is properly designed. Typically, the gates in the center of the chip are most prone to IR drop-induced delays.

Although custom design may be used for some performance-critical portions of the chip, today it is quite common to employ automated logic synthesis flows to reduce the front-end design cycle time. The success of logic synthesis, both in terms of timing closure and optimization, depends critically on the constraints specified during logic synthesis. These constraints include timing constraints, area constraints, load constraints, and so on. Such constraints are easier to provide when a hierarchical approach is followed and smaller par- titions are identified. The idea of using multiple processors as opposed to a single processor is more attractive in this scenario.

Another benefit that comes from a divide-and-conquer approach is the concurrency in the design flow. A design that can naturally be partitioned into sub-blocks such as processors, memory, application-specific processors,

etc., can be design-managed relatively easily. Different design teams can con- currently address the design tasks associated with the individual sub-blocks of the design.

When a design has multiple instances of a common block such as a processor, the design team can gain significantly in terms of design cycle time. This is possible through the reuse of the following work: (a) insertion of scan chains and BIST circuitry, (b) physical design effort, (c) automatic test pattern generation effort, (d) simulation of test patterns.

In VLSI technologies beyond 90 nm, on-chip variability of process parameters, temperature, and voltage is another challenge that designers have to grapple with. The parameters that determine the performance of transistors and interconnects are known to vary significantly across the die, due to the vagaries of the manufacturing processes. In the past, these variances were known to exist in dies made on different wafers, lots, and foundries. However, due to the small dimension of the circuit components, on-die variation has as- sumed significance. The exact way in which a transistor or interconnect gets “printed” on the integrated circuit is no longer independent of the surround- ing components. Thus, a NAND gate’s performance can vary, depending on the physical location of the gate and what logic is in its neighborhood. The temperature of the die varies widely, by as much as 50 degrees Celsius, across the chip. Similarly, due to the impedance drops in the power supply distribution network of the chip, the voltage that reaches the individual gates and flip-flops can vary across the chip.

There are several solutions to combat the problem of on-chip variability. One solution is to apply “optical proximity correction” which subtly trans- forms the layout geometries so that they print well. Optical proximity correction is a slow and expensive step and is best applied to small blocks. In this context, having regularity and repetitiveness in the system can be an advantage. Homogeneous multiprocessor systems offer this advantage. To alleviate the problem of temperature variability, it would be desirable to migrate com- putational tasks from hotter regions to cooler portions of the chip. Once again, homogeneous multiprocessors present a natural way of performing task migra- tion. The problem of reducing the variation in power supplies across the power supply network can also be alleviated by building a hierarchical network from smaller, repeatable supply networks. Here again, the use of multiprocessors can be an advantage.

Testing of integrated circuits for manufacturing defects is yet another challenge. Due to the growing complexity and size of integrated circuits, the amount of test data has grown sharply, increasing the cost of testing. Testing of integrated circuits is performed by using an external tester that applies pre-computed test patterns and compares the response of the integrated circuit with the expected results. The test generation software runs very slowly as the size of the circuit grows. A divide-and-conquer approach offers an ef- fective solution to this problem [21]. Multi-core systems have a natural design hierarchy, which lends itself to the divide-and-conquer approach toward test

generation, fault simulation, and test pattern validation. When a number of identical cores are present in the integrated circuit, it may be possible to reuse the patterns and reduce the effort in test generation. Similarly, there are interesting “built-in-self-test” approaches where mutual testing can be employed to test a chip. Thus, if we have two processor cores on the same chip, we can apply random patterns to both processor cores and compare their responses to the random tests; a difference in response will indicate an error.

As in the case of design-for-test and test generation, the natural hierarchy imposed by the use of multi-core systems can also pave the way for efficient solutions for other computationally intensive tasks in electronic design, such as design verification, logic synthesis, timing simulation, physical design, and static timing analysis.

1.1.1.3 Systemic Considerations

There are software and system-design issues also that make a multiprocessor solution attractive. There are numerous VLSI design challenges that a design team may find daunting when faced with the problem of designing a high- performance system-on-chip (SoC). These include verification, logic design, physical design, timing analysis, and timing closure.

The way to harness performance in a single processor alternative is to use superscalar computing and very large scale instruction word processors. Compilers written for such processors have a limited scope of extracting the parallelism in applications. To increase the compute power of a processor, architects make use of sophisticated features like out of order execution and speculative execution of instructions. These kinds of processors dynamically extract parallelism from the instruction sequence. However, the cost of extracting parallelism from a single thread is becoming prohibitive, making a single complex processor alternative unattractive. With many applications written in languages such as Java or C++ resorting to multithreading, a compiler has more visibility of MIMD-type parallelism (Multiple Instruction Stream, Multiple Data Stream) in the application.

Both homogeneous and heterogeneous multiprocessor architectures have been used in building embedded systems. Heterogeneous multiprocessing is used when there are parts of the embedded software that would need the power of a digital signal processor and other parts need a micro-controller for the housekeeping activity. We shall consider several MPSoC case studies to illustrate the architectures used in modern-day embedded systems. In particu- lar, we shall emphasize the following aspects of MPSoC designs: (a) processor

In document Multi Core Embedded Systems - Embedded Multi Core Systems - Georgios Kornaros (Page 31-61)