AMD Opteron Quad-Core

(1)

AMD Opteron Quad-Core

a brief overview

Daniele Magliozzi

(2)

Opteron Memory Architecture



native quad-core design

(four cores on a single die for more efficient data sharing)



enhanced cache structure



integrated memory controller



sustain multi-threaded

application throughput fitting modern servers and

workstations needs

(3)

3 levels of dedicated & shared cache

4 different caches accelerate instruction exec. and data processing



L1 Instruction Cache:

64-Kbyte, 2-way set-associative, 64 bytes line length, LRU; for instruction loads, instruction prefetching, instruction predecoding, and branch prediction.



L1 Data Cache:

64-Kbyte, 2-way set-associative, W.A. & W.B. with LRU, divided into eight banks(16 bytes wide), with prefetcher and 3- cycle load-to-use latency.



L2 Cache:

contains only victim or copy-back blocks from L1.



L3 Cache:

dynamically shared, non-inclusive victim cache with blocks allocated on L2 victim/copy-backs. Hits in L3 can either leave the data there (for data accessed by multiple cores), or remove the data from L3 placing it solely in L1(for data accessed by a single core)

(4)

DDR2 SDRAM with integrated memory controller



SDRAM:

store memory in memory cells activated using clock signal to synchronize their operation with an external data bus.



DDR2 SDRAM:

(double data rate synchronous dynamic random access memory) cells transfer data both on the rising and falling edge of the clock (a technique called "double pumping").



Improvement:

operation of the external data bus at twice the clock rate achieved to obtain twice the bandwidth over its predecessor (DDR)



Memory Controller:

integrated on-die, manages the flow of data going to and from the main memory, optimizing memory performance and bandwidth per CPU and reducing latency inherent in front-side buffer architectures.

(5)

Direct Connect Architecture

Front side bus eliminated, core directly connected to:



memory controller



I/O subsystem



other processors

by high bw. Hypertransport links.

Improving overall system performance and efficiency by eliminating traditional bottlenecks

inherent in legacy front side bus architectures.

(6)

HyperTransport Technology

high-speed, low latency, point-to-point, unidirectional links between two devices, capable of extremely fast signaling (up to 800MHz ck. sp.) compatible with PCI interface.



“Packetized” bus:

addresses, data, and commands are sent along the same wires allowing narrower links easier to route.



HT System:

a processor with a HyperTransport port called HyperTransport host, the HyperTransport bus and any I/O channels connected to it.



Differential signaling:

(employed by links) use two wires for each signal, with the result being the difference between the two signals sent, does not suffer from problems associated with the single- ended signaling of high speed parallel buses (bouncing signals, interference, cross-talk).

(7)

HyperTransport Technology ^(Switch

Topology)

Switch Topology

The host communicates directly with the switch chip, which in turn manages multiple independent slaves including tunnels, bridges, and end-device chips (Parallelize Daisy Chain).

Each port on the switch benefits from the full bandwidth of the HyperTransport technology I/O link because the switch directs the flow of electrical signals between the slave devices connected to it.

supports multiple connection topologies: daisy chain, switch, star.

(8)

AMD Virtualization

To allow multiple operating systems to run on the same physical platform, a SW platform layer ( Hypervisor) decouples the operating system from the underlying hardware. It is also a translation layer for guest virtual addresses that could operate in 2 ways:



SW:

Hypervisor modifies the guest source code to cooperate with him or to control his privileged operations(at run-time).



HW-assisted virtualization:

Hypervisor uses a set processor extensions (ex: AMD-V) to intercept and emulate guest privileged operations.

In AMD-V technology Hypervisor specifies how the processor should handle privileged operations in guest itself without transferring control to the Hypervisor. This improves the efficiency of switching between VM, helping improve performance and effectively isolates VM for secure operation.

(9)

Rapid Virtualization Indexing (RVI)



Paging enabled: the operating system defines a set of Page Tables, used by the Page Walker (implemented in processor HW), in order to translate the “linear addresses” to physical addresses.



guest Page Table (gPT): another level of translation under virtualization. Hypervisor can manage it via SW (with the shadow Page Table) or via HW:



nested Page Tables (nPT): set by the Hypervisor in the

Page Walker and letting it manage translations using a second

level of translation, reducing overheads found in equivalent

shadow paging implementations, storing recent translations in an

internal translation look-aside buffer (TLB).

(10)

Power Performances



Enhanced AMD PowerNow! with Independent

Dynamic Core: Allows processors and cores to operate at various voltages and frequencies.



AMD Opteron Quad-Core