• No results found

Intel Pentium 4 Processor on 90nm Technology

N/A
N/A
Protected

Academic year: 2021

Share "Intel Pentium 4 Processor on 90nm Technology"

Copied!
14
0
0

Loading.... (view fulltext now)

Full text

(1)

1

1

Intel

®

Pentium

®

4 Processor

on 90nm Technology

Ronak Singhal

August 24, 2004

Hot Chips 16

2

2

Agenda

• Netburst

®

Microarchitecture Review

• Microarchitecture Features

• Hyper-Threading Technology

• SSE3

• Intel

®

Extended Memory 64 Technology

(2)

3

3

Vital Statistics

• 125 million transistors

• 112 mm

2

die size

• 90nm manufacturing

process

• Introduced in Feb.

2004 @ >3GHz

4

4

Intel

®

Pentium

®

4 Processor Block Diagram

Bus

Bus

Interface

Interface

Unit

Unit

Quad

Quad

Pumped

Pumped

6.4 GB/s

6.4 GB/s

L2

L2

Cache

Cache

(1 MB

(1 MB

8

8

-

-

way)

way)

System

System

Bus

Bus

64

64

-

-

bit wide

bit wide

Instruction TLB

Instruction TLB

Dynamic BTB

Dynamic BTB

4K Entries

4K Entries

Instruction Decoder

Instruction Decoder

Execution Trace

Execution Trace

Cache 12K

Cache 12K

µ

µ

ops

ops

Trace Cache BTB

Trace Cache BTB

512 Entries

512 Entries

Micro

Micro

Instruction

Instruction

Sequencer

Sequencer

Allocator

Allocator

/ Register

/ Register

Renamer

Renamer

Memory

Memory

µ

µ

op

op

Queue

Queue

Integer/Floating Point

Integer/Floating Point

µ

µ

op

op

Queue

Queue

Memory

Memory

Slow

Slow

Fast

Fast

Int

Int

Fast

Fast

Int

Int

FP Move

FP Move

FP Gen

FP Gen

Ld/St

Ld/St

Address

Address

unit

unit

Integer Register File / Bypass Network

Integer Register File / Bypass Network

FP Register / Bypass

FP Register / Bypass

2xAGU

2xAGU

Complex

Complex

Instr

Instr

.

.

Slow ALU

Slow ALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

L1 Data Cache 8Kbyte 4

L1 Data Cache 8Kbyte 4

-

-

way

way

FP

FP

Move

Move

FP

FP

MMX

MMX

SSE

SSE

SSE2

SSE2

SSE3

SSE3

256

256

-

-

bit wide

bit wide

(3)

5

5

Key Characteristics

Bus

Bus

Interface

Interface

Unit

Unit

Quad

Quad

Pumped

Pumped

6.4 GB/s

6.4 GB/s

L2

L2

Cache

Cache

(1 MB

(1 MB

8

8

-

-

way)

way)

System

System

Bus

Bus

64

64

-

-

bit wide

bit wide

Instruction TLB

Instruction TLB

Dynamic BTB

Dynamic BTB

4K Entries

4K Entries

Instruction Decoder

Instruction Decoder

Execution Trace

Execution Trace

Cache 12K

Cache 12K

µ

µ

ops

ops

Trace Cache BTB

Trace Cache BTB

512 Entries

512 Entries

Micro

Micro

Instruction

Instruction

Sequencer

Sequencer

Allocator

Allocator

/ Register

/ Register

Renamer

Renamer

Memory

Memory

µ

µ

op

op

Queue

Queue

Integer/Floating Point

Integer/Floating Point

µ

µ

op

op

Queue

Queue

Memory

Memory

Slow

Slow

Fast

Fast

Int

Int

Fast

Fast

Int

Int

FP Move

FP Move

FP Gen

FP Gen

Ld/St

Ld/St

Address

Address

unit

unit

Integer Register File / Bypass Network

Integer Register File / Bypass Network

FP Register / Bypass

FP Register / Bypass

2xAGU

2xAGU

Complex

Complex

Instr

Instr

.

.

Slow ALU

Slow ALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

L1 Data Cache 8Kbyte 4

L1 Data Cache 8Kbyte 4

-

-

way

way

FP

FP

Move

Move

FP

FP

MMX

MMX

SSE

SSE

SSE2

SSE2

SSE3

SSE3

256

256

-

-

bit wide

bit wide

L1 Data Cache 16Kbyte 8- way

Trace Cache instead of conventional I-Cache

6

6

Key Characteristics

Bus

Bus

Interface

Interface

Unit

Unit

Quad

Quad

Pumped

Pumped

6.4 GB/s

6.4 GB/s

L2

L2

Cache

Cache

(1 MB

(1 MB

8

8

-

-

way)

way)

System

System

Bus

Bus

64

64

-

-

bit wide

bit wide

Instruction TLB

Instruction TLB

Dynamic BTB

Dynamic BTB

4K Entries

4K Entries

Instruction Decoder

Instruction Decoder

Execution Trace

Execution Trace

Cache 12K

Cache 12K

µ

µ

ops

ops

Trace Cache BTB

Trace Cache BTB

512 Entries

512 Entries

Micro

Micro

Instruction

Instruction

Sequencer

Sequencer

Allocator

Allocator

/ Register

/ Register

Renamer

Renamer

Memory

Memory

µ

µ

op

op

Queue

Queue

Integer/Floating Point

Integer/Floating Point

µ

µ

op

op

Queue

Queue

Memory

Memory

Slow

Slow

Fast

Fast

Int

Int

Fast

Fast

Int

Int

FP Move

FP Move

FP Gen

FP Gen

Ld/St

Ld/St

Address

Address

unit

unit

Integer Register File / Bypass Network

Integer Register File / Bypass Network

FP Register / Bypass

FP Register / Bypass

2xAGU

2xAGU

Complex

Complex

Instr

Instr

.

.

Slow ALU

Slow ALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

Simple

Simple

Instr

Instr

.

.

2xALU

2xALU

L1 Data Cache 8Kbyte 4

L1 Data Cache 8Kbyte 4

-

-

way

way

FP

FP

Move

Move

FP

FP

MMX

MMX

SSE

SSE

SSE2

SSE2

SSE3

SSE3

256

256

-

-

bit wide

bit wide

L1 Data Cache 16Kbyte 8- way

(4)

7

7

New Microarchitecture Features

• Larger Caches

• Deeper Buffers

• Faster Execution Units

• Algorithmic Enhancements

8

8

Cache Comparison

12k uops

12k uops

Trace Cache

1MB, 8-ways,

Write-back

512KB, 8-ways,

Write-back

2

nd

level data cache

16KB, 8-ways,

Write-through

8KB, 4-ways,

Write-through

1

st

level data cache

90nm

130nm

(5)

9

9

Larger Buffers

8

4

Outstanding 1

st

level Data Cache

Misses

14/16

10/12

FP Schedulers

48

48

Load Buffers

8

6

Write Combining

Buffers

32

24

Store Buffers

126

126

ROB Size

90nm

130nm

1 0

10

Faster Execution Units

• Shifts

– Typical shifts now handled inside of fast

execution core w/ single cycle latency

– Previously handled in complex integer unit

with 6 cycle latency

• Integer Multiply

– Adds dedicated integer multiplier

(6)

1 1

11

Algorithmic Enhancements

• Branch Prediction

• Hardware Prefetching

1 2

12

Branch Prediction

• Continued improvement of existing

algorithms

• Improved static prediction algorithm

– Displacement check

– Condition check

• Added indirect branch predictor

(7)

1 3

13

Branch Predictor Comparison

1.12

1.19

256.bzip2

1.23

1.32

300.twolf

0.09

0.08

255.vortex

0.24

0.33

254.gap

0.28

0.62

253.perlbmk

0.39

0.44

252.eon

0.87

1.06

197.parser

0.68

0.72

186.crafty

1.22

1.35

181.mcf

0.70

0.85

176.gcc

1.21

1.32

175.vpr

1.01

1.03

164.gzip

90nm

130nm

# of Branch Mispredicts Per 100 Instructions on SPECint*_base2000

* Other names and brands are the property of their respective owners

1 4

14

Hardware Prefetching

• Primary mechanism to hide DRAM latency

• Processor predicts what data will be

needed in the future and proactively

fetches it from DRAM

• Exists on all Intel

®

Pentium

®

4 Processor

implementations

• 90nm version improves on what data to

get and when to get it

(8)

1 5

15

Impact of HW prefetcher on most sensitive benchmarks in SPEC CPU2000

Hardware Prefetching

1.16

1.18

1.21

1.26

1.29

1.30

1.32

1.40

1.45

1.49

1.97

0.00

0.50

1.00

1.50

2.00

25

4.g

ap

191.fma3d 178.galgel 187.facerec

171.swim

168.wupwise

173.applu 189.lucas

181.mcf

17

2.m

gr

id

18

3.e

qu

ak

e

HWP Disabled

HWP Enabled

1 6

16

Hyper-Threading Technology

• Makes a single processor look like two

processors to software

• Takes advantage of underutilized

resources when running a single thread

through the processor

Hyper-Threading Technology requires a computer system with an Intel ® Pentium® 4

processor supporting HT Technology and a Hyper -Threading Technology enabled chipset,

BIOS and operating system. Performance will vary depending on the specific hardware and

software you use. See http:// www.intel.com/info/hyperthreading / for more information including

details on which processors support HT Technology.

(9)

1 7

17

Hyper-Threading Technology

Improvements

• 1

st

level data cache

– Uses partial virtual address index

– Aliasing can occur due to stacks of two threads being offset by a

fixed amount

– Use context identifier to differentiate between data from different

threads. Better than thread identifier to allow data sharing

between threads.

– Introduced on later steppings of 130nm version.

• Parallel Operations

– Allow page walks and split memory access handling in parallel

– Allow multiple page walks if one goes to DRAM

• Buffer sizes

– Motivated increase in # of outstanding 1

st

level cache misses

1 8

18

SSE3

• 13 new instructions

– x87 to integer conversion

– Graphics (Horizontal Add/Subtract)

– Complex arithmetic

– Video Encoding

(10)

1 9

19

Complex Arithmetic

• MOVDDUP, MOVSHDUP, MOVSLDUP

– Instructions to load and duplicate data

implicitly

• ADDSUBPS, ADDSUBPD

– Perform a mix of addition and subtraction

simultaneously

• 10-20% gain on 168.wupwise from these

instructions (complex matrix multiply)

2 0

20

Video Encoding

• Motion Estimation compares previous

frame to current frame

– Loads from the previous frame are unaligned

– Leads to costly cache line split memory

accesses

• LDDQU instruction loads 128-bits at an

arbitrary alignment with no cache line split

• Speedups of > 10% on MPEG-4 encoders

(11)

2 1

21

Thread Synchronization

• Used to indicate that a thread is spinning

and waiting for work

• Allows processor to go into an optimized

state

• MONITOR – Sets up address monitoring

hardware

• MWAIT – Sets processor into optimized

state. Will wake up when monitored

address in written to

2 2

22

Intel

®

Extended Memory 64

Technology

• Additional capability in today’s Intel

®

Xeon™ processors

on top of:

– Netburst

®

Microarchitecture

– Hyper-Threading Technology

– SSE3

• Provides 8 more integer and SSE registers

• Larger addressing capability

– 48 bits of virtual address on this implementation

– 36-40 bits of physical address on this implementation

• Full 64-bit support carefully engineered into the 90nm

design

– Limited differences between 32-bit and 64-bit operations

– Similar optimizations for 32-bit and 64-bit code

(12)

2 3

23

32-bit vs. 64-bit Comparison

4 levels – use

PDE cache to

reduce to 1 level

in common case

2 levels -- use

PDE cache to

reduce to 1 level

in common case

Page walks

1 load + 1 store per cycle

Memory

Throughput

4 operations/cycle

ALU Throughput

1 cycle

ALU Latency

64-bit

32-bit

Enable strong 64-bit performance without

compromising 32-bit performance

2 4

24

Optimizing 64-bit code

• Rule #1: Follow 32-bit optimizations

• Rule #2: Compile with Pentium 4 specific

optimizations enabled

• Few additional new rules:

– When data size is 32 bits, typically use 32-bit

instructions

• Example: XOR EAX, EAX instead of XOR RAX, RAX

– But sign extend to full 64-bits instead of only 32 bits

(even for 32-bit data size)

(13)

2 5

25

Performance Results

1.12

1.19

1.20

1.25

1.28

0.00

0.50

1.00

1.50

Adobe*

Photoshop* CS

Windows Media*

Encoder 9.0

Maxon Cinema*

4D

MainConcept*

1.4

MAGIX* mp3

maker 2004

diamond

HT Technology Disabled

HT Technology Enabled

Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400

CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel®PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems

and/or components and reflect the approximate performance of I n tel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

1.12

1.19

1.20

1.25

1.28

0.00

0.50

1.00

1.50

Adobe*

Photoshop* CS

Windows Media*

Encoder 9.0

Maxon Cinema*

4D

MainConcept*

1.4

MAGIX* mp3

maker 2004

diamond

HT Technology Disabled

HT Technology Enabled

Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400

CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel®PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems

and/or components and reflect the approximate performance of I n tel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

2 6

26

Performance Results

1.07

1.14

0.00

0.50

1.00

1.50

SPECint*_base2000

SPECfp*_base2000

Intel® Pentium® 4 Processor with HT Technology 3.40 GHz

Intel® Pentium® 4 Processor with HT Technology 3.40E GHz

Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40 GHz – Intel®D875PBZ Desktop Board (AA-204); Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400 CL3 -3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI*

Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel® PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems and/or components and reflect the approximate

performance of Intel products as measured by those tests. Any di fference in system hardware or software design or configuration may affect actual performance.

1.07

1.14

0.00

0.50

1.00

1.50

SPECint*_base2000

SPECfp*_base2000

Intel® Pentium® 4 Processor with HT Technology 3.40 GHz

Intel® Pentium® 4 Processor with HT Technology 3.40E GHz

Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40 GHz – Intel®D875PBZ Desktop Board (AA-204); Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400 CL3 -3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI*

Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel® PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems and/or components and reflect the approximate

(14)

2 7

27

References

Related documents