1
1
Intel
®
Pentium
®
4 Processor
on 90nm Technology
Ronak Singhal
August 24, 2004
Hot Chips 16
2
2
Agenda
• Netburst
®
Microarchitecture Review
• Microarchitecture Features
• Hyper-Threading Technology
• SSE3
• Intel
®
Extended Memory 64 Technology
3
3
Vital Statistics
• 125 million transistors
• 112 mm
2
die size
• 90nm manufacturing
process
• Introduced in Feb.
2004 @ >3GHz
4
4
Intel
®
Pentium
®
4 Processor Block Diagram
Bus
Bus
Interface
Interface
Unit
Unit
Quad
Quad
Pumped
Pumped
6.4 GB/s
6.4 GB/s
L2
L2
Cache
Cache
(1 MB
(1 MB
8
8
-
-
way)
way)
System
System
Bus
Bus
64
64
-
-
bit wide
bit wide
Instruction TLB
Instruction TLB
Dynamic BTB
Dynamic BTB
4K Entries
4K Entries
Instruction Decoder
Instruction Decoder
Execution Trace
Execution Trace
Cache 12K
Cache 12K
µ
µ
ops
ops
Trace Cache BTB
Trace Cache BTB
512 Entries
512 Entries
Micro
Micro
Instruction
Instruction
Sequencer
Sequencer
Allocator
Allocator
/ Register
/ Register
Renamer
Renamer
Memory
Memory
µ
µ
op
op
Queue
Queue
Integer/Floating Point
Integer/Floating Point
µ
µ
op
op
Queue
Queue
Memory
Memory
Slow
Slow
Fast
Fast
Int
Int
Fast
Fast
Int
Int
FP Move
FP Move
FP Gen
FP Gen
Ld/St
Ld/St
Address
Address
unit
unit
Integer Register File / Bypass Network
Integer Register File / Bypass Network
FP Register / Bypass
FP Register / Bypass
2xAGU
2xAGU
Complex
Complex
Instr
Instr
.
.
Slow ALU
Slow ALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
L1 Data Cache 8Kbyte 4
L1 Data Cache 8Kbyte 4
-
-
way
way
FP
FP
Move
Move
FP
FP
MMX
MMX
SSE
SSE
SSE2
SSE2
SSE3
SSE3
256
256
-
-
bit wide
bit wide
5
5
Key Characteristics
Bus
Bus
Interface
Interface
Unit
Unit
Quad
Quad
Pumped
Pumped
6.4 GB/s
6.4 GB/s
L2
L2
Cache
Cache
(1 MB
(1 MB
8
8
-
-
way)
way)
System
System
Bus
Bus
64
64
-
-
bit wide
bit wide
Instruction TLB
Instruction TLB
Dynamic BTB
Dynamic BTB
4K Entries
4K Entries
Instruction Decoder
Instruction Decoder
Execution Trace
Execution Trace
Cache 12K
Cache 12K
µ
µ
ops
ops
Trace Cache BTB
Trace Cache BTB
512 Entries
512 Entries
Micro
Micro
Instruction
Instruction
Sequencer
Sequencer
Allocator
Allocator
/ Register
/ Register
Renamer
Renamer
Memory
Memory
µ
µ
op
op
Queue
Queue
Integer/Floating Point
Integer/Floating Point
µ
µ
op
op
Queue
Queue
Memory
Memory
Slow
Slow
Fast
Fast
Int
Int
Fast
Fast
Int
Int
FP Move
FP Move
FP Gen
FP Gen
Ld/St
Ld/St
Address
Address
unit
unit
Integer Register File / Bypass Network
Integer Register File / Bypass Network
FP Register / Bypass
FP Register / Bypass
2xAGU
2xAGU
Complex
Complex
Instr
Instr
.
.
Slow ALU
Slow ALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
L1 Data Cache 8Kbyte 4
L1 Data Cache 8Kbyte 4
-
-
way
way
FP
FP
Move
Move
FP
FP
MMX
MMX
SSE
SSE
SSE2
SSE2
SSE3
SSE3
256
256
-
-
bit wide
bit wide
L1 Data Cache 16Kbyte 8- way
Trace Cache instead of conventional I-Cache
6
6
Key Characteristics
Bus
Bus
Interface
Interface
Unit
Unit
Quad
Quad
Pumped
Pumped
6.4 GB/s
6.4 GB/s
L2
L2
Cache
Cache
(1 MB
(1 MB
8
8
-
-
way)
way)
System
System
Bus
Bus
64
64
-
-
bit wide
bit wide
Instruction TLB
Instruction TLB
Dynamic BTB
Dynamic BTB
4K Entries
4K Entries
Instruction Decoder
Instruction Decoder
Execution Trace
Execution Trace
Cache 12K
Cache 12K
µ
µ
ops
ops
Trace Cache BTB
Trace Cache BTB
512 Entries
512 Entries
Micro
Micro
Instruction
Instruction
Sequencer
Sequencer
Allocator
Allocator
/ Register
/ Register
Renamer
Renamer
Memory
Memory
µ
µ
op
op
Queue
Queue
Integer/Floating Point
Integer/Floating Point
µ
µ
op
op
Queue
Queue
Memory
Memory
Slow
Slow
Fast
Fast
Int
Int
Fast
Fast
Int
Int
FP Move
FP Move
FP Gen
FP Gen
Ld/St
Ld/St
Address
Address
unit
unit
Integer Register File / Bypass Network
Integer Register File / Bypass Network
FP Register / Bypass
FP Register / Bypass
2xAGU
2xAGU
Complex
Complex
Instr
Instr
.
.
Slow ALU
Slow ALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
Simple
Simple
Instr
Instr
.
.
2xALU
2xALU
L1 Data Cache 8Kbyte 4
L1 Data Cache 8Kbyte 4
-
-
way
way
FP
FP
Move
Move
FP
FP
MMX
MMX
SSE
SSE
SSE2
SSE2
SSE3
SSE3
256
256
-
-
bit wide
bit wide
L1 Data Cache 16Kbyte 8- way
7
7
New Microarchitecture Features
• Larger Caches
• Deeper Buffers
• Faster Execution Units
• Algorithmic Enhancements
8
8
Cache Comparison
12k uops
12k uops
Trace Cache
1MB, 8-ways,
Write-back
512KB, 8-ways,
Write-back
2
nd
level data cache
16KB, 8-ways,
Write-through
8KB, 4-ways,
Write-through
1
st
level data cache
90nm
130nm
9
9
Larger Buffers
8
4
Outstanding 1
st
level Data Cache
Misses
14/16
10/12
FP Schedulers
48
48
Load Buffers
8
6
Write Combining
Buffers
32
24
Store Buffers
126
126
ROB Size
90nm
130nm
1 0
10
Faster Execution Units
• Shifts
– Typical shifts now handled inside of fast
execution core w/ single cycle latency
– Previously handled in complex integer unit
with 6 cycle latency
• Integer Multiply
– Adds dedicated integer multiplier
1 1
11
Algorithmic Enhancements
• Branch Prediction
• Hardware Prefetching
1 2
12
Branch Prediction
• Continued improvement of existing
algorithms
• Improved static prediction algorithm
– Displacement check
– Condition check
• Added indirect branch predictor
1 3
13
Branch Predictor Comparison
1.12
1.19
256.bzip2
1.23
1.32
300.twolf
0.09
0.08
255.vortex
0.24
0.33
254.gap
0.28
0.62
253.perlbmk
0.39
0.44
252.eon
0.87
1.06
197.parser
0.68
0.72
186.crafty
1.22
1.35
181.mcf
0.70
0.85
176.gcc
1.21
1.32
175.vpr
1.01
1.03
164.gzip
90nm
130nm
# of Branch Mispredicts Per 100 Instructions on SPECint*_base2000
* Other names and brands are the property of their respective owners
1 4
14
Hardware Prefetching
• Primary mechanism to hide DRAM latency
• Processor predicts what data will be
needed in the future and proactively
fetches it from DRAM
• Exists on all Intel
®
Pentium
®
4 Processor
implementations
• 90nm version improves on what data to
get and when to get it
1 5
15
Impact of HW prefetcher on most sensitive benchmarks in SPEC CPU2000
Hardware Prefetching
1.16
1.18
1.21
1.26
1.29
1.30
1.32
1.40
1.45
1.49
1.97
0.00
0.50
1.00
1.50
2.00
25
4.g
ap
191.fma3d 178.galgel 187.facerec
171.swim
168.wupwise
173.applu 189.lucas
181.mcf
17
2.m
gr
id
18
3.e
qu
ak
e
HWP Disabled
HWP Enabled
1 6
16
Hyper-Threading Technology
†
• Makes a single processor look like two
processors to software
• Takes advantage of underutilized
resources when running a single thread
through the processor
†
Hyper-Threading Technology requires a computer system with an Intel ® Pentium® 4
processor supporting HT Technology and a Hyper -Threading Technology enabled chipset,
BIOS and operating system. Performance will vary depending on the specific hardware and
software you use. See http:// www.intel.com/info/hyperthreading / for more information including
details on which processors support HT Technology.
1 7
17
Hyper-Threading Technology
Improvements
• 1
st
level data cache
– Uses partial virtual address index
– Aliasing can occur due to stacks of two threads being offset by a
fixed amount
– Use context identifier to differentiate between data from different
threads. Better than thread identifier to allow data sharing
between threads.
– Introduced on later steppings of 130nm version.
• Parallel Operations
– Allow page walks and split memory access handling in parallel
– Allow multiple page walks if one goes to DRAM
• Buffer sizes
– Motivated increase in # of outstanding 1
st
level cache misses
1 8
18
SSE3
• 13 new instructions
– x87 to integer conversion
– Graphics (Horizontal Add/Subtract)
– Complex arithmetic
– Video Encoding
1 9
19
Complex Arithmetic
• MOVDDUP, MOVSHDUP, MOVSLDUP
– Instructions to load and duplicate data
implicitly
• ADDSUBPS, ADDSUBPD
– Perform a mix of addition and subtraction
simultaneously
• 10-20% gain on 168.wupwise from these
instructions (complex matrix multiply)
2 0
20
Video Encoding
• Motion Estimation compares previous
frame to current frame
– Loads from the previous frame are unaligned
– Leads to costly cache line split memory
accesses
• LDDQU instruction loads 128-bits at an
arbitrary alignment with no cache line split
• Speedups of > 10% on MPEG-4 encoders
2 1
21
Thread Synchronization
• Used to indicate that a thread is spinning
and waiting for work
• Allows processor to go into an optimized
state
• MONITOR – Sets up address monitoring
hardware
• MWAIT – Sets processor into optimized
state. Will wake up when monitored
address in written to
2 2
22
Intel
®
Extended Memory 64
Technology
• Additional capability in today’s Intel
®
Xeon™ processors
on top of:
– Netburst
®
Microarchitecture
– Hyper-Threading Technology
– SSE3
• Provides 8 more integer and SSE registers
• Larger addressing capability
– 48 bits of virtual address on this implementation
– 36-40 bits of physical address on this implementation
• Full 64-bit support carefully engineered into the 90nm
design
– Limited differences between 32-bit and 64-bit operations
– Similar optimizations for 32-bit and 64-bit code
2 3
23
32-bit vs. 64-bit Comparison
4 levels – use
PDE cache to
reduce to 1 level
in common case
2 levels -- use
PDE cache to
reduce to 1 level
in common case
Page walks
1 load + 1 store per cycle
Memory
Throughput
4 operations/cycle
ALU Throughput
1 cycle
ALU Latency
64-bit
32-bit
Enable strong 64-bit performance without
compromising 32-bit performance
2 4
24
Optimizing 64-bit code
• Rule #1: Follow 32-bit optimizations
• Rule #2: Compile with Pentium 4 specific
optimizations enabled
• Few additional new rules:
– When data size is 32 bits, typically use 32-bit
instructions
• Example: XOR EAX, EAX instead of XOR RAX, RAX
– But sign extend to full 64-bits instead of only 32 bits
(even for 32-bit data size)
2 5
25
Performance Results
1.12
1.19
1.20
1.25
1.28
0.00
0.50
1.00
1.50
Adobe*
Photoshop* CS
Windows Media*
Encoder 9.0
Maxon Cinema*
4D
MainConcept*
1.4
MAGIX* mp3
maker 2004
diamond
HT Technology Disabled
HT Technology Enabled
Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400
CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel®PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems
and/or components and reflect the approximate performance of I n tel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
1.12
1.19
1.20
1.25
1.28
0.00
0.50
1.00
1.50
Adobe*
Photoshop* CS
Windows Media*
Encoder 9.0
Maxon Cinema*
4D
MainConcept*
1.4
MAGIX* mp3
maker 2004
diamond
HT Technology Disabled
HT Technology Enabled
Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400
CL3-3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI* Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barra cuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Windows* XP Build 2600 SP1, Intel®PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems
and/or components and reflect the approximate performance of I n tel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
2 6
26
Performance Results
1.07
1.14
0.00
0.50
1.00
1.50
SPECint*_base2000
SPECfp*_base2000
Intel® Pentium® 4 Processor with HT Technology 3.40 GHz
Intel® Pentium® 4 Processor with HT Technology 3.40E GHz
Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40 GHz – Intel®D875PBZ Desktop Board (AA-204); Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400 CL3 -3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI*
Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel® PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems and/or components and reflect the approximate
performance of Intel products as measured by those tests. Any di fference in system hardware or software design or configuration may affect actual performance.
1.07
1.14
0.00
0.50
1.00
1.50
SPECint*_base2000
SPECfp*_base2000
Intel® Pentium® 4 Processor with HT Technology 3.40 GHz
Intel® Pentium® 4 Processor with HT Technology 3.40E GHz
Source: Intel Configuration: Intel®Pentium®4 processor with HT Technology 3.40 GHz – Intel®D875PBZ Desktop Board (AA-204); Intel®Pentium®4 processor with HT Technology 3.40E GHz – Intel®D875PBZ Desktop Board (AA-301); All Platforms – 1GB DDR400 CL3 -3-3, ATI* Radeon* 9800 Pro AGP graphics, ATI*
Catalyst* 3.5 Driver Suite: display driver 6.14.10.6360, Intel®Application Accelerator RAID Edition 3.5 with RAID ready, Intel®Chipset Software Installation Utility 5.01.1015, Seagate* Barracuda* 7200 Serial ATA 160GB Hard Drive - ST3160023AS, Intel C & Fortran compilers 8.0, DirectX* 9.0b, Wi ndows* XP Build 2600 SP1, Intel® PRO/1000 MT Desktop Adapter. Performance tests and ratings are measured using specific c o m p u ter systems and/or components and reflect the approximate