ARM Cortex-A9 MPCore
Multicore Processor
Hierarchical Implementation p with IC Compiler
DAC 2008 Philip Watson Philip Watson
Implementation Environment Program Manager ARM Ltd
Background - Who Are We?
Processor Division, Cores Implementation, ARM-India. This team is actively involved in processor development benchmarking
The team has been working alongside the development of the microarchitecture of the ARM® Cortex™-A9 processor since early development and test
development and test
The outcome of this effort is to showcase
Power consumption
Performance
Area
The effort is focused on making the Cortex-A9 processor core a deployable embedded solutionp y
Partnership Through the Design Chain
The CPU is at the heart of the The RM ties all
this together, piloting the route from RTL
system-on-chip the route from RTL
to Silicon
Processors
Reference Methodology We work with major
We partner
with silicon foundries
t id di it f
Mutual Customers Fabric &
Physical IP EDA Tools
Methodology EDA companies to
ensure our IP works seamlessly
to provide diversity of SoC implementation
and manufacturing choice
Mutual Customers
SoCs require high performance fabric EDA tools provide
the environment to
l it thi IP and quality physical IP
exploit this IP
Cortex-A9 MPCore
™Multicore Solutions
The relative performance and power range of an ARM processor enabled by its ARM Physical IP
Performance
15% CPU performance
Performance Platform
Mainstream Platform
MHz performance
boost !
15% lower power, higher density Density Optimized
Platform
Challenges with Cortex-A9 MPCore
Implementation run time with all EDA tools is a key challenge for design
Implementation run time with all EDA tools is a key challenge for design closure, particularly with scalable performance processor designs
Iteration time increases as the design size increases
The iterations influence our ability to turnaround floor plan changes, tailor optimizations, allow the debug of constraints and design feedback
– this is a key to converging results 6.0
4.0 5.0
1.0 2.0 3.0
0.0 1.0
9 MP 1x with Neon 9 MP 2x with Neon 9 MP 4x with Neon
A9 A9 A9
Gate Count Run time
Challenges with Cortex-A9 MPCore
Implementation of 1 CPU vs 4 CPU Cortex-A9 with flat flow
Configuration 1CPU, 1 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts
Process Technology TSMC CLN65LP TSMC CLN65LP
Standard Cell Library 12Track – Nominal VT 12Track – Nominal VT
Memory Library Optimized fast cache instances Optimized fast cache instances
The 4 CPU solution gives:g
A significant increase in run time
Potentially some drop in performance (frequency)
…as compared to a 1 CPU implementation.p p
Hierarchical Implementation with IC Compiler
For faster TTR For faster TTR
Cortex-A9 cpu0 Placement (X Hrs) CTS (Y Hrs)
Cortex-A9 cpu1 Placement (X Hrs) Routing (Z Hrs)
Cortex-A9 MPCore Cortex-A9 top only
CTS (Y Hrs)
Routing (Z Hrs)
Cortex-A9 cpu2 Placement (X Hrs) CTS (Y Hrs)
Routing (Z Hrs)
Placement (A Hrs)
CTS (B Hrs)
Cortex-A9 cpu3 Placement (X Hrs) CTS (Y Hrs) Routing (Z Hrs)
Routing (C Hrs)
( )
Routing (Z Hrs) Total Run Time = X + Y + Z + C Hrs
Steps involved
Hierarchical Implementation with IC Compiler
Floorplanning
Create Physical Partition SDC &
ScanDef
Steps involved
Create Physical Partition
Partition Aware Place
Power Network Synthesis
Power Network Analysis In-Place Optimization
Clock Planning Pin Assignment
Budgeting
Cortex-A9 MPCore Multicore Solutions
The relative performance and power range of an ARM processor enabled by its Artisan® physical IP
Cortex-A9 Hierarchical Flow (with IC Compiler)
( p )
Performance
15% CPU performance
Mainstream Platform
MHz
Performance Platform
performance boost !
15% lower power, higher density Density Optimized
Platform
mW
Hierarchical Implementation with IC Compiler
Results
Configuration 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts
Process Technology TSMC CLN65LP TSMC CLN65LP
St d d C ll Lib 12T k N i l VT 12T k N i l VT
Implementation of 1 CPU Cortex-A9 flat vs 4 CPU Cortex-A9 hierarchical flow
Standard Cell Library 12Track – Nominal VT 12Track – Nominal VT
Memory Library Optimized fast cache instances Optimized fast cache instances
Implementation flow Flat Hierarchical
3.0 3.5 4.0 4.5
1.0 1.5 2.0 2.5
The 4 CPU implemented with a hierarchical flow gives:
0.0 0.5
A9 MP 1x with Neon A9 MP 2x with Neon A9 MP 4x with Neon
Comparable QoR in performance (frequency)
25% additional run time
…when compared to a 1CPU flat implementation
Gate Count Run time hierarchical
Next Steps
Handling efficiently Multiple Instantiated Module (MIM) for symmetric cores
Summary
Hierarchical flow delivers much faster iteration time with no loss of QoR
Simple and effective strategy to implement a multicore processor
Reduction in high memory cluster requirements
Lends itself very well for low power partitioning
Advanced low power management such as State Retention Power Gatingp g g
Leakage mitigation by power shutdown if the hardware is not being utilized
Easily deployable for the partner base (estimated by end of 2008)
In an ARM-Synopsys iRM (implementation Reference Methodology) with:In an ARM Synopsys iRM (implementation Reference Methodology) with:
Floorplan
Tcl Scripts (Complete flow from RTL to GDSII)
Physical IP Libraries
Physical IP Libraries
ARM Documentation - Core Signoff Guide
…providing an out-of-box solution from ARM