ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

(1)

ARM Cortex-A9 MPCore

Multicore Processor

Hierarchical Implementation p with IC Compiler

DAC 2008 Philip Watson Philip Watson

Implementation Environment Program Manager ARM Ltd

(2)

Background - Who Are We?

Processor Division, Cores Implementation, ARM-India. This team is actively involved in processor development benchmarking

The team has been working alongside the development of the microarchitecture of the ARM^® Cortex^™-A9 processor since early development and test

development and test

The outcome of this effort is to showcase

Power consumption

Performance

^Area

The effort is focused on making the Cortex-A9 processor core a deployable embedded solutionp y

(3)

Partnership Through the Design Chain

The CPU is at the heart of the The RM ties all

this together, piloting the route from RTL

system-on-chip the route from RTL

to Silicon

Processors

Reference Methodology We work with major

We partner

with silicon foundries

t id di it f

Mutual Customers Fabric &

Physical IP EDA Tools

Methodology EDA companies to

ensure our IP works seamlessly

to provide diversity of SoC implementation

and manufacturing choice

Mutual Customers

SoCs require high performance fabric EDA tools provide

the environment to

l it thi IP and quality physical IP

exploit this IP

(4)

Cortex-A9 MPCore

^™

Multicore Solutions

The relative performance and power range of an ARM processor enabled by its ARM Physical IP

Performance

15% CPU performance

Performance Platform

Mainstream Platform

MHz performance

boost !

15% lower power, higher density Density Optimized

Platform

(5)

Challenges with Cortex-A9 MPCore

Implementation run time with all EDA tools is a key challenge for design

Implementation run time with all EDA tools is a key challenge for design closure, particularly with scalable performance processor designs

Iteration time increases as the design size increases

The iterations influence our ability to turnaround floor plan changes, tailor optimizations, allow the debug of constraints and design feedback

– this is a key to converging results _6.0

4.0 5.0

1.0 2.0 3.0

0.0 1.0

9 MP 1x with Neon 9 MP 2x with Neon 9 MP 4x with Neon

A9 A9 A9

Gate Count Run time

(6)

Challenges with Cortex-A9 MPCore

Implementation of 1 CPU vs 4 CPU Cortex-A9 with flat flow

Configuration 1CPU, 1 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts

Process Technology TSMC CLN65LP TSMC CLN65LP

Standard Cell Library 12Track – Nominal VT 12Track – Nominal VT

Memory Library Optimized fast cache instances Optimized fast cache instances

The 4 CPU solution gives:g

A significant increase in run time

Potentially some drop in performance (frequency)

…as compared to a 1 CPU implementation.p p

(7)

Hierarchical Implementation with IC Compiler

For faster TTR For faster TTR

Cortex-A9 cpu0 Placement (X Hrs) CTS (Y Hrs)

Cortex-A9 cpu1 Placement (X Hrs) Routing (Z Hrs)

Cortex-A9 MPCore Cortex-A9 top only

CTS (Y Hrs)

Routing (Z Hrs)

Cortex-A9 cpu2 Placement (X Hrs) CTS (Y Hrs)

Routing (Z Hrs)

Placement (A Hrs)

CTS (B Hrs)

Cortex-A9 cpu3 Placement (X Hrs) CTS (Y Hrs) Routing (Z Hrs)

Routing (C Hrs)

( )

Routing (Z Hrs) Total Run Time = X + Y + Z + C Hrs

(8)

Steps involved

Hierarchical Implementation with IC Compiler

Floorplanning

Create Physical Partition SDC &

ScanDef

Steps involved

Create Physical Partition

Partition Aware Place

Power Network Synthesis

Power Network Analysis In-Place Optimization

Clock Planning Pin Assignment

Budgeting

(9)

Cortex-A9 MPCore Multicore Solutions

The relative performance and power range of an ARM processor enabled by its Artisan^® physical IP

Cortex-A9 Hierarchical Flow (with IC Compiler)

( p )

Performance

15% CPU performance

Mainstream Platform

MHz

Performance Platform

performance boost !

15% lower power, higher density Density Optimized

Platform

mW

(10)

Hierarchical Implementation with IC Compiler

Results

Configuration 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts 4CPU, 4 Neon, 32K D$, 32K I$, 32 interrupts

Process Technology TSMC CLN65LP TSMC CLN65LP

St d d C ll Lib 12T k N i l VT 12T k N i l VT

Implementation of 1 CPU Cortex-A9 flat vs 4 CPU Cortex-A9 hierarchical flow

Standard Cell Library 12Track – Nominal VT 12Track – Nominal VT

Memory Library Optimized fast cache instances Optimized fast cache instances

Implementation flow Flat Hierarchical

3.0 3.5 4.0 4.5

1.0 1.5 2.0 2.5

The 4 CPU implemented with a hierarchical flow gives:

0.0 0.5

A9 MP 1x with Neon A9 MP 2x with Neon A9 MP 4x with Neon

Comparable QoR in performance (frequency)

25% additional run time

…when compared to a 1CPU flat implementation

Gate Count Run time hierarchical

(11)

Next Steps

Handling efficiently Multiple Instantiated Module (MIM) for symmetric cores

(12)

Summary

Hierarchical flow delivers much faster iteration time with no loss of QoR

Simple and effective strategy to implement a multicore processor

Reduction in high memory cluster requirements

Lends itself very well for low power partitioning

Advanced low power management such as State Retention Power Gatingp g g

Leakage mitigation by power shutdown if the hardware is not being utilized

Easily deployable for the partner base (estimated by end of 2008)

In an ARM-Synopsys iRM (implementation Reference Methodology) with:In an ARM Synopsys iRM (implementation Reference Methodology) with:

^Floorplan

Tcl Scripts (Complete flow from RTL to GDSII)

Physical IP Libraries

ARM Documentation - Core Signoff Guide

…providing an out-of-box solution from ARM