• No results found

Application Performance Analysis of the Cortex-A9 MPCore

N/A
N/A
Protected

Academic year: 2021

Share "Application Performance Analysis of the Cortex-A9 MPCore"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Application Performance Analysis

of the Cortex-A9 MPCore

Bryan Lawrence

This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development

(2)

2

Agenda

Motivation

Experimentation platforms

Performance exploration of different application classes

Performance evaluation of multiple concurrent applications

Summary and conclusion

(3)

Phone ++ Upcoming Use Cases

Mobile Internet Browsing

Video conferencing

Gaming on the Go

Multi-player over 3G / 4G Network

(4)

4

Mobile Phone Applications

Compute Intensive

(5)

Tablet Applications

Compute Intensive

(6)

6

Achieving Scalable Performance

Clock frequency of processor not the only metric of performance

Scalable, energy efficient performance required from mobile devices – phones, tablets to large enterprise computing

(7)

Hardware Platforms

Versatile Express

 ARM-NEC Cortex™-A9 processor

test-chip ~400MHz

 Cortex-A9 x 4

 4x NEON™/FPU

 32KB I&D invidual L1 caches

 512K L2 cache

 1GB RAM (32b DDR2)

Early Partner Silicon

 Cortex-A9 x 2 @ 1GHz

(8)

8

Video Decode / Encode

Hardware encoder/decoders are common in consumer

Video/audio codecs standards evolve rapidly

Many codecs are used infrequently to justify h/w

Consumer applications involve other video processing

 Different from encode / decode (E.g. video editing)

Simultaneous encode / decode required for video conferencing

(9)

FFmpeg used for decode

X264 library used with FFmeg for video encode

CIF & VGA resolutions

 Commonly used in video conf.

Movie trailers used

 Order of computation more than video conf. Streams

Compression factor of 100 - 200

(10)

10

H.264 Decode / Encode

Results for single core operation

 Normalized logarithmic scales used

 Encode is more compute intensive than decode (at least ~2-3 times)

 Writing out decoded streams

to secondary storage media limited by media bandwidth

(11)

H.264 Decode / Encode

Concurrent video decode + encode

 Important use case for video conferencing

 Excellent scalability is observed for up to all 4 cores

 Encoding is at least

2-3 times or more compute intensive than decode

 Ideally more resources

should be dedicated to encode

(12)

12

On2/Google VP8

Libvpx library used for decoding VP8 (from WebM project)

Libvpx uses multi-threading and actively takes advantage of parallelizability available in the VP8 codec.

Comparative results obtained on Versatile Express and 1GHz dual core platforms

(13)

On2/Google VP8

Shows good scalability with the number of cores.

Scalability is relatively independent of the number of partitions in the video frame

Saturation is observed for no. of threads > no. of cores

Designers can query the platform to fetch the no. of cores –

determine available paralelizability

(14)

14

Compilation - ffmpeg

Code compilation has inherent parallelism in terms of modules

Most build systems allow for this compilation to be exploited

 E.g. make –j 4

Compilation of FFmpeg and Linux Kernel shown here

1GHz dual-core

(15)

Compilation – Linux Kernel

1GHz dual-core

Almost linear speed-up is observed with no. of cores for both cases

Effectively doubles (quadruples) the utilized memory bandwidth for 2 cores (4 cores)

(16)

16

Browsers

Browser benchmark using collection of web-pages similar to the mix found in common browsing

Speed-up of 1.54 times observed between single and dual core execution

The ‘webcore’ fraction of the pie grows for multicore execution

Normalized Performance Execution time decomposition

(17)

Multiple Concurrent Applications

Multitasking is becoming mainstream in mobile devices today

Common combinations include

 Browser + Audio playback

 E.g. Internet Radio

 Browser + background download

Independent applications can benefit immensely from

(18)

18

Browser + Pandora Internet Radio

Speed up factor of 1.9

Super linear speed-up can be observed sometimes due to reduced cache pollution from conflicting applications

The speed-up can be traded for energy by

slowing the cores down (depends on the

fabrication process technology used)

Normalized Performance

Execution time decomposition

(19)

Browser + Internet File Download

Speed up factor of 1.64x

Common use case

involves downloading an App from an application store or market-place while browsing the internet

Email synchronization in the bakground also forms

Normalized Performance

(20)

20

Cortex-A9 MP Benefits – Performance

Browser (single app)

1

1.54

1 Core 2 Core

(21)

Cortex-A9 MP Benefits – Richer Experience

Browser (single app)

1

1.54

Browser + Pandora

0.78

1.50

Browser + Download

0.73

1.20

1 Core 2 Core

(22)

22

Cortex-A9 MP Benefits – Richer Experience

Browser (single app)

1

1.54

Browser + Pandora

0.78

1.50

Browser + Download

0.73

1.20

1 Core 2 Core 1.64x 1.9x

(23)

Summary and Conclusion

This presentation demonstrates the scalability of the ARM Cortex-A9 MPCore™ processor across various classes of applications, on today’s currently available software

Better power/performance can be achieved using an efficient low power ARM multicore processor, as compared to a single processor at much higher freq.

Next generation software will make more intensive use of threads, and scalability will improve further.

(24)

24

Thank You

Please visit www.arm.com for ARM related technical details

References

Related documents

As noted in the Literature Review, above, scholarship on the determinants of foreign direct investment (FDI) variously argue the influence of GDP growth, the openness of a

When employees forget their password or forget to login/logout properly, helpdesk calls will increase to request password resets and front desk has to cover non-answered phone

Darshan records independent statistics for each file accessed by the application, including the number of bytes moved, cumulative time spent in I/O operations such as read()

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

Appendix 1 provides the optimal tax solution and a table of optimal tax rates. The optimal tax increases with the externality parameter,

The CoreLink Level 2 Cache Controller reduces the number of external memory accesses and has been optimized for use with Cortex-A9 processors and Cortex-A9 MPCore processors....

Middle Ages, Medieval, Christianity, Christian saints, Christian women saints, Virgin martyrs, Hagiography, Hagiographer, Middle English, ars memoria, Art of memory,

In this paper, we study the convergence of Schwarz Waveform Relaxation Domain Decomposition Methods (DDM) for solving the stationary two-dimensional linear Schr¨odinger equation