Application Performance Analysis
of the Cortex-A9 MPCore
Bryan Lawrence
This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development
2
Agenda
Motivation
Experimentation platforms
Performance exploration of different application classes
Performance evaluation of multiple concurrent applications
Summary and conclusionPhone ++ Upcoming Use Cases
Mobile Internet Browsing
Video conferencing
Gaming on the Go
Multi-player over 3G / 4G Network4
Mobile Phone Applications
Compute Intensive
Tablet Applications
Compute Intensive
6
Achieving Scalable Performance
Clock frequency of processor not the only metric of performance
Scalable, energy efficient performance required from mobile devices – phones, tablets to large enterprise computingHardware Platforms
Versatile Express ARM-NEC Cortex™-A9 processor
test-chip ~400MHz
Cortex-A9 x 4
4x NEON™/FPU
32KB I&D invidual L1 caches
512K L2 cache
1GB RAM (32b DDR2)
Early Partner Silicon Cortex-A9 x 2 @ 1GHz
8
Video Decode / Encode
Hardware encoder/decoders are common in consumer
Video/audio codecs standards evolve rapidly
Many codecs are used infrequently to justify h/w
Consumer applications involve other video processing Different from encode / decode (E.g. video editing)
Simultaneous encode / decode required for video conferencing
FFmpeg used for decode
X264 library used with FFmeg for video encode
CIF & VGA resolutions Commonly used in video conf.
Movie trailers used Order of computation more than video conf. Streams
Compression factor of 100 - 20010
H.264 Decode / Encode
Results for single core operation Normalized logarithmic scales used
Encode is more compute intensive than decode (at least ~2-3 times)
Writing out decoded streams
to secondary storage media limited by media bandwidth
H.264 Decode / Encode
Concurrent video decode + encode Important use case for video conferencing
Excellent scalability is observed for up to all 4 cores
Encoding is at least
2-3 times or more compute intensive than decode
Ideally more resources
should be dedicated to encode
12
On2/Google VP8
Libvpx library used for decoding VP8 (from WebM project)
Libvpx uses multi-threading and actively takes advantage of parallelizability available in the VP8 codec.
Comparative results obtained on Versatile Express and 1GHz dual core platformsOn2/Google VP8
Shows good scalability with the number of cores.
Scalability is relatively independent of the number of partitions in the video frame
Saturation is observed for no. of threads > no. of cores
Designers can query the platform to fetch the no. of cores –determine available paralelizability
14
Compilation - ffmpeg
Code compilation has inherent parallelism in terms of modules
Most build systems allow for this compilation to be exploited E.g. make –j 4
Compilation of FFmpeg and Linux Kernel shown here1GHz dual-core
Compilation – Linux Kernel
1GHz dual-core
Almost linear speed-up is observed with no. of cores for both cases
Effectively doubles (quadruples) the utilized memory bandwidth for 2 cores (4 cores)16
Browsers
Browser benchmark using collection of web-pages similar to the mix found in common browsing
Speed-up of 1.54 times observed between single and dual core execution
The ‘webcore’ fraction of the pie grows for multicore executionNormalized Performance Execution time decomposition
Multiple Concurrent Applications
Multitasking is becoming mainstream in mobile devices today
Common combinations include Browser + Audio playback
E.g. Internet Radio
Browser + background download
Independent applications can benefit immensely from18
Browser + Pandora Internet Radio
Speed up factor of 1.9
Super linear speed-up can be observed sometimes due to reduced cache pollution from conflicting applications
The speed-up can be traded for energy byslowing the cores down (depends on the
fabrication process technology used)
Normalized Performance
Execution time decomposition
Browser + Internet File Download
Speed up factor of 1.64x
Common use caseinvolves downloading an App from an application store or market-place while browsing the internet
Email synchronization in the bakground also formsNormalized Performance
20
Cortex-A9 MP Benefits – Performance
Browser (single app)
1
1.54
1 Core 2 CoreCortex-A9 MP Benefits – Richer Experience
Browser (single app)1
1.54
Browser + Pandora0.78
1.50
Browser + Download0.73
1.20
1 Core 2 Core22
Cortex-A9 MP Benefits – Richer Experience
Browser (single app)
1
1.54
Browser + Pandora0.78
1.50
Browser + Download0.73
1.20
1 Core 2 Core 1.64x 1.9xSummary and Conclusion
This presentation demonstrates the scalability of the ARM Cortex-A9 MPCore™ processor across various classes of applications, on today’s currently available software
Better power/performance can be achieved using an efficient low power ARM multicore processor, as compared to a single processor at much higher freq.
Next generation software will make more intensive use of threads, and scalability will improve further.24
Thank You
Please visit www.arm.com for ARM related technical details