• No results found

How to realize high-performance compute with Multicore DSP

N/A
N/A
Protected

Academic year: 2021

Share "How to realize high-performance compute with Multicore DSP"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

1

How to realize high-performance compute with

(2)

C667x Target Applications (Non- Telecom)

Emerging Others

Test and Automation

Mission Critical

Infrastructure Audio

HPC, Imaging and

Medical

Video Infrastructure

Emerging

Broadband

Innovations

(3)

TI Confidential – NDA Restrictions 3 3

RF and Communication Applications

Key Customer Careabouts

•Long Term Partnership

•Financial Stability

•Strong Roadmap and R&D

•Floating Point Performnce

•Size, Weight, and Power (SWaP)

•I/O Bandwidth

•Longevity of supply (10+yrs)

Application

ISR (Intelligence/Surveillance/Reconnaissance)

o SIGINT/COMINT/Signal Generators

Military Communications.

o SDR(JTRS)-Manpack/LMR/Fixed

o Comm. Infra - VoIP/Video Gateways

Satellite\Avionics Communications

o Ground Receiver/Repeaters

o Weather Radar

FAA – Civil Aviation/Govt Comm.

Conventional PS – TETRA/APCO/E911

o Wireless Infrastructure

o Comm. Infra - VoIP/Video Gateways

Emerging Broadband (OFDM/LTE/WiMAX)

o Utilities/Transport/Smart Grid

Govt & Public Safety

Avionics

(4)

RF and Comm. Product Requirements

Needs Raw Performance in

terms of MIPS/GHz/MMACS

Floating Point Capable ISA to

achieve “precision” and high

GFLOPS.

Large On Chip RAM

– Reduce accesses to slow

external memory.

High Speed External Memory

Interface

Large addressable memory

Efficient DMA architecture

Wireless specific accelerators

and TCP/IP Offload

Support Multiple Waveforms

Common Platform for

TDMA/CDMA/OFDMA

Multi-channel VoIP/Video

capability

Support FEC and Modulation

TCP/IP Networking support

(5)

TI Confidential – NDA Restrictions 5

Reliability in Mission Critical

Designs

Low Power Design

High BW Interface

RF Front End and Telecom ports

Connect Multiple DSPs on a

board e.g. in ATCA Card

High BW Backplane and Network

Connectivity

Needs multiple high speed

interfaces

– PCIe ,Serial RapidIO

– OBSAI/CPRI Interface

– Gigabit Ethernet etc

Memory Error Correction & Checking

(ECC)

Efficient Low Power DSPs

Support Extended Temp ranges from

-40

o

C to 105

o

C and others Temp

Ease of Use

Imaging Product Requirements

Dev and Debug Tools

Multicore S/W Frameworks

Signal/Image Processing functions.

VoIP Library

Audio/Video Codecs

(6)

6

Introducing “Keystone Architecture” (C66x)

The Best Combination of Performance (GHz) and Power Consumption in the Industry

16GFLOPs & 32GMACS per Core @ 1GHz

Fixed and Floating-point Core

@ 1.25 GHz

4x C64x+ MAC (32) 4xC67x Fl pt MAC(8)

16FLOP/cy compared to 6FLOP/cy 8 Core C6678 based on C66x core delivers 320 GMACs/160GFLOPS

@ 1.25GHz/Core (effectively a 10GHz DSP) 100% Code Compatible with all

C64x (fixed) & C67x (floating) Devices

Similar Power Profiles as C64x Core Supported by Code Composer Studio

IDE

Next-Generation

C66x DSP Core

Floating

Point

Fixed

Point

C64x+ Core (Fixed pt)

C64x+

Lowest Power Highest Performance DSP Core

C67x Core (Floating pt)

Industry’s Lowest Power FP DSP Core High precision and wide dynamic range

C67xx

NEW

MultiCore

DSP

C66x

KEYSTONE

Architecture

(7)

TI Confidential – NDA Restrictions

0 2000 4000 6000 8000 10000 12000 14000 TMS320C66xx

TMS320C67x Renesas SH77xx (SH-4) Intell Pentium III ADI TS202S/203S (TigerSHARC) ADI TS201S (TigerSHARC) ADI 213xx (SHARC) ADI 2126x (SHARC) ADI 2116x (SHARC)

Unmatched Performance

BDTI Score for Floating Point Processors

BDTImark2000

TM

Score

0 5000 10000 15000 20000 25000 TMS320C66xx TMS320C64x+ Freescale MSC815x (SC3850) Freescale MSC814x (SC3400) Freescale MSC81xx (SC140) ADI TS202S/203S (TigerSHARC) ADI TS201S(TigerSHARC) ADI BF5xx (Blackfin) NEC uPD77050

BDTI Score for Fixed Point Processors

Algorithm 300MHz C67x @ @1.2GHz C64x+ @1.25GHz C66x Gain

Single Precision Floating Point FFT,

2048 pt, Radix 4 86.84 us 14.00 us* ~600% Fixed Point FFT, 2048 pt, Radix 4 8.23 us 4.46 us* ~200%

FIR Filter, 40 samples, 40 taps 0.69 us 0.34 us* ~200%

Matrix Multiply 32 x 32 17.92 us 6.16 us* ~300%

(8)

8 8

The first network on chip infrastructure to unleash full multicore entitlement

Te ra N et 2 Shared Memory

High Speed I/O MulticoreShared Memory Controller

C66x, ARM Processing Cores Multicore Navigator Application Accelerator Application Accelerator HyperLink 50 System Management

(Debug, Clocking, Power)

Network on Chip

TI Multicore KeyStone Architecture

• Highest Integration

– Cost & Power 

• Common Architecture

– Portable Software

• Scalable

–  Tailored Solutions

• Navigator

– Innovative Multi-core

• Floating Point

– Development Time 

• Tools & Debugging

– R&D Efficiency 

• Quality Software

(9)

9

TI Confidential – NDA Restrictions

Product Highlights: C6670 and C6678

TI Confidential – NDA Restrictions

Next Generation C66x Core

-

Up to 8 C66x Cores @ 1GHz -1.25GHz

-

Available Options: 1, 2, 4, and 8 Core Devices

Memory Architecture

-

4MB Local L2/Core (512KB per Core)

-

4MB Multicore Shared Memory

Power Optimized Core

-

<10W at 1Ghz nominal temp

C6678

Power Optimized Core

C6670

Performance Optimized Core

Next Generation C66x Core

-

4 C66x Cores @ 1GHz - 1.2GHz

Memory Architecture

-

4MB Local L2/Core (1MB per Core)

-

2MB Multicore Shared Memory

Communication Accelerators

- TCP3e (Turbo Encode) – Up to 550Mbps - TCP3d (Turbo Decode) – Up to 600Mbps

- FFTC – 2048 FFT every 4.6µs

- VCP2 for voice channel decoding

Multicore Navigator T er aN et C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 8 x CorePac SRIO x4 PCIe x2 EMIF 16 TSIP x2 I 2C SPI UART Peripherals & IO GbE Switch SGMII SGMII IP Interfaces Crypto Packet Accelerator Network CoProcessors Power Management Debug

Multicore Shared Memory Controller (MSMC) Shared Memory 4MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem Hy per Li nk Multicore Navigator T er aN et C66X DSP L1 L2 SRIO x4 PCIe x2 AIF2 x6 I2C SPI UART Peripherals & IO SGMII x2 4x VCP2 3x TCP3d Communications CoProcessors Power Management Debug

Multicore Shared Memory Controller (MSMC) Shared Memory 2MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem H y pe rLi nk C66X DSP L1 L2 2x RAC 1x TAC 3x FFTC BCP Crypto Packet Accelerator Network CoProcessors C66X DSP L1 L2 C66X DSP L1 L2

(10)

Memory Architecture

• 0.5 MB of local Memory per core; • 4 MB of Shared Memory.

• Enhanced memory architecture through an enhanced Multicore Shared memory Controller • Bottleneck free fast on- and off-chip memory access including a

DDR3-1333MHz (64-bit) interface • L1/L2/L3 ECC

Multicore Navigator

T

e

raN

et

C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 C66X DSP L1 L2 8 x CorePac SRIO x4 PCIe x2 EMIF 16 TSIP x2 I 2C SPI UART Peripherals & IO GbE Switch SGMII SGMII IP Interfaces Crypto Packet Accelerator Network CoProcessors Power Management Debug

Multicore Shared Memory Controller (MSMC) Shared Memory 4MB DDR3- 64b EDMA SysMon System Elements Memory Subsystem

Hy

pe

rL

in

k

Innovation & Integration via C6678 DSP Highlights

Peripherals and I/O Interfaces

High bandwidth peripherals that operate independently (NOT Shared)

allowing simultaneous data transfer to prevent bottle necks - featuring:

RapidIO v2.1 – 4lanes @ 5Gbps with 1x, 2x and 4x support

 PCIe x2 – 2lanes, running independently of RapidIO

Improved Debug

S/W Dev and Debug Support Leveraged by CCS

C66x Core

Next generation Fixed / Floating-Point DSP core with clock speeds ranging from 1GHz– 1.25GHz and Up to 8

core options

Network Co- Processor and

Accelerators

A cost effective implementation to off-load the TCP/IP and secure networking functions from the DSP

Multicore Navigator

Data transfer engine that is architected to move data between various system elements without using any CPU overhead so maximum system efficiency is achieved

TeraNet

Switch fabric that has 2 Terabits of bandwidth which allows maximum data transfer between system components to realize full system entitlement

HyperLink

Ultra high-speed ( up to 50 Gbaud), low latency serial interface that connects to other DSPs and FPGAs in the systems

(11)

11

Competitive Analysis

Value Prop against FPGA

Value Prop against other DSPs

•C66x Performance

– 320GMACS/160GFLOP

– Baseband on a chip. Handles

multiple waveforms supporting

OFDM,CDMA,TDM

– L1/L2/L3 Processing capability

– Wireless Accelerators

(VCP/TCP/FFT)

•Software Programmability

– Time To Market

•Smaller Package

(more DSP/Board)

•Lower Power

– smaller battery, simpler cooling

•Low Cost - MIPs/$

•C66x Fixed & Floating Point [email protected]

– Industry’s Fastest DSP at 10GHz

•On-Chip RAM up to 8MB

•DDR3

– 1600MHz, 64Bit, 8GB Address space

•Multiple Independent High Speed IO

– 4xsRIOv2.1,2xPCIe Gen II, 2xSGMII, 2xTSIP

•High BW FPGA connectivity

– Hyperlink @ 50Gbps

•1/2/4/8 Core Option (Pin Compatible)

•L1/L2/L3 Memory ECC – System Reliability

•Low Power per GFLOPs and GMACS

•Extended Temp support -40

o

C to 105

o

C

•CCS Tools + S/W Collateral

•3

rd

Party Network

(12)

TMDXEVM6678L EVM

Singe wide AMC form factor

Code Composer Studio™ IDE

*Design *Code and Build *Debug *Analyze *Tune

CCSv5 Allows designers of all experience levels to move quickly through application development (www.ti.com/ccstudio)

•Time Limited FREE Evaluation Versions available for download.

Includes C667x Simulator

EVM Kit includes

•BIOS 6.x,

•BIOS-MCSDK / LINUX-MCSDK 2.0 (NDK, PDK, LIB etc), •Sample Program and Out of box demo (OOB) e.g.

I/O Benchmark, Imaging Processing Pipeline and High Performance DSP Utility Application (HUA)

•User Guide, Starter guide, Tech Ref Guide, App Notes etc

H/W Development Tools

• TMDXEVM6678L – EVM with XDS100 emulation -

$399

• TMDXEVM6678LE – EVM with XDS560V2 emulation

- $599

• TMDXEVM6678LXE – EVM with XDS560V2

emulation –Encryption Enabled - $599

• TMDSEMU560v2STM-UE - XDS560v2 System Trace

Emulator with 128Mb System Trace buffer and Ethernet / USB support

• Optional PCIe adapter card to connect the C6678

EVM to a standard PCI header of a desktop.

(13)

TI’s Multicore Hardware Ecosystem

Custom

Chassis / System

Others

PCIExpress (with Gen 2)

Advanced Mezzanine (AMC)

ATCA

Standardized Boards

(14)

TI’s Multicore Software Ecosystem

Layer 1 UMTS

Layer 1 LTE

Layer 2+

Customer Application

TI Layer 1 Libraries

TI BIOS, Linux, OSE(ck)

Multicore Entitlement

TI’s Device Entitlement Libraries

IP Network

Stack

(15)

TI Confidential – NDA Restrictions

15

DSP

Multicore Tools and Software (MC-SDK)

• Tools

– Codegen with

OpenMP

support

– Emulator/Debugger

– Simulator

– Profiler / DVT

– 3

rd

party tools

• Software

– BIOS/Linux SDK

• Multicore Demonstration

• 6.x DSP BIOS

– Platform Abstraction

– Basic Networking

– Inter core communication

• Application Specific Libraries

– Audio/Video CODECS

– VoIP Components

– WiMAX Toolkit, LTE Toolkit,

– DSPLib

• others..

Host Computer

Target Board

XDS 560 V2 XDS 560 Trace

Eclipse

Code Composer StudioTM Third Party Plug-Ins Editor/IDE Compiler Linker (Codegen) Profiler Debugger Remote Debug SoC Analyzer Polycore ENEA Optima 3L

Operating System w/ Boot Loader

BIOS

Full Silicon Entitlement

Multicore Entitlement

Linux

Platform Development Kit

Inter Core Communication

Customer Application

Speech

Codec NDK Audio Codec Video Codec

Demo App Multicore BIOS Demo App Multicore Linux Demo App Multicore BIOS and Linux DSPLIB IMGLIB

(16)

Digital Signal Processing FFT Adaptive Filtering Filtering and convolution Others…..

• Available free from TI

KeyStone Multicore Software – Libraries & Codecs

MATLAB Image processing Math operations Vision Analytics Image Processing • Edge Detection • Boundary • Morphology • Others…..

• Available free from TI

Voice and Fax

Line Echo Cancellation

Voice Activity Detection

Others…

Available free from TI

Security/Cryptography

• AES, SHA1, 3DES

Voice G.711, G.722 G.723, G.729 CDMA, AMR(NB/WB), EVRC-B Others Audio MPEG1 Layer2 AAC LC/HE AC3 2.0/5.1 Sample Rate Conversion Video H.263 H.264 MPEG2 MPEG4 VC1/WMV9 Decode Others Fax T.38 Fax Modem

Libraries

Codecs

Vision Lib (object only)

• 50+ royalty-free kernels:

• Background modeling & subtraction

• Object feature extraction • Tracking, recognition • Low-level pixel processing

(17)

High-Performance and Multicore Processor

High Value

Easy to Use

Quick to Market

Low-Cost EVM

High-Performance

at the Right

Power & Price

Open & Affordable Tools

User Community

Drivers &

Example Code

Product Collateral

Training

Enabler Software

Frameworks &

Abstraction Libraries Generic Application Libraries

Benchmarks & Functional

Understanding

Quick-Start Hardware

(18)

Getting Started – More Information/Links

• Product Folders:

C66X Informational Wiki Page

All C6000 Multicore DSPs

TMS320C6670

TMS320C6678

• EVMs and Software Tools:

TMS320C6678 EVM

TMS320C6670 EVM

AMC to PCIe Adapter Card

Multicore Software Development Kit for BIOS & Linux

MCSDK Wiki

CCS v5 Wiki

C66x Linux Wiki

DSP Signal Processing Library(DSPLIB)

Image and Video Processing Library (IMGLIB)

– LTE /WiMAX Toolkit – Discuss with BDM

• Technical Support

TI E2E Community (Online Support)

(19)

TI Confidential – NDA Restrictions

Online Video Training

(20)

Mission Critical DSP Market

“What Customers Like about TI”

Undisputed #1 DSP and SoC supplier

Strong Growth for 8 years in a row, even in 2009

Higher R&D spending than DSP revenue of most competitors

KeyStone SoC Architecture secures future success

Rich Product Portfolio & Strong Roadmap

2 Families with multiple devices and growing

Nyquist(6670), Shannon(6678/4/2) 40nm -> 28nm

Tools/Software & Compilers 3rd Party Eco-System

Multiple Design Wins Pre-Announcement

Secure Supply – No DSP product discontinuation (end of life)

History of delivery upon promises (Power, GHz, ..)

Field Experience - Completeness of system analysis, Architecture, Internal Switch, ….

Customer Support

Business Model - Long Term relationships with key customers

– Actively seek and incorporate customer feedback in roadmap devices.

TI SoC Architecture Layer 1 PHY Radio IP Network Macro Pico Femto Software 2002 2009 Rev en ue

(21)

21

Backup Slides

Product Details

(22)

C6678 (Shannon) “Lightning” Half-Length PCIe Card Feature Set

 TI TMS320C6678 (8-core) x 4

― C66x Core Frequency: 1.25GHz

― DDR3 Memory

― Data Frequency: 1600MHz

― Data Bus Width: 64-bit

― Serial RapidIO Gen-2 Interface

― PCIe Gen-2 Interface

― 10/100/1000Mbps Ethernet w/ SGMII

― Hyperlink50 Interface

 1024 MB DDR3-1333 on board

 PLX PEX8624 PCIe Gen-2 Switch

 Serial RapidIO daisy-chain

 Ethernet daisy-chain

 Each DSP device is linked to PCIe

switch by x2 lanes

 Dual DSPs linked by Hyperlink50

 Power: Max 54Watts

(23)

TI Confidential – NDA Restrictions

What is Hyperlink?

“high-speed, low-latency, and low-pin-count communication interface”

23

Low pin count (24 pins)

Point to Point Connection

Interconnect

DSP-to-DSP

DSP-to-FPGA.

SerDes for data transfer

x1 x4 modes for Tx and Rx

12.5GBaud/lane

Effectively 8b9b encoding

LVCMOS sideband signals for

flow control & power mgmt

- errors/events/timeouts

* Simple packet-based transfer protocol for memory-mapped access

* Read/Write to DSP/FPGA local memory

- discrete memory access of any byte aligned width up to 64bits. - burst transfer modes

Write (Maximum Burst Size 256Bytes)

Write Request --->

Data Packet --->

Read (Maximum Burst Size 256Bytes)

Read Request --->

Read Response -

Interrupt Request <--> Up to 64 Memory mapped Regions

(24)

Universal Parallel Port (uPP)

What is it?

– Parallel bus, two independent channels (separate data buses) – I/O speeds up to 75 MHz with 8-16 bit data width per channel – 1 or 2 channel parallel interface operating in RX, TX or FD

mode

– Supports Double data rate mode of operation (Bandwidth does not change/increase)

Application

– Each channel can interface cleanly with high-speed ADCs and/or DACs with up to 16-bit data width (per channel).

– Useful as low cost interface with FPGAs. Can run up to 120MByte/s per channel in single channel or bi-directional mode ( 240MByte for both channels in unidirectional mode) – Can also be used to interface two C6655/57 devices or to

connect C6655/57 with C674x or OMAP-L13x family of devices.

Other benefits

– Internal DMA – leaves CPU EDMA free

– Simple protocol with few control pins (configurable: 2-4 per channel)

– Multiple data packing formats for 9-15 bit data widths – Interleave mode (single channel only)

– Simple interface: IO Queued by software

Throughput Estimates:

(25)

25

References

Related documents

74 Jigme Norbu Reldri Academy of Health Sciences Diploma in General Nursing and Bhutan 75 Karma Tshering College of Natural Resources (CNR) B.Sc.. 78 Tenzin Gaeddu College

The Slave Coast of West Africa, 1550-1750: The Impact of the Atlantic Slave Trade on African Society.. Mississippi in Africa: The Saga of the Slaves of Prospect Hill Plantation

The central takeaways from the model are: (1) if training is part of the employment contract (contractible) then competition internalizes the training externality and increases

Standard features • Multi-function display showing time, date, audio information and outside air temperature (with ice warning function) • Rev counter • Digital clock with

The results of this study displayed di ff erences in the e ffi ciency of di ff erent lactobacilli strains and phytobiotic products regarding the ability to reduce ESBL-PE prevalence

Instead, they interacted with properties of the picture itself (i.e., visual complexity), such that phonological cues improved naming accuracy for items with low

In [8] the effects of slotting in a brushless dc motor (BLDCM) are determined by calculating the airgap permeance distribution using the Schwarz- Christoffel transformation.

Decision: replace all legacy systems with single integrated system Logistics &amp; Money flow Contracting Demand &amp; Supply planning Control &amp; limits Reporting..