• No results found

Texas Instruments TMS320C6678 (Shannon) DSP Training

N/A
N/A
Protected

Academic year: 2021

Share "Texas Instruments TMS320C6678 (Shannon) DSP Training"

Copied!
65
0
0

Loading.... (view fulltext now)

Full text

(1)

Texas Instruments

TMS320C6678 (Shannon)

DSP Training

Brighton Feng

(2)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

(3)

4

Shannon Functional Diagram

• Multi-Core SoC

• Fixed/Floating C66x™ Core

– Eight cores @ 1.0 GHz, 0.5 MB Local L2 – 4.0 MB shared memory

– 256 GMAC, 128 GFLOP • Navigator

– Multicore eco system • Packet Infrastructure • Network Coprocessor

– IP Network solution for IP v4/6 – 1.5M packets per sec (1Gb Ethernet

wire-rate)

– IPsec, SRTP, Air Interface Encryption fully offloaded

• 3-port GigE Switch (Layer 2) • Low Power Consumption

– Adaptive Voltage Scaling (Smart ReflexTM) • Hyperlink 50 – 50G Expansion port – Transparent to Software • Multicore Debugging C6678 (Shannon) C66x core L2 Memory L1 D L1 P . . . 8 C66x Cores

Peripherals and I/O

sRIO

Flash PCIe TSIP

UART SPI, I2C

System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b Shared Memory Multicore Memory Controller Hyperlink50 TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor

(4)

100% backward object code compatible

Increased Fixed and floating

Point capability Improved support for

complex arithmetic and matrix computation

Enhanced DSP core

Native instructions for IEEE 754, SP&DP Advanced VLIW architecture 2x registers Enhanced floating-point add capabilities Advanced fixed-point instructions Four 16-bit or eight 8-bit MACs Two-level cache SPLOOP and

16-bit instructions for smaller code

size Flexible level one

memory architecture iDMA for rapid

data transfers between local memories C66x C64x+ C64x C67x C67x+

FLOATING-POINT VALUE FIXED-POINT VALUE

Pe rform anc e improv em ent

(5)

C66x core block diagram

C66x Core

Data Path 1 Data Path 2

A Register File A0 – A31 B Register File B0 –B31 Instruction Decode Instruction Dispatch Instruction Fetch Control Registers Interrupt Control In-Circuit Emulation D2 S2 L2 S1 L1 + + + + M1 D1 M2 x x x x SPLOOP Buffer x x x x x x x x x x x x x x x x x x x x x x x x x x x x 256 Bits 2x64 Bits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

(6)

Key Improvements of C66x

4x Multiply Accumulate improvement

Enhanced complex arithmetic and matrix operations

2x Arithmetic and Logical operations

improvement

Support the floating point arithmetic. Single

precision floating point operation capability

same as 32 bit fixed point operation capability

division and square root is supported by

(7)

C64x+

C66x Comparison

Operation Precision Operations

per cycle on C64x+ Operations per cycle on C66x Function Unit MAC Real 8 x 8 2 x 4 = 8 2 x 8 = 16 M1, M2 Real 16 x 16 2 x 2 = 4 2 x 8 = 16 M1, M2 Real 32 x 32 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (16,16) x (16,16) 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (32,32) x (32,32) N/A 2 x 1 = 2 M1, M2 Arithmetic Logical 8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2 16 bit 4 x 2 = 8 4 x 4 = 16 L1, L2, S1, S2 32 bit 4 x 1 = 4 4 x 2 = 8 L1, L2, S1, S2 Memory Access 8 bit, 16 bit, 32 bit, 64 bit 2 x 1 = 2 2 x 1 = 2 D1, D2

(8)

Outline

C6678 DSP Overview

Multi-core DSP programming

Memory Architecture Overview

Shannon Memory Architecture

Improvement

Programming model

Interconnection and resource sharing

(9)

TCI6486 Memory Architecture

Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared L2 Control (Core speed)/2 256 bit Shared L2 RAM External Memory Shared L2 ROM EDMA . . . S M S M M
(10)

Shannon Memory Architecture

Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared Memory Control (Core speed)/2 256 bit Shared L2 RAM External Memory EDMA . . . S S M EDMA M
(11)

Outline

C6678 DSP Overview

Multi-core DSP programming

Memory Architecture Overview

Shannon Memory Architecture

Improvement

Programming model

Interconnection and resource sharing

(12)

Addition of XMC

Bring over existing EMC MDMA path

Fat pipe to external (and internal) shared memory

Bus width: 256 instead of 128 bits

Clock rate: CPUCLK/2 instead of CPUCLK/3 Optimize requests for MSMC / DDR3 memory

L2 line allocations and evictions are split into sub-lines of 64 bytes

Memory Protection and Address Extension (MPAX) support

16 segments of programmable size (powers of 2: 4KB to 4GB) Each segment maps a 32-bit address to a 36-bit address.

Each segment controls access: supervisor/user, R/W/X, (non-)secure Memory protection for shared internal MSMC memory and external DDR3

memory

Multi-stream Prefetch support

Program prefetch buffer up to 128 bytes Data prefetch buffer up to 8 x 128 bytes

Prefetch enabled/disabled on 16MB ranges defined in MAR Manual flush for coherence purposes

(13)

MAR Register Extension

• L2 memory controller extends the MAR registers by adding the “PFX” field, L2 memory controller uses this bit to convey XMC whether a given address range is prefetchable.

(14)

MSMC Block Diagram

RAM banks, 256-bits per bank CGEM Slave Port CGEM Slave Port System Slave Port for shared SRAM (SMS) System Slave Port for external memory (SES) MSMC System Master Port MSMC EMIF Master Port MSMC Datapath x N CGEM cores

Arbitration for Banks

256 256 256 256 256 Memory Protection and Extension Unit (MPAX) 256 256 VBUSM 256 events VBUSM 256 VBUSM 256 Memory Protection and Extension Unit (MPAX) MSMC Core EMIF – 64 bit DDR3 SCR SCR VBUSM 256

EDC for SRAM

One slave interface per C66x

Megamodule (256 bits @ CPUCLK/2) Uses a 36 bit address extended inside

a C66x Megamodule core

Two slave interfaces (256 bits @ CPUCLK/2) for access from system masters

SMS interface for accesses to MSMC SRAM space

SES interface for accesses to DDR3 space

Both interfaces support memory protection and address extension

One master interface (256-bits @ CPUCLK/2) for access to the DDR3 EMIF

One master interface (256 bits @ CPUCLK/2) for access to system slaves

(15)

MSMC Shared Memory

4 banks x 2 sub-banks, sub-bank are 256-bit

wide.

Reduces conflicts between C66x Megamodule cores and system masters

Features a dynamic fair-share bank arbitration for each transfer

Supports bandwidth management. Avoid

indefinite starvation for lower priority requests

due to higher priority requests

Features Not Supported

Cache coherency between L1/L2 caches in C66x Megamodule cores and MSMC memory

Cache coherency between XMC prefetch buffers and MSMC memory

C66x Megamodule to C66x Megamodule cache coherency via MSMC

(16)

MPAX Units

MPAX stands for “Memory Protection and

Address Extension”

There are N+2 MPAX units in a system with N

C66x Megamodules

N MPAX units for all requests from N C66x

Megamodules to internal shared memory, external shared memory or any system slave

1 MPAX unit for all requests from any system master to internal shared memory

1 MPAX unit for all requests from any system master to external shared memory

Each MPAX unit operates on a number of

segments of programmable size

Each segment maps a 32-bit address to a 36-bit address.

(17)

Number of Segments

Each C66x Megamodule has 16 segments which

control direct (load/store) requests to internal

shared memory, external shared memory and

any other system slave.

Any master identified by a privilege ID has

8 segments for requests to internal shared memory

8 segments for requests to external shared memory.

Some masters work on behalf of other masters.

They will inherit the privilege ID of their

commanding master. As such, each C66x

Megamodule also has

8 segments for indirect (DMA) requests to internal shared memory

8 segments for indirect (DMA) requests to external shared memory

(18)

Segment Definition

Each segment is defined by a base address and a size

The segment size can be set to any power of 2 from 4K to

4GB

The segment base address is constrained to power-of-2

boundary equal to size.

One would expect that each request should find one matching segment, however ...

a request may find two or more matching segments, in

which case segments with higher ID take priority over segments with lower ID. This allows

creating non-power of 2 segments

creating 3 segments with just 2 segment definitions ...

a request may find no matching segment, in which case an

error is reported in Memory protection fault reporting registers (XMPFAR, XMPFSR)

(19)

XMC Segment Registers

XMPAXH/XMPAXL[15-0]

(20)

MPAX Default Memory Map

Segment 1 Segment 0 Disabled Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15 CGEM Logical 32-bit Memory Map

Upper 60GB

System Physical 36-bit Memory Map

Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 7:FFFF_FFFF 8:0000_0000

BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 800000h; Size = 2GB

8:8000_0000 8:7FFF_FFFF

XMC configures MPAX segments 0 and 1 so that C66x Megamodule can access system memory.

The power up configuration is that segment 1 remaps 8000_0000 – FFFF_FFFF in C66x

Megamodule’s address space to 8:0000_0000 – 8:7FFF_FFFF in the system address map.

This corresponds to the first 2GB of address space

(21)

MPAX MSMC Aliasing Example

BADDR = 0C000h; RADDR = 00C000h; Size = 2MB BADDR = 20000h; RADDR = 00C000h; Size = 2MB

CGEM 32-bit Memory Map 0000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0Cxx_xxxx 0:0C1F_FFFF 0:0C00_0000 BADDR = 21000h; RADDR = 00C000h; Size = 2MB

20xx_xxxx 21xx_xxxx “Fast” MSMC RAM MSMC RAM Alias 1 MSMC RAM Alias 2 MSMC RAM (2MB)

Example shows 3 segments to map the MSMC RAM address space into C66x Megamodule’s address space as three distinct 2MB ranges. By programming the MARs accordingly, the three segments could have different semantics.

Accesses to MSMC RAM via this alias do not use the “fast RAM”

(22)

Copyright © 2010 Texas Instruments. All rights reserved.

MPAX Overlayed Segments Example

BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 080000h; Size = 2GB Segment 1

Segment 0

BADDR = C0007h; RADDR = 050042h; Size = 4K Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15

CGEM 32-bit Memory Map

Upper 60GB

System 36-bit Memory Map Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 0:5004_2xxx 0:C000_7xxx C000_7xxx

segment 1 matches 8000_0000 through FFFF_FFFF,

and segment 2 matches C000_7000 through C000_7FFF.

Because segment 2 is higher priority than segment 1,

its settings take priority, effectively carving a 4K hole in segment 1’s 2GB address space.

Furthermore, it maps this 4K space to 0:5004_2000 -

0:5004_2FFF, which overlaps the mapping established by segment 0. This physical address range is now

(23)

Outline

C6678 DSP Overview

Multi-core DSP programming

Memory Architecture Overview

Shannon Memory Architecture

Improvement

Programming model

Interconnection and resource sharing

(24)

single program image

L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memory

App.out App.out App.out

code and read/write data Shared L2 or DDR memory App.out Shared code and Read only data Data 0 Data 1 Data 2

Data 0 Data 1 Data 2

Same image on each DSP core

Aliased addressing used for DSP core to access local L2 DNUM DSP core register for:

Global addressing when programming EDMA3, SRIO, …

(25)

Shannon MPAX enables easy single program image

M P A X M P A X code1 data2 data2 code2 data3 data3 MSMC RAM internal External memory code1 data2 code2 data3 MSMC RAM internal External memory

SoC address space CGEM address space (1)

code1 data2 code2 data3 MSMC RAM internal External memory

CGEM address space (n)

(26)

multiple program image

L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memory App0.out App1.out C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data

App0.out App1.out App2.out

App2.out

Shared L2 or DDR memory

Data 0 Data 1 Data 2

Data 0 Data 1 Data 2

Each DSP core has its image

Static split of DDR2 per DSP core

(27)

43

Shannon Software

Flexible development

environment for the customer.

Customer can choose to develop

their application using all or any one of the software layers.

Will contain following software

layers

BIOS and Linux Operating System support

Chip Support Library

Platform Development Kit

Inter Core Communication

Optimized DSP functions library

Optimized Audio, Video and Speech codecs

Voice Gateway Demonstration Kit

Video Transcoding Demonstration Kit

Demonstration applications

C6678 Software

Operating System w/ Boot Loader BIOS

Full Silicon Entitlement Multi-core Entitlement

Linux

Chip Support Library Platform Development Kit

Inter Core Communication

Voice Gateway Demonstration Kit Video Transcoding Demonstration Kit Speech Codec DSPLIB Audio Codec Video Codec Demo App Customer Application

(28)

Data

Visualization

Shannon Debug

Best Multicore Debug and Visualization Debug enabled Multicore SoC

Debug visibility at core, across multicore and for SoC

45 C6678 (Shannon) C66x core L2 Mem or y L1 D L1 P . . . 8 C66x Cores

Peripherals and I/O

sRIO

Flash PCIe TSIP

UART SPI, I2C

System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b Shared Memory Multicore Memory Controller TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor E T B TRACE TRA CE Hyperlink50

(29)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

Interconnection Architecture

Shannon Hardware queue

Inter-core communication

Shared Resource Management

(30)

Shannon Switch Fabric

MSMC_SS CPU/2 256b VBUSM SCR Shared L2 RAM CPU/3 128b VBUSM SCR S S SRIO M PCIe QM_SS M M 16ch DMA TC0 M M TC1 M M DDR3 S XMC 64ch DMA M TC2 M TC3 M TC4 M TC5 64ch DMA M TC6 M TC7 M TC8 M TC9 CPU/3 32b VBUSP SCR PA_SS M VUSR M VUSR S TSIP 0,1 M QM_SS PCIe S S EMIF16 S CONFIG M EDMA_0 EDMA_1,2 GEM S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S M
(31)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

Interconnection Architecture

Shannon Hardware queue

Inter-core communication

Shared Resource Management

(32)

Hardware Queue Architecture

packetized Data transfer architecture

designed to minimize DSP core

interaction while maximizing memory

and bus efficiency

the key communication platform for TI’s

future Infrastructure DSPs

Used by following peripherals in

Shannon:

Serial RapidIO, Packet Accelerator

Each module contains its own DMA to

transfer associated data with the ‘jobs’, No

CPU resources involved

(33)

Queue 1..x

Hardware Queue

Producer writes ‘jobs’ into a Queue. Consumer reads ‘jobs’ from the

Queue

Supports Multiple In – Multiple Out

Multiple Producers can write to the

same Queue

Used to share common Hardware

Multiple Consumers can read from

the same Queue

Used for Load Balancing

Abstracts the Consumer

Consumer can be a Hardware IP

(accelerator, peripheral) or a software (ie a CPU core)

Transparent for the Producer

  ‘Easy’ to upgrade to new

hardware. The ‘job gets done’.

  Minimize changes to Host

software, Easy maintenance

CPU1 CPU2 CPU3 Packet Acc. RapidIO .... Producer Queue Manager CPUx Acc 1 Acc 2 RapidIO Peri x … Queue Controller DMA Consumer

(34)
(35)

Hardware Queue Operation

Push to a queue

Host write pointer of new descriptor to a queue register.

Queue manager links (modify the link RAM) the new

descriptor to the tail (or header) of the queue.

Tail (or header) pointer points to the new descriptor.

Pop from a queue

Host read a descriptor pointer from a queue register.

Queue manager returns the descriptor pointed by the header

pointer

Header pointer points to the next descriptor.

Monitor queue

Queue manager generates events when queue changes: not

empty, entry count, exceed threshold, starvation…

Queue Diversion

Entire queue contents can be cleared or moved to another

(36)

Shannon Hardware queue architecture

DSP core DSP core

Queue Manage Subsystem DSP core Packet DMA

(SRIO) Packet DMA (PA) VBUS Accumulation Buffer Buffer Memory . . . Link RAM Descriptor RAM Queue Manager Q1 IF Q0 IF Qx IF Queue Events

Queue Event Queue Event

Packet DMA (Internal) APDSP APDSP Queue Interrupt Queue Interrupts

(37)

Queue Manager Subsystem

Support 8192 queues

HW queues are multi-core safe without mutual

exclusion, multiple senders can use a destination

queue without restrictions

Can Notify Packet DMA when transfer is pending

Can notify DSP core when packet is pending, can

copy descriptor pointers of transferred data to

destination core’s local memory to reduce access

latency

Internal Packet DMA

Transfer packet from one queue to another queue. Good for core to core data transfer.

(38)

Descriptor RAM

Data elements (buffers) to be passed on queues are first described to a

descriptor region manager built into the QM.

Although technically called descriptors, these memory elements can hold any

arbitrary data.

The size of the data

elements must be a power of 2, from 32 bytes to 8192

bytes in length.

20 configurable memory regions (for descriptor storage)

The number of elements in the region must be a power of 2, from 32 buffers to 4096 buffers in the region.

32 byte buffers

256 byte buffers

Memory Descriptor Region

Registers 16 0 0x1000 0x2000 0x1000 0 16 Region 0 Region 1 Region 19 32 16 4 256 0x2000 15 19

(39)

Linking RAM

Linking RAM contains 1 entry for each

descriptor . Linking RAM entry is effectively an extension of the descriptor

Linking RAM stores Forward data pointer that is critical for the PUSH / POP operations performed by the Queue Manager

Linkage between physical address of

descriptor and physical address of Linking RAM is performed inside the QM using

information provided in the Queue Management configuration registers

Linking RAM is typically placed in local memory for speed. This allows data

elements to be linked and unlinked in a

queue very quickly, even though the buffers themselves may be in external memory

There is no limit to the length of a single queue, only a limit on the total number of data elements in the system.

2 configurable Linking RAM regions

Queue Contents Linking RAM

0 17

Forward Pointer Table

- - - - x - - - - - - - - - - - 5 19 x Queue 0 Queue 1 17 5 19 18

(40)

Queue Data Flow Example, Transmit

Host Processor Queue Manager Rx Queue Rx Port

INIT : Host Allocates Rx Free Descriptors

and initializes queues Interrupt Generator

Free Descriptor Queue Tx Queue TX 2 Processor Queues a packet to a Tx Queue TX 3 Port transmits the buffer being pointed to by the descriptor TX 4 Port Posts Packet Descriptor to return Queue Tx Port TX 1 Processor fetches a descriptor to fill with the data to transmit

(41)

Queue Data Flow Example, Receive

Host Processor Queue Manager Rx Queue Rx Port

INIT : Host Allocates Rx Free Descriptors and initializes queues

RX 1 Port Fetches a Free Descriptor and transfers the data to the buffer pointed to by the descriptor RX 2 Port Posts Packet to Rx Queue RX 4

Interrupt according to pacing rules or poll

Interrupt Generator RX 3 Not Empty Level Status Free Descriptor Queue Tx Queue Tx Port

Optionally prefetches the descriptor to L2 prior to interrupting

(42)

Accumulator (A Programmable DSP)

Accumulator is used to help

DSP core efficiently POP descriptor pointers from queue.

Accumulator pop descriptor

pointer from queue and write to accumulation memory

(normally in DSP local memory).

Accumulator generates

interrupt to DSP core

according to interrupt pacing configuration.

Two Accumulator (PDSP)

One generate 32 interrupts, each for one queue.

The other generate 16

interrupts, each is combined event for 32 queues. Totally monitor 16x32 queues.

DSP core Accumulation Memory

(Descriptor Pointer Array)

Queue Manager Monitor Queue Changes APDSP Queue Events Queue Interrupts Descriptor RAM Timer for Interrupt Pacing

(43)

Hardware queue Performance Consideration

Push Operation

1~4 words write. Since it is post operation, normally, do not stall DSP core.

Pop Operation

1~4 words read. Stall DSP core about 80~100 cycles.

Accumulator (PDSP) can pop the descriptors to DSP local memory which will save DSP cycles dramatically.

Descriptor Access

Write/read full descriptor may consume many cycles.

For most applications, DSP core can initialize all descriptors during initialization, and only write/read few fields of the descriptor during run time.

(44)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

Interconnection Architecture

Shannon Hardware queue

Inter-core communication

Shared Resource Management

(45)

Shared Data in the L2 SRAM of transmitter

If cache is enabled, Core Y needs invalidate cache before read Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

(46)

Shared Data in the L2 SRAM of receiver

Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

If cache is enabled, Core X needs write back cache after write

(47)

Shared Data in the shared memory

Data Switch Fabric Center Shared L2 or DDR L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

If cache is enabled, Core X needs write back cache after write; core Y needs invalidate cache before read

(48)

Use IPC register for inter-core communication

Configuration Switch Fabric L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache IPC

Interrupt is generated for Core Y

(49)

Inter-core Data Block exchange with EDMA

Data Switch Fabric Center EDMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Data Data

Interrupt is generated for Core Y

(50)

Inter-core data exchange through hardware queue

(Packet DMA copy)

Data Switch Fabric Center Packet DMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Src Que Dst Que

Core X simply push data to Source Queue

Packet DMA transfer the data Dest Queue

Core Y simply pop data from Dest Queue

If Queue buffers are in L2 RAM, Software on both cores do not need maintenance the cache coherency.

(51)

Inter-core data exchange through hardware queue

(Zero Copy)

Core X push data to Shared Queue, Core Y pop data from Shared

Queue

Multi-core can access Shared Queue simultaneously without mutual

exclusion

Software need maintenance the cache coherency.

Data Switch Fabric Center Queue Manager L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Shared Queue

(52)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

Interconnection Architecture

Shannon Hardware queue

Inter-core communication

Shared Resource Management

(53)

Shared resources

Internal shared L2 and External Shared memory (DDR)

 Each core access shared memory independently. Arbitration

handled by switch fabric and end-point arbiters.

Shared on-chip Peripherals

 Configuration: Typically done at startup to set the operating

mode of a particular logic block (e.g. DDR settings). Should be done by a single core as part of the boot process.

 Use:

 Peripherals with Hardware queue, Each core access hardware

queue independently. Arbitration handled by queue manager.

 Ethernet, SRIO on Shannon…

 Multi-channel peripherals can be split amongst the cores for

concurrent, orthogonal control

 EDMA, TSIP, Timer…

 Single-channel peripherals can be controlled by a single master,

servicing the other cores if needed. Or mutual exclusively used by multi-masters through semaphore.

(54)

System-level prioritization for arbitration

A user-specified priority may be assigned to:

Any DSP core accesses

Any EDMA, sRIO, Ethernet, … on-chip transfers

Each of the master ports are assigned a priority (8

(55)

Hardware Semaphores on Shannon for atomic accesses

What function does the Semaphore module provide?

A method to control who accesses a shared resource

Provides accesses for shared resources in an atomic manner

Read-modify-write sequence is not broken

Features of the Semaphore module

Binary Semaphore

Contains 64 semaphores to be used within the system

Two methods of accessing a semaphore resource

Direct Access

A core directly accesses a semaphore resource. If free, the semaphore will be granted. If not, the semaphore is not granted

Useful if the system can afford to poll for the semaphore

Indirect access

A core indirectly accesses a semaphore resource by writing to it. Once it is free an interrupt will notify the DSP core that it is available.

(56)

Outline

C6678 DSP Overview

Multi-core DSP programming

Interconnection and resource sharing

(57)

Shannon RapidIO Gen 2 Features and Enhancements

4 lanes – options include 2x

Baud rates: 5 Gbaud per lane in addition to 1.25, 2.5, 3.125 Gbaud per lane

DeviceID Support

16 Local DeviceIDs (up

from 1)

8 Multicast IDs (up from 3)

24 Interrupt outputs (up from 8)

Messaging

Type 9 Packets Support (Data Streaming)

Type 11 Message –

classification improvements

DirectIO

8 Load/Store (DirectIO) Units (up from 4)

Shadow register sets for LSUs to simplify management and minimize overhead

Provide up to 1MB block transfers (up from 4KB)

Packet Forwarding with Reset

Isolation

(58)

89

RapidIO – Topology Examples

C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Mesh C6678 DSP C6678 DSP C6678 DSP C6678 DSP Chain SRIO Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Ring

(59)

Packet Accelerator Subsystem On Shannon

3 Port Ethernet Switch

Port 0: Internal hardware queue port

Port 1: SGMII 0 Port, 1Gbps

Port 2: SGMII 1 Port, 1Gbps

Packet Accelerator (PA)

L2, L3, and L4 packet processing

1.5M packets per sec

Security Accelerator (SA)

Encryption/Decryption IPSEC ESP IPSEC AH SRTP 3GPP 91

(60)

IEEE 1588 support

EMAC hardware supports classifying at the physical level

ingress and egress frames as timing synchronization

frames and the timestamp is recorded.

A software algorithm running on DSP core would then run

the algorithm to calculate the delay and adjust local time

accordingly.

Device A is the master device

Device B is the slave device

Message B is used to send the actual transmit time (tA) of Message A

Message D is used to send the actual receive time (rC) of Message C

wire time in one direction ((rC - tA)-(tC - rA))/2

(61)

TSIP Overview

1024 8-bit timeslots receive and transmit.

8 links of 128 timeslots at 8.192 Mbps.

4 links of 256 timeslots at 16.384 Mbps.

2 links of 512 timeslots at 32.768 Mbps.

Two clock and frame sync inputs.

Independent clocking – 1 receive clock and 1 transmit

clock.

Redundant/common clocking – 1 receive/transmit clock

(62)

Shannon PCIe Interface

Nyquist/Shannon incorporates PCIe interface with

the following characteristics:

Two SERDES lanes running at 5 GBaud/2.5GBaud

Gen2 compliant

Three different operational modes (default defined by pin inputs at power up; can be overwritten by software):

Root Complex (RC) End Point (EP)

Legacy End Point

Single Virtual Channel (VC)

Single Traffic Class (TC)

Maximum Payloads

Egress – 128 bytes Ingress – 256 bytes

Configurable BAR filtering, IO filtering and configuration

filtering

(63)

Remaining Peripherals & System Elements (1/2)

EMIF16

Supports NAND flash memory, up to 256MB Supports NOR flash up to 16MB

Supports asynchronous SRAM mode, up to 1MB Used for booting, logging, announcement, etc.

64-Bit Timers

Total of 16 64-bit timers

One 64-bit timer per core is dedicated to serve as a watchdog (or may be used as a general purpose timer)

Eight 64-bit timers are shared for general purpose timers

Each 64-bit timer can be configured as two individual 32-bit timers Timer Input/Output pins

Two timer Input pins

Two timer Output pins

Timer input pins can be used as GPI

(64)

Remaining Peripherals & System Elements (2/2)

UART Interface – Operates at up to 128,000 baud I2C Interface

Supports 400Kbps throughput Supports full 7-bit address field Supports EEPROM size of 4 Mbit

SPI Interface

Operates at up to 66MHz Supports two chip selects Support master mode

GPIO Interface

16 GPIO pins

Can be configured as interrupt pins

(65)

References

Related documents