Texas Instruments TMS320C6678 (Shannon) DSP Training

(1)

Texas Instruments

TMS320C6678 (Shannon)

DSP Training

Brighton Feng

(2)

Outline



C6678 DSP Overview



Multi-core DSP programming



Interconnection and resource sharing

(3)

4

Shannon Functional Diagram

• Multi-Core SoC

• Fixed/Floating C66x™ Core

– Eight cores @ 1.0 GHz, 0.5 MB Local L2 – 4.0 MB shared memory

– 256 GMAC, 128 GFLOP • Navigator

– Multicore eco system • Packet Infrastructure • Network Coprocessor

– IP Network solution for IP v4/6 – 1.5M packets per sec (1Gb Ethernet

wire-rate)

– IPsec, SRTP, Air Interface Encryption fully offloaded

• 3-port GigE Switch (Layer 2) • Low Power Consumption

– Adaptive Voltage Scaling (Smart ReflexTM₎ • Hyperlink 50 – 50G Expansion port – Transparent to Software • Multicore Debugging C6678 (Shannon) C66x core L2 Memory _{L1 D} L1 P . . . 8 C66x Cores

Peripherals and I/O

sRIO

Flash PCIe TSIP

UART SPI, I2C

System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b Shared Memory Multicore Memory Controller Hyperlink50 TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor

(4)

100% backward object code compatible

Increased Fixed and floating

Point capability Improved support for

complex arithmetic and matrix computation

Enhanced DSP core

Native instructions for IEEE 754, SP&DP Advanced VLIW architecture 2x registers Enhanced floating-point add capabilities Advanced fixed-point instructions Four 16-bit or eight 8-bit MACs Two-level cache SPLOOP and

16-bit instructions for smaller code

size Flexible level one

memory architecture iDMA for rapid

data transfers between local memories C66x C64x+ C64x C67x C67x+

FLOATING-POINT VALUE FIXED-POINT VALUE

Pe rform anc e improv em ent

(5)

C66x core block diagram

C66x Core

Data Path 1 Data Path 2

A Register File A0 – A31 B Register File B0 –B31 Instruction Decode Instruction Dispatch Instruction Fetch Control Registers Interrupt Control In-Circuit Emulation D2 S2 L2 S1 L1 + + + + M1 D1 M2 x x x x SPLOOP Buffer x x x x x x x x x x x x x x x x x x x x x x x x x x x x 256 Bits 2x64 Bits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

(6)

Key Improvements of C66x



4x Multiply Accumulate improvement

 Enhanced complex arithmetic and matrix operations



2x Arithmetic and Logical operations

improvement



Support the floating point arithmetic. Single

precision floating point operation capability

same as 32 bit fixed point operation capability



division and square root is supported by

(7)

C64x+



C66x Comparison

Operation Precision Operations

per cycle on C64x+ Operations per cycle on C66x Function Unit MAC Real 8 x 8 2 x 4 = 8 2 x 8 = 16 M1, M2 Real 16 x 16 2 x 2 = 4 2 x 8 = 16 M1, M2 Real 32 x 32 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (16,16) x (16,16) 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (32,32) x (32,32) N/A 2 x 1 = 2 M1, M2 Arithmetic Logical 8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2 16 bit 4 x 2 = 8 4 x 4 = 16 L1, L2, S1, S2 32 bit 4 x 1 = 4 4 x 2 = 8 L1, L2, S1, S2 Memory Access 8 bit, 16 bit, 32 bit, 64 bit 2 x 1 = 2 2 x 1 = 2 D1, D2

(8)

Outline



C6678 DSP Overview





Memory Architecture Overview



Shannon Memory Architecture

Improvement



Programming model



(9)

TCI6486 Memory Architecture

Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared L2 Control (Core speed)/2 256 bit Shared L2 RAM External Memory Shared L2 ROM EDMA . . . S M S M M

(10)

Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared Memory Control (Core speed)/2 256 bit Shared L2 RAM External Memory EDMA . . . S S M EDMA M

(11)

Outline



C6678 DSP Overview





Improvement



Programming model



(12)

Addition of XMC

 Bring over existing EMC MDMA path

 Fat pipe to external (and internal) shared memory

 Bus width: 256 instead of 128 bits

 Clock rate: CPUCLK/2 instead of CPUCLK/3  Optimize requests for MSMC / DDR3 memory

 L2 line allocations and evictions are split into sub-lines of 64 bytes

 Memory Protection and Address Extension (MPAX) support

 16 segments of programmable size (powers of 2: 4KB to 4GB)  Each segment maps a 32-bit address to a 36-bit address.

 Each segment controls access: supervisor/user, R/W/X, (non-)secure  Memory protection for shared internal MSMC memory and external DDR3

memory

 Multi-stream Prefetch support

 Program prefetch buffer up to 128 bytes  Data prefetch buffer up to 8 x 128 bytes

 Prefetch enabled/disabled on 16MB ranges defined in MAR  Manual flush for coherence purposes

(13)

MAR Register Extension

• L2 memory controller extends the MAR registers by adding the “PFX” field, L2 memory controller uses this bit to convey XMC whether a given address range is prefetchable.

(14)

MSMC Block Diagram

RAM banks, 256-bits per bank CGEM Slave Port CGEM Slave Port System Slave Port for shared SRAM (SMS) System Slave Port for external memory (SES) MSMC System Master Port MSMC EMIF Master Port MSMC Datapath x N CGEM cores

Arbitration for Banks

256 256 256 256 256 Memory Protection and Extension Unit (MPAX) 256 256 VBUSM 256 events VBUSM 256 VBUSM 256 Memory Protection and Extension Unit (MPAX) MSMC Core EMIF – 64 bit DDR3 SCR SCR VBUSM 256

EDC for SRAM

 One slave interface per C66x

Megamodule (256 bits @ CPUCLK/2)  Uses a 36 bit address extended inside

a C66x Megamodule core

 Two slave interfaces (256 bits @ CPUCLK/2) for access from system masters

 SMS interface for accesses to MSMC SRAM space

 SES interface for accesses to DDR3 space

 Both interfaces support memory protection and address extension

 One master interface (256-bits @ CPUCLK/2) for access to the DDR3 EMIF

 One master interface (256 bits @ CPUCLK/2) for access to system slaves

(15)

MSMC Shared Memory



4 banks x 2 sub-banks, sub-bank are 256-bit

wide.

 Reduces conflicts between C66x Megamodule cores and system masters

 Features a dynamic fair-share bank arbitration for each transfer



Supports bandwidth management. Avoid

indefinite starvation for lower priority requests

due to higher priority requests



Features Not Supported

 Cache coherency between L1/L2 caches in C66x Megamodule cores and MSMC memory

 Cache coherency between XMC prefetch buffers and MSMC memory

 C66x Megamodule to C66x Megamodule cache coherency via MSMC

(16)

MPAX Units



MPAX stands for “Memory Protection and

Address Extension”



There are N+2 MPAX units in a system with N

C66x Megamodules

 N MPAX units for all requests from N C66x

Megamodules to internal shared memory, external shared memory or any system slave

 1 MPAX unit for all requests from any system master to internal shared memory

 1 MPAX unit for all requests from any system master to external shared memory



Each MPAX unit operates on a number of

segments of programmable size

 Each segment maps a 32-bit address to a 36-bit address.

(17)

Number of Segments



Each C66x Megamodule has 16 segments which

control direct (load/store) requests to internal

shared memory, external shared memory and

any other system slave.



Any master identified by a privilege ID has

 8 segments for requests to internal shared memory

 8 segments for requests to external shared memory.



Some masters work on behalf of other masters.

They will inherit the privilege ID of their

commanding master. As such, each C66x

Megamodule also has

 8 segments for indirect (DMA) requests to internal shared memory

 8 segments for indirect (DMA) requests to external shared memory

(18)

Segment Definition

 Each segment is defined by a base address and a size

 The segment size can be set to any power of 2 from 4K to

4GB

 The segment base address is constrained to power-of-2

boundary equal to size.

 One would expect that each request should find one matching segment, however ...

 a request may find two or more matching segments, in

which case segments with higher ID take priority over segments with lower ID. This allows

 creating non-power of 2 segments

 creating 3 segments with just 2 segment definitions  ...

 a request may find no matching segment, in which case an

error is reported in Memory protection fault reporting registers (XMPFAR, XMPFSR)

(19)

XMC Segment Registers

XMPAXH/XMPAXL[15-0]

(20)

MPAX Default Memory Map

Segment 1 Segment 0 Disabled Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15 CGEM Logical 32-bit Memory Map

Upper 60GB

System Physical 36-bit Memory Map

Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 7:FFFF_FFFF 8:0000_0000

BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 800000h; Size = 2GB

8:8000_0000 8:7FFF_FFFF

 XMC configures MPAX segments 0 and 1 so that C66x Megamodule can access system memory.

 The power up configuration is that segment 1 remaps 8000_0000 – FFFF_FFFF in C66x

Megamodule’s address space to 8:0000_0000 – 8:7FFF_FFFF in the system address map.

 This corresponds to the first 2GB of address space

(21)

MPAX MSMC Aliasing Example

BADDR = 0C000h; RADDR = 00C000h; Size = 2MB BADDR = 20000h; RADDR = 00C000h; Size = 2MB

CGEM 32-bit Memory Map 0000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0Cxx_xxxx 0:0C1F_FFFF 0:0C00_0000 BADDR = 21000h; RADDR = 00C000h; Size = 2MB

20xx_xxxx 21xx_xxxx “Fast” MSMC RAM MSMC RAM Alias 1 MSMC RAM Alias 2 MSMC RAM (2MB)

 Example shows 3 segments to map the MSMC RAM address space into C66x Megamodule’s address space as three distinct 2MB ranges. By programming the MARs accordingly, the three segments could have different semantics.

 Accesses to MSMC RAM via this alias do not use the “fast RAM”

(22)

MPAX Overlayed Segments Example

BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 080000h; Size = 2GB Segment 1

Segment 0

BADDR = C0007h; RADDR = 050042h; Size = 4K Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15

CGEM 32-bit Memory Map

Upper 60GB

System 36-bit Memory Map Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 0:5004_2xxx 0:C000_7xxx C000_7xxx

 segment 1 matches 8000_0000 through FFFF_FFFF,

and segment 2 matches C000_7000 through C000_7FFF.

 Because segment 2 is higher priority than segment 1,

its settings take priority, effectively carving a 4K hole in segment 1’s 2GB address space.

 Furthermore, it maps this 4K space to 0:5004_2000 -

0:5004_2FFF, which overlaps the mapping established by segment 0. This physical address range is now

(23)

Outline



C6678 DSP Overview





Improvement



Programming model



(24)

single program image

L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memory

App.out App.out App.out

code and read/write data Shared L2 or DDR memory App.out Shared code and Read only data Data 0 Data 1 Data 2

Data 0 Data 1 Data 2

 Same image on each DSP core

 Aliased addressing used for DSP core to access local L2  DNUM DSP core register for:

 Global addressing when programming EDMA3, SRIO, …

(25)

Shannon MPAX enables easy single program image

M P A X M P A X code1 data2 data2 code2 data3 data3 MSMC RAM internal External memory code1 data2 code2 data3 MSMC RAM internal External memory

SoC address space CGEM address space (1)

code1 data2 code2 data3 MSMC RAM internal External memory

CGEM address space (n)

(26)

multiple program image

L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memory App0.out _App1.out C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data

App0.out App1.out App2.out

App2.out

Shared L2 or DDR memory

Data 0 _{Data 1} _{Data 2}

Data 0 Data 1 Data 2

 Each DSP core has its image

 Static split of DDR2 per DSP core

(27)

43

Shannon Software

• Flexible development

environment for the customer.

• Customer can choose to develop

their application using all or any one of the software layers.

• Will contain following software

layers

– BIOS and Linux Operating System support

– Chip Support Library

– Platform Development Kit

– Inter Core Communication

– Optimized DSP functions library

– Optimized Audio, Video and Speech codecs

– Voice Gateway Demonstration Kit

– Video Transcoding Demonstration Kit

– Demonstration applications

C6678 Software

Operating System w/ Boot Loader BIOS

Full Silicon Entitlement Multi-core Entitlement

Linux

Chip Support Library Platform Development Kit

Inter Core Communication

Voice Gateway Demonstration Kit Video Transcoding Demonstration Kit Speech Codec DSPLIB Audio Codec Video Codec Demo App Customer Application

(28)

Data

Visualization

Shannon Debug

Best Multicore Debug and Visualization Debug enabled Multicore SoC

Debug visibility at core, across multicore and for SoC

45 C6678 (Shannon) C66x core L2 Mem or y L1 D L1 P . . . 8 C66x Cores

Peripherals and I/O

sRIO

Flash PCIe TSIP

UART SPI, I2C

System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b _Shared Memory Multicore Memory Controller TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor E T B TRACE TRA CE Hyperlink50

(29)

Outline



C6678 DSP Overview



Interconnection and resource sharing



Interconnection Architecture



Shannon Hardware queue



Inter-core communication



Shared Resource Management

(30)

Shannon Switch Fabric

MSMC_SS CPU/2 256b VBUSM SCR Shared L2 RAM CPU/3 128b VBUSM SCR S S SRIO M PCIe QM_SS M M 16ch DMA TC0 M M TC1 M M DDR3 S XMC 64ch DMA M TC2 M TC3 M TC4 M TC5 64ch DMA M TC6 M TC7 M TC8 M TC9 CPU/3 32b VBUSP SCR PA_SS M VUSR M VUSR S TSIP 0,1 M QM_SS PCIe S S EMIF16 S CONFIG M EDMA_0 EDMA_1,2 GEM S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S M

(31)

Outline



C6678 DSP Overview





Shared Resource Management

(32)

Hardware Queue Architecture



packetized Data transfer architecture

designed to minimize DSP core

interaction while maximizing memory

and bus efficiency



the key communication platform for TI’s

future Infrastructure DSPs



Used by following peripherals in

Shannon:



Serial RapidIO, Packet Accelerator



Each module contains its own DMA to

transfer associated data with the ‘jobs’, No

CPU resources involved

(33)

Queue 1..x

Hardware Queue

 Producer writes ‘jobs’ into a Queue.  Consumer reads ‘jobs’ from the

Queue

 Supports Multiple In – Multiple Out

 Multiple Producers can write to the

same Queue

 Used to share common Hardware

 Multiple Consumers can read from

the same Queue

 Used for Load Balancing

 Abstracts the Consumer

 Consumer can be a Hardware IP

(accelerator, peripheral) or a software (ie a CPU core)

 Transparent for the Producer

  ‘Easy’ to upgrade to new

hardware. The ‘job gets done’.

  Minimize changes to Host

software, Easy maintenance

CPU1 CPU2 CPU3 Packet Acc. RapidIO .... Producer _Queue Manager CPUx Acc 1 Acc 2 RapidIO Peri x … Queue Controller DMA Consumer

(34)

(35)

Hardware Queue Operation

 Push to a queue

 Host write pointer of new descriptor to a queue register.

 Queue manager links (modify the link RAM) the new

descriptor to the tail (or header) of the queue.

 Tail (or header) pointer points to the new descriptor.

 Pop from a queue

 Host read a descriptor pointer from a queue register.

 Queue manager returns the descriptor pointed by the header

pointer

 Header pointer points to the next descriptor.

 Monitor queue

 Queue manager generates events when queue changes: not

empty, entry count, exceed threshold, starvation…

 Queue Diversion

 Entire queue contents can be cleared or moved to another

(36)

Shannon Hardware queue architecture

DSP core DSP core

Queue Manage Subsystem DSP core Packet DMA

(SRIO) Packet DMA (PA) VBUS Accumulation Buffer Buffer Memory . . . Link RAM Descriptor RAM Queue Manager Q1 IF Q0 IF Qx IF Queue Events

Queue Event Queue Event

Packet DMA (Internal) APDSP APDSP Queue Interrupt Queue Interrupts

(37)

Queue Manager Subsystem



Support 8192 queues



HW queues are multi-core safe without mutual

exclusion, multiple senders can use a destination

queue without restrictions



Can Notify Packet DMA when transfer is pending



Can notify DSP core when packet is pending, can

copy descriptor pointers of transferred data to

destination core’s local memory to reduce access

latency



Internal Packet DMA

 Transfer packet from one queue to another queue. Good for core to core data transfer.

(38)

Descriptor RAM

 Data elements (buffers) to be passed on queues are first described to a

descriptor region manager built into the QM.

 Although technically called descriptors, these memory elements can hold any

arbitrary data.

The size of the data

elements must be a power of 2, from 32 bytes to 8192

bytes in length.

 20 configurable memory regions (for descriptor storage)

 The number of elements in the region must be a power of 2, from 32 buffers to 4096 buffers in the region.

32 byte buffers

256 byte buffers

Memory Descriptor Region

Registers 16 0 0x1000 0x2000 0x1000 0 16 Region 0 Region 1 Region 19 … 32 16 4 256 0x2000 15 19

(39)

Linking RAM

 Linking RAM contains 1 entry for each

descriptor . Linking RAM entry is effectively an extension of the descriptor

 Linking RAM stores Forward data pointer that is critical for the PUSH / POP operations performed by the Queue Manager

 Linkage between physical address of

descriptor and physical address of Linking RAM is performed inside the QM using

information provided in the Queue Management configuration registers

 Linking RAM is typically placed in local memory for speed. This allows data

elements to be linked and unlinked in a

queue very quickly, even though the buffers themselves may be in external memory

 There is no limit to the length of a single queue, only a limit on the total number of data elements in the system.

2 configurable Linking RAM regions

Queue Contents Linking RAM

0 17

Forward Pointer Table

- - - - x - - - - - - - - - - - 5 19 x Queue 0 Queue 1 17 5 19 18

(40)

Queue Data Flow Example, Transmit

Host Processor Queue Manager Rx Queue Rx Port

INIT : Host Allocates Rx Free Descriptors

and initializes queues Interrupt Generator

Free Descriptor Queue Tx Queue TX 2 Processor Queues a packet to a Tx Queue TX 3 Port transmits the buffer being pointed to by the descriptor TX 4 Port Posts Packet Descriptor to return Queue Tx Port TX 1 Processor fetches a descriptor to fill with the data to transmit

(41)

Queue Data Flow Example, Receive

Host Processor Queue Manager Rx Queue Rx Port

INIT : Host Allocates Rx Free Descriptors and initializes queues

RX 1 Port Fetches a Free Descriptor and transfers the data to the buffer pointed to by the descriptor RX 2 Port Posts Packet to Rx Queue RX 4

Interrupt according to pacing rules or poll

Interrupt Generator RX 3 Not Empty Level Status Free Descriptor Queue Tx Queue Tx Port

Optionally prefetches the descriptor to L2 prior to interrupting

(42)

Accumulator (A Programmable DSP)

 Accumulator is used to help

DSP core efficiently POP descriptor pointers from queue.

 Accumulator pop descriptor

pointer from queue and write to accumulation memory

(normally in DSP local memory).

 Accumulator generates

interrupt to DSP core

according to interrupt pacing configuration.

 Two Accumulator (PDSP)

 One generate 32 interrupts, each for one queue.

 The other generate 16

interrupts, each is combined event for 32 queues. Totally monitor 16x32 queues.

DSP core Accumulation Memory

(Descriptor Pointer Array)

Queue Manager Monitor Queue Changes APDSP Queue Events Queue Interrupts Descriptor RAM Timer for Interrupt Pacing

(43)

Hardware queue Performance Consideration



Push Operation

 1~4 words write. Since it is post operation, normally, do not stall DSP core.



Pop Operation

 1~4 words read. Stall DSP core about 80~100 cycles.

 Accumulator (PDSP) can pop the descriptors to DSP local memory which will save DSP cycles dramatically.



Descriptor Access

 Write/read full descriptor may consume many cycles.

 For most applications, DSP core can initialize all descriptors during initialization, and only write/read few fields of the descriptor during run time.

(44)

Outline



C6678 DSP Overview



Multi-core DSP programming





(45)

Shared Data in the L2 SRAM of transmitter

 If cache is enabled, Core Y needs invalidate cache before read Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

(46)

Shared Data in the L2 SRAM of receiver

Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

 If cache is enabled, Core X needs write back cache after write

(47)

Shared Data in the shared memory

Data Switch Fabric Center Shared L2 or DDR L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache

 If cache is enabled, Core X needs write back cache after write; core Y needs invalidate cache before read

(48)

Use IPC register for inter-core communication

Configuration Switch Fabric L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache IPC

 Interrupt is generated for Core Y

(49)

Inter-core Data Block exchange with EDMA

Data Switch Fabric Center EDMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Data Data

 Interrupt is generated for Core Y

(50)

Inter-core data exchange through hardware queue

(Packet DMA copy)

Data Switch Fabric Center Packet DMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Src Que Dst Que

 Core X simply push data to Source Queue

 Packet DMA transfer the data Dest Queue

 Core Y simply pop data from Dest Queue

 If Queue buffers are in L2 RAM, Software on both cores do not need maintenance the cache coherency.

(51)

Inter-core data exchange through hardware queue

(Zero Copy)

 Core X push data to Shared Queue, Core Y pop data from Shared

Queue

 Multi-core can access Shared Queue simultaneously without mutual

exclusion

 Software need maintenance the cache coherency.

Data Switch Fabric Center Queue Manager L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Shared Queue

(52)

Outline



C6678 DSP Overview





(53)

Shared resources



Internal shared L2 and External Shared memory (DDR)

 Each core access shared memory independently. Arbitration

handled by switch fabric and end-point arbiters.



Shared on-chip Peripherals

 Configuration: Typically done at startup to set the operating

mode of a particular logic block (e.g. DDR settings). Should be done by a single core as part of the boot process.

 Use:

 Peripherals with Hardware queue, Each core access hardware

queue independently. Arbitration handled by queue manager.

 Ethernet, SRIO on Shannon…

 Multi-channel peripherals can be split amongst the cores for

concurrent, orthogonal control

 EDMA, TSIP, Timer…

 Single-channel peripherals can be controlled by a single master,

servicing the other cores if needed. Or mutual exclusively used by multi-masters through semaphore.

(54)

System-level prioritization for arbitration



A user-specified priority may be assigned to:



Any DSP core accesses



Any EDMA, sRIO, Ethernet, … on-chip transfers



Each of the master ports are assigned a priority (8

(55)

Hardware Semaphores on Shannon for atomic accesses

 What function does the Semaphore module provide?

 A method to control who accesses a shared resource

 Provides accesses for shared resources in an atomic manner

 Read-modify-write sequence is not broken

 Features of the Semaphore module

 Binary Semaphore

 Contains 64 semaphores to be used within the system

 Two methods of accessing a semaphore resource

 Direct Access

 A core directly accesses a semaphore resource. If free, the semaphore will be granted. If not, the semaphore is not granted

 Useful if the system can afford to poll for the semaphore

 Indirect access

 A core indirectly accesses a semaphore resource by writing to it. Once it is free an interrupt will notify the DSP core that it is available.

(56)

Outline



C6678 DSP Overview



(57)

Shannon RapidIO Gen 2 Features and Enhancements

4 lanes – options include 2x

Baud rates: 5 Gbaud per lane in addition to 1.25, 2.5, 3.125 Gbaud per lane

DeviceID Support

 16 Local DeviceIDs (up

from 1)

 8 Multicast IDs (up from 3)

24 Interrupt outputs (up from 8)

 Messaging

 Type 9 Packets Support (Data Streaming)

 Type 11 Message –

classification improvements

 DirectIO

 8 Load/Store (DirectIO) Units (up from 4)

 Shadow register sets for LSUs to simplify management and minimize overhead

 Provide up to 1MB block transfers (up from 4KB)

 Packet Forwarding with Reset

Isolation

(58)

89

RapidIO – Topology Examples

C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Mesh C6678 DSP C6678 DSP C6678 DSP C6678 DSP Chain SRIO Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Ring

(59)

Packet Accelerator Subsystem On Shannon



3 Port Ethernet Switch

 Port 0: Internal hardware queue port

 Port 1: SGMII 0 Port, 1Gbps

 Port 2: SGMII 1 Port, 1Gbps



Packet Accelerator (PA)

 L2, L3, and L4 packet processing

 1.5M packets per sec



Security Accelerator (SA)

 Encryption/Decryption  IPSEC ESP  IPSEC AH  SRTP  3GPP 91

(60)

IEEE 1588 support



EMAC hardware supports classifying at the physical level

ingress and egress frames as timing synchronization

frames and the timestamp is recorded.



A software algorithm running on DSP core would then run

the algorithm to calculate the delay and adjust local time

accordingly.

Device A is the master device

Device B is the slave device

Message B is used to send the actual transmit time (tA) of Message A

Message D is used to send the actual receive time (rC) of Message C

wire time in one direction ((rC - tA)-(tC - rA))/2

(61)

TSIP Overview

 1024 8-bit timeslots receive and transmit.

 8 links of 128 timeslots at 8.192 Mbps.

 Two clock and frame sync inputs.

 Independent clocking – 1 receive clock and 1 transmit

clock.

 Redundant/common clocking – 1 receive/transmit clock

(62)

Shannon PCIe Interface



Nyquist/Shannon incorporates PCIe interface with

the following characteristics:

 Two SERDES lanes running at 5 GBaud/2.5GBaud

 Gen2 compliant

 Three different operational modes (default defined by pin inputs at power up; can be overwritten by software):

 Root Complex (RC)  End Point (EP)

 Legacy End Point

 Single Virtual Channel (VC)

 Single Traffic Class (TC)

 Maximum Payloads

 Egress – 128 bytes  Ingress – 256 bytes

 Configurable BAR filtering, IO filtering and configuration

filtering

(63)

Remaining Peripherals & System Elements (1/2)

EMIF16

 Supports NAND flash memory, up to 256MB  Supports NOR flash up to 16MB

 Supports asynchronous SRAM mode, up to 1MB  Used for booting, logging, announcement, etc.

64-Bit Timers

 Total of 16 64-bit timers

 One 64-bit timer per core is dedicated to serve as a watchdog (or may be used as a general purpose timer)

 Eight 64-bit timers are shared for general purpose timers

 Each 64-bit timer can be configured as two individual 32-bit timers  Timer Input/Output pins

 Two timer Input pins

 _{Two timer Output pins}

 Timer input pins can be used as GPI

(64)

Remaining Peripherals & System Elements (2/2)

UART Interface – Operates at up to 128,000 baud I2C Interface

 Supports 400Kbps throughput  Supports full 7-bit address field  Supports EEPROM size of 4 Mbit

SPI Interface

 Operates at up to 66MHz  Supports two chip selects  Support master mode

GPIO Interface

 16 GPIO pins

 Can be configured as interrupt pins

(65)