Texas Instruments
TMS320C6678 (Shannon)
DSP Training
Brighton Feng
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
4
Shannon Functional Diagram
• Multi-Core SoC
• Fixed/Floating C66x™ Core
– Eight cores @ 1.0 GHz, 0.5 MB Local L2 – 4.0 MB shared memory
– 256 GMAC, 128 GFLOP • Navigator
– Multicore eco system • Packet Infrastructure • Network Coprocessor
– IP Network solution for IP v4/6 – 1.5M packets per sec (1Gb Ethernet
wire-rate)
– IPsec, SRTP, Air Interface Encryption fully offloaded
• 3-port GigE Switch (Layer 2) • Low Power Consumption
– Adaptive Voltage Scaling (Smart ReflexTM) • Hyperlink 50 – 50G Expansion port – Transparent to Software • Multicore Debugging C6678 (Shannon) C66x core L2 Memory L1 D L1 P . . . 8 C66x Cores
Peripherals and I/O
sRIO
Flash PCIe TSIP
UART SPI, I2C
System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b Shared Memory Multicore Memory Controller Hyperlink50 TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor
100% backward object code compatible
Increased Fixed and floating
Point capability Improved support for
complex arithmetic and matrix computation
Enhanced DSP core
Native instructions for IEEE 754, SP&DP Advanced VLIW architecture 2x registers Enhanced floating-point add capabilities Advanced fixed-point instructions Four 16-bit or eight 8-bit MACs Two-level cache SPLOOP and16-bit instructions for smaller code
size Flexible level one
memory architecture iDMA for rapid
data transfers between local memories C66x C64x+ C64x C67x C67x+
FLOATING-POINT VALUE FIXED-POINT VALUE
Pe rform anc e improv em ent
C66x core block diagram
C66x Core
Data Path 1 Data Path 2
A Register File A0 – A31 B Register File B0 –B31 Instruction Decode Instruction Dispatch Instruction Fetch Control Registers Interrupt Control In-Circuit Emulation D2 S2 L2 S1 L1 + + + + M1 D1 M2 x x x x SPLOOP Buffer x x x x x x x x x x x x x x x x x x x x x x x x x x x x 256 Bits 2x64 Bits + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Key Improvements of C66x
4x Multiply Accumulate improvement
Enhanced complex arithmetic and matrix operations
2x Arithmetic and Logical operations
improvement
Support the floating point arithmetic. Single
precision floating point operation capability
same as 32 bit fixed point operation capability
division and square root is supported by
C64x+
C66x Comparison
Operation Precision Operations
per cycle on C64x+ Operations per cycle on C66x Function Unit MAC Real 8 x 8 2 x 4 = 8 2 x 8 = 16 M1, M2 Real 16 x 16 2 x 2 = 4 2 x 8 = 16 M1, M2 Real 32 x 32 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (16,16) x (16,16) 2 x 1 = 2 2 x 4 = 8 M1, M2 Complex (32,32) x (32,32) N/A 2 x 1 = 2 M1, M2 Arithmetic Logical 8 bit 4 x 4 = 16 4 x 8 = 32 L1, L2, S1, S2 16 bit 4 x 2 = 8 4 x 4 = 16 L1, L2, S1, S2 32 bit 4 x 1 = 4 4 x 2 = 8 L1, L2, S1, S2 Memory Access 8 bit, 16 bit, 32 bit, 64 bit 2 x 1 = 2 2 x 1 = 2 D1, D2
Outline
C6678 DSP Overview
Multi-core DSP programming
Memory Architecture Overview
Shannon Memory Architecture
Improvement
Programming model
Interconnection and resource sharing
TCI6486 Memory Architecture
Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared L2 Control (Core speed)/2 256 bit Shared L2 RAM External Memory Shared L2 ROM EDMA . . . S M S M MShannon Memory Architecture
Core 0 Internal L2 RAM Core N Internal L2 RAM DMA SCR (Core speed)/3 128 bit Shared Memory Control (Core speed)/2 256 bit Shared L2 RAM External Memory EDMA . . . S S M EDMA MOutline
C6678 DSP Overview
Multi-core DSP programming
Memory Architecture Overview
Shannon Memory Architecture
Improvement
Programming model
Interconnection and resource sharing
Addition of XMC
Bring over existing EMC MDMA path
Fat pipe to external (and internal) shared memory
Bus width: 256 instead of 128 bits
Clock rate: CPUCLK/2 instead of CPUCLK/3 Optimize requests for MSMC / DDR3 memory
L2 line allocations and evictions are split into sub-lines of 64 bytes
Memory Protection and Address Extension (MPAX) support
16 segments of programmable size (powers of 2: 4KB to 4GB) Each segment maps a 32-bit address to a 36-bit address.
Each segment controls access: supervisor/user, R/W/X, (non-)secure Memory protection for shared internal MSMC memory and external DDR3
memory
Multi-stream Prefetch support
Program prefetch buffer up to 128 bytes Data prefetch buffer up to 8 x 128 bytes
Prefetch enabled/disabled on 16MB ranges defined in MAR Manual flush for coherence purposes
MAR Register Extension
• L2 memory controller extends the MAR registers by adding the “PFX” field, L2 memory controller uses this bit to convey XMC whether a given address range is prefetchable.
MSMC Block Diagram
RAM banks, 256-bits per bank CGEM Slave Port CGEM Slave Port System Slave Port for shared SRAM (SMS) System Slave Port for external memory (SES) MSMC System Master Port MSMC EMIF Master Port MSMC Datapath x N CGEM coresArbitration for Banks
256 256 256 256 256 Memory Protection and Extension Unit (MPAX) 256 256 VBUSM 256 events VBUSM 256 VBUSM 256 Memory Protection and Extension Unit (MPAX) MSMC Core EMIF – 64 bit DDR3 SCR SCR VBUSM 256
EDC for SRAM
One slave interface per C66x
Megamodule (256 bits @ CPUCLK/2) Uses a 36 bit address extended inside
a C66x Megamodule core
Two slave interfaces (256 bits @ CPUCLK/2) for access from system masters
SMS interface for accesses to MSMC SRAM space
SES interface for accesses to DDR3 space
Both interfaces support memory protection and address extension
One master interface (256-bits @ CPUCLK/2) for access to the DDR3 EMIF
One master interface (256 bits @ CPUCLK/2) for access to system slaves
MSMC Shared Memory
4 banks x 2 sub-banks, sub-bank are 256-bit
wide.
Reduces conflicts between C66x Megamodule cores and system masters
Features a dynamic fair-share bank arbitration for each transfer
Supports bandwidth management. Avoid
indefinite starvation for lower priority requests
due to higher priority requests
Features Not Supported
Cache coherency between L1/L2 caches in C66x Megamodule cores and MSMC memory
Cache coherency between XMC prefetch buffers and MSMC memory
C66x Megamodule to C66x Megamodule cache coherency via MSMC
MPAX Units
MPAX stands for “Memory Protection and
Address Extension”
There are N+2 MPAX units in a system with N
C66x Megamodules
N MPAX units for all requests from N C66x
Megamodules to internal shared memory, external shared memory or any system slave
1 MPAX unit for all requests from any system master to internal shared memory
1 MPAX unit for all requests from any system master to external shared memory
Each MPAX unit operates on a number of
segments of programmable size
Each segment maps a 32-bit address to a 36-bit address.
Number of Segments
Each C66x Megamodule has 16 segments which
control direct (load/store) requests to internal
shared memory, external shared memory and
any other system slave.
Any master identified by a privilege ID has
8 segments for requests to internal shared memory
8 segments for requests to external shared memory.
Some masters work on behalf of other masters.
They will inherit the privilege ID of their
commanding master. As such, each C66x
Megamodule also has
8 segments for indirect (DMA) requests to internal shared memory
8 segments for indirect (DMA) requests to external shared memory
Segment Definition
Each segment is defined by a base address and a size
The segment size can be set to any power of 2 from 4K to
4GB
The segment base address is constrained to power-of-2
boundary equal to size.
One would expect that each request should find one matching segment, however ...
a request may find two or more matching segments, in
which case segments with higher ID take priority over segments with lower ID. This allows
creating non-power of 2 segments
creating 3 segments with just 2 segment definitions ...
a request may find no matching segment, in which case an
error is reported in Memory protection fault reporting registers (XMPFAR, XMPFSR)
XMC Segment Registers
XMPAXH/XMPAXL[15-0]
MPAX Default Memory Map
Segment 1 Segment 0 Disabled Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15 CGEM Logical 32-bit Memory MapUpper 60GB
System Physical 36-bit Memory Map
Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 7:FFFF_FFFF 8:0000_0000
BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 800000h; Size = 2GB
8:8000_0000 8:7FFF_FFFF
XMC configures MPAX segments 0 and 1 so that C66x Megamodule can access system memory.
The power up configuration is that segment 1 remaps 8000_0000 – FFFF_FFFF in C66x
Megamodule’s address space to 8:0000_0000 – 8:7FFF_FFFF in the system address map.
This corresponds to the first 2GB of address space
MPAX MSMC Aliasing Example
BADDR = 0C000h; RADDR = 00C000h; Size = 2MB BADDR = 20000h; RADDR = 00C000h; Size = 2MB
CGEM 32-bit Memory Map 0000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0Cxx_xxxx 0:0C1F_FFFF 0:0C00_0000 BADDR = 21000h; RADDR = 00C000h; Size = 2MB
20xx_xxxx 21xx_xxxx “Fast” MSMC RAM MSMC RAM Alias 1 MSMC RAM Alias 2 MSMC RAM (2MB)
Example shows 3 segments to map the MSMC RAM address space into C66x Megamodule’s address space as three distinct 2MB ranges. By programming the MARs accordingly, the three segments could have different semantics.
Accesses to MSMC RAM via this alias do not use the “fast RAM”
Copyright © 2010 Texas Instruments. All rights reserved.
MPAX Overlayed Segments Example
BADDR = 00000h; RADDR = 000000h; Size = 2GB BADDR = 80000h; RADDR = 080000h; Size = 2GB Segment 1
Segment 0
BADDR = C0007h; RADDR = 050042h; Size = 4K Segment 2 Disabled Segment 3 Disabled Segment 4 Disabled Segment 5 Disabled Segment 6 Disabled Segment 7 Disabled Segment 8 Disabled Segment 9 Disabled Segment 10 Disabled Segment 11 Disabled Segment 12 Disabled Segment 13 Disabled Segment 14 Disabled Segment 15
CGEM 32-bit Memory Map
Upper 60GB
System 36-bit Memory Map Lower 4GB 0000_0000 7FFF_FFFF 8000_0000 FFFF_FFFF (not remappable) 0BFF_FFFF 0C00_0000 0:FFFF_FFFF 0:8000_0000 0:7FFF_FFFF 0:0C00_0000 0:0BFF_FFFF 0:0000_0000 1:0000_0000 F:FFFF_FFFF 0:5004_2xxx 0:C000_7xxx C000_7xxx
segment 1 matches 8000_0000 through FFFF_FFFF,
and segment 2 matches C000_7000 through C000_7FFF.
Because segment 2 is higher priority than segment 1,
its settings take priority, effectively carving a 4K hole in segment 1’s 2GB address space.
Furthermore, it maps this 4K space to 0:5004_2000 -
0:5004_2FFF, which overlaps the mapping established by segment 0. This physical address range is now
Outline
C6678 DSP Overview
Multi-core DSP programming
Memory Architecture Overview
Shannon Memory Architecture
Improvement
Programming model
Interconnection and resource sharing
single program image
L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memoryApp.out App.out App.out
code and read/write data Shared L2 or DDR memory App.out Shared code and Read only data Data 0 Data 1 Data 2
Data 0 Data 1 Data 2
Same image on each DSP core
Aliased addressing used for DSP core to access local L2 DNUM DSP core register for:
Global addressing when programming EDMA3, SRIO, …
Shannon MPAX enables easy single program image
M P A X M P A X code1 data2 data2 code2 data3 data3 MSMC RAM internal External memory code1 data2 code2 data3 MSMC RAM internal External memorySoC address space CGEM address space (1)
code1 data2 code2 data3 MSMC RAM internal External memory
CGEM address space (n)
multiple program image
L2memory C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 Data L2memory L2memory App0.out App1.out C6000 Core 0 L1 Prog L1 Data C6000 Core 1 L1 Prog L1 Data C6000 Core 2 L1 Prog L1 DataApp0.out App1.out App2.out
App2.out
Shared L2 or DDR memory
Data 0 Data 1 Data 2
Data 0 Data 1 Data 2
Each DSP core has its image
Static split of DDR2 per DSP core
43
Shannon Software
• Flexible development
environment for the customer.
• Customer can choose to develop
their application using all or any one of the software layers.
• Will contain following software
layers
– BIOS and Linux Operating System support
– Chip Support Library
– Platform Development Kit
– Inter Core Communication
– Optimized DSP functions library
– Optimized Audio, Video and Speech codecs
– Voice Gateway Demonstration Kit
– Video Transcoding Demonstration Kit
– Demonstration applications
C6678 Software
Operating System w/ Boot Loader BIOS
Full Silicon Entitlement Multi-core Entitlement
Linux
Chip Support Library Platform Development Kit
Inter Core Communication
Voice Gateway Demonstration Kit Video Transcoding Demonstration Kit Speech Codec DSPLIB Audio Codec Video Codec Demo App Customer Application
Data
Visualization
Shannon Debug
Best Multicore Debug and Visualization Debug enabled Multicore SoC
Debug visibility at core, across multicore and for SoC
45 C6678 (Shannon) C66x core L2 Mem or y L1 D L1 P . . . 8 C66x Cores
Peripherals and I/O
sRIO
Flash PCIe TSIP
UART SPI, I2C
System Elements Power Mgt Debug EDMA SysMon Memory System DDR -3 64b Shared Memory Multicore Memory Controller TeraNet 2 Multic or e Na vig at or Enet Switch SGMI I SGMI I Packet CoProcessor Crypto/IPSec CoProcessor E T B TRACE TRA CE Hyperlink50
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Shannon Switch Fabric
MSMC_SS CPU/2 256b VBUSM SCR Shared L2 RAM CPU/3 128b VBUSM SCR S S SRIO M PCIe QM_SS M M 16ch DMA TC0 M M TC1 M M DDR3 S XMC 64ch DMA M TC2 M TC3 M TC4 M TC5 64ch DMA M TC6 M TC7 M TC8 M TC9 CPU/3 32b VBUSP SCR PA_SS M VUSR M VUSR S TSIP 0,1 M QM_SS PCIe S S EMIF16 S CONFIG M EDMA_0 EDMA_1,2 GEM S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S GEM M S MOutline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Hardware Queue Architecture
packetized Data transfer architecture
designed to minimize DSP core
interaction while maximizing memory
and bus efficiency
the key communication platform for TI’s
future Infrastructure DSPs
Used by following peripherals in
Shannon:
Serial RapidIO, Packet Accelerator
Each module contains its own DMA to
transfer associated data with the ‘jobs’, No
CPU resources involved
Queue 1..x
Hardware Queue
Producer writes ‘jobs’ into a Queue. Consumer reads ‘jobs’ from the
Queue
Supports Multiple In – Multiple Out
Multiple Producers can write to the
same Queue
Used to share common Hardware
Multiple Consumers can read from
the same Queue
Used for Load Balancing
Abstracts the Consumer
Consumer can be a Hardware IP
(accelerator, peripheral) or a software (ie a CPU core)
Transparent for the Producer
‘Easy’ to upgrade to new
hardware. The ‘job gets done’.
Minimize changes to Host
software, Easy maintenance
CPU1 CPU2 CPU3 Packet Acc. RapidIO .... Producer Queue Manager CPUx Acc 1 Acc 2 RapidIO Peri x … Queue Controller DMA Consumer
Hardware Queue Operation
Push to a queue
Host write pointer of new descriptor to a queue register.
Queue manager links (modify the link RAM) the new
descriptor to the tail (or header) of the queue.
Tail (or header) pointer points to the new descriptor.
Pop from a queue
Host read a descriptor pointer from a queue register.
Queue manager returns the descriptor pointed by the header
pointer
Header pointer points to the next descriptor.
Monitor queue
Queue manager generates events when queue changes: not
empty, entry count, exceed threshold, starvation…
Queue Diversion
Entire queue contents can be cleared or moved to another
Shannon Hardware queue architecture
DSP core DSP core
Queue Manage Subsystem DSP core Packet DMA
(SRIO) Packet DMA (PA) VBUS Accumulation Buffer Buffer Memory . . . Link RAM Descriptor RAM Queue Manager Q1 IF Q0 IF Qx IF Queue Events
Queue Event Queue Event
Packet DMA (Internal) APDSP APDSP Queue Interrupt Queue Interrupts
Queue Manager Subsystem
Support 8192 queues
HW queues are multi-core safe without mutual
exclusion, multiple senders can use a destination
queue without restrictions
Can Notify Packet DMA when transfer is pending
Can notify DSP core when packet is pending, can
copy descriptor pointers of transferred data to
destination core’s local memory to reduce access
latency
Internal Packet DMA
Transfer packet from one queue to another queue. Good for core to core data transfer.
Descriptor RAM
Data elements (buffers) to be passed on queues are first described to a
descriptor region manager built into the QM.
Although technically called descriptors, these memory elements can hold any
arbitrary data.
The size of the data
elements must be a power of 2, from 32 bytes to 8192
bytes in length.
20 configurable memory regions (for descriptor storage)
The number of elements in the region must be a power of 2, from 32 buffers to 4096 buffers in the region.
32 byte buffers
256 byte buffers
Memory Descriptor Region
Registers 16 0 0x1000 0x2000 0x1000 0 16 Region 0 Region 1 Region 19 … 32 16 4 256 0x2000 15 19
Linking RAM
Linking RAM contains 1 entry for each
descriptor . Linking RAM entry is effectively an extension of the descriptor
Linking RAM stores Forward data pointer that is critical for the PUSH / POP operations performed by the Queue Manager
Linkage between physical address of
descriptor and physical address of Linking RAM is performed inside the QM using
information provided in the Queue Management configuration registers
Linking RAM is typically placed in local memory for speed. This allows data
elements to be linked and unlinked in a
queue very quickly, even though the buffers themselves may be in external memory
There is no limit to the length of a single queue, only a limit on the total number of data elements in the system.
2 configurable Linking RAM regions
Queue Contents Linking RAM
0 17
Forward Pointer Table
- - - - x - - - - - - - - - - - 5 19 x Queue 0 Queue 1 17 5 19 18
Queue Data Flow Example, Transmit
Host Processor Queue Manager Rx Queue Rx PortINIT : Host Allocates Rx Free Descriptors
and initializes queues Interrupt Generator
Free Descriptor Queue Tx Queue TX 2 Processor Queues a packet to a Tx Queue TX 3 Port transmits the buffer being pointed to by the descriptor TX 4 Port Posts Packet Descriptor to return Queue Tx Port TX 1 Processor fetches a descriptor to fill with the data to transmit
Queue Data Flow Example, Receive
Host Processor Queue Manager Rx Queue Rx PortINIT : Host Allocates Rx Free Descriptors and initializes queues
RX 1 Port Fetches a Free Descriptor and transfers the data to the buffer pointed to by the descriptor RX 2 Port Posts Packet to Rx Queue RX 4
Interrupt according to pacing rules or poll
Interrupt Generator RX 3 Not Empty Level Status Free Descriptor Queue Tx Queue Tx Port
Optionally prefetches the descriptor to L2 prior to interrupting
Accumulator (A Programmable DSP)
Accumulator is used to help
DSP core efficiently POP descriptor pointers from queue.
Accumulator pop descriptor
pointer from queue and write to accumulation memory
(normally in DSP local memory).
Accumulator generates
interrupt to DSP core
according to interrupt pacing configuration.
Two Accumulator (PDSP)
One generate 32 interrupts, each for one queue.
The other generate 16
interrupts, each is combined event for 32 queues. Totally monitor 16x32 queues.
DSP core Accumulation Memory
(Descriptor Pointer Array)
Queue Manager Monitor Queue Changes APDSP Queue Events Queue Interrupts Descriptor RAM Timer for Interrupt Pacing
Hardware queue Performance Consideration
Push Operation
1~4 words write. Since it is post operation, normally, do not stall DSP core.
Pop Operation
1~4 words read. Stall DSP core about 80~100 cycles.
Accumulator (PDSP) can pop the descriptors to DSP local memory which will save DSP cycles dramatically.
Descriptor Access
Write/read full descriptor may consume many cycles.
For most applications, DSP core can initialize all descriptors during initialization, and only write/read few fields of the descriptor during run time.
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Shared Data in the L2 SRAM of transmitter
If cache is enabled, Core Y needs invalidate cache before read Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache
Shared Data in the L2 SRAM of receiver
Data Switch Fabric Center DDR2 SDRAM L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache If cache is enabled, Core X needs write back cache after write
Shared Data in the shared memory
Data Switch Fabric Center Shared L2 or DDR L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache If cache is enabled, Core X needs write back cache after write; core Y needs invalidate cache before read
Use IPC register for inter-core communication
Configuration Switch Fabric L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache IPC Interrupt is generated for Core Y
Inter-core Data Block exchange with EDMA
Data Switch Fabric Center EDMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Data Data Interrupt is generated for Core Y
Inter-core data exchange through hardware queue
(Packet DMA copy)
Data Switch Fabric Center Packet DMA L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Src Que Dst Que
Core X simply push data to Source Queue
Packet DMA transfer the data Dest Queue
Core Y simply pop data from Dest Queue
If Queue buffers are in L2 RAM, Software on both cores do not need maintenance the cache coherency.
Inter-core data exchange through hardware queue
(Zero Copy)
Core X push data to Shared Queue, Core Y pop data from Shared
Queue
Multi-core can access Shared Queue simultaneously without mutual
exclusion
Software need maintenance the cache coherency.
Data Switch Fabric Center Queue Manager L2 RAM L2 Cache DSP Core X L1 Cache L2 RAM L2 Cache DSP Core Y L1 Cache Shared Queue
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Interconnection Architecture
Shannon Hardware queue
Inter-core communication
Shared Resource Management
Shared resources
Internal shared L2 and External Shared memory (DDR)
Each core access shared memory independently. Arbitration
handled by switch fabric and end-point arbiters.
Shared on-chip Peripherals
Configuration: Typically done at startup to set the operating
mode of a particular logic block (e.g. DDR settings). Should be done by a single core as part of the boot process.
Use:
Peripherals with Hardware queue, Each core access hardware
queue independently. Arbitration handled by queue manager.
Ethernet, SRIO on Shannon…
Multi-channel peripherals can be split amongst the cores for
concurrent, orthogonal control
EDMA, TSIP, Timer…
Single-channel peripherals can be controlled by a single master,
servicing the other cores if needed. Or mutual exclusively used by multi-masters through semaphore.
System-level prioritization for arbitration
A user-specified priority may be assigned to:
Any DSP core accesses
Any EDMA, sRIO, Ethernet, … on-chip transfers
Each of the master ports are assigned a priority (8
Hardware Semaphores on Shannon for atomic accesses
What function does the Semaphore module provide?
A method to control who accesses a shared resource
Provides accesses for shared resources in an atomic manner
Read-modify-write sequence is not broken
Features of the Semaphore module
Binary Semaphore
Contains 64 semaphores to be used within the system
Two methods of accessing a semaphore resource
Direct Access
A core directly accesses a semaphore resource. If free, the semaphore will be granted. If not, the semaphore is not granted
Useful if the system can afford to poll for the semaphore
Indirect access
A core indirectly accesses a semaphore resource by writing to it. Once it is free an interrupt will notify the DSP core that it is available.
Outline
C6678 DSP Overview
Multi-core DSP programming
Interconnection and resource sharing
Shannon RapidIO Gen 2 Features and Enhancements
4 lanes – options include 2x
Baud rates: 5 Gbaud per lane in addition to 1.25, 2.5, 3.125 Gbaud per lane
DeviceID Support
16 Local DeviceIDs (up
from 1)
8 Multicast IDs (up from 3)
24 Interrupt outputs (up from 8)
Messaging
Type 9 Packets Support (Data Streaming)
Type 11 Message –
classification improvements
DirectIO
8 Load/Store (DirectIO) Units (up from 4)
Shadow register sets for LSUs to simplify management and minimize overhead
Provide up to 1MB block transfers (up from 4KB)
Packet Forwarding with Reset
Isolation
89
RapidIO – Topology Examples
C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Mesh C6678 DSP C6678 DSP C6678 DSP C6678 DSP Chain SRIO Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP Switch C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP C6678 DSP Ring
Packet Accelerator Subsystem On Shannon
3 Port Ethernet Switch
Port 0: Internal hardware queue port
Port 1: SGMII 0 Port, 1Gbps
Port 2: SGMII 1 Port, 1Gbps
Packet Accelerator (PA)
L2, L3, and L4 packet processing
1.5M packets per sec
Security Accelerator (SA)
Encryption/Decryption IPSEC ESP IPSEC AH SRTP 3GPP 91
IEEE 1588 support
EMAC hardware supports classifying at the physical level
ingress and egress frames as timing synchronization
frames and the timestamp is recorded.
A software algorithm running on DSP core would then run
the algorithm to calculate the delay and adjust local time
accordingly.
Device A is the master device
Device B is the slave device
Message B is used to send the actual transmit time (tA) of Message A
Message D is used to send the actual receive time (rC) of Message C
wire time in one direction ((rC - tA)-(tC - rA))/2
TSIP Overview
1024 8-bit timeslots receive and transmit.
8 links of 128 timeslots at 8.192 Mbps.
4 links of 256 timeslots at 16.384 Mbps.
2 links of 512 timeslots at 32.768 Mbps.
Two clock and frame sync inputs.
Independent clocking – 1 receive clock and 1 transmit
clock.
Redundant/common clocking – 1 receive/transmit clock
Shannon PCIe Interface
Nyquist/Shannon incorporates PCIe interface with
the following characteristics:
Two SERDES lanes running at 5 GBaud/2.5GBaud
Gen2 compliant
Three different operational modes (default defined by pin inputs at power up; can be overwritten by software):
Root Complex (RC) End Point (EP)
Legacy End Point
Single Virtual Channel (VC)
Single Traffic Class (TC)
Maximum Payloads
Egress – 128 bytes Ingress – 256 bytes
Configurable BAR filtering, IO filtering and configuration
filtering
Remaining Peripherals & System Elements (1/2)
EMIF16
Supports NAND flash memory, up to 256MB Supports NOR flash up to 16MB
Supports asynchronous SRAM mode, up to 1MB Used for booting, logging, announcement, etc.
64-Bit Timers
Total of 16 64-bit timers
One 64-bit timer per core is dedicated to serve as a watchdog (or may be used as a general purpose timer)
Eight 64-bit timers are shared for general purpose timers
Each 64-bit timer can be configured as two individual 32-bit timers Timer Input/Output pins
Two timer Input pins
Two timer Output pins
Timer input pins can be used as GPI
Remaining Peripherals & System Elements (2/2)
UART Interface – Operates at up to 128,000 baud I2C Interface
Supports 400Kbps throughput Supports full 7-bit address field Supports EEPROM size of 4 Mbit
SPI Interface
Operates at up to 66MHz Supports two chip selects Support master mode
GPIO Interface
16 GPIO pins
Can be configured as interrupt pins