Going Linux on Massive Multicore

(1)

Embedded Linux Conference Europe 2013

Going Linux on Massive Multicore

Marta Rybczyńska

(2)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(3)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(4)

First MPPA

®

-256 Chips with TSMC 28nm CMOS

256 Processing Engine cores + 32 Resource Management cores



_{256 (+32) user-programmable,}

generic cores



_{Architecture and software}

scalability



_{High processing performance}



_{High energy efficiency}



_{Execution predictability}

(5)

The MPPA-256 Processor (1)



_{16 compute clusters}



_{4 Input/Output clusters with}

rich peripheral access



_{DDR, PCI Express, Ethernet,}

Flash, GPIO, SPI, I2C etc



_{Connected by a}

(6)

The MPPA-256 Processor (2)



_{Compute cluster includes:}



_{16+1 cores}



_{Shared memory}



_{Network-on-Chip Interfaces}



_{Debug unit (DSU)}



_{IO cluster includes:}



_{4 cores}



_{Shared memory}



_Peripherals

(7)

The MPPA-256 Processor Core ISA



_{Same on IO and compute}

cluster



_{5-issue Very Long Instruction}

Word (VLIW)



_{32/64-bit IEEE 754 floating point}

unit



_{DSP instructions}



_{Advanced bitwise instructions}



_{Hardware loops}



_MMU

(8)

Kalray Software Development Kit for MPPA-256 (1)



_{Standard Embedded C/C++ Programming}



_{GCC, GNU binutils, newlib, uClibc, ...}



_{Simulators, Profilers, Debuggers & System Trace}



_GDB



_{Cycle-based simulator}



_{Hardware System trace}



_{Operating systems & Device Drivers}



_{NodeOS on the compute clusters}



_{One thread per core}



_{RTEMS port on the IO clusters}

(9)

Kalray Software Development Kit for MPPA-256 (2)



_{Programming Models}



_{Dataflow Programming}



_{FPGA Style}



_{POSIX-level Programming}



_{DSP Style}



_{Streaming Programming}



_{GPU Style}

(10)

The MPPA POSIX Programming Model



_{POSIX-like process management}



_{Spawn 16 processes from the IO cluster}



_{Process execution on the 16 clusters start with main(argc,argv) and}

environment



_{On each cluster, support to up to 16 concurrent threads}



_{Inter-process communication (IPC)}



_{POSIX file descriptor operations on 'NoC Connectors'}



_{Extension to the PCIe interface with the 'PCIe Connectors'}



_{Rich communication and synchronization}



_{Multi-threading inside clusters}



_{Standard GCC/G++ OpenMP support}



_{POSIX threads interface}

(11)

The MPPA & Linux



_{High-performance SMP on the IO clusters}



_{Device Drivers}



_{SPI, I2C, GPIO (sensors, small peripherals)}



_{Full network stack}



_{Existing user libraries}

(12)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(13)

History



_2.6.33-rc4



_2.6.33



_2.6.35



_3.2



_{SMP, first specific drivers}



_3.10

(14)

Features



_{The kernel}



_{Single core or SMP}



_{Device-tree, generic headers}



_{Supports userspace with FDPIC}



_{Initial tracing support}



_Drivers



_{Generic drivers tested}



_{Specific drivers}



_{PCI Express, console,...}



_{Userspace: buildroot-based with custom toolchain &}

(15)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(16)

Optimization Case: Atomic Ops



_{Single core case: disable/enable interrupts}



_{Cheap on MPPA}



_{SMP case: compare-and-swap}



_{Huge performance/code size impact}



_{Last version uses implementation details of the caches and}

the write buffer



_{Return from experience}

(17)

Lessons Learned 1: Read the Specs Carefully (1)



_{SMP version started deadlocking}



_{Spinlock ticket value showed corruption}



_Debugging



_{Spinlock ordering problem?}



_{Careful platform code analysis}



_{Enabled spinlock debug}



_{Detailed simulator trace analysis}

(18)

Lessons Learned 2: Use the GCC



_{K1 core is a VLIW: multiple}

instructions (one bundle) per

cycle



_{High performance gain}



_{GCC handles it well}



_{Manual bundling OK for short code,}

hard for longer ones



_Result



_{Preferring built-ins over asm inlines}



_{Less assembly in the code}

(19)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(20)

Interrupt Controllers



_{Two-level interrupt control}



_{One private to each core (timers, NoC)}



Shared between 4 cores (PCI Express,

Ethernet, other peripherals)



_{Basic operations on the core}

interrupt controller



_{Multiple devices on the same}

line



_{Configuration issue: configuring lanes}

for devices

Shared

interrupt

controller

Core 3

Core 2

Core 1

Core 0

Interrupt lines

from peripherals

Interrupt lines from

internal peripherals

4 Interrupt lines from

the shared controller

(21)

Memory Areas



_{Two areas}



_{Big DDR}



_{Small internal shared memory}



_{Different latency, optimization opportunity}



_{Shared memory is visible through PCI Express}



_Solution:



_{Allocation in the DDR by default}

(22)

The PCI Express Driver (1)



_{PCI Express is the main interface to the MPPA}



_{Host (Linux) driver is ~10KLOC}



_{Boot, synchronization, DMA, message passing,...}



_{Two interfaces per MPPA (visible as two PCI Express}

devices)



_{Host-side framework in Linux is mature}



_{Less standard on the device side}



_{Code reuse (protocols)}

(23)

The PCI Express Driver (2)



_{PCI Express allows memory-mapping on Host of}

memory zones of the device



_{Hardware resources}



_{Doorbell registers (shared, in BAR0)}



Shared memory accessible by the Host directly (BAR2)



_{Both DDR and shared memory accessible by DMA}



_{Cache effects}

(24)

Boot



_{Using custom bootloaders at the moment}



_{Boot of IO cluster by PCI Express}



_{Initiated by the Host}



_{First code in shared memory}



_{Then complete image in shared memory+DDR}



_{Boot of clusters from the IO}



_{Cluster executables in IO cluster memory}



_{Also transferred by PCI Express}

(25)

Network-on-Chip



_{The way to communicate}

between clusters



_{High performance interface}



_{Shared resources}



_{Used in different places}



_{Boot, drivers in the kernel space}



_{User code (IPC)}

(26)

The Ethernet Device Drivers



_{Up to 8x10Gbit/s}



_{MAC controller, PHY}



_{Uses NoC}



_{Distributed network stack}



_{Packet dispatch to different clusters}

(27)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(28)

(29)

(30)

(31)

Agenda



_Architecture



_{Linux Port}



_Core



_Peripherals



_Debugging

(32)

Lessons Learned 3: What was Difficult



_{Documentation missing}



_{With great exceptions!}



_{Need to read other platform code}



_{No examples (yet) for our architecture}



_{Reaching limits of the existing software}



_{Not always a bug in the Linux port}



_{Trying to be mainline-compatible}



_{Initial port hard to split between developers}



_{Time-consuming, needed alternative for early testing}

(33)

Lessons Learned 4: What was Easy



_Debugging



_{Mixed mode depending on the testcase: simulator, gdb,}

FPGA



_{Simulator trace postprocessing}



_{Generic headers cover a good part of the code}

(34)

Future Plans



_{Complete driver support (Ethernet, NoC, others...)}



_{Optimized MMU}



_{Replace generic implementations with optimized ones}



_{Public release}

(35)

Marta Rybczynska

[email protected]

http://www.kalray.eu