A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition System

(1)

A Versatile Network Processor

based Electronics Module for the

LHCb Data Acquisition System

LHCb Technical Note

Issue:

1.0

Revision:

0

Reference:

LHCb DAQ 2001-132

Created:

10/23/2001 11:14 AM

Last modified:

10/23/2001 11:50 AM

(2)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

LHCb Technical Note Revision: 0

Issue: 1.0 Last modified: 10/23/2001 11:50 AM

Page ii

Abstract

A versatile network processor based module for the LHCb Data Acquisition system is presented. The architecture of the chosen network processor and its integration on a single board are discussed. Then the performance for various

applications within LHCb is shown, where results have been obtained both from simulation and measurements. The applications of this module as a Readout Unit, a Front-end Multiplexer, and a Main Event-Builder and as an elementary switching element are described in detail.

Document Status Sheet

Table 1 Document Status Sheet

1. Document Title: A versatile network processor based electronics module for the LHCb Data Acquisition System

2. Document Reference Number: LHCb DAQ 2001-132

3. Issue 4. Revision 5. Date 6. Reason for change

(3)

Page iii

4 A VERSATILE 8-PORT GIGABIT ETHERNET MODULE ...6

4.1 MOTHERBOARD...6

4.2 PIGGY BACK BOARD...7

5 SOFTWARE...9

5.1 SOFTWARE DEVELOPMENT ENVIRONMENT...9

5.2 (SUB-) EVENT BUILDING...10

5.2.1 Ingress Event building...11

5.2.2 Egress Event building...12

6 MEASUREMENTS WITH THE REFERENCE DESIGN KIT...14

6.1 THE IBM N4GS3 REFERENCE PLATFORM...14

6.2 TEST SET-UP...14

6.3 RESULTS...15

7 THE PATH TO A FULLY INTEGRATED MODULE...17

(4)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page iv

List of Figures

Figure 1 Main components of the NP4GS3 together with an indication of the standard data-flow paths. Data can be accessed and modified at the input/ingress and output/egress stages, leading to two different event-building algorithms, described below in section 5.2. One of the two DASL interfaces is always wrapped, so that each NP can send to itself. Also indicated are the various external memories. ... 2 Figure 2 Schema of the proposed network processor based module. The main

components are a carrier and one or two piggyback boards. Each piggyback contains one NP4GS3 network processor and 4 Gigabit Ethernet ports. They are connected via the 4 Gigabit DASL interface. The complete module provides thus either 4 or 8 fully connected, full-duplex Gigabit Ethernet ports... 3 Figure 3 Motherboard for the NP4GS3 carriers. It provides power, clock and PCI to both

NPs. Also shown are the 9 physical connectors (8 for the NPs, one for the CC-PC). 7 Figure 4 The piggyback or carrier board will house the NP4GS3 processor and its

associated memory-chips. ... 8 Figure 5 The customised Interface for NPSIM. The top half shows incoming fragments

with header bytes (green), data bytes (blue) and error-block bytes (red). The bottom half shows the output of the concatenated blocks. ... 10 Figure 6 Performance of Ingress event building from simulation as a function of the

average incoming packet size. The straight line shows the fixed level 1 trigger rate of 1.1 MHz. The limitation on performance comes from the output link bandwidth only.11 Figure 7 Performance of the Egress Event Building as a function of the average input

fragment size. The green area shows the range of possible L1 trigger rates... 12 Figure 8 Main components of the NP4GS3 reference platform. The packet routing

switchboard is not included in our set-up. ... 14 Figure 9 : Test-set-up for the NP4GS3 reference kit. The data are fed and read back

using Tigon 2 based Gigabit Ethernet NICs. Download of NP software and

(5)

Page v

List of Tables

Table 1 Document Status Sheet...ii

Table 2 Comparison of measurements with simulation results. The handling time per fragment is shown in microseconds. The measurement times are shown once raw as measured, and second corrected, with the round-trip and handling time in the NICs subtracted... 16

(6)

(7)

Page 1

1 Introduction

Network processors are a relatively new development. The first one was introduced by C-Port (now Motorola) in 1999. Nowadays every major, and a lot of smaller, chip-manufacturers have one in their product line.

A network processor is a dedicated processor for network packet (=frame) handling. It provides fast memory and dedicated hardware support for frame analysis, address look-up, frame manipulation, check summing, frame classification, multi-casting and much more. All these operations are driven by software, which runs in the network processor (NP) core. These processors are usually multi-threaded in hardware; multiple threads are running at the same time with zero-overhead context switching. They were primarily designed as powerful and flexible front-ends for high-end network switches and switching routers. Because they are software driven they can easily be customised to various network protocols, requirements or new developments. They make it possible to create big and scalable switching frameworks, because they decentralise the address resolution and forwarding functions traditionally performed by a single, powerful control processor. Thus they enable switch manufacturers to construct large switches (up to 256 Gigabit ports and more), with dedicated software in a short time. Currently the “Gigabit” generation of network processors is on the market, while the next one will be able to handle 10 Gigabit speeds (either as 10-Gigabit Ethernet or OC-192). These processors will be available in the course of 2002.

More information on the history of network processors, general features and future prospects can be found in [1] (a paper by members of the IBM NP4GS3 design team). Lots of online information is collected at the Network Processor Central website [2].

We propose to use a specific network processor, the IBM NP4GS3 to implement a versatile DAQ module for LHCb. The NP4GS3 can be operated either together with a switching fabric or in back-to-back with a second NP4GS3. In this note we will summarise our experiences so far, and will demonstrate how a NP-based module can fulfil many uses in the LHCb data acquisition system [3].

(8)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 2

2 The IBM NP4GS3

The IBM NP4GS3 is a network processor, which comprises 8 dual processor units (DPPU). Each of the two processors in a DPPU can run either of two threads. The context switch between threads is done in hardware without any delay, like this whenever a thread stalls, because it waits for an operation to complete, the other thread will run, keeping the processor busy. Each DPPU shares a set of coprocessors, which regulate the efficient access to external resources, such as port queues, memory, tree look-up, check-summing and policy. The chip includes also 4 media access controllers, to which Gigabit Ethernet Physical Layer Interfaces (PHYs) can be directly attached either using the GMII or the 8/10 bit encoding. The processor has a 128 kB fast, on-chip input buffer, and a 64 MB output buffer, made from DDR RAM chips. The access to the memory is via a 128-bit wide data-path. The processor also includes a PPC 405 core for control and monitoring and exception handling. This PPC can run an operating system, if desired. IBM provides Linux and VxWorks. Also attached are various memory interfaces for very fast and fast address look-up memory; in total up to 64 MB. For more details the datasheet can be consulted [4].

The data-flow through the NP4GS3 is shown in Figure 1 below. Data are received from the 4 Gigabit Ethernet ports, then they are stored in the ingress memory, can be accessed here and are then transferred to the Switch Interface Link (the DASL). From here they can either reach the output stage of the same processor blade or a twin processor connected to the first one via the DASL interface. They will arrive in any case in the output buffer or egress memory, where they can be accessed a second time, before they are finally put on to one of several (virtual) output queues for transfer over the network.

Ingress Event Building Egress Event Building

DASL DASL

Access to frame data Access to frame data

DASL DASL

Figure 1 Main components of the NP4GS3 together with an indication of the standard data-flow paths. Data can be accessed and modified at the input/ingress and

output/egress stages, leading to two different event-building algorithms, described below in section 5.2. One of the two DASL interfaces is always wrapped, so that each NP can send to itself. Also indicated are the various external memories.

(9)

Page 3

3 Applications in the Data Acquisition System

For all applications described below, the basic building block is the (up-to) 8 port Gigabit Ethernet module described in section 4. A simplified drawing of this module is shown in Figure 2. The expected performances achievable for the various applications will be discussed in sections 5 and 6, where also the software aspects of the various functionalities will be described briefly.

13*6

3,**<%$&.

'$6/

13*6

3,**<%$&.

[*LJ(

Figure 2 Schema of the proposed network processor based module. The main components are a carrier and one or two piggyback boards. Each piggyback contains one NP4GS3 network processor and 4 Gigabit Ethernet ports. They are connected via the 4 Gigabit DASL interface. The complete module provides thus either 4 or 8 fully connected, full-duplex Gigabit Ethernet ports.

3.1 Readout

Unit

The first and most obvious application is as a Readout Unit. In the LHCb data acquisition system [3] the Readout Unit (RU) has to fulfil two basic tasks:

(10)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 4

1. It has to merge sub-event fragments, coming from either the L1 front-end electronics directly or the Front-end Multiplexers (see below) into new equal structured event fragments. The sub-event merging process has to be performed at the accept rate of the L1 trigger, i.e. 40 kHz upgradeable to 100 kHz.

2. It has to interface the front-end electronics and its link technology (S-Link over Gigabit Ethernet) to the Readout Network (RN). The link technology there is Gigabit Ethernet, with a custom transport protocol. In a sense the RU acts as a “gateway”.

3. It has to assign destinations to the outgoing frames. These destinations will be normally selected in a round-robin fashion from a pre-loaded table.

The network processor operates as a fast sub-event merger, assigns destinations and functions as a front-end to the event building-switching network. The Readout Unit always has one (and only one) output to the Readout Network (RN). It can have up to 7 inputs. Referring to Figure 2, for this application one link would be connected to the Readout Switch and from one up to seven links would be connected to the Front-end Multiplexers or directly to the Level 1 front-end electronics. Obviously for up to three inputs one network processor is sufficient.

The NP based Readout Unit will respect flow-control messages (“X-On/X-Off”) from the network, to cope with local congestion, and it will itself have the possibility to throttle the trigger, when its buffers are about to overspill. Finally it will send completely assembled event fragments to the Readout Network, assigning destinations according to the general LHCb data acquisition scheme.

3.2 Front-end

Multiplexer

The Front-end Multiplexer (FEM) application is very similar to the Readout Unit. It is used to multiplex sub-event fragments to improve the usage of the single link bandwidth. Since the 8 ports shown in Figure 2 are fully connected, several combinations of multiplexing can be envisaged. Examples are seven to one, two times three to one etc. Since the network processor is the most expensive component, it will be most economical to minimise its instances. The sub-event merging performance requirements are the same as for the RU.

The output of a FEM goes always to the input of a RU (see section 3.1). The software problem is the same, except for the destination assignment, which is obviously not necessary on a single point-to-point connection.

3.3

Main Event builder

The main event builder’s task is to collect all the fragments belonging to a specific event, originating from the RUs. This will be some 100 fragments, which have to be assembled into one contiguous event and sent to the Sub farm Controller (SFC). The fragments will arrive out of the order from the network, because the LHCb data acquisition does not have (nor does it want or need) any synchronisation after the Level 1 derandomisers. They have to be re-arranged into correct order and the possibly empty data and error blocks have to be merged. The resulting event is likely to exceed the maximum transfer unit for an Ethernet frame (1500 or 9000 bytes), so the event has to be segmented in multiple frames. They will then be sent in the correct order to the SFC, there to be retrieved in the most efficient manner.

Since the rates are low at this stage, one module could drive 4 event-building streams, i.e. it can feed 4 SFCs. In Figure 2 that would mean that the Readout Switch is connected to 4 ports and 4 SFCs are connected to the remaining four ports. In this way one network processor would drive two event-building streams. The requirements in terms of fragment merging performance are similar to the Readout Unit application. In fact it corresponds to two times a “one to one” multiplexing. A sub-event building software,

(11)

Page 5 which meets the requirements of the RU application, can thus easily do the main event building. This is made clear in detail in section 5.2.2.

3.4

Elementary Switching Module

The 8-port module (Figure 2) can also be used as the building network for the switching network itself. Performance-wise this is definitely no problem, because this is the original domain of the network processors from their conception. The question is then more if such a switching network can be cost-effective on a price/port basis. Obviously one needs quite a lot of them, because one has to provide interconnections, which serve the purpose of the backplane in a conventional monolithic switch. There exist however studies on how to reduce the number of modules, by making intelligent use of the traffic patterns in a data acquisition system see for example [7]. Additional advantages would be, that such a module would allow full control over switching process and functionality. Flow control, traffic shaping, check summing could be implemented at will and customised for maximum performance in the DAQ.

Furthermore a switching network built out of these modules could do the final, main event building in its last stage as described in section 3.3.

3.5

Level 1 Readout Module

Although this is not a DAQ application, the great conceptual similarity between the Level 1 Trigger network and the data acquisition, make it interesting to investigate the possibility to use a network processor based module also in the demanding high frequency, small fragment regime of the Level 1 trigger.

Here very small sub-event fragments (usually some 30 Bytes) must be merged at an incoming rate of 1.1 MHz. The multiplexing factors would be small, because of the limitation by the output link-speed. To economise, one module like in Figure 2 could be used to implement two independent units.

It should be noted that the present architecture for the L1 readout network is based on SCI, using a Gigabit Ethernet based module would require an architectural change.

(12)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 6

4 A versatile 8-port Gigabit Ethernet Module

The IBM NP4GS3 has a high-speed interface to connect to a switching engine, the Data Aligned Synchronous Link (DASL). In fact it has 2 such interfaces. When there is no switching engine to connect to, one of these interfaces can used to connect to another NP4GS3, thus creating effectively an 8-port switch. The other DASL will be usually wrapped to itself to ensure full connectivity.

In addition to the DASL the NP4GS3s (can) share the following resources: power and clock distribution and access to a PCI interface for configuration and monitoring. Each NP4GS3 requires its own memories and physical layer interfaces. The Media Access Controller (MAC) is already incorporated on-chip. Since the network processor and memory carrying part of the module is by far the more complex and deep (in terms of layers), it is very attractive separate this module as a daughter/piggy-pack board, which carries everything belonging to one NP4GS3alone (except the physical layer interfaces, PHY for short), and feeding out the connections for PCI, DASL (to the other Processor if present) and PHYs.

The common, “simpler” functionality and the control processor (Credit Card PC) would be housed on a motherboard. The two boards will be described in more detail in the following:

4.1 Motherboard

The motherboard will provide all common “infra-structure” needed for the operation of the NP4GS3’s. This includes power generation, clock generation and the physical layer interfaces. It will also include a Credit-Card PC (CC-PC), the standard LHCb interface to the Experimental Control System (ECS). It provides the connectors for the two carrier cards with the Network Processors. These connectors carry the following lines and interfaces: DASL, PCI, JTAG, DMU. PCI is used by the CC-PC to configure and monitor the NPs and also to communicate with the embedded PowerPC. JTAG is needed for the boundary scan and for the hardware debugging using RISCWatch [6]. The DASL is used to connect two NPs, as has been said already. The DMU (Data Mover Unit) interfaces connect the Media Access Controllers integrated on the NP4GS3 to the physical interfaces. These could be hot-plugable, thus allowing more flexibility in configuring them either as 1000BaseT (CAT 5 copper) or 1000BaseSX (multi-mode fibre).

Except for some length requirements on the DMU and DASL lines and some necessary screening for the high frequency signals this motherboard will not be particularly complex. A simple layout is shown in Figure 3.

(13)

Page 7

Figure 3 Motherboard for the NP4GS3 carriers. It provides power, clock and PCI to both NPs. Also shown are the 9 physical connectors (8 for the NPs, one for the CC-PC).

4.2

Piggy back board

This board will be comparatively complex, with ~ 12 layers and has rather stringent requirements on timing and distances. To have it on a small carrier board has therefore quite some advantages: the multi-layer board can be kept small, which eases production, and the flexibility to connect different physical layers is kept. A simplified block-diagram is shown in Figure 4.

(14)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 8

NP4GS3

8Mx16 DDR D3 8Mx16 DDR D2 8Mx16 DDR D1 2x 8Mx16 DDR D0 2x 32Mx4 DDR PARITY D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR DATA D6 2x 8Mx16 DDR PARITY D4 2x 8Mx16 DDR DATA DS0 2x 8Mx16 DDR DATA DS1 2x 512k x18 SRAM LU 512kx18 SRAM SCH

DMUs

A B C D DRAM Data DRAM Contr ol DRAM Data DRAM Contr ol PC I DA S L A Thr ottle

NP4GS3

DMUs

NP4GS3

DMUs

NP4GS3

8Mx16 DDR D3 8Mx16 DDR D2 8Mx16 DDR D1 2x 8Mx16 DDR D0 8Mx16 DDR D3 8Mx16 DDR D2 8Mx16 DDR D1 2x 8Mx16 DDR D0 2x 8Mx16 DDR D0 2x 32Mx4 DDR PARITY D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR PARITY D6 2x 32Mx4 DDR PARITY D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR DATA D6 2x 32Mx4 DDR DATA D6 2x 8Mx16 DDR PARITY D4 2x 8Mx16 DDR DATA DS0 2x 8Mx16 DDR DATA DS1 2x 8Mx16 DDR PARITY D4 2x 8Mx16 DDR PARITY D4 2x 8Mx16 DDR DATA DS0 2x 8Mx16 DDR DATA DS0 2x 8Mx16 DDR DATA DS1 2x 8Mx16 DDR DATA DS1 2x 512k x18 SRAM LU 512kx18 SRAM SCH 2x 512k x18 SRAM LU 2x 512k x18 SRAM LU 512kx18 SRAM SCH

DMUs

A B C D A B C D DRAM Data DRAM Contr ol DRAM Data DRAM Contr ol PC I DA S L A Thr ottle

Figure 4 The piggyback or carrier board will house the NP4GS3 processor and its associated memory-chips.

(15)

Page 9

5 Software

As has been pointed out in the introduction, it is the software driving the network processor core, which determines the functionality provided. The hardware will always be the same, differing only in the number of network processors on a board (1 or 2).

5.1

Software Development Environment

The software for the IBM NP4GS3 is written in Assembly Language [3], which is a simple RISC code, augmented by a flexible coprocessor and register handling. The programming model differs from traditional computers, in that the memory is not simply addressable in a linear way. It consists of external buffers, managed by hardware and accessed by means of coprocessor instructions. Programs have to be written with special care in view of 32 concurrently running threads. The development of code is greatly facilitated by the existence of a very sophisticated a powerful simulator, npscope, which allows single-stepping threads and investigate and change almost all registers.

An interactive debugger does not easily probe some aspects of the software mostly relating to concurrency. For this there is a different access to a simulator, npsim, which allows feeding a complete simulation model with data, monitoring input and output data. This simulator interface has been customised and is shown in Figure 5. The simulator has also the facility to generate timing information, which in principle should be cycle precise1. This permits checking not only the correctness of the code but also its run-time efficiency. The profiler npprof, as provided by IBM as part of the development tool-kit, is a TCL script to evaluate the timing information generated by the simulator. This script is very slow and has been substituted by a custom VisualBasic script, whose results are then graphically evaluated using Microsoft Excel. This allows for a very fast development cycle, where the effect of changes on run-time behaviour can be evaluated within 1-2 minutes. Results from these simulations and performance studies are shown in subsequent sections.

1

(16)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 10

Figure 5 The customised Interface for NPSIM. The top half shows incoming fragments with header bytes (green), data bytes (blue) and error-block bytes (red). The bottom half shows the output of the concatenated blocks.

Run-time performance is of paramount importance in a high-speed networking environment. As has been said, the simulation model provides the possibility of cycle precise benchmarking of code. The main aim of our first measurements using real hardware was to verify the precision of the simulation results.

5.2

(Sub-) Event Building

Sub-event building can be done in either the ingress (input) or egress (output) stage of the NP4GS3, a hybrid solution, when using 2 processors is also imaginable. The terms “ingress” and “egress” for the input and output stages, as well as “blade” for a single network processor comprising multiple network ports, come from the jargon of switch manufacturers.

The two stages differ in the way the memory is organized, as well as in its physical implementation. On the ingress side, there is a moderate amount (128 kB total) of 64 byte buffers in very fast on-chip memory. On the egress side there is a large amount (64 MB total) of slower DDR-RAM. Naturally two algorithms can be developed, one optimised for small frames at high rates (“Ingress Event building”) and one for large frames at somewhat lower rates (“Egress Event building”).

In the following the algorithms will be briefly described and then the results from simulation will be discussed2.

2

The results presented in this chapter are obtained from simulation with 16 threads enabled. It is expected that the NP system does not scale perfectly, since there is competition on common resources, like e.g.

(17)

Page 11

5.2.1 Ingress Event building

Ingress event building uses the input buffer of the NP4GS3 to store and finally assemble the incoming data fragments. Since the ingress store is quite limited (128 kB) only small data fragments and small numbers of them can be stored. The very high speed of the ingress memory, however, makes it ideal to perform event building at high data rates and high trigger rates. The main idea is to store the data fragments waiting for all those belonging to the same event to arrive and then to copy the payload to form a new fragment to be sent towards the output ports. Since there is a significant overhead of header information contained in each incoming fragment, it doesn’t make sense to transfer all this transport information across the DASL bus and deal with the very small payload data in terms of the complicated egress data structures [see below]. Instead only about 50% of the incoming data is transferred to the egress side after the event-building process, including the new transport structure for the outgoing event fragment. It has been shown that this strategy allows for event-building performances far beyond the capabilities of the output port. The performance is shown in Figure 6. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0 10 20 30 40 50 60 70 80 90 100 110 120

Average Input Fragment Payload [Bytes]

M a x im u m A c c e pt a b le L0 Tr ig g e r R a te [ M H z ]

Limit imposed by NP Perfomance Limit imposed by single Gb Output Link Max. Level-0 Trigger Rate

Figure 6 Performance of Ingress event building from simulation as a function of the average incoming packet size. The straight line shows the fixed level 1 trigger rate of 1.1 MHz. The limitation on performance comes from the output link bandwidth only.

The results shown in Figure 6 demonstrate that for a ~ 30 Byte fragments a three to one multiplexing can be done safely at 1.1 MHz. It should be noted that due to limitations in the current simulation model only 16 out of 32 total threads (see section 2) have been enabled. While it can of course not be assumed to gain a factor two because of contention between threads to expect further improvement by some 50% does not seem unrealistic. Then one would be limited by the output-link speed of 1 Gigabit at every working point. This points in the direction of reducing the multiplexing factor in this scenario.

‘global’ memory. However with 32 threads still an increase in performance can be expected. For the trustworthiness of the simulation/profiling see 6.3.

(18)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 12

5.2.2 Egress Event building

Egress event building is performed at the output stage or egress side of the processor. The data come in from the DASL interface and are stored in the egress memory. The data structure in this memory resembles the format of the cells, which are used on the switching fabric. A so-called twin-buffer consists of 2 times 58 bytes payload, the remaining 6 bytes having been originally used by the cell-header. Pointers in the place of the second original cell header link these buffers. An offset in to the next buffer can also be given, however all buffers, except for the last, have to be filled from the bottom.

Sub event building proceeds by analysing the fragment headers and waiting until all fragments belonging to an event have been received or a time-out condition has occurred. In any case event building will start, in the latter case an error will be flagged in the error block. Adapting the link-pointers, moving as little data as possible, will then connect the frames. At the boundaries of original frames, it sometimes becomes necessary to actually copy some data to fill from the bottom. New frames are being built until a pre-defined maximum transfer unit has been reached. The frame is then dispatched, and the procedure is iterated until all data have been used up.

Care has to be taken to avoid corruption of the static data, due to multiple threads wanting to access them at the same time. The NP4GS3 provides a powerful semaphore mechanism to handle these situations. The performance of egress event building, as measured by simulation, is shown in Figure 7.

0

50

100

150

200

250

300

350

0

100

200

300

400

500

600

Average Fragment Size [Bytes]

M

axi

mu

m

Al

lo

w

e

d

F

rag

men

t Rat

e

[

k

Hz

]

Limit imposed by NP Processing Power

Limit imposed by single Gb Output Link

Range of possible L1 Trigger Rate Values

Figure 7 Performance of the Egress Event Building as a function of the average input fragment size. The green area shows the range of possible L1 trigger rates.

(19)

Page 13 These results show clearly that for almost every fragment size the sub-event merging can be performed faster than the output link can be loaded3. In all cases the working point can be far above the 100 kHz, which is the maximum Level 1 accept rate after an upgrade of the system. Destination assignment of the produced output frame has been done in a trivial way in this simulation, however the time for a proper assignment based on a lookup is negligible.

3

The same remark as in section 5.2.1 about the number of simulated threads holds here as well. This should boost the performance above wire-speed even for fragments smaller than 100 Bytes (64 Bytes being the lower limit in Ethernet).

(20)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 14

6 Measurements with the Reference Design Kit

The results presented so far have been obtained using simulation. It is therefore interesting to assess how reliable the timing information of the simulation is. Since the latest version of the processor (revision 2.0) is not yet available on a reference platform, some parts of the sub-event building code cannot run un-modified on the reference design hardware. However, it has been possible to compare a representative set of algorithms on simulation and real hardware. The results agree well. They will be described in the following:

6.1

The IBM N4GS3 Reference Platform

The IBM NP4GS3 Reference Kit [5] aims at providing users with an implementation, which allows most of the functions of the network processor to be explored. The chassis with the important cards is shown in Figure 8. The configuration used for our measurements consisted of a chassis, a control processor (a PPC 705 based cPCI computer), two carrier boards with a NP4GS3 and 4 daughter cards, each with 2 Gigabit Ethernet LX optical ports. Furthermore we had a RiscWATCH ([6]) probe attached to the JTAG interface of one of the blades, which allows accessing the network processor’s internal registers directly from the NPSCOPE debugger (via Ethernet).

Figure 8 Main components of the NP4GS3 reference platform. The packet routing switchboard is not included in our set-up.

6.2 Test

set-up

The test set-up shown in Figure 9 consisted of 4 Netgear Gigabit Ethernet NICs, running a dedicated firmware, which made them operate as traffic generators and sources simultaneously. The internal clock of the NICs allowed measuring latencies with approximately 1 V SUHFLVLRQ 7KH 1,&V DUH PRQLWRUHG DQG controlled remotely. Similarly the network processor is accessed remotely via the RiscWATCH. This configuration allowed testing 3 to 1 event building. Each NIC has an internal clock, which has been used to

(21)

Page 15 measure the latencies imposed by the event building code. This clock has an intrinsic resolution of 1 V7he generation of frames and the evaluation of the time-differences in the receiving NICs, which are the same as the senders4.

The data sources must be synchronised, such as to keep the spread between arrival times of different fragments small. To synchronise the NIC (the sources), which is necessary for the reasons outlined above, a special frame is send by one of the NICs to the network processor, which then multi-casts it to all NICs. This triggers the collective sending of the packets within less then 0.5 microseconds.

Figure 9 : Test-set-up for the NP4GS3 reference kit. The data are fed and read back using Tigon 2 based Gigabit Ethernet NICs. Download of NP software and monitoring of the NP is done via JTAG using the RISCWatch probe

6.3 Results

The main aim of all measurements was to constrain the accuracy of the simulation. The simulation is claimed to be cycle precise. It takes into account the contention between threads. However, it does not accurately simulate all external resources with their associated latencies. It was therefore especially interesting to see to what extent simulation results can be trusted. One problem with these measurements is that the version of the NP on these boards is not the latest. It lacks the semaphore coprocessor, a unit specifically designed for efficient resource protection to avoid race conditions in a multi-threaded application. Since our code strongly depends on this facility it was necessary to tune the test conditions somewhat. This has been done by doing either a single thread measurement or by reducing the spread in the arrival time of the fragments by careful synchronisation. This does not change the run-time of the code, the simulations for both versions agree on that. It allows, however to avoid the occurrence of synchronisation problems which would otherwise be avoided by the semaphore coprocessor. We are confident that these measurements give a realistic impression of the performance of our sub-event building codes.

4

This is possible because each NIC has two independent CPUs. One is used for creating frames, while the other is receiving the frames and reading the time-stamps.

(22)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 16

Measurements have been done only in a “Level 1” like environment, which means short frames at high rates, since this is the more demanding and critical application.

First the round-trip time of a packet has been measured. This is necessary to subtract any overheads coming from the transport over the DASL, the cables and especially the creation and time stamping in the NICs,

ZKLFK DUH QRW LQFOXGHG LQ WKH VLPXODWLRQ 7KLV WLPH KDV EHHQ IRXQG WR EH V ,W FDQ EH VHHQ DV DQ

intrinsic resolution of the measurements. Since the whole system is pipeline (many threads working) time-intervals smaller than this intrinsic time cannot be accurately measured. This is the reason for the apparently strange value 0.0 in the last row of Table 2.

Several scenarios have been tried, varying the number of active threads and active sources. The results are summarized in Table 2.

Table 2 Comparison of measurements with simulation results. The handling time per fragment is shown in microseconds. The measurement times are shown once raw as measured, and second corrected, with the round-trip and handling time in the NICs subtracted.

Measurement

> VIUDJPHQW@ Simulation > VIUDJPHQW@

1 source 1 thread 6.6(4.9) 4.9 4 sources 1 thread 4.5(2.8) 3.2 1 source 16 threads 1.7 (0.0) 0.5

The results show clearly that the simulation is fairly accurate. Although the arrival dates of fragments had to be tuned a little bit to cope with the limitations of the early chip version available for these tests, it should be noted that the full featured sub-event merging code has been tested. As has been pointed out already several times in this paper, only 16 threads could be tested, due to current limitations in the simulation model. Clearly the performance can only improve with more threads enabled, certainly not in a linear fashion, but a factor 50% can realistically be expected.

(23)

Page 17

7 The path to a fully integrated module

The reference kit as available from IBM is a fully functional Readout Unit, Switch Element and Event Builder. However, the functionality it provides needs to be integrated on a standard (i.e. 9U, VME width) board. The implementation of such a module has been outlined in section 4.

Together with the design and building of this module a few other things have to be provided. These will be listed here:

1. A test stand for frame generation and reception. This will be an extension of the existing set-up used for the measurements presented before.

2. Software to run on the embedded PowerPC 405 core. This will be Linux, as provided by IBM, together with a few LHCb specific extensions such as communication with the Credit-Card PC, and monitoring and exception handling for the sub-event building.

Software for the Credit-Card PC (again, most likely Linux) to communicate with the NP4GS3, monitor and control them via PCI

All these items together with optimisation of code represent an effort of approximately one man-year. The second main issue still to be addressed is the design and proto-typing of the mezzanine cards and carrier boards. Final production should be finished by the end of 2003.

(24)

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Page 18

8 Summary and Conclusions

Network processors are an exciting new technology. They offer unprecedented power and flexibility in frame manipulation, traffic shaping and other network oriented processing. We propose a versatile network processor based module for the LHCb Data Acquisition system. This module consists of 8 fully connected Gigabit Ethernet ports. The frame processing is performed by up to two IBM NP4GS3 network processors. This module can serve all purposes required in the LHCb DAQ system: Front-end Multiplexer, Readout Unit, Main Event-Builder and even building block for the Readout Switch. It is the software driving the network processor, which determines the functionality.

A powerful set of software utilities is available to facilitate rapid development along with extensive documentation and good technical support.

Simulation studies have shown that the performance is largely sufficient for all applications envisaged. We have confirmed the simulations by measurements done on the evaluation platform provided by IBM. The main open issue is the integration of the functionality of the module on a standard VME-like electronics board. We propose to use a carrier board together with one or two daughter boards equipped with network processor. The design and implementation of these boards is currently under study.

(25)

Page 19

References

[1] Bux et al., Technologies and Building Blocks for Fast Packet Forwarding, IEEE Comm. Mag., Jan. 2001

[2] Network Processor Central, [online] http://www.linleygroup.com/npu/

[3] B. Jost, The LHCb DAQ System - Presentation given at the DAQ 2000 Workshop in Lyon [online] http://lhcb-comp.web.cern.ch/lhcb-comp/daq/postscript/LHCb%20Paper.pdf

[4] IBM PowerNP NP4GS3 Datasheet, [online]

http://www3.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF7785256983006A3809

[5] IBM PowerNP NP4GS3 Reference Platform, [online]

http://www3.ibm.com/chips/techlib/techlib.nsf/techdocs/546B9AC56334EA0F872569F9005F7DA5

[6] RISC Watch Debugger, [online]

http://www-3.ibm.com/chips/techlib/techlib.nsf/products/RISCWatch_Debugger

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition System

A Versatile Network Processor

0

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

Abstract

1. Document Title: A versatile network processor based electronics module for the LHCb Data Acquisition System

Table of Contents

3.1 READOUT UNIT...3

4 A VERSATILE 8-PORT GIGABIT ETHERNET MODULE ...6

4.1 MOTHERBOARD...6

5 SOFTWARE...9

5.1 SOFTWARE DEVELOPMENT ENVIRONMENT...9

6 MEASUREMENTS WITH THE REFERENCE DESIGN KIT...14

6.1 THE IBM N4GS3 REFERENCE PLATFORM...14

7 THE PATH TO A FULLY INTEGRATED MODULE...17

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

List of Figures

Page v

List of Tables

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

2 The IBM NP4GS3

DASL DASL

3 Applications in the Data Acquisition System

3,**<%$&.

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

3.2 Front-end

Main Event builder

Elementary Switching Module

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

4 A versatile 8-port Gigabit Ethernet Module

Piggy back board

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

5 Software

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

(Sub-) Event Building

5.2.1 Ingress Event building

Average Input Fragment Payload [Bytes]

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

5.2.2 Egress Event building

0

0

Average Fragment Size [Bytes]

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

6 Measurements with the Reference Design Kit

set-up

6.3 Results

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

ZKLFK DUH QRW LQFOXGHG LQ WKH VLPXODWLRQ 7KLV WLPH KDV EHHQ IRXQG WR EH V ,W FDQ EH VHHQ DV DQ

7 The path to a fully integrated module

A Versatile Network Processor based Electronics Module for the LHCb Data Acquisition SystemReference: LHCb DAQ 2001-132

8 Summary and Conclusions

References