New Design and Layout Tips For Processing Multiple Tasks

(1)

Novel, Highly-Parallel Software for the

Online Storage System of the ATLAS Experiment at CERN:

Design and Performances

Tommaso Colomboa,b Wainer Vandellib

a_{Università degli Studi di Pavia} b_CERN

(2)

The ATLAS Trigger & DAQ System

Dete

ctor Re

adout

ROD ROD ROD FE FE FE Data Flow ~150 Readout System ~100 Event Builder 5 Data Logger Level 2 dsa ~5000 Processing Unit Event Filter ~5000 Processing Unit Level 1 Custom Hardware Calo/

Muon Other Other

Data Collection Network Event Filter Network Event rates design (2012 peak) 40 MHz (20 MHz) 75 kHz (~65 kHz) 3 kHz (~5.5 kHz) ~200 Hz (~800 Hz) Trigger DAQ CERN Permanent Storage Data rates design (2012 peak) ATLAS Event 1.5 MB/25 ns (1.6 MB/50ns) ~110 GB/s (~105 GB/s) ~4.5 GB/s (~9 GB/s) ~300 MB/s (~1100 MB/s) Level 1 Accept Regions of Interest L2 Accept ROI data EF Accept Full events 2.5 µs ~40 ms (~45 ms) ~4 s (~1 s)

(3)

The current Data Logging system: overview

Purpose

5 PCs receive data from the Event Filter system and write it to local disks. Each event is:

analyzed to determine the “tags” applied by the Event Filter trigger

processed (e.g. compressed)

written to appropriate file(s) according to the tags Details

The event tags are determined by the trigger algorithms based on the event content

To facilitate off-line data distribution, every event is written to multiple files, one per each of its tags

(4)

The current Data Logging system: overview

Purpose

5 PCs receive data from the Event Filter system and write it to local disks. Each event is:

analyzed to determine the “tags” applied by the Event Filter trigger

processed (e.g. compressed)

written to appropriate file(s) according to the tags

Details

The event tags are determined by the trigger algorithms based on the event content

To facilitate off-line data distribution, every event is written to multiple files, one per each of its tags

(5)

The current Data Logging system: limitations

The current Data Logger implementation is

essentiallysingle-threaded:

I multiple threads receive the events from the EF I _{a single thread does the processing and writing}

This design is very unlikely to scale:

I maximum processing throughput: ~ 500 MB/s

I comparable with I/O (network and disks) limits

Events Q. ev ev ev Processing and Writing Thread Network I/O Thread

get event

put event

It is a major blocker for the addition of new features requiring more CPU power than a single core can provide

I _{due to this}_{event-level data compression}_{currently needs to be performed} off-line, as an additional step

(6)

The current Data Logging system: limitations

get event

put event

(7)

The current Data Logging system: limitations

get event

put event

(8)

New design: general considerations

The data processing workload is embarrassingly parallel:

the incoming data are already divided in events Constraint

The raw data file format is strictly sequential

⇒ It is impossible to do concurrent writes to the same file

Necessary to calculate the overall file checksumbeforewriting to disk, aiding in the detection of write errors

Keeps the format complexity to a minimum

Multiple events can be written to different raw data files concurrently, butno more than one event can be written to each data file at once

(9)

New design: general considerations

the incoming data are already divided in events

Constraint

Necessary to calculate the overall file checksumbeforewriting to

disk, aiding in the detection of write errors Keeps the format complexity to a minimum

(10)

New design: general considerations

the incoming data are already divided in events

Constraint

Necessary to calculate the overall file checksumbeforewriting to

disk, aiding in the detection of write errors Keeps the format complexity to a minimum

(11)

New design: idea

Split the workload intasks

For each event:

I _{one task does the processing} I _{multiple tasks do the writing}

(for each tag, a task writes the event to the corresponding file).

Use a single thread poolto execute the tasks Schedule the tasks cleverly to avoid locking

At any given time:

I _{any number of processing tasks can run}

(12)

New design: idea

For each event:

Use a single thread poolto execute the tasks

Schedule the tasks cleverly to avoid locking

At any given time:

(13)

New design: idea

For each event:

Use a single thread poolto execute the tasks

Schedule the tasks cleverly to avoid locking

At any given time:

(14)

New design: finally, a diagram!

Execution Q. PT 11 PT 10 WT 2-jets PT 9 Events PT 14 PT 13 PT 12 schedule Thread WT 1-μ run Thread WT 1-eγ

Raw file manager

File jets Q. WT 6-jets WT 5-jets File eγ Q. WT 7-eγ WT 6-eγ WT 5-eγ notify completion File μ Q. WT 7-μ WT 6-μ

schedule only one

PT nn Processing task_{for event nn}

WT nn-aa Writing taskfor stream aa of event nn Thread PT 8 WT 8-μ WT 8-jets enqueue

(15)

New design: implementation with Threading Building Blocks

PT nn Processing task_{for event nn} WT nn-aa Writing taskfor stream aa

of event nn Thread PT 8 WT 8-μ WT 8-jets enqueue

The new design was implemented using (and inspired by) the open source C++ library Intel Threading Building Blocks

(16)

New design: implementation with Threading Building Blocks

PT nn Processing task_{for event nn}

WT nn-aa Writing taskfor stream aa of event nn Thread PT 8 WT 8-μ WT 8-jets enqueue Task execution queue

Thread pool Task based multi-threading

(17)

New design: implementation with Threading Building Blocks

PT nn Processing task_{for event nn} WT nn-aa Writing taskfor stream aa

of event nn Thread PT 8 WT 8-μ WT 8-jets enqueue

Thread-safe containers optimized for concurrency Concurrent queue

(18)

Performance evaluation: resource utilization

The new implementation was tested and compared with the old one

I _{in the current production system} I _{in a testbed with older hardware}

A single Data Logger machine was operated at saturation

The dataset consisted of actual event data, with 1 to 4 tags assigned to each event

I _{changing the number of tags per event} changes the number of files the Data Logger has to write each event to

I _{does not change the required network}

bandwidth

Testbed Data Logger PC

2x dual-core Xeon 5130 4 GB RAM 3x 3ware RAID5 array

2x GbE NIC

Production Data Logger PC

2x quad-core Xeon E5520 24 GB RAM 3x Adaptec RAID5 array

(19)

Performance evaluation: resource utilization

(20)

Performance evaluation: resource utilization

(21)

Performance evaluation: resource utilization

(22)

Performance evaluation: resource utilization

(23)

Performance evaluation: resource utilization

(24)

Performance evaluation: resource utilization

The “hard” limit on the throughput of a single Data Logger is given by the

network bandwidth: 2 Gb/s'250 MB/s

Old single-threaded implementation

Can operate at network saturation only for a single tag per event Above 2 tags per event, the load

generated by itssingle thread

exceeds what a single CPU core can take

The throughput decreases accordingly

(25)

Performance evaluation: resource utilization

The “hard” limit on the throughput of a single Data Logger is given by the

network bandwidth: 2 Gb/s'250 MB/s

New multi-threaded implementation

Thethroughputis almost

unaffected by the load

Its 4 threads spread the workload on the 4 CPU cores:

none of them uses more than 60% of a core

Leaves plenty of headroom for additional CPU intensive

(26)

Performance evaluation: scalability (in testbed)

On-line event compression (with zlib) radically changes the landscape

The time spent compressing events (~ 50 ms per MB)

dominates the rest of the processing (~ 2 ms per MB)

I _{throughput is much lower}

I _{all workloads saturate the CPU}

One can examine the throughput as a function of the number of CPU cores (threads) used

(27)

Performance evaluation: scalability (in testbed)

On-line event compression (with zlib) radically changes the landscape

The time spent compressing events (~ 50 ms per MB)

dominates the rest of the processing (~ 2 ms per MB)

I _{throughput is much lower}

I _{all workloads saturate the CPU}

One can examine the throughput as a function of the number of CPU cores (threads) used

(28)

(29)

(30)

Conclusions

A novel design for the ATLAS Data Logging application was implemented and thoroughly tested

The performance of the new software is very satisfactory:

I _{taps into the full power of modern CPUs}

I _{future-proofs the Data Logger}

I _{enables the addition of computationally-intensive features}

It will be one of the essential components of the “evolved” system currently being developed to meet the challenges of LHC data-taking in 2014 and beyond

(31)

Conclusions

The performance of the new software is very satisfactory: I _{taps into the full power of modern CPUs}

(32)

Conclusions

The performance of the new software is very satisfactory: I _{taps into the full power of modern CPUs}

(33)

(34)

Other design constraints

Operations are driven by the received event data

The Data Logger can only rely on the information it gathers by examining the received events

No assumptions about the data flow

Cannot assume that the rate of received events is somehow balanced across the spectrum of the possible tags

The flow of events with one tag can vary during a run and even stop completely

(35)

Other design constraints

Operations are driven by the received event data

The Data Logger can only rely on the information it gathers by examining the received events

No assumptions about the data flow

Cannot assume that the rate of received events is somehow balanced across the spectrum of the possible tags

The flow of events with one tag can vary during a run and even stop completely

(36)

Other possible designs: thread pool with locking

Raw File Manager

Stream 2 Stream 3

Stream 1

Lumiblock 1 Lumiblock 2Stream 1 Event Queue Processing Thread get event save event ask for access

(37)

Other possible designs: chain of responsibility

Event Queue Processing Thread Stream 1 Lumiblock 1 Stream 1 Lumiblock 2 Stream 2

Processing Thread Processing Thread

(38)

Other possible designs: one thread pool per file

Event Queue Processing Thread Stream 1 Lumiblock 2 Processing Thread Stream 1 Lumiblock 2 Processing Thread Stream 2

(39)