Typical Particle Physics Computing Model - A REST Model for High Throughput Scheduling in Compu

Modern particle physics experiments have come to adopt similar strategies for organ-ising and executing their simulations and data analysis. It is this common approach to mass data handling and processing which motivates the model described in later chapters. Furthermore, the field of grid computing largely grew out of physics computing requirements which were not met by existing cluster or supercomputer systems. For this reason, many of the existing grid computing strategies have been developed with physics computing requirements in mind. This section provides a summary of the computing model used by the CERN-based experiments.

Tier 0 13 MSI2K

11%

Tier 1 56 MSI2K

47%

Tier 2 50 MSI2K

42%

Distribution of LCG computing power between Tier 0, 1, and 2 sites totalling 118 MSI2K

Figure 3.2: Computing resource distribution for Tier 0, 1, and 2 sites committed to LCG in 2008.

The CERN MONARC project[40] proposed a four tier system for distributing the processing load and data across a global network of physics sites[41]. The sys-tem is centred at CERN, called the Tier 0 site, where the data from the detector originates. Subsequent tiers consist of increasing numbers of sites, but the comput-ing and storage capacity at each site is successively smaller, and the reliability of the site decreases. Tier 1 sites are typically large national computing centres with high reliability systems, dedicated support staff, high bandwidth networks, and large storage capacity. They are also termed “Regional Centres” and act as hubs for the Tier 2 to 4 sites under their umbrella. Tier 2 sites represent institution or univer-sity computing centres which are predominately dedicated to physics computing.

Tier 3 sites are small clusters typically belonging to individual working groups, and Tier 4 represents individual computers, typically desktops or laptops belonging to

3.3 Typical Particle Physics Computing Model 23

Tier 0 1 PB 3%

Tier 1 27 PB 65%

Tier 2 14 PB 32%

Distribution of LCG disk storage between Tier 0, 1, and 2 sites totalling 42 PB

Figure 3.3: Disk storage distribution for Tier 0, 1, and 2 sites committed to LCG in 2008.

Tier 0 14 PB 32%

Tier 1 27 PB 64%

Tier 2 2 PB 5%

Distribution of LCG tape storage between Tier 0, 1, and 2 sites totalling 43 PB

Figure 3.4: Tape storage distribution for Tier 0, 1, and 2 sites committed to LCG in 2008.

experimental collaborators. The current LCG Memorandum of Understanding de-tails commitments of Tier 0, 1, and 2 sites and describes Service Level Agreements, expected operational environment, qualitative requirements, and quantitative char-acteristics for computing power, disk storage, tape storage, and network bandwidth committed by institutions involved with LCG[42]. Figures 3.2, 3.3, and 3.4 illustrate the distribution of some of these quantities between the Tier 0, 1 and 2 sites. It should be noted that there is a shortfall for all three of these, as the MoU commits 118 MSI2K computing resources, 42 PB of disk storage, and 43 PB of tape storage,

while the experiments’ computing models require 140 MSI2K computing resources, 47 PB of disk storage, and 67 PB of tape storage. As the MoU is the most re-cent document, it is possible this is due to the latest official experiment computing models being out of date with the current computing requirements.

Within particle physics experiments grid computing is taken to mean the trans-mission, processing, and storage of data once it is on commodity components. Every experiment will contain custom hardware, electronics, sensors, and processing which is “close” to the detector and part of the real-time “on-line” system which is not considered part of the grid computing infrastructure. Grid computing provides facil-ities for what is known within the particle physics domain as “off-line” computing.

There are three primary operating modes for particle physics grids: reconstruction, simulation, and analysis.

Reconstruction is very predictable, as it must be completed as the raw detector data is produced, during the annual 7 month operational period of the detector.

This utilises stable software with parameters describing the detector configuration and calibration values. Reconstruction must take place in near-real time, meaning the data produced by the detector each day must be reconstructed into physics events within a day. The throughput of reconstruction must match the rate of data generation, as the detector operates continually during the 7 month period and anything less would lead to a permanent and growing backlog until either the operational period came to an end or the disk storage buffers were saturated and it became necessary to flush the data and later retrieve it from tape storage. It also allows for data quality monitoring which feeds back into controlling the run conditions. Beyond the annual operational window, the detector data may be re-processed (that is, reconstructed a second or third time) based on improved software, calibration data, or detector models. This reconstruction is planned in advance and administered at the experiment level.

Simulation can be done by individuals, small working groups, or directed by the experiment. The experiment-level simulations are well defined in advance and typically consist of thousands of days of computing doing wide ranging simulation of detector response. Simulations conducted by individuals or working groups are much more chaotic, in terms of the data requirements, software requirements, and computing load.

Analysis is the most chaotic and consists of individual physicists or small working groups making cross-cutting selections of reconstructed data in a search for interest-ing physics. This will typically use locally customised software to analyse the data,

In document A REST Model for High Throughput Scheduling in Computational Grids (Page 34-37)