TAG Data and Services - The quality-aware service selection problem: an adaptive evolutionary a

RAW Size 1.6 MB ESD Size 0.6 MB AOD Size 100 kB TAG Size 1 kB

Table 2.2: Nominal Event Data Sizes

2.3 TAG Data and Services

In order to gain an understanding of the TAG system and its challenges, it is important to briefly discuss the content of TAGs, the main use cases and the services provided to enable those use cases.

2.3.1 TAG Content

In a TAG record, an event is described by approximately 300 variables. The TAG content has originally been defined by a special task force in 2006 [6], and is since then under constant evaluation by the ATLAS physics and combined performance groups. The aim is to provide content that can be used for an efficient event preselection and covers all use cases defined by the physics groups. The exact content is beyond the scope of this overview, but the classes of variables are interesting to understand the TAG use cases. The variables can be classified as follows.

• Event quantities: include event number, Run number (a run/event number pair uniquely identifies an event), luminosity block and missing energy values.

• Data quality: bit-encoded words representing error codes per sub-detector.

• Trigger information: bit-encoded word specifying if an event passed a certain trigger or not. • Number of objects and their properties: number of electrons, photons, muons, jets, etc.

in the event, and basic object properties (e.g., momentum).

• Physics TAG attributes: words for each physics and performance group, free to encode in a “true/false” manner to mark an event as interesting for a specific analysis.

• Collection information: references to the same event in upstream data products, namely AOD, ESD and RAW data. These references (called Globally Unique Identifiers, or GUID) allow for back-navigation to the correct AOD, ESD and RAW file to retrieve the event.

TAGs are write-once, read-many – once written, the content is never updated. If already-

processed RAW data are reconstructed with a new software version and new calibration, a new version of all data products, including TAGs, is produced. This process is referred to as a Repro- cessing.

2.3.2 TAG Use Cases

In [32], the role of the TAG database (and more generally the TAG system) is described as “to support seamless discovery, identification, selection and retrieval of ATLAS event data held in the

merge AODtoAOD recon RAWtoESD posttag at DB upload to DB merge TAGtoTAG recon ESDtoAOD merge AODtoTAG RAW AOD TAG ESD AOD TAG

Figure 2.3: Data Reconstruction Chain, from RAW Data to TAGs [8].

multipetabyte distributed ATLAS Event Store.” TAGs are not meant for physics analysis, but for preselecting interesting events for further analysis. This preselection can be done on any of the TAG variables. From the selected events physicists can then navigate to the corresponding AOD, ESD or RAW data, to perform their specific analysis. As this analysis process can be very resource-intensive, it is preferable to run it only on events that are known to satisfy basic query predicates. An example TAG query is provided in Equation 2.2.

RunNumber ≥ 52280 AND RunNumber ≤ 52304 AND (2.1)

(LooseElectronPt1 ≥ 20000 OR LooseMuonPt1 ≥ 20000 OR TauJetPt1 ≥ 20000)

According to [6], [30], [31] and recently arisen requirements, the main TAG use cases can be summarized as follows:

2.3. TAG DATA AND SERVICES 19 events are recorded, the TAG database can be used to “check what data is there,” in order to get a general overview. The TAGs are thus used as an index of the recorded data.

2. Using a TAG selection as input to a physics analysis. For example, let’s consider that all the TAGs for a specific set of data (ESD, AOD) have been created. A physicist looks at the TAGs for this entire set of data and finds some interesting classes of events. He then wants to take the result of this TAG-based preselection and use it as input to a job which looks at the upstream AOD data. This can mean:

(a) to locally extract a TAG file from the database and use it as input to an analysis job, or (b) to skim the data and directly get a RAW/ESD/AOD file containing only the events

passing the provided TAG query.

3. Locate events based on a run and event number, and get the corresponding identifiers of the files containing these events. This functionality is referred to as event lookup and is integrated into central ATLAS Grid tools.

These generic use cases are addressed by the services of the TAG system, as described in the following two subsections. More specific use cases continually arise in physics groups, as the analyses evolve.

2.3.3 TAG Databases

As soon as TAG files are available, their content is uploaded to the TAG databases by the Tier-0 management system, using tools provided by POOL [82]. The TAG databases are relational Oracle

databases. There are several deployments at several tiers of ATLAS, and their federation can

be considered as a distributed database system, as the data can be queried transparently on any database, using a metadata registry (described in Chapter 4, 4.4).

Per year of data taking, the overall volume of relational TAGs grows by approximately 15 terabytes. Considering the operational parameters defined in Tables 2.1 and 2.2, a much lower volume is computed (Equation 2.3).

Trigger rate × seconds/day × days/year × TAG size = 200 × 5000 × 200 × 1 kB (2.2)

= 2 × 109kB ' 1.8 TB

However, these numbers do not account for indexes in the database, taking up a considerable amount of space. Additionally, the above parameters are valid for first-pass processing at Tier-0. However, data are regularly reprocessed, resulting in several versions of TAGs. Finally, TAGs of simulated events are also uploaded to the database. There are referred to as Monte Carlo TAGs.

Deletion policies are in place, so that unneeded data are being deleted regularly. However,

at any given point in time during the experiment, several terabytes of TAGs are in each TAG database. Providing an Oracle service of this scale is a challenge and requires an important technical management effort. Much effort has thus been put in the optimization of data storage and query

management. As part of the posttag process depicted in the right corner of Figure 2.3, TAGs are indexed, horizontally and vertically partitioned, and compressed.

For details on the operational challenges of the multi-terabyte TAG databases the reader is referred to [84].

2.3.4 TAG Services

The TAG system is composed of a suite of tightly coupled services, allowing access to TAG data and supporting the use cases defined in 2.3.2. The main TAG services are listed and briefly described below, with emphasis being put on providing an overview of the functionality instead of technical details.

TAG Database. Although being the data source, the TAG database is also considered as a service. See 2.3.3.

TASK Lookup. Service that allows looking up the TAG registry to locate a given set of data. TASK (TAG Application Service Knowledgebase) is described in detail in Chapter 4 (Sec- tion 4.4).

iELSSI. interactive Event Level Selection Service Interface, also referred to as TAG browser. It is a web interface, implemented using PHP and AJAX technologies, that allows browsing relational TAG data [107]. It requires a connection to one or more TAG databases and allows defining queries on all TAG attributes. The attributes are displayed in an ordered manner mimicking a typical selection process. Based on the entered selection criteria, it allows count queries and display of results, and can invoke the extract, skim and histogram services defined below. To properly display and process TAG information, it further uses the trigger decoding, extract XML builder and Lumiblock range builder services (see below). Figure 2.4(a) shows an example TAG query entered in iELSSI.

Extract. Service producing a TAG ROOT file containing the specified attributes of the TAGs for the selected events. The extract service uses POOL [82] utilities wrapped in Python. The output of extract can be used as input to a local or distributed analysis job, in the latter case using Grid job submission tools [30]. Extract can be invoked from iELSSI or directly from the command line, in both cases using an interface defined in XML. Figure 2.4(b) shows the integration of the extract service in iELSSI (panel “Retrieve without skimming”).

Skimming. Service producing an AOD or ESD ROOT file for the events passing the TAG-based preselection. The main use case of the skimming service is that of a physicist who wants to first find interesting events satisfying a specified query, and then run some analysis on those events, without having to have any knowledge about where these events are, neither in terms of files, nor in terms of sites hosting the files. Figure 2.4(b) shows the integration of the skimming service in iELSSI (panel “Retrieve with skimming”).

Event Lookup. The information contained in the TAGs allows determining in which physical RAW, ESD and AOD file a particular event is residing. As such it is the only event-level file reference. Event lookup is a service returning the GUIDs of the files containing specific events

2.4. TYPICAL TAG WORKFLOWS 21

In document The quality-aware service selection problem: an adaptive evolutionary approach (Page 34-38)