3.8 DC04 Performance Results
3.8.2 LCG Performance
The 123,000 jobs submitted to LCG provide a rich set of performance metrics, based on the logging of 46 details for every job, as shown in Table 3.1. The relatively ho-mogeneous nature of the LHCb jobs also facilitate consistent performance analysis.
The efficiency of the DIRAC system allowed the saturation of all available LCG computing resources, usually with 90-99% of all executing jobs at a grid site being LHCb DC04 jobs due to under-utilisation by other experiments and virtual organi-sations. Between days 20 and 60 of DC04 Figure 3.11 is representative of the state of the entire LCG.
Table 3.1: The 46 characteristics recorded for each job in DC04.
2 10 20 30 40 50 60 70 80 90
Time to submit job to LCG Resource Broker
2 10 20 30 40 50 60 70 80 90
Time to schedule job on LCG Resource Broker
Figure 3.16: Distribution of job submit and schedule time of LCG Resource Broker for 100k jobs during DC04
Job queue time for LCG jobs (<30m)
60 150 270 390 510 630 750 870 990 1080
Job queue time for LCG jobs (>30m)
Figure 3.17: Distribution of job queue time at site once scheduled.
Scheduling and Queuing
With the rich logging information collected by the DIRAC Services, the Agents, the jobs themselves, and the LCG interfacing software it is possible to reconstruct a good picture of the state of the grid during LHCb DC04. Figure 3.16 shows the performance of the LCG Resource Broker for processing job submissions and doing job matching. Figure 3.17 shows the distribution of job waiting times, which is the time between a job being queued and its arrival at a worker node to commence execution.
Jobs were submitted in batches, usually 500-2000 at a time. The 12 second average submission time shown in Figure 3.16 puts a daily limit of 7200 jobs sent to the Resource Broker. The 31 second scheduling time puts a daily limit of 3000 jobs scheduled by a Resource Broker. LCG addressed these limitations by deploying multiple RBs, however this complicated job management as jobs managed by one
3.8 DC04 Performance Results 59
RB were not visible from other RBs.
Job Results
Figure 3.11 shows the daily LCG submitted job results, which are described in more detail below.
Completed (48%): These jobs ran to completion on LCG.
Lost (2%): A small portion of all submitted jobs were lost without record of the result.
Cancelled (12%): Jobs were cancelled by an LHCb operator if they were not scheduled by LCG within 24-36 hours. This was to avoid the job being run with an expired proxy certificate.
C04PerformanceResults60
0 10 20 30 40 50 60 70 80 90 100
0 2000 4000 6000 8000
day
jobs
95664 DC04 Jobs Aborted by LCG, 16 July to 27 Oct 2004
79341 Retry Limit 13839 Could Not Plan 2074 Proxy Expired 396 No JDL 14 Condor Failure
0 10 20 30 40 50 60 70 80 90 100
0 20 40 60 80 100
day
% 82.94% Retry Limit
14.47% Could Not Plan 2.17% Proxy Expired 0.41% No JDL 0.01% Condor Failure
Figure 3.18: LCG distribution of causes of aborted job
3.8 DC04 Performance Results 61
Aborted (37%): There were five types of LCG-internal errors which resulted in jobs being aborted. The distribution of these is illustrated in Figure 3.18 and described below:
Retry Limit (83%): Jobs which failed to successfully transit from LCG to a Worker Node to begin execution would be resubmitted. Typically the cause of the fault was within LCG itself, therefore the fault recurred on subsequent retries until the limit was reached.
Could Not Plan (14%): The Resource Broker encountered an error resulting in an inability to schedule the job to any site.
Proxy Expired (2%): By the time the job commenced the proxy certificate used to submit it had expired leaving the executor with no authorization to proceed.
This low error rate conceals the fact that 12.5% of all submitted jobs were cancelled in order to avoid this situation. A job whose proxy expires in mid-execution will typically be unable to register or transfer any data or to report any results, therefore this is a critical error.
No JDL (0.41%): This error was concentrated on three days during DC04, and corresponded to a failure in the Resource Broker where it lost access to the original JDL.
Condor Failure (0.01%): This error only occurred 14 times during all of DC04 and was attributed to a failure of the LCG interaction with the Condor sched-uler, or a failure of the Condor scheduler itself.
Unfortunately, due to the limited logging information supplied by LCG, the 37%
of tasks which were aborted by LCG cannot easily be diagnosed, beyond the five failure categories listed above. In particular, the common “Retry Limit” failure, which occurred in 30% of all submitted jobs, cannot be diagnosed further. From working with the LCG Resource Broker developers it was reported that this would often occur when the Resource Broker contained either corrupted or out of date information regarding available computing resources, resulting in assignment of a task to a computing resource which was either unavailable or unable to accept the task when the Resource Broker attempted to route it there. Due to a quirk in the Resource Broker allocation strategy, when this routing failed it would restart the task allocation process from scratch, usually making the same decision and consequently resulting in the same failure, until the submission retry limit was reached. The other source of “Retry Limit” failures was a situation which would arise where the submitted job description (a JDL file) would be lost on the Resource Broker, leading to a failure at a later stage in the task management process.
Worker Node Characteristics
4883 AMD jobs
job time (h)
500 1000 1500 2000 2500 3000 3500
0
500 1000 1500 2000 2500 3000 3500
0
500 1000 1500 2000 2500 3000 3500
0
500 1000 1500 2000 2500 3000 3500
0 10 20 30 40
Figure 3.19: Job CPU time vs. CPU speed sorted by processor type. Only jobs which produced 500MB of output or less are shown. Illustrates variation in CPU processing time for different processor types and speeds.
0.60.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5
Figure 3.20: CPU Speed of LCG nodes used by DC04 jobs.
DC04 provided an opportunity to observe the degree of heterogeneity of comput-ing systems within a grid environment. Figure 3.19 illustrates the execution time vs. processor speed for four different classes of processor (AMD, Pentium 3, Pen-tium 4, and Xeon). This covers all LCG jobs which completed successfully and is a picture of what the jobs “saw” rather than a description of the grid itself. Hyper Threading, or processor overloading (more than one active job on the processor) was difficult or impossible to identify at run-time, however this figure reveals these
3.8 DC04 Performance Results 63
256 512 1024 1536 2048 2560 3072 3584 4096 0
1647
mean: 1507.18 mode: 1024.00 plotted evts: 3273
memory (MB)
number of nodes
LCG Node Physical Memory Distribution
Figure 3.21: Physical memory of LCG nodes used by DC04 jobs.
dual 80%
LCG Processor Count Distribution
quad 18%
single 2%
Figure 3.22: CPU count of LCG nodes used by DC04 jobs.
conditions with bimodal job densities for a given processor speed. It also reveals the relative performance of the AMD processors is better than the Intel processors based on processor speed. In terms of an overview of the actual nodes within LCG, the DC04 jobs ran on 3272 unique nodes, and their processor speeds and physical memory distributions are shown in Figures 3.20 and 3.21 respectively. Figure 3.22 shows the distribution of processors per node.
50 100 200 300 400 500 600 700 800 900 1000 0
22806
mean: 407.19 mode: 450.00 plotted jobs: 57956
overflow >1000: 6 o/f mean: 1270.60
data transfer (MB)
number of jobs
LCG Data Transfer Distribution
Figure 3.23: Histogram of data generated by DC04 jobs.
5 10 20 30 40 50 60 70 80 90 100 0
6302
mean: 41.66 mode: 40.00 plotted jobs: 57893
overflow >100: 69 o/f mean: 261.09
bandwidth (Mb/sec)
number of jobs
LCG Data Transfer Bandwidth
Figure 3.24: Bandwidth of data transfer stage at end of each DC04 job. Typical transfer volume was 400 MB.
Data Transfer Bandwidth
As the DC04 jobs produced relatively large amounts of data and transferred these to remote sites on job completion, it was possible to observe the sustained network bandwidth. Figure 3.23 illustrates the distribution of generated data over almost 58 thousands jobs, while Figure 3.24 shows the sustained transfer bandwidth. Although there were problems with the data management mechanisms within DIRAC and provided by LCG (as discussed in Section 3.7.3), LHCb was encouraged to see sustained transfer rates of 40 Mb/s were achieved during DC04.