Case Study Performance - The Origin of Data: Enabling the Determination of Provenance in Multi

Having looked at the performance of the Provenance Store in a controlled environment, we now investigate the the impact of p-assertion recording on the example application ACE. Recall from Chapter5 that ACE is a combination of multiple distributed compo- nents running across multiple institutions. Our implementation of ACE was designed to run in a Grid environment. Figure6.6shows the deployment workflow for the implementation of the ACE application. The different organisations involved are identified by grey boxes with dotted lines. Sequences are obtained from an external provider through either a Web Service or via FTP. These are then collated locally (i.e. on the bioinfor- matician’s computer) into several sample sequences. The Jobs Creator then generates a series of jobs to be submitted to a Grid and executed. The executables used by the jobs are pre-staged on the Grid (i.e. they are already available on the Iridis cluster).

Provided by a Web Service

Executed Locally

Executed on the Grid

Sequences Collate Sample collated sample sample size Job Encode by Groups Compress Compute Entropy Calculate Efficiencey recoded sample recoded sample compressed size shannon entropy group coding information efficiency value Job Encode by Groups Compress Compute Entropy Calculate Efficiencey recoded sample recoded sample compressed size shannon entropy group coding information efficiency value Job Encode by Groups Compress Compute Entropy Calculate Efficiencey recoded sample recoded sample compressed size shannon entropy group coding information efficiency value Jobs Creator Collate Sample collated sample sample size

Figure 6.6: ACE deployment workflow

Chapter 6 Evaluation 144

submitted to Iridis via a Globus interface using Condor-G [76]. Globus provides an abstraction over a wider variety of schedulers and jobs submission interfaces. Therefore, the jobs created by the Jobs Creator can run on any Globus enabled Grid that can run Java executables. This portability is important because it allows ACE to be run wherever computational resources can be procured. For example, ACE could be run on the United Kingdom’s National Grid Service (http://www.grid-support.ac.uk/) or the Open Science Grid (http://www.opensciencegrid.org/). In fact, the certificates used by our implementation to access Iridis using Globus are issued by the National Grid Service.

The collate sample portion of the ACE workflow is typically run once and a number of jobs are generated to process these samples with different groups. In the experimental setup for these performance measurements, one run of ACE consists of 80 jobs. Each job analysed 900 unique groups on 5 different 100K collated samples, thus, a job generates 4500 information efficiency values. A set of 900 groups is a 50K file. Process documentation that represents the provenance of each information efficiency value is stored across two provenance stores. One store is deployed on the Grid infrastructure, the other is deployed on the same network where the local portion of ACE executes. The provenance store hosted locally contains documentation of the generation of the 80 jobs, which is 5 MB in size on disk.

The process documentation created by ACE is extremely detailed; the steps used to compute each information efficiency value are recorded. To prevent duplication of data and the creation of a larger than necessary provenance store, we make use of documentation styles to enable references to input data. Therefore, a collated sample is only documented once and not for every information efficiency value computation that takes it as input. Furthermore, intermediate data is not stored in the provenance store if it can be generated by a well-known and documented algorithm. For example, the output of the PPMZ compression algorithm is not stored because it can be regenerated accurately. After processing one run of ACE, the provenance store deployed on the Grid contains 14GB of data. In Section 6.6, we discuss the trade-offs between process documentation detail and performance. For ACE, our choice of documentation detail is a good compro- mise as it allows all our use case questions to be effectively answered while still achieving acceptable performance. We now calculate the effect of p-assertion recording on ACE. To provide an average, the data used was collected from three runs of ACE where the same jobs were submitted for each run.

The most pertinent measure of application performance for the scientist is the duration of an application run measured in wall clock time. Therefore, we calculate the slowdown of ACE when recording p-assertions in terms of wall clock time. Because we run ACE in an uncontrolled environment, we now show that the average difference in duration between recording and non-recording jobs provides a reasonable approximation for the impact of p-assertion recording on application performance.

Chapter 6 Evaluation 145

We start by noting that the time to collate samples is constant and is small when compared to the time necessary to run jobs. Discounting sample collation time, we can approximate the time to perform one run of the ACE application by summing the runtimes of all the application’s jobs and dividing that by the average number of jobs run in parallel. When jobs actually get scheduled and run is beyond our control and is representive of the load on Iridis and not our experiment.

This approximation is reasonable because all jobs run on a standard computational environment and there are no dependencies between jobs. Furthermore, we observe that in ACE, parameter variation in terms of groups has very little effect on job duration, as we have 95% confidence that a job will last 22 minutes ±30 seconds. Figure6.7shows that job times follow a normal distribution and the majority of job times fall within the stated confidence interval. Thus, from Figure 6.7 and our reasoning, we conclude that the average job runtime is representative of overall application runtime. Thus, a slowdown in job runtime is a good predictor of the slowdown in application runtime.

Histogram of Job Runtime frequency

0 10 20 30 40 50 60 70 80 90 0:20:30 0:21:00 0:21:30 0:22:00 0:22:30 0:23:00 0:23:30 0:24:00 Job Runtimes (hours:minutes:seconds)

frequency

Job Runtime frequency

Figure 6.7: Frequency distribution of job times

We now consider the scenario in which jobs record p-assertions. In this scenario, the runtime is affected by two additional factors: the p-assertion creation time and the p- assertion record time. We note that the record time for p-assertions can be influenced by contention for the provenance store from other jobs as shown in Figure 6.3. How- ever, the influence of contention is negligible because, although the distribution of job parallelism is broad ranging from 0 to 60 jobs in parallel (see Figure6.8), a majority of p-assertion recording job times fall within two standard deviations of the average (i.e. either plus/minus a minute away from the average job run time). This clustering of job times is shown in Figure6.9and mirrors the result from Figure6.7. Thus, contention is

Chapter 6 Evaluation 146

not a factor influencing p-assertion record time within ACE.

Figure 6.8: Distribution of job parallelism

Having discounted both contention as an influence on p-assertion record time and parameter variation as influence on job runtime, we conclude that the difference between the runtime of ACE and the runtime of ACE with p-assertion recording is the time it takes to create and record p-assertions. Figure6.10shows the maximum, minimum and average job record times over all application runs and the difference between times with and without p-assertion recording. Taking the average, there is a 13% overhead on job runtime for p-assertion recording. From our previous reasoning, we conclude that there is 13% overhead for recording onapplication runtime.

We believe this value is acceptable in light of the functionality gained from having an accurate representation of the process by which the results of ACE are produced. We now analyse how the process documentation recorded can be used to satisfy the provenance questions posed in Chapter5.

In document The Origin of Data: Enabling the Determination of Provenance in Multi institutional Scientific Systems through the Documentation of Processes (Page 155-158)