5 Validation Strategy of MaStA
6.6 Validation of the Workload Assumption
The requirement of this procedure is the justification of the hypothesis (the Workload Assumption of MaStA):
The cost of running the I/O stream generated by an application is approximately the same as running the I/O stream generated by the workload abstraction.
This procedure essentially validates that workload is correctly modelled. The strategy used to validate this hypothesis is illustrated in Figure 6.6. The workloads generated from OOl, OOlb, 0 0 7 and MOB are characterised by a number of workload variables. These variables are used to drive a synthetic workload generator that produces workloads with similar numbers of data reads and writes, and similar locality properties to the original applications. The I/O costs of executing the synthetic workloads (synthetic I/O costs) on each recovery mechanism are measured. These costs are compared with the total real costs of the original workloads recorded in Section 6.3.
The hypothesis is justified if for each pair of recovery mechanisms, where the variation in the total costs of executing a given workload is significant (>5%), the synthetic VO costs can be used to select the mechanism that incurs the lower total cost. There are 103 such variations in the workloads executed.
workloads from OOl, OOlb, 0 0 7 and MOB (database reads and writes)
workload variables workload analyser synthetic workload generator synthetic workloads (database reads and writes)
1 _________
\ AISP / \ DS / \ LSD /
I/O costs of synthetic
workloads of real workloadstotal costs predictions
Figure 6.6: The Strategy Used to Validate the Workload Assumption 6.6.1 Characterising Workload
The number of variables used to characterise workloads are kept to a minimum to ensure that the design and implementation of the synthetic workload generator are tractable. At the same time the variables have sufficient expressive power to ensure that the synthetic I/O costs are accurate enough to predict the relative total costs of recovery mechanisms for a given workload. The variables used to characterise workloads are given in Table 6.2. The workload analyser makes use of the variables cache and the knowledge that the recovery mechanisms employ LRU page replacement strategies, to calculate the values of read and readRecent.
In the definitions of readFaultLoc, two logical database pages are considered near to one another if they are less than 1920 logical pages (15 MB) apart. This value is chosen to reflect the size of the disk partition used to measured clustered I/O (Appendix B).
Note that the variables used here assume that transactions are executed serially, as is the case in the workloads used in this validation procedure. Applications exhibiting concurrent behaviour may be accommodated by adding transaction behaviour variables to the workload abstraction. These may be, for example, the average number of concurrently executing transactions and the average number of concurrent transactions that access and update the same page.
Workload Variables Description read the number of read operations performed
readRecent the number of (no page faults incurred)reads that access data already in the cache readFaultLoc the number of page faults in which the database page accessed is logically near the previously faulted page
update the number of write operations performed
firstUpdate the number of read operations performed before the first write operation UpdateTrans the sum of the number of transaction on pages already updated by the transactionupdate performed by each updateTemp been updated by a previous transactionsthe number of pages updated by a transaction that have
commit the number of commit operations the size of the virtual database in bytes cache the size of the cache in bytes
pagg page size in bytes
Table 6.2: Workload Variables Used to Characterise Workloads 6.6.2 Synthetic Workload Generator
The synthetic workload generator takes as input, values for the variables in Table 6.2 and produces workloads consisting of database access, update and commit operations. The generator uses a probabilistic approach to determine whether each access generated is a read or write, and to select the database page accessed by each operation.
• An operation is a read if the number of operations generated so far in a workload is <firstUpdate. Otherwise, an operation has a read!{read + write) probability of being a read, otherwise it is a write.
• If a read operation is generated, the probability that the page accessed by the operation has been read recently is readRecent!read, and hence the operation does not cause a page fault. If a read operation is generated to cause a page
fault, the probability that the faulted page is near the previously faulted page is readFaultLoc/(read - readRecent).
• If a write operation is generated, the probability that the operation changes a page already updated by the current transaction is updateTrans/update. If so, a page already updated by the transaction is randomly selected. If not, the probability that the operation updates a page changed by a previous transaction is updateTemp/{update - updateTrans).
• A commit operation is performed every {{read + update)/commit) operations. The standard library function ra n d o m was used to produce the random values required by the synthetic workload generator.
6.6.3 Results
The average I/O costs of the synthetic database workloads and the total costs of the original workloads executing on the three recovery mechanisms and the two platforms are given in Appendix C.4. The costs of each pair of mechanisms executing a given workload are analysed to determine if the relative order of the synthetically produced I/O costs reflect the relative order of the total real costs. Analysis indicates that in 101 of the 103 comparisons of recovery mechanisms, the synthetic I/O costs could be used to predict which mechanism incurs the lower total cost.
The two inaccurate predictions occur when AISP and DataSafe executing u p d a te (OOlb) on the configuration of the Alpha are compared and when the same mechanisms executing lo o k u p 2 (OOlb) on the Sun are compared. No satisfactory explanation can be found for these two results. In future work, such results may be corrected by incorporating more workload variables, for example, to develop a more accurate model of workload locality.
If the results of this procedure are analysed for only those pairs of recovery mechanisms where there is > 10% variation in total costs, then the synthetic I/O costs can be used to produce 100% accurate comparisons of total real costs.