• No results found

An Analytical Performance Model

2 A WEB SERVICES APPROACH TO MASTER/WORKER PARALLEL

2.5 An Analytical Performance Model

As mentioned in the previous section, in a traditional PDES execution performance can be classified by metrics such as speedup relative to a sequential implementation. Because Aurora is a high-throughput computing system and may execute on non-dedicated hardware, speedup is not the most appropriate metric. The

efficiency of the simulation given the master/worker infrastructure where client machines are able to contribute a certain amount of processor time to perform simulation

resource pool. The fraction of time used performing application computations as opposed to overhead computation or wasted through idle processor cycles is the basis behind the efficiency a PDES application can achieve across this type of infrastructure. Using the metrics as presented in the previous section, we can create an analytical model for performance for a monolithic metacomputing master/worker PDES system under web services.

The first major parameter that negatively affects efficiency of a simulation is overhead. Overhead in the Aurora system arises primarily from the time required to transfer work units between the client and server machines; the amount of simulation computation performed by a client once it receives a work unit also greatly impact efficiency. These factors, in turn, are affected by three principle parameters:

1. State vector size

2. Number of input messages (messages sent from server to client) and aggregate input message size

3. Number of output messages (messages sent from client to server) and aggregate output message size

These three characteristics directly affect the overhead of shipping the work unit from the server to the client and back. In the current implementation, the state, input messages, and output messages are encoded in Base64 then sent using a SOAP message. The larger each of these parameters becomes, the more overhead that is incurred for each work unit release and return. The number of input messages is included due to the processing time required on the server to construct the correct input buffer for the work unit being released to the client, excluding any messages that cannot be executed in the

current processing cycle because they are too far into the simulated future. Processing time is also required on the client side to process the packed block of memory containing input messages and queue the messages for the simulation application. The number of output messages is included as the client must construct a suitable packed block of memory for Base64 encoding. The server must also process the output buffer from the completed work unit and bin the messages to the appropriate destination LPs.

Overhead can be calculated as:

s i o i o rate rate

S

S

S

N

T

T

B

+

+

=

+

(2.2)

where Ss is the average sum input and output state vector size, Si is the average input buffer size, So is the average output buffer size, Trate is estimated transfer rate for the selected communication mode, Ni is the average number of input messages, and Brate is the estimated server message processing rate for binning output messages into the correct input buffer. This equation measures the time to transmit messages and message

processing time overhead.

We can now construct a model for approximating that time a simulation application runs under a conservatively synchronized Aurora system. Lookahead is a simulation characteristic that can affect the efficiency of a simulation tremendously. If the lookahead is too small, client concurrency will be reduced whereby instead of doing useful work, there will be increased shuffling of state and message buffers between the client and the server. In a centralized conservative time management system, the lookahead plus either current simulation time or the minimum of input message times determines the LBTS value. For each work unit released to a client, the client may only execute from the current simulation time up to the LBTS value.

The partitioning of the simulation’s LPs into work units also plays a large role in the efficiency of an Aurora simulation. If the number of available work units is smaller than the number of available clients, clients will become idle waiting for work units to become available instead of doing useful work. A balance must be struck, however, when partitioning a model. If the work units contain too few LPs then the amount of work for that portion of the simulation may become trivial compared to the overhead of leasing the LP to a client. Conversely, if the work units are “too large” then the amount of state that must be transferred upon the completion of a work unit may be prohibitively large, and the number of available work units may be too small relative to the number of client machines.

With the addition of lookahead and work unit partitioning, we can build a model for total average application run time (Ta):

a wu

T

=ρµN

(2.3)

where ρ is the average event processing rate (e.g., wall clock seconds per event), µ is the average leased event density (e.g., number of events per leased execution window), and

Nwu is the number of work units in the system. Equation (2.3) approximates the time in seconds a work unit spends performing the actual simulation computation.

The final component for total run time is request or idle time. This is the time an Aurora client spends waiting in a loop for the server to respond with a work unit available for the client. A conservative synchronization algorithm based on global LBTS values is assumed. This requires at least twice the number of work units as there are clients so that no client waits for an available work unit given that all clients have the same processing

speeds. This is due to the LBTS value not increasing until the last work unit for the current LBTS is successfully returned. Total request time can be formulated as follows:

wu

N

C

α

=

(2.4) wu

S

N

L

λ

=

  

 

(2.5)

(

)

if 1 otherwise

0

a wu r

T

C

N

T

=

λ

α

(2.6)

where α determines work unit to client (C) ratio in equation (2.4). This provides a quantifiable value whether clients must block for a work unit to become available. Next, the number of total lease windows, λ, for the entire simulation must is calculated through equation (2.5) where S denotes the total simulation time and L is the lookahead. In non- uniform lookahead simulations, the λ value would be the sum of all per work unit lease windows over the entire system. Finally, the model for total average request time (Tr) can be constructed where the average application work unit run time is multiplied by the total number of lease windows and the difference between the available client pool and number of work units. Assuming no other server overheads, there should be near zero request time when there are enough work units (e.g., α > 1). One contributor to overhead which is not captured is the deferred wait time. Although equation (2.5) partially

captures deferred wait due to work unit unavailability, there are additional non-

deterministic factors which are not accounted for. Additionally, equation (2.6) holds only if the amount of computation performed per work unit is relatively equivalent.

a a o r

T

E

T

T

T

=

+

+

(2.7)

The efficiency rating gives an approximation of how well the Aurora system is utilizing the available client pool given all clients have the same processor speeds, for actual simulation computation and progress. A higher efficiency rating represents a properly partitioned simulation with relatively good computation to communication ratio and computationally intense work unit leases. This efficiency rating can be directly measured from performance data as the percentage of processor time spent in application code and is referred to as simply percentage application processor time in performance results.