Sparse State Saving (SSS) - State Saving - Techniques for Transparent Parallelization of Discre

3.1 State Saving

3.1.2 Sparse State Saving (SSS)

To overcome the high resource demand from CSS, various approaches which all fall under the name of Sparse State Saving (SSS) have been proposed. The

basic idea is to take snapshots sparsely [95, 13], rather than before each event, as shown in Figure 3.2. The event-period according to which a snapshot should

be taken can be either fixed—in the case of Periodic State Saving (PSS)—or variable—in the case of Adaptive State Saving (ASS).

By relying on SSS, whenever a rollback operation should be performed, we

have two possibilities:

Then the rollback operation is carried on exactly as in CSS;

2) There is no state snapshot associated with timestamp Ts= LV Trollback.

In the latter case, the state snapshot associated with the higher timestamp

among the ones lower than LV T_rollback is restored. Then, as shown in Figure 3.2, some events must be reprocessed in order to re-align the clock of the rolling-

back LP to the value LV T_rollback, an operation which is known as coasting forward. It is important to note that, during the re-execution of these events,

no messages should be delivered by the rolling-back LP. In fact, since the events have already been processed, the involved messages have been already sent.

And since the timestamp associated with any reprocessed event erepr is such that T_e_repr < LV Trollback, the antimessages ¯erepr have not been sent. Then, if during the coasting forward operation, messages are sent to other LPs, they are going to receive multiple copies of the same messages, thus creating an error

in the simulation’s results. This re-execution without message sending is called silent execution. Although in this scenario it would be practically correct to send

the antimessages for any event erepr, from a logical point of view it is not, as they do not belong to a portion of the simulation trajectory which is discovered

to be inconsistent. Furthermore, doing so could generate biases in the execution of the simulation.

Additionally, it is important that the re-execution of any event follows the same (original) execution trajectory. If, for example, the logic associated with an

event relies on some probability distribution function (based on pseudo-random number generation), it is important that re-executing the same (logical) call

to the random generator provides the event with the same exact result. In the negative case, the logic associated with the event might produce a different result

determinism (PWD) [41], is necessary to correctly reconstruct the very same

simulation state before executing the straggler event, and can be supported by the underlying simulation kernel by exposing its own version of a random library,

which is aware of the rollback operation.

PSS has the undeniable advantage that memory consumption due to state

saving is reduced. Yet, the rollback overhead is increased, due to the cost of the coasting forward operation. The efficiency of the approach depends on the

checkpointing period χ. If χ is too small, memory is inefficiently used. On the other hand, if χ is too large, we should expect a performance decrease.

Various ASS techniques have been proposed, which try to fine tune the value of χ depending on the actual execution dynamics of the simulation model. The

approach described in [121] selects the best checkpointing interval by relying on an analytic model based on LP execution time. By assuming that the execu-

tion of events is non-preemptive1, and by assuming that the rollback length2 is independent of each other, the optimal checkpointing interval is:

χopt = &s 2δs δc + N kr + γ − 1 ' (3.1) where:

δs is the average time to take a state snapshot;

δc is the average time to execute the coasting forward operation;

N is the total number of committed events; kr is the number of rollbacks executed;

This has been the traditional behaviour of events’ execution. See Chapter 7 for a more detailed discussion of this topic.

Rollback length describes how many (optimistically-executed) events are undone by a rollback operation. The average rollback length can be used as a measure of the amount of “wasted work” in an optimistic simulation run.

Algorithm 3.2 Optimal χ Selection if n = 0 then

χn← χinit

else if k_obs= 0 then

χn← d(1 − ρ)χn−1+ ρχmaxe else

χn← max(1, d(1 − ρ)χn−1+ ρ min(χmin, χmax)e) end if

γ is the average rollback length.

Under the same assumptions, the work in [140] proposes to observe in a WCT interval T_obs the number of rollback operations k_obs and the number of

executed events Robs(both committed and uncommitted). A numerical sequence of checkpointing intervals χ_n is generated, where the first element is given by:

χinit= &s 2Robs kobs δs δc ' (3.2)

The next values are then computed according to Algorithm 3.2, where:

ρ ∈ (0, 1) determines whether we are giving more importance to the history of χn rather than to more recent observations;

χmin is the minimum threshold; χmax is the maximum threshold.

This scheme does not take into account the fact that the execution time of different typologies of events can vary. This aspect is captured in [152], where

the Event Sensitive State Saving (ESSS) is proposed. This technique emphasizes that it is convenient to take a state snapshot when the granularity of the next

event3 increases. Then, starting from the model in [140], and classifying the

By granularity of an event, we mean the average execution time of one type of event, with respect to the average event’s execution time that considers all kinds of event to take the same WCT to be executed.

events in N different classes, the simulation kernel groups in the same class

n ∈ N all the events showing a similar behaviour (in terms of required WCT for execution). A proper χ_opt is then selected depending on the most-occurring

class. Assuming that each class n of events is associated with an event frequency fnand a state saving probability pn, the optimal checkpointing value is computed

as the geometric average of all the classes:

χopt= N X n=1 pnfn !−1 (3.3)

This approach, therefore, tries to reduce the coasting-forward time by avoid-

ing to reprocess chains of events containing ones that require a high amount of WCT to be reprocessed.

A different approach is presented in [45], which regulates the checkpointing

interval by using a heuristic algorithm, based on the periodic re-calculation of the cost function:

Ec= Css+ Ccf (3.4)

where:

Css is the average WCT to perform a state saving;

Ccf is the average WCT to execute the coasting forward operation.

Taking into account observation periods which are completely independent of the checkpointing intervals, the system periodically re-computes the value

of Equation (3.4). If its value increases, then the value of χ is increased by one unit (up to a maximum threshold). If there is no fluctuation in the value

of Equation (3.4), then χ is decreased by one unit. Therefore, if the model’s dynamics change—and this change is reflected in a variation of the rollback

operations changes, and the value of χ is adapted accordingly.

An additional approach in [133] proposes to observe LPs’ event history, taking into account the variations between the timestamp of two consecutive events,

to determine which is the best moment for taking a snapshot. In particular, a rollback operation can involve any ST instant, specifically it can fall into any

interval bounded by the timestamps of two consecutive events. If LP i’s clock has value LV T_i, and if the next event’s timestamp is T_next, if the ST interval

I = [LV Ti, Tnext] is such that the difference Tnext− LV Ti has a positive (non- negligible) variance with respect to the average value, there is a higher risk that

a rollback might affect that ST interval I, and it is therefore convenient to take a log.

In [134] a cost model is proposed, to select the checkpointing position in an optimized way. It is based on a heuristic which tries to minimize the rollback

length: the system decides to pay the cost of a checkpoint at a certain ST instant only if the estimation of its possible (future) restore cost is higher.

The model takes into account the following parameters:

• the position of the last checkpoint;

• the granularity of the events executed in between the last taken checkpoint

and the current LVT;

• the probability that state S will be restored in the future, due to a rollback

operation.

which are combined in the equation:

CR(S) =        µs+ P (S)µs if S is saved P (S) " µs+ P e∈E(S) µe # otherwise (3.5)

where:

µs is the save & restore cost of state S;

µe is the granularity of event e, which is part of the cost of the coasting forward

operation;

P (S) is the estimation of the probability that the current state S will be restored

in the future, which depends on the application dynamics and on the

interval I(S) which spans from the last checkpoint’s timestamp and the current LVT;

E(S) is the set of events executed in the interval I(S) which will be re-executed during a coasting forward operation.

Equation 3.5 is computed in both forms (as if it were necessary to take a snap-

shot, and as if it weren’t) before executing any event e. The result associated with the most convenient option determines whether the checkpoint will be

taken or not.

In [166] we find a work which addresses an orthogonal problem to check-

pointing interval, but which is perfectly compatible with the aforementioned SSS approaches. This work proposes a transparent memory-management archi-

tecture targeted at optimistic synchronization which allows the user to rely on dynamically allocated memory to store the LPs’ simulation states. This work

is actually one of the basis upon which our solution in Chapter 4 is built, so we remind the reader to that Chapter for a more specific discussion.

Other proposals oriented to transparency for checkpoint/restore operations in the context of general memory layouts cope with optimistic synchronization in

protection mechanisms offered by the Operating System, used to detect memory

accesses and to trigger incremental copies of the accessed pages.

Another proposal which targets checkpoints of scattered-memory simulation

states is the one in [156], which nevertheless offers a degree of transparency reduced with respect to [166, 144, 145], as the user has to explicitly notify the

simulation kernel about which memory buffers are being used for storing LP’s state variables.

In document Techniques for Transparent Parallelization of Discrete Event Simulation Models (Page 97-104)