Experiment Setup - Autonomic management in a distributed storage system

7.4.1 The Test-Bed

The experiments reported here were conducted on a local area test-bed consisting of 16 dedicated hosts each with a 3.00GHz IntelRPentiumR4 CPU and 1GB of RAM. The hosts were connected to a dedicated switch and isolated from the rest of the network. A single overlay node was executed on each network host to ensure that the performance of the overlay network was not skewed by multiple overlay nodes competing for resources (CPU-time, memory and network bandwidth) within a host. A separate host, theworkload- executor, ran the workload and conducted performance measurements. Each participating node monitored the number of bytes it sent and received as well as autonomic management details. To avoid measurements being skewed by collecting data from the individual hosts during an experimental run, monitoring data was kept locally and collected after each experiment finished. As monitoring data was time-stamped the system clocks on all hosts were synchronised using N T P [63]. Information about the motivation for choosing this specific test-bed can be found in appendix A.1.5.

7.4.2 Derivation of User-Level Metrics

The experiments1were carried out to measure the effects of the various policies on the user- level metrics (ULM) performance and network usage. Single values for both performance and network usage was computed by aggregating measurements for each experiment in order to compare effects of the specific policies. To verify reproducibility each experiment was repeated three times.

Measurements were aggregated over observation periods of5 minutes. Performance measurements were derived from the execution of workload lookups. Performance was pre- viously defined in section 4.4.1 as a combination of lookup time, lookup error rate and

lookup error time2 collected during individual observation periods. Network usage was measured as the amount of data all nodes sent to the network during each individual observation period. The time during which monitoring data was gathered in one experimental run (experiment run time) was the time interval from the first lookup until the last, whether it was successful or not.

To highlight the fact that a performance measurement was computed for each observation period, the performance is referred to asexpected lookup timefor the rest of this thesis. The motivation for this notion was that a lookup would have been expected to complete after some time if a fall-back mechanism retried failed lookups until they succeeded. Thus for modelling the expected lookup time,texpected, it is assumed that any given lookup succeeds,

after a lookup time tlt, with a probability psuccess. Conversely any given lookup may fail

1_{An experiment is specified by the combination of a specific churn pattern, workload and policy for}

managing maintenance scheduling.

with a probabilitypf ailureafter a lookup error time,tlethas passed. Every failure is followed

by a retry, which is repeated n times. Thus, the expected lookup time is given by the weighted sum of all possible cases as shown in formula 7.1.

texpected=tlt×psuccess+ n

i=0

(tlt+i×tlet)×psuccess×pif ailure

(7.1)

This resulted in (derived) ULM monitoring data being available in form of progressions of expected lookup times and network usages, each for individual observation periods. Including the repetitions, three progressions were available for each ULM. The individual expected lookup times and network usages for each observation period, including their repetitions, were used to create distributions of expected lookup times and network usages. In order to compare the effects of specific policies on an individual ULM, the arithmetic mean of the distribution was calculated.

7.4.3 Churn Pattern Configurations

The four churn patterns were specified by pseudo-randomly selected values3, as:

• Low membership churn:

– ton−line>> 2[h]

– tof f−line: 157±20[s]

• High membership churn4_:

– ton−line: 200±40[s]

– tof f−line: 100±20[s]

• Locally varying membership churn: 25% of all P2P nodes were representative of dedicated servers which exhibited low churn. 75% of the P2P nodes were representative of user workstations which exhibited high churn.

• Temporally varying membership churn: a phase in which all nodes exhibited low churn, with a duration of 1000 [s], followed by a phase in which all nodes exhibited high churn, again with a duration of 1000 [s] and so forth.

The churn patterns were held constant between experiment repetitions.

7.4.4 Workload Configurations

The following workload specifications5 were derived from preliminary work, as reported in appendix A.1:

• Synthetic light weight workload: 10 lookups were issued in total; between two lookups a period of 300 seconds of inactivity was configured.

• Synthetic heavy weight workload: 6000 successive lookups were issued.

4_{This was a random churn pattern amongst the highest churn patterns the experimental platform supported.} 5_{See section 7.3.2 for definitions and examples.}

• Synthetic variable weight workload: 1000 lookups were issued in total; 100 successive lookups were followed by 300 seconds of inactivity.

• File system specific workload: a temporal sequence of 14576 P2P lookups derived from a file system trace. In order to represent original ASA semantics, the lookups were organised in sets of keys, those representative of keys for meta-data were looked up in parallel and those representative of keys for data in sequence. The lookups were spread over the experimental duration in accordance with the file system workload. More details are available in the appendix A.1.4.

7.4.5 Policy Parameter Configurations

Three different policies for scheduling maintenance operations were defined using the autonomic management mechanism6. The management mechanism consisted of an aggrega- tion policy which balanced out interval recommendations of sub-policies which analysed individual metrics in isolation. Each sub-policy determined an increased or decrease of the current interval proportional to the difference of the analysed metric value to its corre- sponding ideal value7_{. A threshold}_t_{determined which metric values were ignored and a} constant factorkdetermined the rate of change. Values fortandkwere configured specifi- cally for the individual sub-policies but then had the same values for each of the individual maintenance operations8_. _{N EM O}

t, N EM Ok, ERt, ERk, LILTt, LILTk were referred

to as policy parameters.

6_{See section 7.2 and 4.5 for more details.}

7_{The NEMO specific sub-polices determined an increase, ER and LILT a decrease.} 8_{stabilize, fixNextFinger, checkPredecessor}

• Policy 0: This policy left nodes unmanaged but still incurred the overhead of the management processes in order to allow comparison.

• Policy 1: This policy determined a new interval between any maintenance operation based on the operation-specific metrics outlined above. The policy parameters were derived from preliminary experiments (appendix A.1.3) with the objective of finding the most suitable parameter set.

• Policy 2: Like policy 1, this policy determined a new interval between any maintenance operation based on the operation-specific metrics outlined above. This policy was configured to ignoreLILTmetrics and to aggressively react to the other metrics.

All policies were evaluated every two seconds. Two seconds was also used as an initial maintenance interval. Thus the configuration of policy 0 resulted in a statically configured interval for each maintenance operation of two seconds. This maintenance interval was derived from the preliminary experiments reported in appendix A.1.1 as the most suitable static interval.

In document Autonomic management in a distributed storage system (Page 100-105)