Simulating Ad hoc Host Behaviours - Ad hoc Cloud Computing

6.4 Reliability

6.4.1 Simulating Ad hoc Host Behaviours

The primary aim of this experiment is to find the potential reliability an ad hoc cloud could offer in a realistic setting. Therefore we accurately simulated an unreliable in-frastructure by obtaining Nagios monitoring data over a period of 36 months from 650 hosts in The School of Informatics at The University of Edinburgh. An example of the data we received is shown in Figure 6.2.

[1294472199] HOST ALERT : host256 UP SOFT 1 PING CRITICAL [1294472210] HOST ALERT : host259 UP SOFT 1 PING CRITICAL [1294472220] HOST ALERT : host174 DOWN SOFT 3 PING WARNING [1294473745] HOST ALERT : host271 UP SOFT 1 PING CRITICAL [1294473756] HOST ALERT : host259 DOWN HARD 1 PING CRITICAL

Figure 6.2: Nagios Example Output

The example output displays various forms of information. However the three most im-portant entities are the timestamp when an event occurred, the hostname that the event

relates to and the host state. For example, the first line of Figure 6.2 shows that host256 became available on 8/1/2011 at 7:36:39 AM (i.e. the epoch time 1294472199) and this new state was determined using ping.

We parsed this large amount of monitoring data by creating a Nagios data tool that calculated the host activity for every hour, i.e the number of unique UP and DOWN state events for each host. Despite finding that hosts within Informatics are highly reliable and rarely fail or become unavailable, there were times when groups of hosts did become sporadically available over short periods of time. In one of the most active hours, at least 30 hosts acted in a sporadic manner. We therefore used this set of hosts to simulate the behaviour of an ad hoc cloud; this hour was between approximately 04:45 and 05:40 am on the 13th of September 2012. Therefore by using monitoring data from an actual infrastructure, we can determine the reliability of an ad hoc cloud as if it was operated over the selected set of hosts at the selected time.

Figure 6.3 shows the availability map depicting the group of selected host’s be-haviour during the selected hour. A host is initially assumed to be available until a red marker signifies the host has become unavailable or has failed. A green marker signifies that the host has now become available and the downtime can be calculated between the time of the two events, depicted by the dark grey area between the two.

We include the time and date when each event occurred on the left hand side of the availability map with each blue area showing the number of events that occurred in each ten minute period.

Most importantly, we also show the reliability of each host. This was calculated as the ratio of the total number of downtime seconds over the total number of available seconds the host was available, from when monitoring records began until the begin-ning of the selected hour. We also show the last three digits of the IP address assigned to the ad hoc guest that runs cloud jobs; we define this as the VM ID. These virtual machines may run on any EDIM1 host depending on the order the installed ad hoc clients register with the ad hoc server.

In order to simulate the behaviour outlined in the availability map, we created and added a simulator daemon to the VM Service project. This daemon takes the Nagios monitoring data for the selected hour and replays the UP and DOWN state events for each of the ad hoc hosts on the EDIM1 infrastructure. Simulating an ad hoc host could be performed by instructing each EDIM1 node to shutdown and boot up when a DOWN and UP event occur respectively, however this solution is impractical and difficult to manage remotely.

6.4. Reliability 153

Instead we informed the ad hoc server of the infrastructure state changes by mod-ifying the VM Service project database. For example, when a DOWN event occurs, the respective ad hoc host and EDIM1 node is set to unavailable by setting the host’s availability value to false. This then triggers the ad hoc server to initiate the virtual machine migration process when it detects the ad hoc host, that previously was exe-cuting a cloud job, is now unavailable. Similarly, when an UP state event occurs, the ad hoc host’s availability value is set to true allowing the ad hoc host to receive cloud jobs or to restore virtual machine checkpoints. By replaying availability data from an infrastructure that an ad hoc cloud could have been deployed on, we can reasonably gauge the reliability of our prototype in similar realistic settings.

Key:Job XAd hoc host running job X becomes non-operationalAd hoc host running job X becomes non-operationalAd hoc host running job X becomes non-operationalAd hoc host running job X becomes non-operationalAd hoc host running job X becomes non-operationalAd hoc host becomes operationalAd hoc host becomes operationalAd hoc host becomes operational DateTimeVM ID174175172171184199187179185193178192189188195173170181186176191177198197180194182190183196 EDIM ID Reliability99.9966499.97053599.9976299.5326299.980499.9718299.9835399.7986799.9857399.98855692.5721699.9871199.93947699.7748899.06100599.4143999.9794299.9706184.36481599.40597599.98389499.9793299.9819699.9079196.6899399.9478999.843799.6159982.8328100 13/09/201204:45:23JobXJobX 13/09/201204:46:23JobX 13/09/201204:46:33JobX 13/09/201204:47:03 13/09/201204:47:53 13/09/201204:48:13 13/09/201204:48:43 13/09/201204:49:13 13/09/201204:49:53 13/09/201204:50:33JobX 13/09/201204:55:13 13/09/201204:56:03 13/09/201204:57:53JobX 13/09/201204:59:44JobX 13/09/201204:59:53 13/09/201205:00:53 13/09/201205:01:23 13/09/201205:01:53JobX 13/09/201205:02:03JobX 13/09/201205:02:43 13/09/201205:02:43 13/09/201205:03:23 13/09/201205:03:53JobX 13/09/201205:04:23 13/09/201205:04:43 13/09/201205:04:43 13/09/201205:05:53 13/09/201205:08:03 13/09/201205:08:03 13/09/201205:08:13 13/09/201205:09:13 13/09/201205:09:13 13/09/201205:09:13 13/09/201205:09:23 13/09/201205:09:33 13/09/201205:09:53 13/09/201205:09:53 13/09/201205:10:03 13/09/201205:10:03 13/09/201205:10:23 13/09/201205:10:23 13/09/201205:10:23 13/09/201205:10:33 13/09/201205:10:53 13/09/201205:11:23 13/09/201205:11:33 13/09/201205:11:53 13/09/201205:12:43 13/09/201205:13:03 13/09/201205:13:23 13/09/201205:13:33 13/09/201205:13:33 13/09/201205:13:53JobX 13/09/201205:14:53 13/09/201205:16:53JobX 13/09/201205:17:13 13/09/201205:17:33 13/09/201205:17:53 13/09/201205:18:13 13/09/201205:18:23JobX 13/09/201205:18:23JobX 13/09/201205:18:23 13/09/201205:19:33 13/09/201205:19:53 13/09/201205:20:03JobX 13/09/201205:20:03 13/09/201205:20:23 13/09/201205:20:43 13/09/201205:21:33 13/09/201205:21:43 13/09/201205:21:43 13/09/201205:22:03 13/09/201205:22:53 13/09/201205:22:53 13/09/201205:23:23 13/09/201205:24:33 13/09/201205:24:53 13/09/201205:25:13 13/09/201205:26:03 13/09/201205:26:13 13/09/201205:27:23 13/09/201205:27:23 13/09/201205:27:43 13/09/201205:27:53 13/09/201205:28:03 13/09/201205:28:13 13/09/201205:28:13 13/09/201205:28:13 13/09/201205:29:03 13/09/201205:29:03 13/09/201205:30:13 13/09/201205:30:13 13/09/201205:30:43 13/09/201205:30:43 13/09/201205:31:03 13/09/201205:31:53 13/09/201205:31:53 13/09/201205:31:53 13/09/201205:32:13 13/09/201205:32:13 13/09/201205:32:23 13/09/201205:33:03 13/09/201205:34:23 13/09/201205:34:53 13/09/201205:36:43 13/09/201205:38:13 13/09/201205:38:43 13/09/201205:39:03

Figure 6.3: Informatics Host Activity Availability Map

6.4. Reliability 155

In document Ad hoc Cloud Computing (Page 165-169)