Validating Simulations - Bi-Architecture Model

Chapter 3: Automating Communication Architecture Maintenance

3.2 Bi-Architecture Model

3.5.3 Validating Simulations

As mentioned above, if we are going to accept our simulation results, we must validate our simulations. For this, we need to run experiments and simulations for the same scenario and compare the results. Unfortunately, as mentioned above, we cannot run experiments for large scenarios because we do not have enough machines. This was the reason to use simulations in the first place. It seems that this is a-chicken-and-an-egg problem.

Fortunately, even with as few as ten machines, we can, in fact, run some limited large-scale experiments. To do so, we use a virtualization-like approach in which we treat each user’s computer as a virtual computer that is mapped to one of the physical computers. One physical computer may have multiple virtual computers mapped to it. In fact, in

experiments involving hundreds of users, there can be a large number of virtual computers mapped to a single physical computer since we have only ten of them. We could use off-the-

147

shelf virtualization software to accomplish this. However, since we are using Windows machines and we may need to virtualize many users on a single computer, we cannot do this since a single Core2 desktop with 1Gb of memory can run only several virtual Windows machines because of memory restrictions. Instead, we use added virtualization-like

functionality in our collaborative framework approach that is able to map up to one hundred users onto a single computer before memory becomes an issue.

To illustrate how virtualization-like solutions can help us run experiments, consider an experiment with 100 users. Suppose that (a) user1 inputs all the commands and the

remaining 99 users observe them, (b) suppose that unicast is used to transmit commands, (c) user1’s computer transmits to user100’s computer last, and (d) we are interested in user1’s

local response times and user100’s remote response times. As explained above, to properly

measure the response times we are interested in, we must run user1 and user100 on dedicated

machines. As a result, we are left with eight machines on which to run the remaining 98 users. We can evenly divide, to within a few users, these 98 users on eight machines.

Therefore, a machine may run as many as 13 users. While we cannot accurately measure the response times of these users, user1’s computer will still require time to transmit to them.

Therefore, we are able to run certain near-ideal large-scale experiments.

Experiments when multicast is used instead of unicast are more difficult. The reason is that our system dynamically measures performance parameters based on which it

dynamically deploys multicast overlays. As a result, the multicast overlay that our system deploys may be one in which the path from user1 to user100 contains one or more of the 98

users running on 8 machines as intermediate users. For the same reason we could not rely on response times reported by these 98 users, we cannot rely on response times reported by any

user to which these users forward commands. Once again, it seems impossible to validate through experiments the predictions made by our analytical model.

Fortunately, a work-around is possible that will allow us to run some large-scale experiments with multicast. Unfortunately, the workaround requires us to impose even more limits on the scenario we can use. Recall that our system can use previously measured parameter values to make architectural decisions at start time. Because of this functionality, we can force our system to deploy a multicast tree in which the path from user1 to user100

contains only users running on separate dedicated physical machines. We do this by fixing the values of some performance parameters that are provided to our system at start time. Moreover, we configure our self-optimizing system to not measure the values of the fixed parameters at runtime. Then, we compare the actual performance measurements with simulated performance. Note that because we are fixing some values and ignoring their actual values at runtime, the experiment is less realistic than one in which the values are measured dynamically. However, fixing these values is the only way we can run experiments using the resources available us. We next describe how we used this approach to validate our simulations.

Simulation

Consider the same 100 user scenario above, in which one user is giving a PowerPoint presentation to 99 other users in which we fix the parameter values as follows. Suppose that user1, user2, and user100 are using a netbook, a P4 desktop, and a Core2 desktop, respectively,

and that all of the other users are using Core2 desktops. Moreover, suppose that 1) user1 can

communicate directly with all users except user100, 2) user2 can communicate directly with

149

we are running all of our computers on the same LAN, these communication restrictions do not reflect reality in our experiments as all computers, and hence all users, can communicate directly with each other. Nevertheless, we can trick our system into believing in these communication restrictions by providing it with the following network latencies: 0ms latencies from user1 to all other users except user100; 0ms latencies from user99 to user1 and

user100; 1000000ms (effectively infinite) network delays between from any other users to all

other users. By doing this, we force the HMDM multicast scheme to create a multicast tree in which (a) user1 transmits first to user2 and then to the user3 through user99 and (b) user2

transmits only to user100 as shown in Figure 3-9. The rest of the parameters values provided

to the system at start time, such as processing and transmission costs, are assigned realistic values using the approach described in Appendix A. This scenario will enable us to validate our simulations because we have a path from user1 to user100 that has only one intermediate

node, user2. Moreover, we have a netbook, a P4 desktop, and a Core2 desktop on which we

can run user1, user2, and user100, respectively. Hence, we have 1) isolated a path in the

Figure 3-9. Local Response Times of PowerPoint presenter. User98 User3 User1 User100 User2 … Netbook P4 Desktop Core2 Desktops Core2 Desktops

multicast overlay that our system would create if the conditions matched those we setup and 2) an overlay which we can map onto our ten computers.

We performed both simulations and an experiment for this scenario. Since we were interested in not only whether or not the simulations predict absolute performance correctly but also whether or not they predict trends correctly, we actually performed two sets of simulations and experiments with 3 and 100 users, respectively We report user1’s local

response times for both sets, and user3’s and user100’s remote response times for the 3 and

100 user set, respectively.

We configured our system to calculate a new communication architecture only once, immediately after the presenter advances the presentation to the second slide. Moreover, since we are interested in whether or not our system can perform communication architecture changes dynamically, we configured our system to optimize only the communication

architecture. In particular, we configured the system to keep the initial processing

architecture - the centralized architecture in which the presenter’s computer is the master - throughout the session. Finally, we used a total order function that optimizes response times for as many users as possible.

The simulated and experiment response times for user1 and the user3 and user100 with

unicast are shown in Figure 3-10. The simulated and experiment response times for these users with multicast are shown in Figure 3-11. We ran the experiment ten times and report the average response times. The simulations needed to be run only once since the values of all parameters are constant. This was not the case in our previous result in which we

simulated the performance of our system without fixing any parameter values. In that result, the computer types used by users other than the presenter were randomly assigned for each

151

simulation. Moreover, the network latencies among the computers were assigned random, real values. Therefore, we ran forty simulations and reported the average.

We do not expect the simulated and measured response times to be exactly the same for two reasons. First, the values of parameters used in our simulations are average values as explained in Appendix A. Hence, these values can be slightly off from the values during a

Figure 3-10. Simulation and experiment unicast response time comparison.

Figure 3-11. Simulation and experiment multicast response time comparison.

0 200 400 600 800 1000 1200 1400 1600 0 20 40 60 80 100 Re sp o n se T im e (m s) Num Users

Unicast Simulation vs. Experiment

USER 0 SIM LAST USER SIM USER 0 EXP LAST USER EXP

0 20 40 60 80 100 120 140 160 180 0 20 40 60 80 100 Re sp o n se T im e (m s) Num Users

Multicast Simulation vs. Experiment

USER 0 SIM LAST USER SIM USER 0 EXP LAST USER EXP

particular experiment. Second, inherent to any experiment with distributed computers are fluctuations in network traffic and non-collaborative processes, such as those of the operating system and anti-virus software. These fluctuations can impact performance experiment to experiment. In our experience, reporting the average of ten experiments helps reduce but does not completely eliminate the fluctuations.

As Figures 3-10 and 3-11 show, the simulated response times are close to the experiment response times. The largest difference was for the unicast remote response time of user100. The simulated value is was 79.077ms less than the measured value. The next

highest difference between any other simulated and experiment value is 10.667ms. Therefore, even though most values are close, our simulations can be noticeably wrong. Nevertheless, even with the differences in simulated and measured values, the simulation shows that the self-optimizing system would have selected multicast instead of unicast which would have improved remote response times of user100 by 1276.31ms according to the

experiments.

In document Towards self-optimizing frameworks for collaborative systems (Page 159-165)