We developed TestDeploy, an automated tool for testing and running the experiments.
We now describe the techniques, that TestDeploy uses to deal with some of the key requirements.
5.2.1 Ensuring Synchronization via Phased Deployment
When evaluating large distributed systems, there is often a need to have the partic-ipating clients join a given experiment simultaneously. Synchronization issues may arise, when not taking proper care while deploying to a large number of remote nodes. The copying of software and data files during the deployment, may take vari-able amount of time to complete on different machines, depending on their download capacities, and depending on the amount of data that needs to be copied to specific machines. For instance, seed peers in a BitTorrent test swarm, must receive the file to be shared prior to the start of the experiment, while leechers do not. If the clients begin running the experiment as soon as the deployment is complete, some clients may get an undesired head-start in the experiments. For instance, if leechers start the experiments before a single seed is deployed, there will be no content to download
at the start of the experiment. Similarly, if one leecher begins the download before other leechers, it will have no leechers to upload to for a time, and may record, a lower upload capacity utilization than it could have if the experiment was set up correctly.
TestDeploy addresses synchronization by splitting the experiment into three con-secutive phases: setup, start and gather, where the setup phase is used to deploy the machines, the start phase is used to start the experiments on the machines, and the gather phase is used to collect logs. The deployment of software in the setup phase may take variable amount of time to complete on different machines. Waiting for the setup to complete on all of the machines before the start phase, which is a relatively quick operation, allows all of the clients to join the experiments within just a few seconds of one another. As a side-effect, failure by a machine to complete the heavier setup phase in a timely manner, suggests that the machine may have become unex-pectedly busy and may not be a good candidate to participate in the experiment.
This additional information is used by TestDeploy to filter out more poor candidates prior to the start of the experiment.
5.2.2 Detecting Stale Clients via White List Distribution
Since the PlanetLab includes machines from multiple geographical regions and net-works, it is possible that some machines become cut-off or unreachable, from parts of the Internet, but not all of the Internet. Such scenarios occur in practice in any large distributed network systems, and we have observed it on the PlanetLab as well. As a result, some stale clients, that could be left over from earlier experiments, that may not have been properly shut down can still be running and could connect to some of the clients in the current experiment. Such a stale client may be, for example, a file-sharing peer that has finished downloading a file in a prior experiment, and could
serve as an unexpected additional seed for the clients that it could contact. This seed would unfairly improve the download rate of some leechers that would connect to it and skew the measurements.
We deal with detection and elimination of stale clients via the distribution of a white list. At the conclusion of the setup phase, TestDeploy generates a white list, that contains the IP addresses of machines that are participating in the current experiment and only those among them that have successfully completed the setup phase. The white list is copied to the participating machines at the beginning of the start phase. All our software is programmed to read in the white list at start up and during runtime refuse connections from any machines that are not on the white list.
In the P2P scenario the white list is deployed and utilized by all participating entities, including seeds, leechers and trackers.
5.2.3 Test Monitoring, Termination and Log Collection
To be able to run a batch of experiments we automated the collection of the logs from the nodes between the experiments. The log collection occurs during the gather phase. Since the BitTorrent clients periodically report their completion state to the BitTorrent tracker, we leveraged this information to determine when the nodes com-plete their download. We modified the tracker to generate a small file that contains the list of nodes that completed the download. This completion file is generated every minute. Our deployment script automatically downloads and reads this small log file periodically to determine when it is time to stop the experiment and collect the logs. Once the logs are copied from the machines and placed in a directory for the given experiment, setup phase of the next experiment can begin. (To be able to leverage this log-collection functionality for P2P streaming experiments we modified
the streaming clients to report their status to a tracker as well. )
Since some machines may become unreachable during the course of an experiment we amended our software to collect the logs without waiting for all of the machines to complete. We would begin collecting logs when a configurable fraction, typically 95-98%, of the machines complete the test and some significant time, typically 5 minutes, passes without further completions. In reality, since we carefully select well-connected and healthy machines it is rare when some machines do not complete, but the configured threshold allows extra flexibility and prevents interruptions in running a batch of tests.
5.3 BitTorrent-specific Deployment Features
5.3.1 Using Local Torrents
A BitTorrent client can be initialized either with a local torrent file or the URL of where to download the torrent file from. Since we strictly want to measure the download of the target file, we want to avoid measuring potential contention of all the clients trying to download the torrent file from a web server. (This may occur in experiments where clients are started simultaneously and the torrent file itself is large enough to cause a web-server resources contention). Therefore, we configured our clients to be initialized with the local torrent file that is also copied to the machines during the setup phase.
5.3.2 Cleaning Dot Files
Many BitTorrent clients, including both the python-based original BitTorrent client and the Azureus-based java clients have a tendency of leaving some hints in the form
of dot files in the home directory. It took us a while to realize this and we added the procedure to remove such files during the setup phase so that these hints would not skew the new experiment.