As is commonly the case when one does something for the first time (in particular something that nobody else has done either), one learns a lot about how to do and how not to do it, risks involved, what is important and
what is not, what to expect, etc. Here are some of our insights: from the first round of Plat_Forms:
• 30-hour time frame: When we announced Plat_Forms, many people reacted quite strongly on the 30- hour time limit. They apparently considered it to mean a mandatory 30-hour-long stretch of continuous development, while we had just meant to say we did not want to keep the teams from following their own pace and style with respect to work times. Having seen it happen at the contest site, we do now think it would be better to actually prescribe a break of 8 or perhaps even 10 hours during the night. We had the impression that some teams were lured into overtaxing their strength by the 30-hour formulation and that it may actually have hurt more than it helped.
• Quality of teams: It was an ambition of Plat_Forms to have only top-class teams, roughly from among the best 5% of such teams available on the market. The idea was that only in this case would the variability that was due to the people be so low that the variability that was due to the platform would become clearly visible. This goal may have been unrealistic from the start and we have clearly not met it: While the teams were obviously competent, they were by-and-large still in the “normal” rather than “superhuman” range. For what it’s worth as an indicator: three members (from team1 Perl, team2 Perl, and team9 Java) had only one or two years professional experience (see Figure 3.2) and eight members (from team1 Perl, team2 Perl, team4 Java, team9 Java, and team7 PHP) estimated themselves to be not even among the top 20% of professional developers in terms of their capabilities (see Figure 3.4). It appears that this fact has reduced the number and clarity of platform differences found, but not kept us from finding some differences — at least with respect to Perl and PHP; it looks problematic for Java.
• Size of task: As a consequence of the lack of superhuman teams, the PbT task was too large for the given time frame. Much of the analysis and comparison would have been easier had the completeness of the solutions been higher.
• Webservice requirements: Our main reason for including these requirements in the task was basically to provide ourselves with an affordable method for load testing (see the discussion in Section9). This did not work out at all; something we should have expected: It should have been obvious that the teams would first concentrate on the “visible” functionality at the user interface level and that, as the task size was ambitious, many would not be able to complete all or some of the webservice — and that making all webservice requirements MUST requirements (as we did) might not help. It did not. Unfortunately, this insight does not show how to escape from this problem. Presumably one has to go the hard way and plan for performing the load testing via the HTML user interface — may the Javascript gods be gentle. • Dropout risk: Only while we waited for the remaining teams to arrive in the afternoon of Januar 24 did
we fully realize what a crazy risk we were taking by having only three teams per platform. If only one of them did not arrive, we would hardly be able to say much about that platform at all. Fortunately, we had apparently found teams with a good attitude and they all appeared as planned. Since scaling the contest to four teams per platform is a major cost problem (see the next lesson), this issue needs to be handled with utmost care in future contests.
• Data analysis effort: Even though we had only three platforms in the contest this time (where six had been planned), we were almost overwhelmed by the amount and complexity of work to be done for the evaluation. Much of it, such as the completeness checks, were just enormous amounts of busywork. Others, such as the correct classification of file origin, were downright difficult. Still others, such as our attempt at runtime profiling, held suprising technical difficulties. Although we have spent a fair amount of resources on the evaluation, we are far away from having done all we would have liked to. Many of our analyses are much more superficial than one would like, because of the additional complexity introduced by the heterogeneity of our raw material. As an example, consider the evaluation of the version archives. Although there were only three different versioning systems used, a far lower amount of variability than in many other places in the evaluation, we were not able to analyze on the level of number of lines added/deleted per revision, because such data was easily available only for CVS. It would have been possible, but very laborious, to get the same data for Subversion and Perforce as well.
• Server and network setup: As a minor note, while our technical setup at the conference site worked quite flawlessly, one of its aspects turned out to be problematic later: Each team had a full subnet with 256 IP addresses (firewalled against the subnets of the other teams) for their development machines plus server (typically about 4 of these IPs would be used) and we told them not only the address of that subnetwork, but also that we would assign IP 80 within it to their webserver and map it to the externally visible name teamX.plat-forms.org. When we later ran the team’s webserver virtual machines at home, though, we wanted to use dynamic IP addresses. It turned out that some teams had hardcoded the IP somewhere, which made setting up the machines at home more difficult.