The Singularity 19 : Origin of the Bots by Jason Arbon

A long time ago, in a Google office far, far away… Chrome was at version 1. We could see the early data coming in and realized that there were quite a few issues where Chrome was rendering web pages differently than Firefox. The early method of measuring these differences was limited to tracking the incoming rate of bugs reported by users and seeing how many users complained of application compatibility problems when they uninstalled the browser after trying it out.

I wondered if there could be a more repeatable, automated, and quantifiable way of measuring how well we were doing in this area. Many folks before had tried to automatically diff screenshots of web pages across browsers, and some even tried to use fancy image and edge detection to identify exactly what was dif-ferent between the renderings, but this often failed because you still end up with a lot of differences due to different images from ads, content changing, and so on.

Basic WebKit layout tests used only a single hash of the entire layout of the page, as shown in Figure 3.33. Even when a real issue was found, engineers still had little clue about what was technically broken in the application, as they had only a

19The singularity is a term often used to describe the moment computers surpass human intel-ligence. It should be an interesting time and we are seeing glimpses of that today

(http://en.wikipedia.org/wiki/Technological_singularity).

ptg7759704

20False positives are test failures that aren’t really failures of the product, but rather the testing software and can become expensive, aggravating to engineers, and quickly slow engineering productivity with fruitless investigations.

21The DOM is the Document Object Model, the internal representation of all that HTML under the hood of a web page. It contains all the little objects that represent buttons, text fields, images, and so on.

22Firefox was used as the benchmark, as it more closely adhered to the HTML standard, and many sites had IE-specific code, which wasn’t expected to render well in Chrome.

picture of the failure. The many false positives²⁰often created more work for engi-neers than they saved.

My mind kept coming back to the early simple ChromeBot that crawled mil-lions of URLs via Chrome browser instances across thousands of virtual machines using spare compute cycles in the data centers, looking for crashes of any sort. The tool was valuable as it caught crashes early on, and some functional testing of browser interaction was bolted on later. However, it had lost its shine and become primarily just a tool for catching the rare crash. What if we built a more ambitious version of this tool and interacted with the page itself instead of just the “chrome”

around it, and, just called this Bots.

So, I considered a different approach: going inside the DOM.²¹I spent about a week putting together a quick experiment that would load many web pages one after the other, injecting JavaScript into the page to scrape out a map of the inner structure of the web page.

There were many smart people who were highly skeptical of this approach when I ran it past them. A small sampling of reasons people gave to suggest not trying this:

• Ads keep changing.

• Content on sites such as CNN.com keep changing.

• Browser-specific code means pages will render differently on different browsers.

• Bugs in the browsers themselves cause differences.

• Such an effort requires overwhelming amounts of data.

All this just sounded like a fun challenge and if I failed, well, I could fail qui-etly. I’d also worked on another search engine in the past, so I probably had more confidence than I should have that the signals could be pulled from the noise. I realized I would have little internal competition on such a project. I pressed on qui-etly. At Google, data speaks. I wanted to see the data.

To run an experiment, I needed control data to compare the data with. The best resource was the actual testers driving this work. I chatted with the two test engineering leads who routinely drove vendor-testers manually though the top 500 or so websites in Chrome, looking for differences with Firefox.²²They said that at the time of launch, a little less than half of the top websites had some issues, but it had been steadily getting better to the point that they were few and far between—less than 5 percent of sites.

ptg7759704

23Just a hash of elements returned from getElementFromPoint(x,y) for a 800 x 1,000 section of the web pages. There are more efficient ways to do this, but it is simple and works for illustration.

I then constructed the experiment using WebDriver (the next generation Selenium). WebDriver had better Chrome support and a cleaner API. I performed the first run to collect data using the early versions of Chrome through the current version to see if the machines would find a similar trend line. It simply loaded up the same top websites, and at every pixel checked to see which HTML element (not the RGB value) was visible at that point²³and pushed this data to the server. This run was on my local machine and took about 12 hours to run so I let it run over-night.

The next day, the data looked good, and so I swapped out Firefox for Chrome and reran the same tests. Yes, there would be jitter from site content changing, but this was a first pass to see what the data might look like and would later run both in parallel. I came into the office in the morning to find my Windows desktop physically disconnected from every cable and pulled out away from the wall! I usually came in later than my desk neighbors who gave me a strange look and said all they knew what that I was supposed to chat with the security folks. I can only imagine what they thought. The crawl had infected my machine with a virus with an unknown signature and it had started behaving very badly overnight. They asked if I wanted to try to remove any data from the machine in a controlled envi-ronment before they physically destroyed the drive. Thanks to my data in the cloud, I said they could just take the entire machine. I moved all runs to external VMs after that.

The data looked similar to the anecdotal data from the TEs (see Figure 3.34).

The machines independently produced data in 48 hours that was eerily similar to perhaps a year worth of manual testing efforts. Eerily similar data.

FIGURE 3.34 Early data showing similarity between bot and human measures of quality.

1.0

0.9

0.8

0.7

0.6

0.5

build

% compat with FF

1.0.154.34 2.0.172.33

Chrome AppCompat

3.0.190.4 3.0.192.1

ptg7759704 The data looked promising. A few days of coding and two nights of execution

on a single machine seemed to quantify the work of many testers for more than a year. I shared this early data with my director who will go unnamed. He thought it was cool, but he asked that we keep focus on our other experiments that were much further along. I did the Googley thing, and told him I’d put it on hold, but didn’t. We had two fantastic interns that summer and we roped them into produc-tizing these runs and richer views to visualize the differences. They also experi-mented with measuring the runtime event differences. Eric Wu and Elena Yang demo’d their work at the end of the summer and made everyone a believer that this approach had a lot of promise.

Tejas Shah was inspired by the data and as the interns rolled off, Tejas built an engineering team to take this experiment and make it real.

In document How Google Tests Software (Page 179-182)