Quality Bots Experiment - How Google Tests Software

What would testing look like if we forgot the state-of-the-art approaches and tools for testing and took on the mindset of a search engine’s infrastruc-ture with virtually free CPU, virtually free storage, and expensive brains to work on algorithms— bots, quality bots to be exact.

After working on many projects at Google and chatting with many other engineers and teams, we realized we spent a lot of engineering brains and treasure hand-crafting and running regression suites. Maintaining automated test scenarios and manual regression execution is an expensive business. Not only is it expensive, but it is often slow. To make it worse, we end up looking for behavior we expect—what about the unexpected?

Perhaps due to the quality-focused engineering practices at Google, regres-sion runs often show less than a 5 percent failure rate. Importantly, this work is also mind-numbing to our TEs, who we interview for being highly curious, intelligent, and creative—we want to free them up to do the smarter testing that we hired them for: exploratory testing.

Google Search constantly crawls the Web; it keeps track of what it sees, figures out a way to order that data in vast indexes, ranks that data accord-ing to static and dynamic relevance (quality) scores, and serves the data up on demand in search result pages. If you think about it long enough, you can start to view the basic search engine design as an automated quality-scoring machine—sounds a lot like the ideal test engine! We’ve built a test-focused version of this same basic system:

1. Crawl:The bots are crawling the Web now.¹³Thousands of virtual machines, loaded with WebDriver automation scripts, drive the major browsers through many of the top URLs on the Web. As they crawl URL to URL like monkeys swinging vine to vine, they analyze the structure of the web pages they visit. They build a map of which HTML elements appear, where they appear, and how they appear.

13The highest priority crawls execute on virtual machines from Skytap.com. Skytap provides a powerful virtual machine environment that lets developers connect directly to the virtual machine where the failure occurred and manage those debugging instances—all without leav-ing the browser. Developer focus and time is much more valuable than CPU cycles. Movleav-ing forward, Skytap enables bots to execute entirely on other users’ virtual machines and accounts, allowing access to their nonpublic staging servers.

ptg7759704 2. Index:The crawlers post the raw data to the index servers. The index

orders the information based on which browser was used and what time the crawl happened; it pre-computes basic statistics about the dif-ferences between each run such as how many pages were crawled.

3. Ranking:When an engineer wants to view results for either a particu-lar page across several runs or all pages for a single browser, the ranker does the heavy compute to figure out a quality score. The quality score is computed as a simple percent similarity score between the two pages, and also averages it for entire runs. A 100 percent means the pages are identical. Less than 100 percent means things are different and is a measure of how different.

4. Results:Results are summarized on a bots dashboard (see Figure 3.27).

Detailed results are rendered as a simple grid of scores for each page, showing the percent similarity (see Figures 3.28 and 3.29). For each result, the engineer can dig into visual differences, showing the detailed score on overlays of what was different between the runs with the XPaths¹⁴of the different elements and their positions (see Figure 3.30).

Engineers can also view the average minimum and maximum historical scores for this URL, and so on.

14XPaths are much like file paths, but work within web pages instead of file systems. They identify the parent-child relationships and other information that uniquely identifies an element within the DOM on a web page. See http://en.wikipedia.org/wiki/XPath.

FIGURE 3.27 Bot summary dashboard showing trends across Chrome builds.

ptg7759704 FIGURE 3.28 Bot typical grid details view.

FIGURE 3.29 Bot grid sorted to highlight the largest differences.

ptg7759704 The first official run of Bots caught an issue introduced between two

Canary builds of Chrome. The bots executed automatically, and the TE looked at the results grid, which showed this URL to have dropped in per-cent similarity. Based on this detail view that highlighted the differences, the engineer was able to quickly file the issue based on the detail view in Figure 3.31, which highlighted the exact portion of the page that was differ-ent. Because these bots can test every build of Chrome,¹⁵the engineer quickly isolated any new regressions found as the build contained only a few CLs, quickly isolating the offending code check-in.

Check-in¹⁶into the WebKit codebase (bug 56859: reduce float iteration in logicalLeft/RightOffsetForLine) caused a regression¹⁷that forced the middle div on this page to be rendered below the fold of the page. Issue 77261: ezinearticles.com layout looks broken on Chrome 12.0.712.0.

FIGURE 3.30 Bot visual diff inspection for page with no differences.

15Chrome builds many times per day.

16URL to check in that caused regression is at http://trac.webkit.org/changeset/81691.

17URL to WebKit BugZilla issue https://bugs.webkit.org/show_bug.cgi?id=56859. Tracking the issue in Chromium is http://code.google.com/p/chromium/issues/detail?id=77261.

ptg7759704 As we predicted (and hoped), the data from the bots looks very much

like the data we get from the manual equivalents—and in many ways, it’s better. Most of the web pages were identical, and even when they were dif-ferent, quick inspection using the results viewer enables engineers to quickly note that there was nothing interesting (refer to Figure 3.29).

Machines are now able to compute that no regressions happened. The sig-nificance of this should not be lost—that means we don’t need humans to slog through all these uninteresting web pages—some 90 percent of them.

Test passes that used to take days can now be executed on the order of min-utes, and they can be executed every day versus every week or so. Those testers are freed to look for more interesting bugs.

If we look at the view where the browser remains the same, but we vary a single website’s data over time, we now have something that tests a website, instead of just testing a browser. There are similar views for seeing a single URL across all browsers and across all test runs. This gives a web developer the opportunity to see all changes that have happened for her site. This means a web developer can push a new build, let the bots crawl it, and be pushed a grid of results showing what changed. At a glance, almost immediately, and with no manual testing intervention, the web developer can confirm that any changes the bots found are okay and can be ignored, and those that look like regressions can be turned into bugs with the data on which browsers and what application version and the exact HTML ele-ment information where the bug occurred.

What about websites that are data-driven? YouTube and CNN are heavily data-driven sites—their content changes all the time. Wouldn’t this confuse the bots? Not if the bots are aware of what the normal “jitter” in FIGURE 3.31 View of first bug caught with first run of bots.

ptg7759704 data for that site is based on historical data. If run over run, only the article

text and images change, the bots measure differences within a range that is normal for that site. If the sites score moves outside of that range, say when an IFRAME is broken, or the site moves to an entirely new layout, the bots can generate an alert, notify the web developer of who can determine this is the new normal, or file appropriate bugs if it was a new layout issue. An example of this small amount of noise can be seen in Figure 3.32 where CNET shows a small ad that appeared during the run on the right side, but not on the left. This noise is small and can be ignored via heuristics or quickly marked ignore by a human in seconds who notices this differ-ence is an ad.

FIGURE 3.32 Bot visual diff inspection for page with noisy differences.

Now, what about all these alerts when they do occur? Does the tester or developer have to see them all? Nope, experiments are underway to route these differences directly to crowd-sourced¹⁸testers for quick evaluation to shield the core test and development teams from any noise. The crowd-sourced testers are asked to view the two versions of the web page and the differences found, and asked to label it as a bug, or ignore it as it looks like a new feature. This extra layer of filtering can further shield the core engi-neering team from noise.

18Our friends at http://www.utest.com have been helpful in setting up these experiments.

Their crowd of testers has been amazingly sharp and responsive. At times, the test results from their crowd have found more issues of higher quality than internal repetitive regression test runs.

ptg7759704 How do we get the crowd-sourced voting data? We built some

infra-structure to take the raw bot data where there are differences and deliver a simple voting page for crowd testers. We’ve run several experiments with crowd testers versus the standard manual review methods. The standard manual review methods take up to three days latency across two days with onsite vendor testers to evaluate all 150 URLs for regressions. The bots flagged only six of the 150 URLs for investigation. Those didn’t need any further evaluation. These flagged URLs were then sent to crowd testers.

With bot data and the difference visualization tools, the average time for a crowd-sourced tester to evaluate a site as having a bug or “okay” was only 18 seconds on average. Crowd testers successfully identified all six as nonissues, matching the results of the manual and expensive form of this validation.

Great, but this measures only the static versions of web pages. What about all the interactive parts of a page such as flying menus, text boxes, and buttons? Work is underway to tackle this problem much like movies.

The bots automatically interact with interesting elements on the web page, and at each step, take another scrape, or picture, of the DOM. Then, these

“movies” from each run can be compared with the same difference tech-nologies frame by frame.

There are several teams at Google already replacing much of their man-ual regression testing efforts with Bots and freeing them up to do more interesting work such as exploratory testing that just wasn’t possible before.

Like all things at Google, we are taking it slowly to make sure the data is solid. The team aims to release this service and source code publically, including options for self-hosting for testing on other folks’ VPNs if they prefer that to opening their staging URLs to the Internet.

The basic Bots code runs on both Skytap and Amazon EC2 infrastruc-ture. The code has been open sourced (see the Google testing blog or Appendix C). Tejas Shah has been the tech lead on Bots from very early on and was joined by Eriel Thomas, Joe Mikhail, and Richard Bustamante.

Please join them in pushing this experiment forward.

In document How Google Tests Software (Page 172-178)