Cross Channel Effects of Search Engine Advertising on Brick & Mortar Retail Sales: Insights from Multiple Large Scale Field. Experiments on Google.

(1)

[Preliminary: Please do not circulate]

Cross Channel Effects of Search Engine Advertising on Brick &

Mortar Retail Sales: Insights from Multiple Large Scale Field

Experiments on Google.com

Kirthi Kalyanam Santa Clara University

John McAteer Google Inc.

Jonathan Marek

Applied Predictive Technologies James Hodges & Lifeng Lin

Division of Biostatistics, University of Minnesota ∗ Updated October 6 2014

Abstract

We investigate the cross channel effects of an increase in search engine

adver-tising on Google.com on sales in brick and mortar retail stores. We report on

the results of 15 independent field experiments, conducted in co-operation with a cross section of 13 well-known U.S. multi-channel retailers, in which 76 prod-uct categories received a total increase of over $4 Million in search advertising spending in test markets. The data set available to us contains estimates of in-cremental store sales in advertised categories and inin-cremental sales for the total

store or a larger grouping of categories including the advertised category. For

each experiment, the data set also contains an estimate of return on ad spend (ROAS). In order to estimate an average effect of each outcome across experi-ments we use a Hierarchical Bayesian (HB) model that improves the precision of

∗

Corresponding author: [email protected]. Kirthi Kalyanam is J.C.Penney Research Professor and Director of the Retail Management Institute at Santa Clara University. John McAteer is Vice President, US Sales at Google.com. John Marek is Senior Vice President at Applied Predictive Technologies. James Hodges and Lifeng Lin are respectively an Associate Professor and a graduate student in the Division of Biostatistics at the University of Minnesota. We are grateful to the retailers and Google for participating in the field experiments and for sharing the results for the analysis, irrespective of the outcomes; to APT for implementing the field experiments; and to Google teams for data collection and project management support of our analysis. We also thank Wes Hartmann, Don Lehmann, Gary Lilien, Sridhar Narayanan, Harikesh Nair, Arvind Rangaswamy, Navdeep Sahni, participants at the 2014 Marketing Dynamics Conference, seminar participants at Penn State University, executives from Wal-Mart’s eCommerce Division, The Clorox Company, Tesco Inc, Mattel Inc, and Waitrose (UK) for their comments. All remaining errors are our own.

(2)

the overall estimate by borrowing strength across experiments while accounting for variation within a study and heterogeneity in treatment effects across studies. The overall estimate from the Hierarchical Bayesian model provides causal evidence that increasing search engine advertising on Google.com had a positive effect on sales in brick and mortar stores for the advertised categories for this population of retailers by 1.27%. We also find that total store or top level sales increased by 1.18%. The positive effect on total store or top level sales provides evidence that the increase in category sales is incremental to the retailer net of any cannibalization effect from other categories. The overall estimates are statistically significant and the poste-rior distributions show very little mass below zero. The overall estimate for ROAS indicates that on the margin every incremental dollar of search advertising yielded 2.5 dollars of incremental in store sales on average. The posterior distribution and typical margin data suggest that several retailers achieved break-even. Comparison to conventional benchmarks show that the estimated effect sizes are economically meaningful. We estimate that if this population of retailers were to extend the test to the entire network of stores the incremental sales opportunity is $1.94 Billion. We examine the robustness of our findings to alternative assumptions about the p-value intervals and correlations underlying the data.

(3)

1 Introduction

Search engine advertising, which refers to paid listings on search engines such as Google, Bing, and Yahoo, is an important and growing part of the advertising market. Figure 1 shows an ex-ample of the results of a search for the phrase “toto toilets” on the popular search engine Google. The sponsored ads on the top and on the right side of the page are examples of search engine advertising. The advertiser in position 1 on the top of the page is nationalbuilderssupply.com,

an online-only retailer. The ads on the top right of the page (top of the right rail1) are from

Home Depot, a multi-channel home improvement retailer. Home Depot sells online via home-depot.com and through a national network of brick and mortar stores. As this example shows, search engine advertising is used by both online-only retailers and multi-channel retailers.

Search advertising has been the focus of a significant stream of literature in multiple fields in-cluding marketing, economics, and information systems. Several empirical studies have examined

the relationship between position and online outcomes such as click-through rates, conversion

rates and sales (Agarwal et al. [2008], Ghose and Yang [2009], Narayanan and Kalyanam [2014]).

Rutz and Bucklin [2011] investigate spill overs from search ads on generic terms2 to branded

search terms. More recently Blake et al. [2013] use a large scale field experiment to investigate

whether a well-known firm like eBay receives incrementalonline sales from search advertising.

Search engine advertising can also have an impact on offline outcomes such as brick and

mortar retail sales. The offline impact is important for many practical and substantive reasons. First and foremost, the vast majority of retail sales occur offline. For example according to the

U.S. Census3 in 2012 eCommerce sales accounted for only 5.2% of total retail sales. However

it is believed that the influence of the web on retail shoppers is disproportionate to its share

of sales. Forrester Research4 estimates that more than 50% of U.S. offline retail sales will be

influenced by the web by 2017. This suggests that shoppers search online even when they plan to buy offline. Some recent reports suggest that the trend of researching online and buying in the store has accelerated. According to data reported by Shopper Trak, foot traffic to retailers over the holiday season dropped by over 50% over the three holiday seasons spanning 2011 to

1

See http://www.sempo.org/?page=glossary&hhSearchTerms=%22right+and+rail%22 2_{Sunglasses is a generic search term and is also referred to as top of the funnel search term.} 3

http://www2.census.gov/retail/releases/historical/ecomm/12q4.pdf

4 http://www.forrester.com/Forrester+Research+Online+Retail+Forecast+2012+To+2017+US/fulltext/-/E-RES90661

(4)

2013 (Banjo and FitzGerald [2014]). These reports note that part of the motivation for shoppers is that shopping online is an efficient way to pre-shop before visiting a brick and mortar store. One approach to marketing to these cross channel shoppers is by advertising on search engines. Substantively, it is not obvious that search engine advertising can incrementally increase sales for brick and mortar stores or that these effects can be measured. Brick and mortar retailers

spend heavily5on offline media such as television and weekly free-standing inserts in newspapers

(FSI or circulars). It might be the case that offline shoppers simply use a search engine such as Google to navigate to the information they previously obtained from other media. This might be a likely scenario for well-known retailers or for shoppers who were influenced by offline media or have prior experience with the retailer. These shoppers might simply be using the search engine to navigate to the retailer’s web site to pre-shop a known retailer before a brick and mortar shopping trip. Under this scenario, because the shopper had a prior propensity to shop with this retailer the incremental effect of search advertising on offline sales would be small. An alternative viewpoint is that offline shoppers might use a search engine such as Google to search for information (Ratchford et al. [2003]) or engage in comparison shopping (Zettelmeyer et al. [2006]). Media coverage of contemporary holiday shopping behavior suggest that nowadays shoppers are less likely to do price comparisons in person “bouncing from store to store because they’ve made their decisions ahead of time [online]”, (FitzGerald [2013]). In this case, search advertising might influence shoppers as they comparison shop online and generate incremental offline sales even for well-known retailers.

As this discussion suggests whether or not advertising on search engines impacts offline

sales is an empirical question. We have very little causal evidence6 regarding the following first

order questions: (1) Is there a causal cross channel effect of advertising on a search engine like Google.com on sales at brick and mortar retail stores for well known retailers? (2) How large is the effect size and is the return on investment meaningful? (3) Is the effect widespread across different categories? and (4) Are the effects heterogeneous across retailers and categories?

5

For example Table 2 provides information on the media spending of each retailer in our data set. The media spending is rank ordered from highest spend to lowest spend for each retailer. As this table shows the retailers in our experiment spent heavily on newspapers, TV, free standing insers, radio and direct mail.

6

Using a randomized experiment and a matched sample of shoppers Lewis and Reiley [2009] studied the offline impact ofdisplay advertising on Yahoo!. In a field experiment Sahni [2011] found that temporal spacing of advertising on a restaurant search web site increased the likelihood of offlineleadsgenerated by the advertisement. However neither of these studies focus on the impact of search engine advertising on offline store sales.

(5)

Obtaining causal evidence regarding these questions on a popular search engine such as Google is an important first step.

Obtaining causal and actionable estimates in this context is challenging. Figure 2 shows

that brick and mortar retail sales can vary greatly from week to week7. Without a proper

control correlational estimates can be confounded with other trends. While endogeniety issues in modeling the relationship between sales and advertising are well known (Berndt [1991]), search advertising might present a particularly difficult context of contemporaneous correlation because the advertising is served in response to search queries: the higher the query volume (potentially due to other demand drivers such as seasonality or TV advertising), the higher the expenditure on search advertising. Multi-channel retailers are often organized into separate online and offline marketing teams and the estimates have to be credible to both groups who often have opposing incentives.

This paper obtains causal estimates of the cross-channel impact of online search engine

adver-tising on brick and mortar sales from 15independent field experiments representing 76 categories.

The experiments were conducted between 2008 to 2011 by 13 well-known U.S. multi-channel re-tailers in collaboration with Google and Applied Predictive Technologies. They represent a vari-ety of categories including apparel, baby products, electronics, toys, cosmetics, sporting goods, furniture, pet food, and home improvement. Collectively these retailers represent $236.39B in annual sales. In each experiment, the retailer increased search advertising spending (“heavy up”) for a randomly selected set of test markets for the entire duration of the test period while hold-ing constant the regular search spendhold-ing in control markets. Our data set includes estimates of incremental sales in the advertised categories, incremental sales at the total store or a top level aggregation of categories that include the advertised categories, and p-values or more commonly p-value intervals for these estimates. Our data set also includes estimates of return on ad spend (ROAS) a measure of return on investment that is the ratio of incremental sales and incremental advertising, but does not include p-values or standard errors for these estimates.

We estimate the overall effect across field experiments for all outcomes drawing from Hierar-chical Bayesian random effect models used in evidence-based medicine (Babapulle et al. [2004], Berry et al. [2003], DerSimonian and Kacker [2007], Sutton and Higgins [2008]) that allow for

(6)

heterogeneity in treatment effects both within and across retailers and categories. We find causal evidence that an increase in search engine advertising at the category level incrementally increased brick and mortar retail sales in the advertised categories. The overall estimate of incremental sales increase in the advertised categories was 1.27%. The posterior density for the overall estimate of sales increase has very little mass below zero. We also find that an increase

in online search advertising incrementally increased top level brick and mortar store sales8 for

these well-known retailers. The positive sales increase for the overall store indicates that the increase in the sales of advertised categories was net of any inter-category substitution. Investing in search ads, even at a modest experimental level, increased brick and mortar top level sales by an average of 1.18% in these experiments. Since the U.S. retail industry grew at a compound

annual rate of 1.55%9this sales increase is meaningful from a growth perspective. A simple back

of the envelope calculation shows that search engine advertising as executed in this paper would have increased the baseline growth rate of these retailers by 77.5%. Further for this population of retailers we estimate that the top level percent sales increase projected to their total retail network generates $1.1 billion in incremental annual sales.

Since our data set does not contain the standard error or p-value of ROAS, we develop a method to estimate the standard error of ROAS based on the estimates of incremental sales. Our method is likely to be useful in other contexts where advertising spending is not measured at each individual store which precludes obtaining an empirical distribution of ROAS across stores. We incorporate the ROAS estimates and standard errors in the HB analysis. The posterior distribution of Return on Ad Spend has a mean of 2.50, is significantly different from zero and with very little mass that is less than zero. The posterior distribution shows that several retailers achieved break-even. Our random effects estimates show that incremental sales varies widely across retailers and categories. This suggests that even though the cross channel treatment effects are widespread there might be benefits to adopting a category orientation. Our findings are robust to alternative assumptions about the p-value intervals and the correlations underlying the data.

The next section provides background on search advertising. Section 3 describes the

partic-8

Defined as the total store sales or total sale for an aggregation of advertised and related non-advertised categories.

9

(7)

ipating retailers and their motivations. Section 4 describes the design of the field experiments. Section 5 presents the Hierarchical Bayesian Model followed by results including robustness anal-ysis. Section 6 presents conclusions, limitations and opportunities for further research. Section 7 is an appendix that presents our derivations.

2 Background on Search Advertising

2.1 Search Engine Advertising and Online Outcomes

Search advertising is a large and rapidly growing market. For instance, Google reported revenues of almost $14.9 billion for the quarter ending September 30, 2013, with a growth of 12% over the same period in the previous year. The revenues from Google’s sites, primarily the search engine,

accounted for 68% of these revenues.10 According to the Internet Advertising Bureau, $18.4

billion was spent in the United States alone on search advertising in 2013. Search advertising is the largest component of the online advertising market, with 43% of all online advertising

revenues in 2013.11 Although it is a relatively new medium for advertising, search is the

third-largest medium after TV and print, and surpassed radio in 2012.12

Several features of search advertising have made it a very popular online advertising format. Search ads are triggered by specific keywords (search phrases). Presumably shoppers research when they are ready to make a purchase and search advertising seems temporally close to purchase. Search advertising also has important targeting capabilities. For example consider an advertiser who is selling health insurance for families. Some of the search phrases related to health insurance could include “health insurance”, “family health insurance”, “discount health insurance” and “California health insurance”. The advertiser can specify that an ad will be shown only for the phrase “family health insurance”. Further, these ads can be geography-specific, with potentially different ads being served in different locations.

Prior research has investigated the impact of search advertising on online sales. Ghose and

Yang [2009] and Agarwal et al. [2008] examined the impact of different search ads on various

10

These data were obtained from Google’s earnings report for Q3 2013, available at https://investor.google.com/earnings/2013/Q3_google_earnings.html (last accessed on October 31, 2013).

11

http://www.iab.net/media/file/IAB_Internet_Advertising_Revenue_Report_FY_2013.pdf

12_{http://www.iab.net/media/file/IABInternetAdvertisingRevenueReportFY2012POSTED.pdf (last accessed} on October 31, 2013).

(8)

types of online outcomes such as click-thru rates, conversions, and sales. They showed that the top positions are not necessarily the most profitable positions. Yang and Ghose [2010] examined whether search advertising and organic links are complements or substitutes. Rutz and Bucklin [2011] investigate whether advertising by a brand on category level keywords creates spillovers by generating incremental subsequent searches for the brand name. Narayanan and Kalyanam [2014] obtain causal non-parametric estimates of position effects using a Regression Discontinuity approach. They showed that selection effects vary by position, that position effects are stronger on the weekdays and vary by advertiser, keyword quality, type of keyword, and match type.

Sahni [2011] examined the impact of repetition of ads on a restaurant search engine by randomizing ads at the individual level. Individuals in the control group saw a dummy ad, whereas individuals in the treatment got different repetitions of the ad. He obtained causal estimates of advertising and repetition effects and proposed a structural model for the repetition effects. A follow up paper, Sahni [2013] examined whether advertising effects spill over to non-advertised restaurants. As this discussion shows, prior research has not focused on the cross-channel impact of search engine advertising on sales at brick and mortar stores.

2.2 Search Engine Advertising and Offline Outcomes

Another stream of research that provides some motivation for our study is research on cross-channel shopping. As mentioned in the introduction, Forrester Research estimates that over 50% of U.S. retail sales are web influenced. Web influence is a broad construct that provides retailers with some directional evidence of the importance of cross-channel shopping. However, it does not provide specific causal evidence about the effectiveness of cross-channel advertising programs. Some recent reports suggest that the trend of researching online and buying in the store has accelerated. According to data reported by Shopper Trak, foot traffic to retailers over the holiday season dropped by over 50% over the three holiday seasons spanning 2011 to 2013 (Banjo and FitzGerald [2014]). These reports note that part of the motivation for shoppers is that shopping online is an efficient way to pre-shop before visiting a brick and mortar store. For example, prior to the Internet, automotive shoppers tended to visit 4 dealers whereas after the Internet they tend to visit 1 dealer (Zettelmeyer et al. [2006]). To the extent that these online shopping trips start at a search engine, it is plausible that search engine advertising can influence

(9)

shoppers to consider a retailer. However, none of these studies provided causal estimates of the impact of search advertising on offline retail sales.

3 Participating Retailers

3.1 Description

Table 1 provides an overview of the experiments and Table 2 of the multi-channel retailers who participated in the field experiments and provided data for this study. There are 13 participating retailers. Two of the retailers conducted two experiments each to give us a total of 15 field experiments. Although the retailer names are disguised, these retailers are very well known and are considered “household names” in the U.S. retail market, a virtual “who’s who" of retailing. The retailers participated in this study subject to the condition that retailer identities be kept confidential. Hence we do not provide descriptives such as annual sales, type of retail format (eg. department store versus speciality store), focal categories or number of stores.

3.2 Motivations

Table 2 summarizes the information we obtained on the motivations of the participating retailers. A general theme that emerged was that the retailers had indirect measures of store impact but needed direct measures to inform spending decisions. For example, some retailers conducted exit interviews of visitors on their web site and found that several customers indicated that they would visit a brick and mortar store after the web site. The retailer could try to trace back the source of this web traffic and potentially link the offline sale, but these types of links do not produce causal estimates since there is no counterfactual or control to measure incrementality. Some retailers were concerned with declining newspaper circulation and the potential decline in the efficacy of spending on newspaper inserts and were interested in new forms of advertising that could potentially provide an alternative. Some retailers were interested in shaping new programs that garnered co-op advertising dollars for search engine advertising with manufacturers. They needed empirical proof that search engine advertising can indeed drive sales at brick and mortar stores for specific products and categories. Some retailers who spent heavily on TV and newspapers simply wondered if any of the traffic driven by search advertising could be incremental. Promotional

(10)

retailers who executed several marketing instruments such as coupons or credit card offers were interested in isolating the impact of search advertising to determine the incremental effects. Many of these retailers were accustomed to using field experiments to obtain causal estimates to inform significant capital allocation decisions.

4 Design and Execution of the Field Experiments

Field experiments were designed to measure the in store impact of an increase in search ad-vertising spending. Each experiment was designed and executed independently of the other experiments. Each experiment is a “heavy up” experiment in which search advertising was in-creased for test stores, and test-store sales were compared to sales in matched control stores. The experiments were conducted as a collaborative effort between the multi-channel retailer and Google, using the Applied Predictive Technologies (APT) Test and Learn platform. APT’s platform specializes in conducting field experiments for retailers and clients include most of the

leading US retailers13. The platform is commonly used by C level decision makers in retail

orga-nizations to evaluate the return on investment for capital allocation decisions around marketing, merchandising, store operations and store improvements and has credibility among retail deci-sions makers. In implementing the tests, the APT platform used built-in capabilities to detect outliers, extreme events, overlapping events due to multiple tests, and anomaly detection to improve the robustness of the tests.

Figure 3 provides an overview of the design and execution process followed in each

experi-ment. The retailer decision maker in consultation with Google selected the product categories that were the focus of the experiment. Based on Google’s geo-targeting capabilities, media mar-kets (and the stores therein) were randomly assigned to the test condition and other marmar-kets were assigned to the control condition. The APT platform matched each test store to 10 stores in the control condition based on 2+ years history of store sales data and geo-demographic corre-lates of each store. The matching was verified so that the test and control store indexes trended similarly in the pre-test period. Figure 4 provides an illustration of comparing the indexes of test and control stores. The figure shows that the index for the group of test and control stores

13_More _details _about _APT _and _its _clients _can _be _obtained _from _the _firm’s _web _site _at www.predictivetechnologies.com

(11)

track each other in the pre-test period, while the sales diverge in the treatment period.

For each experiment, based on variability of sales in the selected test stores, APT recom-mended the number of test stores, the minimum number of weeks, and the minimum sales lift

%14 that the experiment could be expected to detect. Based on this minimum detectable sales

lift, a minimum increase in search advertising spending was calculated. Google account teams translated this minimum increase in search advertising as increased bids on top-of-the-funnel

keywords15 _{in the test markets. The bids were set so that the average position on the page was}

maintained between 1 and 3. The ads were spaced out so that the test budget was allocated evenly over the entire duration of each experiment. In the final step APT estimated and reported incremental sales at various levels of aggregation using the format shown in Figure 5. The data set for this study was assembled from these reports. Our analysis is restricted by what was reported.

Table 1 provides an overview of the field experiments. There were a total of 15 field exper-iments across 13 retailers. Two retailers conducted two field experexper-iments each. The number of test markets and test stores varied across experiments. For example experiment F involved 405 test stores and has the largest number of test stores in our analysis. Field experiment B on the

other hand had 25 test stores, the smallest number of all the experiments. On average each

experiment had a duration of about 4 weeks and involved 4 categories. The incremental search spending varied across retailers. For example Retailer A spent $139,000 whereas Retailer B spent $685,000. In total the participating retailers spent $4,153,393 in increased search advertising. The retailers and Google incurred additional manpower costs to execute the experiments.

To provide perspective on the scale and scope of these field experiments, the last row in Table 1 reports totals for key metrics. The total number of test stores across field experiments was 2650. When estimating the overall average across experiments this larger sample size should be useful in improving the precision of the overall estimate. There were a total of 76 categories across experiments. This large number of categories should provide insight into whether the cross channel effects are widespread or limited to a narrower set of categories. It would also be useful in understanding heterogeneity in treatment effects across categories and retailers.

14

Defined in Section 5.1.1. 15

For example “Sunglasses” is a top-of-the-funnel keyword, as opposed to Rayban Sunglasses or Rayban Aviator Sunglasses at Sunglass Hut. Rutz and Bucklin [2011] refer to these top of the funnel keywords as generic keywords.

(12)

5 Hierarchical Bayesian Analysis: Methods and Results

Each retailer’s experiment targeted specific categories, so the retailers are naturally interested in results for the targeted categories. But they are also interested in whether their ad campaign produced positive or negative spillover effects outside the category, i.e., in effects at a more aggregated level than category, either the whole store or a larger grouping of categories that includes the target category. We present results for both the higher level (“top level”) and for categories. We are interested in obtaining an overall average estimate across experiments that takes into account within study variation and heterogeneity in treatment effects across studies and categories. This overall average estimate and its 95% confidence interval answer the question of whether there is causal evidence of a cross channel effect of search advertising for this population of retailers. This section describes Hierarchical Bayesian (HB) models to obtain the overall estimates.

5.1 Methods 5.1.1 Sales lift

For a given experiment, APT used the random sampling of test stores to estimate the sales lift for the whole retail chain for the top level and for categories. We present a formula for the top-level computation, but the same formula was used for category-specific lifts. For test store

i, the lift was estimated as

Li =

ASi−ESi

ESi

(1)

where Li is test store i’s estimated lift, ASi is its actual sales, and ESi is its expected sales

based on its matched control stores.16 The combined estimate over allntest stores, estimating

sales lift for the whole retail chain, weighted test stores by expected sales:

L=X i ESi P ESLi= 1 nES X i (ASi−ESi). (2) The method of sampling test stores, matching control stores to them, and weighting test stores in the combined estimate implies a particular standard error for the experiment’s estimated sales

16

ESi=T P reiCP ost_{CP re}i

i, whereT P reiis sales of test storei in the pre-treatment period, andCP ostiandCP rei are the sales of the matched control store in the post treatment and pre- treatment periods respectively.

(13)

lift, which accounts for the fact that some control stores were matched to more than one test store. When the experiments were conducted, the estimate and standard error were used to compute P-values to test whether the top-level or category sales lift differed from zero.

The dataset available for this study did not include the standard error for the top-level or cat-egory tests. Rather, it included either a one-sided P-value for the t-statistic estimate/{standard error}, or a range within which the one-sided P-value lay (e.g., 0 to 0.05, or 0.3 to 0.5; 0.5 is the largest possible P-value in a one-sided test). When the exact P-value was available, we could infer the standard error from the estimate and P-value (subject to round-off error). When only a range of P-values was available, our baseline Bayesian analysis treated the P-value as being equally likely to take any value in the range, i.e., we put a uniform prior on the reported range. Further details are given in Section 5.1.4 below.

For category-level results, few exact values were reported and the reported ranges of P-values were often wide. Thus, our category-level analysis is conservative in that it treats each experiment as if it produced less information than, in fact, it did.

5.1.2 Incremental Return on Ad Spend (ROAS)

The next outcome variable of interest to retailers (including decision makers in finance) is the return on investment due to the increased advertising. Of interest here are the incremental spending and sales over and above baseline spending and baseline sales. For the present series of experiments, this was available only at the top level. A very common return on investment measure, used by retailers in these experiments, was incremental sales due to incremental ad spending, defined as follows:

ROAS= 1

ADS

X

i

(ASi−ESi), (3)

whereADS is the incremental ad spending for the whole experiment, which could not be

disag-gregated to individual test stores or categories. (Note that this measure is implicitly weighted by the sales of individual test stores, analogous to the weighting in the combined estimate of sales lift.)

The data available for this study included estimated ROAS but did not report a P-value or a standard error for ROAS. We developed a method (technical details in section 7.2 of the

(14)

appendix) to estimate standard errors for experiment-specific ROAS estimates from the standard errors for sales lift.

5.1.3 Analysis of heterogeneous treatment effects using hierarchical Bayesian meth-ods.

In psychology and health-care, combining randomized experiments to estimate an overall treat-ment effect for a population of interest is referred to as meta-analyses. Early in the developtreat-ment of meta-analytic methods it became clear that differences between experiments were often too large to be explained by chance or sampling variation within experiments, and that a simple combination of such heterogeneous experiments might be ill-advised. Meta-analysis has come to be understood by many as a way of doing regression analyses in which the units of analysis are experiments, reduced to summaries instead of individual measurements (Greenland [1994]), and allowing for heterogeneous treatment effects across experiments.

This approach uses the inherently hierarchical structure of a meta-analysis: the primary unit of analysis is an experiment, but each experiment’s results are aggregates of its subjects’ individual results (in our case, the results from individual test stores). This hierarchy is captured using hierarchical models with “error terms” at two levels, one describing variation between experiments and a second level describing variation between subjects within experiments (in our case, the standard errors associated with an experiment’s lift or ROAS estimate). The earliest popular meta-analysis method using a hierarchical model was the DerSimonian-Laird (1986) method, which assumed the true treatment effect varied between experiments and modeled

those effects as draws from a normal distribution with mean µ and some variance, the object

being to estimate µand attach a suitable standard error. An obvious elaboration of this model

is to add explanatory variables describing differences between experiments and thus reducing the unexplained between-experiment variation. In the analyses below, we have used as predictors impression share, click-through rate, and (for categories only) category strength, a measure of the category’s importance to the retailer.

Meta-analyses using hierarchical models lend themselves naturally to Bayesian methods. In particular, Bayesian methods provide a simple way to handle some complications of the data from these experiments. For example, two retailers are represented in our dataset by two experiments,

(15)

and presumably a retailer’s two experiments will tend to be more similar to each other than they are to other retailers’ experiments. In a Bayesian analysis, this presumed correlation is captured by treating retailer as another layer of the hierarchy and adding a random effect for that layer. Also, as noted, in inferring an experiment’s standard error from its estimate and P-value, when P-values were reported as ranges we could put a prior distribution on the reported range of P-values.

5.1.4 Models and computing

The models for top-level sales lift and ROAS are very similar; In this section we present the model for sales lift and then state the changes needed for ROAS. All models are presented with experiment-level predictors (covariates) included. In the current specifications, none of these predictors was significant (i.e., their 95% credible intervals included zero), so the tables, the forest plots, and combined estimates shown in Section 5.2 are from fits of these models that exclude the predictors.

Recall that two retailers had two experiments each. In the model description below,

retail-ers are indexed by i and experiments within retailers by j. The combined lift estimate from

experiment (i, j) is called Liftij. The model includes an error term capturing sampling and

other variation within each experiment, with variance σ2_ij. It also includes two random effects,

one each for retailers and experiments within retailers, with variances λ2 and τ2 respectively.

The model includes two experiment-level predictors (covariates), “Impression share” (x1,ij) and

(16)

(with sales lift expressed as a fraction, i.e., a 1% sales lift is 0.01):

Mean structure: Lift_ij|θij, σij2 ∼N(θij, σ2ij)

θij =η+αi+βij+ξ1x1,ij+ξ2x2,ij

Random effects: αi|λ∼N(0, λ2) Retailer

βij|τ ∼N(0, τ2) Experiment within retailer

Prior distributions: λ∼U(0,10)

τ ∼U(0,10)

η∼N(0,1000)

ξ1 ∼N(0,1000)

ξ2 ∼N(0,1000)

For inferring standard errors σij =|Liftij|/tnij−1(1−pij)

when P-value∈interval: pij ∼U(plower,ij, pupper,ij).

(4)

In Equation (4), Lift_ij is observed and thus treated as known in a Bayesian analysis, so the

inferred standard errorσij is known if the P-value is exact. Also,tnij−1(1−pij)is the100×(1−

pij) percentile of the tdistribution onnij−1degrees of freedom, wherepij ∈[0,1] and smaller

pij indicate “more significant”. (Section 7.1 discusses how we selectednij.)

The model for top-level ROAS is identical except for one change: replace the line in Equation (4) for inferring the standard error with (see section 7.2 for the derivation)

σij = nijESij ADSij × |Liftij| tnij−1(1−pij) . (5)

The model for category-specific sales lift builds on the preceding model by adding more

structure. It includes a third covariate (predictor), “Category strength” (x3,ijkl). Also, the

categories have a hierarchical structure, with each of the 76 categories being placed in one of

18 “broad categories”. Broad categories are indexed byk; categories within broad categories are

indexed byl. The model has an additional random effect for broad categories, with varianceδ2,

which induces similarity between the categories included in a broad category. The combined lift

(17)

of the model is

Mean structure: Lift_ijkl|θijkl, σijkl2 ∼N(θijkl, σijkl2 )

θL,ijkl =η+αi+βij +γk+ξ1x1,ijkl+ξ2x2,ijkl+ξ3x3,ijkl+ijkl

Random effects: αi|λ2 ∼N(0, λ2) Retailer

βij|τ2 ∼N(0, τ2) Experiment within retailer

γk|δ2 ∼N(0, δ2) Broad category

ijkl|ρ2 ∼N(0, ρ2) Category w/in broad category

Prior distributions: η ∼N(0,1000) ξ1 ∼N(0,1000) ξ2 ∼N(0,1000) ξ3 ∼N(0,1000) λ∼U(01,10) τ ∼U(0,10) δ ∼U(01,10) ρ∼U(0,10)

For inferring standard errors σijkl =|Liftijkl|/tnijkl−1(1−pijkl)

when P-value∈interval: pijkl ∼U(plower,ijkl, pupper,ijkl).

(6)

All analyses were performed by Markov chain Monte Carlo (MCMC) implemented in the R system (version 3.0.2) using the rjags package (JAGS version 3.4.0, rjags version 3-11). Each analysis used three chains of length 200,000 iterations each, with starting values chosen by the rjags package (using specified seeds), with the first 100,000 draws discarded as burn-in and retained draws thinned by taking every second draw. Point estimates reported below are

posterior medians; 95% posterior credible intervals are the 2.5th and 97.5th percentiles of the

(18)

5.2 Results

5.2.1 Top-level sales lift

Results for sales lifts are presented as percents (in Section 5.1.4, the model and priors used lifts expressed as fractions). The first two columns in Table 3 present the reported sales lift and p-values/p-value intervals for the top-level sales, which is the data available to us from each experiment. The rest of the columns present results from the Bayesian analysis. Figure 6 is a forest plot for top-level sales lift, showing the original non-Bayesian estimates and intervals (gray boxes and lines) and estimates and intervals from the Bayesian analysis (blue boxes and lines). The overall estimate and interval are shown at the bottom as a blue diamond.

Figure 6 and Table 3 show the reported sales lift estimates are all positive and range from 0.01% to 8.80% with a simple average of 1.87%. The 95% intervals for experiments C, D, F, G, H, I, J, K, L, M do not contain zero. So we have multiple experiments that provide repeated causal evidence for a positive cross channel effect. These results for individual experiments describe the effects of these ad campaigns for individual retailers, but if they are combined in an overall estimate, with a proper accounting of variation within and between studies, they can tell us about the causal effect of search advertising in this population of retailers, and borrow strength across experiments. The estimated overall average sales lift from the Bayesian analysis

which accounts for within and between study variation17 is 1.18%. The 95% credible interval

of this overall estimate is 0.63% to 1.82% and does not contain zero. The posterior density of the overall estimate of sales lift shown in the left panel of Figure 8 shows very little mass below zero. The average sales lift from the population of experiments shows that there is a causal cross channel impact of search engine advertising on top level sales in brick and mortar stores. Since

the U.S. retail industry grew at a compound annual rate (CAGR) of 1.55%18 during the time

period in which the experiments were run, the magnitude of these sales increases is meaningful. A simple back of the envelope calculation shows that with search engine advertising campaigns executed as reported in this paper, our population of retailers would have had a CAGR of 2.75%,

a 77.5% increase over the base growth rate.19 Further for this population of retailers we estimate

17

The estimate and 95% CI are presented in the last row in Figure 6 labeled “Overall”. 18_{Calculated based on data reported by the National Retail Federation.}

19

The base line CAGR is 1.55%. If sales in the base year was 100, then one year later sales would be 101.55. Applying the incremental lift of 1.18% onto 101.55 we get 101.55×0.018 = 1.83. So adding the incremental growth, sales one year later would be 101.55+1.83=102.75. So the new growth rate post search advertising is

(19)

that the top level percent sales increase projected to their total retail network generates $1.1

billion in incremental annual sales.20

In all of our models, the random effects can be interpreted as the difference between the average measure (sales lift or ROAS) in one specific class and the average measure in the entire

experiment. In Table 3, the between-retailer random effects are theαi in the model (4), and the

within-retailer random effects are the βij, and similarly for ROAS in Table 4. As for Table 5,

the between-retailer random effects are the αi in model (6), the within-retailer random effects

are the βij, the between-category random effects are the γk, and the within-category random

effects are the ijkl.

The random effects reflect the distance of estimated sales lift or ROAS away from the overall estimate in our Bayesian analysis. Referring to the forest plot in Figure 6, we can take the top-level estimates in Table 3 as an example. Both between- and within-retailer random effects for experiment G are estimated as nearly zero, and this means that the difference between average sales lift in experiment G and the overall sales lift is almost zero. This inference can be confirmed by the estimate in the sales lift column in Table 3: the average sales lift in experiment

G is estimated as 1.17, and the overall sales lift over all of the experiments is1.18. As for the

experiments A to F, almost all of the random effects are estimated as negative values, and this leads to the sales lift estimates being smaller than the overall sales lift; for the experiments H to N, the opposite holds. The estimated standard deviation describing variation between retailers is 0.32 percentage points (95% CI 0.02 to 1.36) and this suggests considerable heterogeneity across retailers.

Experiments C and F are from the same retailer and as expected the estimated between retailer random effect for both experiments is identical with a value of -0.15. The 95% CI ranges from -0.95 to 0.45 and this wide intervals suggests considerable heterogeneity between the two experiments of the same retailer. The estimate for the within retailer random effect for experiment C is -0.16 with a 95% CI of -0.97 to 0.39. For experiment F the within retailer random effect is 0.01 with a 95% CI of -0.70 to 0.69. A considerable portion of the 95% CI for these estimates do not overlap. The wide interval for the within retailer random effect estimate

2.75%. Compared to the pre-internet advertising growth rate of 1.55% the increase in growth was 77.5%. This back of the envelope analysis is quite simple in that it only estimates growth based on a single year.

20

This projection is obtained by taking each retailers annual top level sales for the entire store network, multiplying by the bayesian estimate of sales lift % and then summing over all retailers.

(20)

suggests considerable heterogeneity within the experiment presumably due to differences across categories. Category differences are also a plausible reason for the differences across experiments within the same retailer. We offer this interpretation cautiously since the experiments were not conducted in the same time period. Experiments H and M are also from the same retailer and we get similar insights based on the between and within retailer estimates and 95% CIs.

For most of the other experiments the estimates of the two random effect are quite similar. This arises because only two retailers have more than one experiment, and they have just two each, so that the Bayesian machinery will have difficulty allocating variation between these two sources. (Non-Bayesian machinery would have at least as much difficulty.) The variances of the

two random effects were exchangeablea priori, so they are similara posteriori and the machinery

splits an experiment’s deviation from the overall average into two roughly equal pieces, attributed to the two random effects. However, the weak identification of the two random-effect variances has a negligible effect on the posterior distribution of the overall sales lift, because the sum of the two random effects, and thus the sum of the two variances, is well identified.

For the covariate ofImpression share the estimated average increase in lift of 2.5 percentage

points per 1-unit increase in impression share (95% CI -2.0 to 6.9). The sign is positive but the 95% CI contains zero. Higher impression share implies that the ads were served in higher number of searches and hence had a greater “reach” of online shoppers. A positive relationship between

reach and sale lift seems plausible. For the covariate Click-through rate the estimated average

increase in lift of 1.71 percentage points per increase of 0.1 in click-thru rate (95% CI -4.66 to 7.87). The sign for this covariate is positive but the 95% interval contains zero. Higher click thru rate means more clicks and more traffic to the retailer web site. Some of these shoppers might be gathering information and pre-shopping prior to a visit to a brick and mortar store. Hence a positive relationship between click thru rate and sales lift seems plausible.

5.2.2 Top-level ROAS

The first column in Table 4 presents the reported ROAS for the top level in each study. This was the data available to us from each experiment. The Bayesian analysis is presented in the rest of the table. The standard errors or P-values for the top level ROAS were not calculated or reported. Figure 7 is a forest plot for top-level ROAS, showing the reported non-Bayesian

(21)

estimates (gray boxes). The intervals (gray lines) were estimated using Equation 5 and the approach described in Section 7.2. The blue boxes and lines show the estimates and intervals from the Bayesian analysis. The overall estimate and interval are shown at the bottom as a blue diamond.

Figure 7 and Table 4 show the reported ROAS estimates are all positive and range from 0.01 to 14 with a simple average of 4.41. The 95% intervals for experiments L, J, C, N, K, G, and H do not contain zero, so we have multiple experiments that provide repeated evidence for a positive and statistically significant ROAS. The estimated overall average ROAS from the Bayesian analysis is 2.5. The 95% credible interval of this overall estimate is 1.03 to 4.48 and does not contain zero. The right panel of Figure 8 shows the posterior density of the overall estimate of ROAS. This density shows very little mass below zero. The overall ROAS from the population of experiments show that the ROAS of search engine advertising on brick and mortar sales is positive. It is not uncommon for retailers to expect a ROAS of 4 (Holmes [2014]) from certain direct marketing activities such as catalog mailings. This benchmark suggests that the estimates obtained from this study are economically meaningful since the posterior density of our overall estimate has considerable mass around 4.

On average the retailers in our experiments had a Gross Margin of 35%21. This implies that

the break-even ROAS is 2.85. The 95% CI of the overall estimate and the posterior density (Figure 8) indicate that there is a considerable probability mass that is greater than 2.85. The implication for decision makers including those with a financial perspective is that even at these experimental spending levels there is a considerable probability of breaking even.

The between retailers random effect standard deviation is estimated as 1.26 (95% CI 0.07 to 4.01). The between experiments within retailer estimate of standard deviation is estimated as 1.15 (95% CI 0.06 to 3.74). The between retailer estimate is somewhat higher than the within retailer estimate. With respect to random effects, retailer I with a between retailer estimate of 0.04 is the closest to the overall average of 2.5 but the 95% interval is quite wide. The insights for the between and within retailer random effects for the two retailers who had two experiments each are similar to the discussion on sales lift. Experiment C has a within retailer effect estimate of 0.38 whereas for experiment F the estimate is 0.06. Both estimates have wide

21

This estimate was obtained by the authors based on an analysis of the Profit and Loss statements of the retailers who participated in the experiments.

(22)

95% confidence intervals. These are presumably due to category differences. We get similar insights from comparing experiments H and M which are from the same retailer.

For the covariate impression share the estimated average increase in ROAS is 0.69 per 1 unit (10%) increase in impression share (95% CI -0.34 to 2.08 and includes zero). It is not obvious what the sign of this coefficient should be. For example the higher the impression share, the more the traffic generated but the more the retailer pays in terms of clicks. If sales increase at a rate faster than costs increase, then the numerator (sales) is increasing faster than the denominator (costs) and this would yield a positive coefficient. However saturation effects could set in mitigating these effects. For the covariate click-thru rate, for an increase of 0.1 (10%), the estimate is 0.97 and the 95% CI is -5.00 to 6.78 and includes zero. Google sells search ads using an auction format and the winner is selected based on a combination of bid and click thru rate. Google’s cost per click formula implies a negative relationship between click thru rate for the advertiser and the cost per click. So an advertiser with a higher click thru rate will presumably get more sales and also spend less on advertising and hence the impact on ROAS should be positive. This directionally is consistent with our estimate, but it seems that the 15 observations available are not enough to estimate this effect with precision.

5.2.3 Category-level sales lift

The first two columns in Table 5 present the reported sales lift for the category sales and the p-values or p-value intervals in each experiment. These two columns represent the data available to us. The other columns present the Bayesian analysis. The between-retailer random effects

are the αi in model (6), the within-retailer random effects are the βij, the between-category

random effects are theγk, and the within-category random effects are theijkl. Table 6 groups

these categories into 18 broad categories. Figure 11 is a forest plot for sales lift in the 18 broad categories, combining across experiments in the Bayesian analysis. The overall estimate and interval are shown at the bottom as a blue diamond.

Table 5 shows that a few category level lift estimates (category # 2, 3, 8, 13, 14, 15, 25, 28, 46, 51, 56, 62, 63, 67, 68, 69, 70, 71) are negative. But none of these negative estimates is significant and they typically have very wide confidence intervals. The rest of the estimates are positive and for many of them the lower bound of the 95% CI does not contain zero. So we have

(23)

repeated evidence for a positive effect of search engine advertising oncategory sales in brick and mortar stores. The evidence is from multiple retailers and categories. Figure 9 is a forest plot

of the Bayesian estimates of sales lift for all 76 categories.22 The last row of Table 5 shows that

the estimated overall average sales lift from the Bayesian analysis is 1.27%. The 95% credible interval of this overall estimate is 0.30% to 2.34% and does not contain zero. Figure 10 shows the posterior density of the overall estimate of sales lift for the category level data. This density shows very little mass below zero. The average sales lift from the population of experiments show that there is a causal cross channel impact of search engine advertising on category level sales in brick and mortar stores.

The between-retailer random effect shows how far the average category effect for this retailer is from the overall category average across all retailers. Since this is a retailer effect all the categories for the same retailer will have an identical estimate and 95% CI. The within-retailer random effect shows how far the average category effect for this experiment is from the average across all experiments for this retailer. As noted in the discussion of sales lift estimates, only two retailers had two experiments each and these effects may not be well identified. The between category random effect shows how far the effect for this broad category is for this retailer from the overall average across all retailers for this broad category. The within-category (category within a broad category effect in equation 6) shows how far this specific category is from the average of other categories within this broad category for this retailer. The random effects estimates confirm various sources of heterogeneity. The between retailer random effects estimate is 0.66 (95% CI 0.03 to 2.35). The between experiments within retailer random effect estimate is 0.55 (95% CI 0.03 to 2.06). The between broad categories random effect estimate is 0.57 (95% CI 0.03 to 1.82). The between categories within broad category random effect estimate is 0.57 (95% CI 0.03 to 1.30). It is interesting to note that these four random effects components all show significant variation and magnitude.

In order to obtain insights from these random effects in a more intuitive manner Figure 11 and Table 6 present the category level estimates aggregated into 18 broad categories. An interesting insight from the figure is that the estimates are positive for all the 18 broad categories. This suggests that the cross channel impact of search engine advertising on sales in brick and mortar

22

Since only p-value intervals are reported for many estimates it is not possible to generate a forest plot of the reported estimates.

(24)

stores is widespread across these different broad categories. Some of the broad categories with higher estimates such as furniture, home furnishings, kitchen and bath, small appliances and large appliances are categories where it is plausible that consumers might search more online before they purchased in a store.

For the covariate Impression share the estimated average increase in lift of 0.06 percentage points per 0.1 (10%) increase in impression share (95% CI -0.91 to 0.88). For the covariate Click-through rate the estimated average increase in lift of 4.48 percentage points per 0.1 (10%) increase in click-thru rate (95% CI -6.29 to 14.6). For the covariate Category Strength the estimated average increase in lift of 0.032 percentage points per 0.1 (10%) increase in click-thru rate (95% CI -0.024 to 0.103). The interval does contain zero but we note that the 95% CI for this covariate is narrower than the other covariates. While the signs of these coefficients seem intuitive, these finding are at best directional.

5.3 Robustness

5.3.1 Various p-value scenarios

In the category-level data, most p-values were reported as an interval (plower, pupper) for 0 <

plower < pupper < 1. Our main analysis represented this uncertainty about the P-value using

a uniform prior on this interval, for the purpose of deriving a standard error for each cate-gory/experiment’s sales lift. To check sensitivity of the results to this choice, we considered these alternatives:

• Pessimistic: Usepupperfor thisp-value; this gives the maximum standard error consistent

with this interval.

• Optimistic: Useplower for thisp-value; this gives the minimum standard error consistent

with this interval.

• Midpoint: Use the midpoint of the interval, i.e., (plower+pupper)/2.

• Sampling actual p-values (SAPV): Collect all the p-values reported as exact values

(not interval) from both the original top-level and category-level results. For eachp-value

reported as an interval, draw one sample from the subset of these exact P-values that is in the target interval.

(25)

Table 8 shows the resulting overall sales lift estimates and intervals. The 95% CI of overall sales lift excludes zero under all alternatives, and the overall sales lift estimate (posterior mean) is very similar under the uniform prior, mid-point, and SAPV alternatives. These results imply

the main result is robust to our handling of p-values reported as intervals. These results also

confirm that our handling of the p-values is conservative. Most of the other scenarios except the

mid-point produce a higher estimate.23

It may seem counterintuitive that the posterior SD for the overall effect is larger under the

optimistic alternative than under the pessimistic alternative. However, the observed data, i.e., the lift estimates for individual category/experiments, have a fixed amount of variation which the model allocates to five components of variation. If we reduce the part of that variance allocated to within-experiment variation — as we did by switching from the pessimistic to the optimistic alternative — then a larger fraction of the observed variation must be allocated to the other components of variation (between-retailer, between experiment within retailer, etc.). Because these other components of variation are not suppressed by replication to the same extent as within-experiment variation, the net effect is a larger posterior standard deviation.

5.3.2 Top-level ROAS: Correlation between test stores of ESi

To derive a standard error for an experiment’s top-level ROAS (Appendix, Section 7.2, Equation

12), we needed to specify the correlation r between expected sales ESi for pairs of test stores.

This correlation arises from matching individual control stores to more than one test store to derive the test stores’ expected sales. Our main analysis ignored this correlation because it had a tiny effect. This section briefly considers the robustness of the results to this assumption.

The correlations are not directly available from the reported data, however, they can be estimated from the number of test stores and total stores in network, and their estimations are listed in Table 9.

The resulting overall ROAS estimate considering the correlation was2.49, with a 95% credible

interval (1.02,4.47). In our main analysis, ignoring the correlation, the estimate was 2.50 with

95% CI (1.03,4.48). These results support that our analysis is robust for the correlation issue.

23

For the midpoint alternative the p-values are fixed, treated as a known parameter. For the uniform prior alternative, there is an additional layer in the hierarchical model, and p-values are drawn from those prior. Therefore, the results in these two scenarios will not be exactly identical.

(26)

If the assumed values of r were doubled or tripled, the results would still change negligibly.

6 Conclusions, Limitations and Future Research

In this paper we investigate the cross channel effects of search engine advertising on sales in brick and mortar stores. Cross channel advertising is an important and growing topic. The number of channels is increasing and so is consumer shopping across channels. Retailers need to understand if advertising in one channel can generate sales in another channel. Since retailers sell an assortment consisting of multiple categories they also need to know how widespread these effects are, if a category orientation is beneficial, and nature of the ROI from this type if advertising. Our paper takes a first step at investigating these first order questions.

Obtaining causal estimates of cross channels effects is difficult. Sales variability in brick and mortar stores, lack of adequate spending and distinct organization units for online and offline marketing create barriers. We conducted multiple randomized field experiments on Google.com and obtained causal estimates using the APT platform that has found acceptance among C level decision makers in leading retail organizations. In each experiment we increased adver-tising spending on top of the funnel keywords. We analyzed the estimates from multiple field experiments using a heterogeneous treatment effects model in a hierarchical Bayesian Frame-work.

Across the experiments we found causal evidence that an increase in search advertising on Google.com in targeted categories caused an incremental increase in brick-and-mortar sales in the advertised categories. We also found that top line sales increased providing evidence that the category sales increases are incremental net of any cannibalization effects from other categories. We estimate the total sales increase for the participating retailers is in excess of $1B. We find that these effects are widespread across a large number of categories. We also find that there is considerable heterogeneity across categories and this suggest a category orientation might be beneficial. Finally we also found that the ROI from this type of advertising has a positive probability of break-even and compared well to acceptable benchmarks. Hence overall these effects seem to be economically meaningful.

Our analysis also provides information about the likely sales lift and return on ad spend that other retailers would obtain in similar experiments. For a retailer considering an increase

(27)

in search engine advertising, the predictive distributions forLand ROAS, with new draws from

their respective random-effect distributions for retailers (in Equations 4 and 6 respectively),

describes the information available about the likely lift and return. Retailers can simulate

from these distributions to inform their advertising spending decisions. Based on the findings of our experiments several of the participating retailers incorporated search engine advertising spending to influence offline sales. One of the retailers worked with a manufacturer to develop and implement a manufacturer-retailer co-operative program on search engine advertising.

This paper also provides a proof of principle of meta-analyses like these, combining infor-mation from randomized experiments. The meta-analyses presented here are closer in spirit to the meta-analyses of randomized experiments in clinical research than are the more familiar meta-analyses combining results of econometric studies (Sethuraman and Tellis [1991], Farley and Lehmann [1986], Tellis [1988]).

The analyses presented here are subject to several limitations arising mainly because of data limitations. This required a variety of workarounds, with the most serious problem being estimating standard errors. Standard errors for experiment-specific estimates of top-level sales lift were readily available, but standard errors for ROAS had to be computed more indirectly, and standard errors for category specific sales lift were subject to considerable uncertainty because only a range of P-values was available for each category in an experiment. Our results are robust to these assumptions. Our method for calculating the standard errors of ROAS might be useful more generally in contexts where incremental advertising is not measured at the level of an individual store which precludes generating an empirical distribution of ROAS.

There are other limitations to our analysis. Our estimates reflect increase in advertising spending only on top of the funnel keywords. Our results reflect the spending levels in these experiments. Higher spending levels could change the lift and ROI estimates. While we have estimated and documented heterogeneous treatment effects and conducted some preliminary cor-relational analysis of moderators such as campaign impression share, campaign average position and category strength, further work is needed to obtain causal estimates and understand the impact of other moderators. For example does more online search lead to greater cross channel effects of search engine advertising? We have some very indirect evidence to this question based on the correlate impression share and these types of questions require further research.

(28)

7 Appendix

7.1 Degrees of freedom of the t-test, for inferring standard errors

We inferred standard errors from reports of the estimated mean and P-value from a one-sided t-test. The degrees of freedom (df) of the t-test was {number of test stores + number of unique control stores – 2}, which was not available. (Recall that control stores could be matched to

more than one test store.) Lacking the actual df, we used for nij the total number of stores in

the chain, which is necessarily larger. The inferred standard error is not sensitive to this choice for df in the range used here, as we now show.

We show this for the combined lift estimate L. Note that

P-value=α

if and only if Probability(L/SE> tα,df) =α, because the test was one-sided

if and only if SE=L/tα,df, (7)

wheretα,df is the100(1−α)percentile of the standardtdistribution with df degrees of freedom.

As df is increased, tα,df decreases, so SE increases. Thus, by using a df value larger than the

correct but unknown df, we infer standard errors that are larger than the true values, which is conservative in the sense of acting as if each experiment was less informative than it actually was.

Further, tα,df is insensitive to df in the range we’ve used; thus, the inferred standard errors

are insensitive. For example, tα,df is 1.708 for df = 25; 1.671 for df = 60; 1.658 for df = 120;

1.645 for df =∞ (i.e., the normal distribution). Over this range of df, tα,df decreases by 3.7%,

and the ends of this range differ by far more than our conservative df differs from the actual but unknown df.

7.2 Standard errors for return on ad spending (ROAS)

The data available for this study did not include standard errors for individual experiments’ ROAS estimates, and the corresponding P-values were reported in ranges. To avoid excessive conservatism in inferring standard errors for the ROAS estimates, we computed standard errors

(29)

for ROAS as follows. The derivation has two steps. The first conditions on each experiment’s average (over test stores) expected sales (i.e., acts as if average expected sales is known), while the second step removes that conditioning (i.e., treats average expected sales as a random variable).

Conditioning on an experiment’s average expected sales ES, from Section 5.1,

L = 1 nES X i (ASi−ESi) and ROAS = 1 ADS X i (ASi−ESi). (8) Assume that P i(ASi−ESi)∼N(µ, τ2); then L∼N µ nES, τ2 n2_ES2 and ROAS∼N µ ADS, τ2 ADS2 . (9)

We haveˆσ_L2, the estimated variance (square of standard error) for an experiment’s lift estimate;

we want an estimate for the variance of ROAS, ˆσ_R2. From Equation (9),

ˆ σ_L2 = τˆ 2 n2_ES2 so that τˆ2=n2ES2σˆ_L2, Therefore σˆ_R2 = ˆτ 2 ADS2 = n 2_ES2 ADS2σˆ 2 L. (10)

Conditional onES, then, we can estimate the standard error of ROAS. However,ESis itself

a random variable because test stores were randomly sampled. To remove this conditioning, note

that by Equation (8), conditional on the vector of expected sales for the test stores,ES,

ROAS|ES ∼N nES ADSθ, n2ES2 ADS2σ 2 L ! . (11)

(30)

The variance of ROAS is thus

Var(ROAS) = Var[E(ROAS|ES)] + E[Var(ROAS|ES)] = Var nES ADSθ ! + E n 2_ES2 ADS2σ 2 L ! = n 2_θ2 ADS2 Var(ES) + n2σ_L2 ADS2 E(ES 2 ) = n 2_θ2 ADS2 Var(ES) + n2σ_L2

ADS2{Var(ES) + [E(ES)] 2_} = n 2_σ2 L ADS2[E(ES)] 2₊n2(θ2+σ2L) ADS2 Var(ES). (12)

In the last line above, the first term is the conditional estimate of σ_R2 withE(ES) substituted

for ES. Thus the second term is the primary consequence of removing the conditioning onES.

The data available for this study included, for each experiment, the sample mean and

stan-dard deviation of theESi, so we can estimate Var(ES) if we can make a plausible assumption

about the correlation betweenESi andESj for test stores iandj. To do this, we assumed the

ESi were exchangeable with correlationr between each pair of test stores. Therefore,

Var(ES) = 1 n2Cov n X i=1 ESi, n X j=1 ESj ! = 1 n2 nσ 2 ES+n(n−1)rσES2 =σ_ES2 1 n+ n−1 n r . (13)

We roughly estimated the correlation coefficient r asnij/10, wherenij is the number of control

stores shared by ith and jth test stores. Simulations in which control stores were assigned at

random to test stores gave the estimates ofr in Table 7.

A standard error for each experiment’s ROAS was obtained by substituting Table 7’s

esti-mated r’s and other known quantities (e.g., standard errors for sales lift) into Equation (12).

The unconditional and conditional estimates ofσ2_R (withE(ES)substituted forES) were very

similar. Meta-analyses using the two different estimates gave very similar results, so Section 5.2.2

(31)

References

Ashish Agarwal, Kartik Hosanagar, and Michael Smith. Location, location, location: An analysis

of profitability of position in online advertising markets. Journal of Marketing Research, 46

(6):1057–1073, 2008.

Mohan N Babapulle, Lawrence Joseph, Patrick Bélisle, James M Brophy, and Mark J Eisenberg.

A hierarchical bayesian meta-analysis of randomised clinical trials of drug-eluting stents. The

Lancet, 364(9434):583–591, 2004.

Shelly Banjo and Drew FitzGerald. Stores confront new world of reduced shopper traffic. Wall

Street Journal, January 2014.

Ernst R Berndt. The practice of econometrics: classic and contemporary. Addison-Wesley

Reading, MA, 1991.

Donald A Berry, Scott M Berry, John McKellar, and Thomas A Pearson. Comparison of the

dose-response relationships of 2 lipid-lowering agents: a bayesian meta-analysis. American

heart journal, 145(6):1036–1045, 2003.

Thomas Blake, Chris Nosko, and Steven Tadelis. Consumer heterogeneity and paid search

effectiveness: A large scale field experiment. NBER Working Paper, pages 1–26, 2013.

Rebecca DerSimonian and Raghu Kacker. Random-effects model for meta-analysis of clinical

trials: an update. Contemporary clinical trials, 28(2):105–114, 2007.

John U Farley and Don R Lehmann. Generalizing about market response models: Meta-analysis

in marketing. Lexington, MA: Lexington Books, 1986.

Drew FitzGerald. Retail sales on thanksgiving, black friday rose 2.3reports., November 2013.

URLhttp://online.wsj.com/news/articles/SB10001424052702304017204579230801763930942.

Anindya Ghose and Sha Yang. An empirical analysis of search engine advertising: Sponsored

search in electronic markets. Management Science, 55(10):1605–1622, 2009.

Sander Greenland. Invited commentary: a critical look at some popular meta-analytic methods.