• No results found

Chapter 2. A SYSTEMATIC REVIEW OF THE RELIABILITY AND VALIDITY OF DISCRETE CHOICE

2.3 Methods

The systematic review process generally comprises five steps: the development of a protocol to guide the review, screening or inclusion criteria, quality appraisal, data extraction, and synthesis (Pullin and Stewart 2006). As the primary objective of this review is to examine the evidence on the reliability and validity of the DCE method, we selected studies which met the inclusion criteria and whose survey design is judged sufficiently robust to answer our review question, but did not further appraise the quality24 of the selected articles given the limited evidence base. We sent the review protocol to six DCE experts and practitioners, three of them reviewed it and provided valuable comments on the selection criteria and search strategy.

2.3.1 Systematic review protocol and search strategy

We used the conceptual framework developed in section 2 to generate a set of search terms that were included in a search string formatted according to requirements for searching in the Web of Science (WoS) and EconLit databases. Following experts’ recommendations, we used a set of 24 references (Supplement 1) as a ‘test library’ to check whether the search strings captured the expected studies, and, if not, what terms would have included them and how many other relevant studies using those new terms might add. We used an iterative checking process to validate the search terms and reduce the risk of missing relevant studies. The final search string employed (figure 2.1) was defined after 15 iterations and was judged to be sufficiently diverse to capture different phrasings of the reliability and validity of DCE. The search terms ensured a balance between the proportion of hits that are relevant (referred to in the systematic review literature as “specificity”) whilst ensuring that all available literature was captured (“sensitivity”). We conducted the initial search between 20 July and 20 August 2013 by entering the search terms (into two databases: i) WoS (https://webofknowledge.com/), one of the world's largest databases of scientific papers and (ii) Econlit

(http://www.aeaweb.org/econlit/icon.php), the database of the American Economic Association. The

search was updated on 24-29 February 2016, using WoS only, since EconLit had returned only three additional includable articles in the initial search. After removing duplicates, articles were assessed against our inclusion criteria (see 3.2) first using titles and keywords, then abstracts, then full texts. At each stage any potentially includable studies were retained for the next stage. Included studies are

24 Quality appraisal involves the scoring of each relevant study against a set of pre-established criteria or “quality

hierarchy”. These criteria often involve subjective judgements about the relative importance of different sources of bias (for more details, please see Pullin and Stewart 2006).

29

described in the synthesis tables (Supplement 3), which report the type of validity and reliability checks, the good valued, the location, the sample design and sample size, the econometric methods used and the methods used to test for the equality of marginal willingness to pay (MWTP) / willingness to accept (MWTA) estimates.

fa

Figure 2.1: Search strings (combination of sub-strings from DCE and different approaches to reliability and validity testing using Boolean operators)

30

2.3.2 Inclusion criteria, data extraction and synthesis

To be included in the review, studies had to satisfy the following criteria. They had to test for the validity or reliability of the DCE results, and must have been published between January 2003 and February 2016. The time span was restricted to capture the most recent studies as DCE and SP techniques have advanced over the years and are evolving fast. The object of valuation or type of good being valued was restricted to non-market environmental goods or non-market environmental attributes of market goods, including both use and non-use values. “Non-market” refers to goods that do not have an observable market price and are not sold or bought directly in the market (e.g. the regulation of water or air quality, or recreational and spiritual benefits - See Millenium Ecosystem Assessment 2005). Non-market attributes of market goods include for instance the ecological component of certified coffee beans (e.g. Carlsson et al. 2010; Tonsor and Shupp 2011), where organic production may be supposed to produce public goods as well as private benefits to the consumer. Only original DCE applications were included in the analysis, and benefit transfer studies, meta- analyses or discussion papers were excluded. Only papers in English were included.

To be included, qualitative studies must have explicitly reported results in a manner which allows an assessment of reliability/validity to be made. Studies which carried out focus groups or other qualitative methods simply to assist in drafting DCE surveys were excluded. Studies which only included robustness checks (Smith 2007), which examine model fits or the robustness of results to different assumptions such as the treatment of unobserved heterogeneity or different model specifications (e.g. Campbell et al. 2011; Christie and Gibbons 2011; Torres et al. 2011) were also excluded. Instead, we focused on the design and administration of DCE surveys, and on how respondents perceive and answer them, rather than on data analysis. Similarly, we excluded studies that only tested common prior expectations such as the relationship between WTP estimates and income (Bateman et al. 2002). Such tests are routinely handled in data analyses and are ambiguous tests of validity25. We excluded respondents’ self-reported certainty about their choices since low certainty may represent a real feature of respondents’ preferences not a lack of validity. Likewise, we excluded comparisons of MWTP and MWTA estimates because the WTP-WTA disparity is not prima facie evidence of lack of reliability of the DCE method but may instead reflect underlying preferences consistent with Hicksian theory (Kim et al. 2015b). Conversely, while comparing the effect of

25 We distinguish such tests from those described in section 2.2.2, which concern assumptions on which the DCE

31

alternative survey administration modes on DCE results (e.g. Olsen 2009) rightly qualifies as reliability testing, it is beyond of the scope of this systematic review which focused on survey design.

Different outcome elements were extracted from the included studies depending on the types of reliability or validity tests. Reliability, criterion and convergent validity testing often produce comparisons of attribute parameters (or utility coefficients), MWTP/MWTA or compensating surplus estimatesbetween split samples. When comparing attribute parameters between two samples, we included outcomes which used the Swait and Louviere sequential testing procedure (1993) to account for differences in scale factors26. In logit models, the scale parameter (inversely related to the variance of the error term) is jointly estimated and hence confounded with the attribute parameters in the utility function (Louviere et al. 2000a). Three tests for equality of MWTP/MWTA estimates were used in the reviewed studies; i) confidence intervals, ii) performing a simple t-test, iii) using the complete combinatorial method (Poe et al. 2005). The first two tests can give biased outcomes if normality assumptions are violated: t-tests in particular might underestimate the level of significance of differences in WTP (ibid). Nevertheless, we included studies that used any of the three tests, but noted the approaches used by authors (Supplement 3). Studies are too heterogeneous to permit a quantitative meta-analysis. Instead, using the full synthesis tables (Supplement 3), we describe the state of evidence by highlighting the number of studies providing a yes or no answer to the questions of interest. We do not present effect sizes, which would be uninformative because both the context and the non-market environmental good being valued differed across studies.

26We note that in addition to the Swait-Louviere sequential procedure, there are also less common methods

used by other fields (transport and health economics) to control for scale differences such as the procedure proposed by Ben-Akiva and Morikawa (1990), in which observations from separate (groups of) choice tasks are used simultaneously to maximize a joint likelihood function; and the Bradley and Daly (1994) one-step estimation approach of Ben-Akiva and Morikawa, which can be implemented using a nested logit (the logit- based scaling approach).

32