Conducting the Review - Experimental SoftwareEngineering

Conducting the review means setting the review protocol into practice. This includes:

Identification of research. The main activity in this step involves specifying search strings and applying them to databases. However, it also includes manual searches in journals and conference proceedings, as well as searching researchers’

web sites or sending questions to researchers. Systematically searching for primary studies based on references to and from other studies, is called “snowballing” [145].

The search strategy is a trade-off between finding all relevant primary studies, and not getting an overwhelming number of false positives, which must be excluded manually [43]. A false positive is an outcome that is wrongly positive when it should not be; in this case, it means that a paper is found and hence assumed to be of interest, and later it turns out that it is not and therefore it has to be removed. The search string is developed from the area to be covered and the research questions.

Using multiple databases is a necessity to cover all relevant literature, but it also creates duplicates, which must be identified and removed. At the end, it must be accepted that the papers found are a sample of the population of all papers on a specific topic. The key issue is that the sample is indeed from the intended population.

The published primary studies tend to have a publication bias, which means that (in some sense) positive results are more likely to be published than negative results.

Hence, also grey literature, like technical reports, theses, rejected publications, and work in progress, should be searched for [96].

The search results and a log of the actions taken should be stored, preferably using a reference management system.

Selection of primary studies. The basis for the selection of primary studies is the inclusion and exclusion criteria. The criteria should be developed beforehand, to avoid bias. However, they may have to be adjusted during the course of the selection, since all aspects of inclusion and exclusion are not apparent in the planning stage.

The identified set of candidate studies are processed related to the selection criteria. For some studies, it is sufficient to read the title or abstract to judge the paper, while other papers need a more thorough analysis of, for example, the methodology or conclusions to determine its status. Structured abstracts [30] may help the selection process.

As the selection process is a matter of judgments, also with well defined selection criteria, it is advised that two or more researchers assess each paper, or at least a random sample of the papers. Then the inter-rater agreement may be measured using the Cohen Kappa statistic [36] and be reported as a part of the quality assessment of the systematic literature review. However, it should be noted that a relatively high Cohen Kappa statistics may be obtained due to that many papers found in the automatic search are easily excluded by the researchers when assessing them manually. Thus, it may be important to conduct the assessment in several steps, i.e.

start by removing those papers that are obviously not relevant although found in the search.

Study quality assessment. Assessing the quality of the primary studies is impor-tant, especially when the studies report contradictory results. The quality of the

48 4 Systematic Literature Reviews

primary studies may be used to analyze the cause of contradicting results or to weight the importance of individual studies when synthesizing results.

There is no universally agreed and applicable definition of “study quality”.

Attempts to map quality criteria from medicine did not map to the quality range of software engineering studies [47].

The most practically useful means for quality assessment are checklists, even though their empirical underpinning may be weak. A study by Kitchenham et al.

also showed that at least three reviewers are needed to make a valid assess-ment [105]. Checklists used in quality assessassess-ment of empirical studies are available in the empirical software engineering literature [96,105,145].

The quality assessment may lead to some primary studies being excluded, if the study quality is part of the selection criteria. It is also worth noting that the quality of the primary studies should be assessed, not the quality of the reporting. However, it is often hard to judge the quality of a study if it is poorly reported. Contacts with authors may be needed to find or clarify information, lacking in the reports.

Data extraction and monitoring. Once the list of primary studies is decided, the data from the primary studies is extracted. A data extraction form is designed to collect the information needed from the primary study reports. If the quality assessment data is used for study selection, the extraction form is separated into two parts, one for quality data, which is filled out during quality assessment, and one for the study data to be filled out during data extraction.

The data extraction form is designed based on the research questions. For pure meta-analytical synthesis, the data is a set of numerical values, representing number of subjects, objects characteristics, treatment effects, confidence intervals, etc. For less homogeneous sets of studies, more qualitative descriptions of the primary studies must be included. In addition to the raw data, the name of the reviewer, date of data extraction and publication details are logged for each primary study.

The data extraction form should be piloted before being applied to the full set of primary studies. If possible, the data extraction should be performed independently by two researchers, at least for a sample of the studies, in order to assess the quality of the extraction procedure.

If a primary study is published in more than one paper, for example, if a conference paper is extended to a journal version, only one instance should be counted as a primary study. Mostly, the journal version is preferred, as it is most complete, but both versions may be used in the data extraction. Supporting technical reports, or communication with authors may also serve as data sources for the extraction.

Data synthesis. The most advanced form of data synthesis is meta-analysis. This refers to statistical methods being applied to analyze the outcome of several indepen-dent studies. Meta-analysis assumes that the synthesized studies are homogenous, or the cause of the in-homogeneity being well known [135]. A meta-analysis compare effect sizes andp values to assess the synthesized outcome. It is primarily

1/variance

effect size

0 1 2 3 4

Fig. 4.1 An example funnel plot for 12 hypothetical studies

applicable to replicated experiments, if any, due to the requirement on homogeneity.

In summary, the studies to be included in a meta-analysis must [135]:

• Be of the same type, for example, formal experiments

• Have the same test hypothesis

• Have the same measures of the treatment and effect constructs

• Report the same explanatory factors

Meta-analysis procedures involve three main steps [135]:

1. Decide which studies to include in the meta-analysis.

2. Extract the effect size from the primary study report, or estimate if there is no effect size published.

3. Combine the effect sizes from the primary studies to estimate and test the combined effect.

In addition to the primary study selection procedures presented above, the meta-analysis should include an meta-analysis of publication bias. Such methods include the funnel plot, as illustrated in Fig.4.1where observed effect sizes are plotted against a measure of study size, for example, the inverse of the variance or another dispersion measure (see Sect.10.1.2). The data points should scatter around a ‘funnel’ pattern if the set of primary studies is complete. Gaps in the funnel indicate some studies not being published or found [135].

The effect size is an indicator, independent of the unit or scale that is used in each of the primary studies. It depends on the type of study, but could typically be the difference between the mean values of each treatment. This measure must be normalized to allow for comparisons with other scales, that is, divided by the combined standard deviation [135].

The analysis assumes homogeneity between studies, and is then done with a fixed effects model. The meta-analysis estimates the true effect size by calculating an average value of the individual study effect sizes, which are averages themselves.

There are tests to identify heterogeneity, such as the Q test and the Likelihood Ratio test, which should be applied to ensure model conditions are met [135].

For inhomogenous data, there are a random effects model, which allow for variability due to an unknown factor, which influences the effect sizes for the primary studies. This model provides estimates both for the sampling error, as the fixed effects model, and for the variability in the inhomogenous sub-populations.

50 4 Systematic Literature Reviews

Study 1

Study 2

Study 3

Favors control Favors intervention

0 0.1 0.2

–0.2 –0.1 0.3

Fig. 4.2 An example forest plot for three hypothetical studies

Less formal methods for data synthesis include descriptive or narrative synthesis.

These methods tabulate data from the primary studies in a manner that brings light to the research question. As a minimum requirement on tabulated data, Kitchenham and Charters propose the following items be presented [96]:

• Sample size for each intervention

• Estimates of effect size for each intervention with standard errors for each effect

• Difference between the mean values for each intervention, and the confidence interval for the difference

• Units used for measuring the effect

Statistical results may be visualized using forest plots. A forest plot presents the means and variance of the difference between treatments for each study. An example forest plot is shown in Fig.4.2.

Synthesizing inhomogenous studies and mixed-method studies require qualita-tive approaches. Cruzes and Dyb˚a [39] surveyed secondary studies in software engineering, which included synthesis of empirical evidence. They identified several synthesis methods, many from medicine of which seven methods were used in software engineering. These methods are briefly introduced below. For more detail, refer to Cruzes and Dyb˚a [39] and related references.

• Thematic analysis is a method that aims at identifying, analyzing and reporting patterns or themes in the primary studies. At minimum, it organizes and presents the data in rich detail, and interprets various aspects of the topic under study.

• Narrative synthesis, mentioned above, tells a ‘story’ which originates from the primary evidence. Raw evidence and interpretations are structured, using for example tabulation of data, groupings and clustering, or vote-counting as a descriptive tool. Narrative synthesis may be applied to studies with qualitative or quantitative data, or combinations thereof.

• The comparative analysis method is aimed at analyzing complex causal connec-tions. It uses Boolean logic to explain relations between cause and effect in the primary studies. The analysis lists necessary and sufficient conditions in each of the primary studies and draws conclusions from presence/absence of independent variables in each of the studies. This is similar to Noblit and Hare’s [127] Line of argument synthesis, referred to by Kitchenham and Charters [96].

• The case survey method is originally defined for case studies, but may apply to inhomogenous experiments too. It aggregates existing research by applying a survey instrument of specific questions to each primary study [114], similar to the data extraction mentioned above. The data from the survey is quantitative, and hence the aggregation is performed using statistical methods [108].

• Meta-ethnography translates studies into one another, and synthesize the trans-lations into concepts that go beyond individual studies. Interpretations and explanations in the primary studies are treated as data in the meta-ethnography study. This is similar to Noblit and Hare’s [127] Reciprocal translation and Refutational synthesis, referred to by Kitchenham and Charters [96].

• Meta-analysis is, as mentioned above, based on statistical methods to integrate quantitative data from several cases.

• Scoping analysis aims at giving an overview of the research in a field, rather than synthesizing the findings from the research. Scoping are also referred to as mapping studies, which are further discussed in Sect.4.4.

Independently of synthesis method, a sensitivity analysis should take place to analyze whether the results are consistent across different subsets of studies. Subsets of studies may be, for example, high quality primary studies only, primary studies of particular type, or primary studies with good reports, presenting all detail needed.

In document Experimental SoftwareEngineering (Page 68-73)