PART II RESEARCH DESIGN AND METHODS
CHAPTER 5 The conduct of the systematic review
The conduct of the systematic review
Prior studies have suggested that instruction in CT can help develop CT abilities of students (McMillan, 1987; Huber & Kuncel, 2016). Most of these studies are conducted in Western countries and with students whose first language is English. These studies also tended to claim that certain instructional approaches are more effective. However, there is no consolidated robust evidence that such instruction can benefit students in higher education whose first language is not English. And the evidence for the most effective approach is unclear.
The systematic review aims to synthesise empirical evidence of the effectiveness of such instruction for ELL students in higher education. And if found to be effective, to identify the most promising approach to teaching CT. This chapter describes the methods involved in conducting the systematic review. It details the steps used in the systematic review from the formulation of the key words, database search, and screening to the appraisal and synthesis of research reports.
5.1 Aim of the present review
The aim of the systematic review is to review and synthesise empirical research to establish whether there is any evidence that instruction in CT can help develop CT skills of EFL/ESL students in post-16 classroom. This age is chosen because students who are studying English as a second or a foreign language will generally have reached an acceptable level of proficiency in English, and so be able to access the instruction in the teaching modules in English.
5.2 Rationale for a systematic review
A systematic review synthesises research evidence based on a clearly pre-defined eligibility criteria and a set protocol. It comprehensively explores the field of research and includes all existing studies that fall within the eligibility criteria in order to answer a research question. As such it minimises the potential for selection bias and publication bias. A systematic review synthesises evidence from previous research already conducted and is particularly useful in areas where the evidence from existing studies is yet unclear (Torgerson, 2003; Petticrew & Roberts, 2006; HM Treasury, 2011). It is
also useful in scoping research in a field. For this reason, the thesis starts with a systematic review to look at the existing evidence.
The systematic review in this thesis starts with a protocol which made explicit the research question, the objectives of the review, the inclusion and exclusion criteria, the search and screening strategy, the quality appraisal and synthesis of evidence. A comprehensive search of relevant databases, in addition to a comprehensive hand search, is done in order to make sure not to exclude any studies that might be of relevance to the research question. The steps involved are explicit, transparent and replicable to ensure that a non-biased and objective assessment of the evidence is presented. Quality appraisal of each included study is an important aspect of a systematic review. This is to ensure that the evidence synthesised from the review is valid and credible. Lumping studies of different designs and different quality as though they were all of the same can lead to misleading results. In this study the criteria for judging each piece of work are clearly spelt out and the process is made transparent to allow for replication (Torgerson, 2003).
5.3 Research questions
The review aims to answer the following research questions:
RQ 1 Is there evidence that instruction in CT can help develop CT skills of ELL in higher education?
RQ2 What is/are the most promising approach(es) to teaching CT skills to ELL in higher education?
5.4 The search strategy
The search strategy involved searching 12 electronic databases and handsearches. This also included following up known studies. The first search was done in October 2014. In order to update this systematic review, another search of each database previously searched was done again in August 2016 and again in November 2018. Some studies that were found in the first search in some databases did not emerge in the second search. A specific search for those studies showed that they were no longer available in the databases where they were originally found.
The search was limited to the following: Date: After 1989
Source type: Conference Paper & Proceedings; Dissertations & Theses; Government & Official Publications; Scholarly Journals
Document type: Article; Conference Paper; Conference Proceeding;
Dissertation/Thesis; Government & Official Document Language: English
Following an initial review to test the sensitivity of the search terms, a standard and very inclusive statement of search terms was used for each database (adjusted to suit the idiosyncrasies of each). Many search iterations of the syntax were run with different search options and limiters to make sure no relevant studies were missed. This statement of search terms was tested, adjusted and retested iteratively to ensure that as little as possible of relevance was missed. In summary, the search was for any materials published or unpublished that mentioned CT skills (or synonym) as outcomes plus any causal term (or synonym) or any research design (such as randomised controlled trial, experiment or regression discontinuity) that would be appropriate for testing a causal model. A key element of the search was to gather grey literature as well, wherever possible to avoid publication bias. The search ‘syntax’ for each database is provided in Table 5.1.
Table 5.1 Search syntax used in each database
Name of database Search words ASSIA (Applied Social Sciences
Index and Abstracts) British Periodicals Social Science Database ERIC
International Bibliography of the Social Sciences (IBSS) Periodicals Archive Online ProQuest Dissertations &
Theses: UK & Ireland ProQuest Dissertations &
Theses Global Education Database
"critical thinking" OR "critical reasoning" OR "higher-order thinking" OR "rational thinking" OR "analytical thinking" OR "cognitive skills" OR argument* OR debate* OR "thinking skills" OR criticality OR "thinking skills" in [Document title - TI]
AND
"language teaching" OR "language learning" OR "foreign language" OR L2 OR L1 OR "second language" OR ESL OR EFL OR "target language" OR "English language" OR "language skills" in [Anywhere]
AND
intervention OR experiment* OR "quasi-
experiment*" OR "difference in differences" OR study OR "randomised controlled trial" OR "regression discontinuity" OR "factorial" OR "controlled study" in [Anywhere]
PsychINFO
British Education Index
"critical thinking" OR "critical reasoning" OR "higher-order thinking" OR "rational thinking" OR "analytical thinking" OR "cognitive skills" OR argument* OR debate* OR “thinking skills” OR criticality OR “thinking skills” [TI Title]
AND
"language teaching" OR "language learning" OR "foreign language" OR L2 OR L1 OR "second language" OR ESL OR EFL OR “target language” OR “English language” OR “language skills” [TX All Text]
AND
intervention OR experiment* OR "quasi-
experiment*" OR "difference in differences" OR study OR “randomised controlled trial” OR “regression discontinuity” OR “factorial” OR “controlled study” [TX All Text]
Web of Science "critical thinking" OR "critical reasoning" OR "higher-order thinking" OR "rational thinking" OR "analytical thinking" OR "cognitive skills" OR argument* OR debate* OR “thinking skills” OR criticality OR “thinking skills” [Title]
AND
"language teaching" OR "language learning" OR "foreign language" OR L2 OR L1 OR "second language" OR ESL OR EFL OR “target language” OR “English language” OR “language skills” [Topic] AND
intervention OR experiment* OR "quasi-
experiment*" OR "difference in differences" OR study OR “randomised controlled trial” OR “regression discontinuity” OR “factorial” OR “controlled study” [Topic]
JSTOR "critical thinking" OR "critical reasoning" OR "higher-order thinking" OR "rational thinking" OR "analytical thinking" OR "cognitive skills" OR argument* OR debate* OR “thinking skills” OR criticality OR “thinking skills” (item title) AND
"language teaching" OR "language learning" OR "foreign language" OR L2 OR L1 OR "second language" OR ESL OR EFL OR “target language” OR “English language” OR “language skills” (full- text)
AND
intervention OR experiment* OR "quasi-
experiment*" OR "difference in differences" OR study OR “randomised controlled trial” OR “regression discontinuity” OR “factorial” OR “controlled study” (full-text)
Other searches
In addition a comprehensive and thorough search using Google and Google Scholar was conducted. The search included a combination of the same search terms used in the databases. However, handsearching could not be done in the same systematic way that was done in the databases, so various combinations of search terms were done in order to be able to find as many articles as possible. In addition, reference sections of studies were also examined for more titles.
All relevant records from the electronic database searches were exported to Zotero (a free, open source reference manager software). Research reports from hand searches were then added to the records in Zotero.
5.5 Screening
Once relevant studies were identified, they were screened for duplicates and relevance first by titles and abstracts. This was done by applying a pre-set inclusion and exclusion criteria. Only studies that were of an experimental nature were included because the aim of the systematic review was to identify approaches that can improve CT skills. Therefore, only studies that can establish causation were considered. These would be randomised controlled trials, quasi-experiments using matched comparison design, regression discontinuity, propensity score matching, difference in difference or similar.
Inclusion criteria
Included studies must:
Be concerned with instruction in CT Test CT skills as an outcome
Be concerned with courses related to ESL (English as a second language) or EFL (English as a foreign language)
Be conducted in post-16 education setting
Be empirical (i.e. not opinion pieces or promotional literature)
Use experimental or quasi-experimental designs (e.g. randomised control trials, quasi-experiments using matched comparison design, regression discontinuity, propensity score matching, difference in difference or similar)
Be published between 1990 and 2018 Be published or reported in English
Exclusion criteria
The focus of this review is on courses where English is used as a foreign or second language.
Studies were excluded if they were:
Not empirical research (i.e. opinion pieces, instructional manuals/guidance about how to teach CT or promotional literature about CT or about the theory of CT) Not related to teaching English as a foreign or a second language
About teaching English for Academic Purposes (EAP) About the teaching of grammar and phonology
About teaching literature
Related to computers, technology, and the internet as learning tools
About testing English language skills, such as Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS)
Related to primary and middle school students (under 16 years old) Related specifically to students with learning disability such as dyslexia Related to gifted students
Related to metacognitive skills Simply about assessment of CT skills
Very short intervention lasting less than a month
Short intervention spanning less than one month were excluded because instruction in CT is not a simple task and it requires time for students to get exposure to various skills and to absorb those skills. A meta-analysis by Niu, Behar-Horenstein, and Garvan (2013) shows that interventions that lasted more than 12 weeks reported a bigger gain than those that lasted less than 12 weeks. Also, interventions that lasted less than a month suggest that the pre- and post- tests would be very close. This might introduce practice effect and render the results less valid.
5.6 Data extraction
When all the studies retrieved have been screened for relevance and duplicates have been removed, the full paper/research report was read and relevant information was extracted.
The criteria are based on general guidelines for conducting systematic reviews (Torgerson, 2003; See, n.d.). They provide detailed guidance in the evaluation of each step of the research process. Detailed information about each study is provided in Appendix 2.
Data extraction involved noting information about all aspects of the research design which include matters pertaining to the sampling strategy, the randomisation process, and the instrument used to assess the outcome measure.
Research design
Since the review question is a causal one, it is essential that only studies that use experimental or quasi-experimental studies are analysed. To extract data about the research design, the following guiding questions were used:
Is it experimental or quasi-experimental research? Are groups randomised or matched?
How is randomisation carried out? Is randomisation done at the individual or group level?
If randomisation is done at group level, are there enough subgroups in the control and experimental groups?
Is there a control and an experimental group? Is there a pre-test and a post-test?
In case of essays or tests requiring open-ended answers, are raters blinded to ensure objectivity of results?
What is the duration of the intervention?
Sample
Examination of the sample involved extracting data about the sampling strategy, balance between the groups, and attrition rate using the following questions:
What is the sampling strategy (how was the sample identified; were they students from the researchers’ own class; were participants volunteers)?
What is the sample size?
What is the number in each group?
Is there attrition? What is the number of participants who dropped out from each group?
In the case of nonrandom allocation, is there any attempt at ensuring that the two groups are equivalent at baseline?
Outcome measures
Data was also extracted regarding the outcome measured and the instrument used to measure the outcome using the questions below:
What are the outcomes? How are they measured?
Are participants assessed using teacher-developed tests, researcher-developed tests, intervention-related tests or commercially produced tests?
Information was also extracted to establish the rigor with which the study was conducted, whether data was appropriately analysed, and whether the conclusions were warranted.
Rigor of the study
The rigor of the study is judged based on the following: Is the research question clearly stated?
Is the choice of research design appropriate for the research question(s)? In case of randomisation, is the procedure clearly explained?
Is the conclusion reached warranted by the evidence?
Do the research methods used eliminate any simpler alternative explanations for the findings of the study?
Is analysis appropriate?
Data was also extracted about the kind of analyses used and whether they were appropriate and if there is any evidence of data dredging. This helped to assess the trustworthiness of the study. In extracting the data about analysis the following questions were asked:
How is the data analysed?
Is the statistical analysis appropriate to the research design? Does the statistical analysis help answer the research question(s)?
Is there a pre-test and post-test comparison or is baseline equivalence established?
Is there evidence of data dredging (i.e. exhaustively searching for combinations of variables to explain a correlation which can lead to biased or false results)? Is effect size calculated? If not, are means and standard deviations presented to
allow for calculation of effect size?)
Results
The findings in each study were examined based on the following questions: What are the major findings regarding development of CT skills? Are all the results presented or are only favourable ones presented? Are they logical and convincing?
Comments and limitations of the studies
Data extraction also involved any comments that the two assessors found to affect the validity of the study. Comments were written on aspects of the study that might affect the internal and external validity of the experiment including sample size, level of dropout, blinding, and confounding variables. Comments were guided by the following question:
Were threats to validity, such as demoralisation, Hawthorne effect, regression to the mean, bias in treatment, experimenter effect, teacher effect, conflict of interests, or diffusion of treatment dealt with properly?
5.7 Judging the quality of studies
Assessment of the quality of studies is essential in judging the trustworthiness of the findings of the study. The data extracted in the previous steps facilitate this judgement. In this review, the ratings were completed by two raters (myself and my supervisor). We both rated all the 36 studies individually and then compared our ratings. Where there was a disparity we explained our ratings and an agreement was reached.
Each of the included studies is rated to judge the strength of its evidence using a set of criteria based on the design of the study, sample size, attrition, how outcomes are measured and threats to validity. This is done using the “Sieve” proposed by Gorard (2014). This ignores the source of any publication or the status of its author or funder as any guarantee of research quality. Instead the quality of evidence for each of the
included studies is judged by applying the “Sieve”. This step is essential since much of education policy so far has been based on incorrect, misleading or incomplete evidence.
The “Sieve” works by judging the study first by its design (see Table 5.2) and rating moves from left to right and from top to bottom. For example, a study with a good design will drop one star (from 4* to 3*) if the sample is small (e.g. under 50 in each arm). Or if a RCT starts with a large number but attrition is over 25% then it will drop from 4* to 2*. There is no magic number as to what the minimal attrition or sample size should be. It is based on judgement. For example, it is clear that a study with 100 participants individually randomised will be stronger than one with 20 children in two randomised classes (the number of cases here is 2 and not 100). Similarly attrition at 25% is more likely to affect the results of the study than an attrition of under 5%. Therefore, the “Sieve”, as Gorard, See, and Siddiqui (2017) contend, is left vague on purpose so that the reviewer would make a sensible judgement about the study being reviewed. The "Sieve" was used in this review as it is a simple and transparent way to judge the trustworthiness of research while allowing some flexibility to detect any evidence of bias in the study under examination.
Table 5.2 The "Sieve" to assist judging the trustworthiness of single research reports
Design Scale Dropout Outcomes Fidelity Validity Rating
Fair design for comparison Large number of cases per comparison group Minimal attrition, no evidence of impact on findings Standardised pre-specified independent outcome Clear intervention, uniform delivery No evidence of diffusion or other threat 4 Balanced comparison Medium number of cases per comparison group Some initial imbalance or attrition Pre-specified outcome, not standardised or not independent Clear intervention, unintended variation in delivery Little evidence of diffusion or other threat 3 Matched comparison Small number of cases per comparison group Initial imbalance or moderate attrition Not pre- specified but valid outcome Unclear intervention, with variation in delivery Evidence of experimenter effect, diffusion or other threat 2 Comparison with poor or no Very small number of cases per Substantial imbalance and/or high Outcome with issues of validity or Poorly specified intervention Strong indication of experimenter 1
equivalence comparison group
attrition appropriateness effect, diffusion or other threat No report of comparator A trivial scale of study, or N unclear Attrition not reported or too high for any comparison Too many outcomes, weak measures, or poor reliability No clearly defined intervention No consideration of threats to validity 0
Source: Gorard et al., 2017
The “Sieve” is based on six criteria. The design of a study is the most important to determine whether the findings are trustworthy or not and whether the research design