• No results found

This section will discuss the concepts of reliability and validity in research, spe- cifically in relation to my own studies. In general, reliability refers to consistency

Pre-test Post-test Experimental group Control group D ep endent measure

Figure 3.4: Possible quasi-experimental design outcome 4.

of measurement, and validity refers to the extent to which an instrument meas- ures what you intend it to measure. However, these terms can be divided into many sub-types, as will be seen below.

3.6.1

Reliability

Beginning with test-retest reliability, the issue is whether the concept you are measuring is being measured consistently across time (Christensen, 2000). If, for example, you measured your participants’ heights twice and got a different measurement each time, then unless they were on some sort of growth-spurt drugs you could conclude that the way in which you are measuring height is not very reliable. Unless the construct you are measuring is theorised to be unstable, then it is desirable that your measure provides the same results on different occasions.

How, then, can you ensure that your measures have high reliability? Changes in measurements can be the result of either systematic error (a changing factor of the situation that biases the results) or random error (Christensen, 2000). As the name suggests, random error happens for no particular reason and so cannot be controlled. The way in which to increase reliability, therefore, is to tightly control the experimental situation, ensuring that it is as similar as possible for all participants every time the study is run.

The main way in which reliability is maximised in the longitudinal study reported in this thesis is by ensuring that all participants complete the tasks in the same exam style conditions at every testing point. This includes working alone, in silence, and having the same amount of time to complete the tasks. This control should reduce the chance of any distractions or time pressure in-

terfering with performance. It would be a problem if, for example, the tasks were completed in silence at the first testing point and in a noisy environment at the second testing point – this could inhibit performance the second time round giving the misleading impression that the participants had become sys- tematically worse at reasoning or it could hide any real improvements. As it is, any change that is found should be the product of an actual change in ability, not an error in measurement.

Internal reliability is the extent to which all items in a measure are related, i.e. consistently measuring the same construct (Heiman, 2002). If you have a measure made up of multiple items, it is a good idea that all items are meas- uring the same construct to some extent, although it is also desirable that each item brings something slightly unique as well. This ensures that the measure is useful, interpretable and not unnecessarily long. To take an example, suppose you have a 15 item task that is supposed to measure attitudes towards immig- ration, but three of the items actually measure attitudes towards emmigration. Attitudes towards the two things may be separate and unrelated so scores on the 12 immigration items may not be correlated with scores on the three em- migration items and this would give the task low internal reliability. Note how this is different from test-retest reliability – the immigration task may produce consistent results if re-administered every week, but the items within the task are not producing consistent responses.

Internal reliability can be measured with a split-half analysis or a Cronbach’s alpha after the task has been completed by a number of participants (Novick & Lewis, 1967). In a split-half analysis, the items on a task are split into two groups, usually by alternating trials or at the midpoint. Participants’ scores on the two halves of items are then subjected to a correlation analysis. In a Cronbach’s alpha, a similar correlation is computed but averaging across every possible combination of test halves. The resulting correlation, or reliability coef- ficient, from either method is considered to be good when it is over .8 (Heiman, 2002). A reliability of below .7 would indicate questionable internal reliability, but these are just rules of thumb (George & Mallery, 2003). Split-half and Cronbach’s alpha analyses are reported for the reasoning measures used in later chapters.

3.6.2

Validity

Validity is a slightly broader concept than reliability, encompassing external validity and internal validity. External validity is the extent to which a study generalises to other people and other situations, and it was touched upon in Section 3.4. It includes ecological validity – whether the study closely resembles

a real-life situation – and temporal validity – whether the findings would apply in the past and future as well as the present (Christensen, 2000). Reasoning is a cognitive process, and cognitive psychology assumes that all humans are born with the same cognitive processing systems (Miller, 2011; Neisser, 1967). This means that findings related to cognitive processing are assumed to apply across time and across the species. That said, this assumes that the manipu- lation is the same. If other people, at other times, studied the current UK A level mathematics syllabus, then we could hope for the results to be the same. This is not to say that the results can generalise to students studying different mathematics syllabi in non-UK education systems, or in the past or future.

In terms of ecological validity, one issue is whether the measures of reasoning ability used in my research are relevant to the TFD claims about the sort of reasoning that is valued by the job market. Are pen and paper based tasks valid measures of the types of reasoning skills that mathematics graduates might demonstrate in their future jobs? On the one hand, it is unlikely that the tasks closely resemble tasks that would be encountered in day-to-day life. On the other hand, it can be argued that an improvement in logical reasoning skills in general (as suggested by the TFD) would be demonstrated on any logical reasoning task because of the universal nature of cognitive processing (Miller, 2011). Since the TFD does not specify the exact types of logical reasoning skills that are improved by studying mathematics, this is the best we can hope for.

As Mook (1983) argued, however, ecological validity is often misunderstood and is not necessary if the research is designed to test a theory as opposed to generalise directly to the real world. In psychological research ecological validity is often compromised for the sake of internal validity (absence of confounding variables). This allows a hypothesis derived from a theory to be tested effect- ively. It is simply not possible to control for all confounding variables in a real world setting (and it is also difficult to find a real world setting that is the same as every possible real world setting to which you want to generalise, Mook, 1983). By accurately testing a hypothesis derived from a theory in a controlled artificial setting, it is possible to support, refute or refine that theory, and use the theory to explain real world behaviour. In this thesis, I am testing the TFD, and it is that theory, not my data, which generalises to the real world. The TFD argues that studying mathematics improves general reasoning skills. If this is true, then performance on an ‘artificial’ task that requires reasoning should be changed as part of the umbrella development.

As stated previously internal validity is basically the absence of confounding extraneous variables. High internal validity means that a relationship between two factors can be taken to be a relationship between those two factors alone (Heiman, 2002). If a study has low internal validity, there may be extraneous

variables that have not been controlled that are influencing the relationship (making them confounding variables). The best way to ensure that a study has high internal validity is to randomly assign participants to conditions so that any extraneous variables are balanced out and cannot systematically bias the results. Another safeguard to is to measure any variables that you know may be confounding factors so that you can control for them in the analysis (Christensen, 2000).

As already stated, the quasi-experimental design of the longitudinal study means that participants are not randomly assigned to conditions. However, it is likely that intelligence is a confounding factor in the study – it may differ between conditions and it is related to reasoning ability. To deal with this issue, intelligence will be measured so that its effects can be accounted for in the analysis.

Internal validity may be compromised by the quasi-experimental design of the longitudinal study. However, despite the fact that external validity is ar- guably irrelevant because the study is testing the TFD rather than trying to generalise to real-world settings directly (Mook, 1983), it does have fairly high external validity due to the manipulation (mathematical study) being carried out ‘in the field’ rather than in a laboratory setting, and because it is dealing with a cognitive process that is believed to be common across humans (Miller, 2011).

Chapter 4

Measures of reasoning

4.1

Introduction

The psychology of reasoning has been a strong and growing area of research since the 1960s, when Peter Wason first demonstrated that people systematically fail to behave logically on his famous Selection Task (Wason, 1966). Since that time, a huge amount of research has been conducted with the Wason Selection Task and a collection of other tasks. The aim of this chapter is to justify why the Conditional Inference task was chosen as the primary measure of logical reasoning in this thesis, and why the Belief Bias Syllogisms task was chosen as a secondary measure. In order to do this, the most commonly used tasks in the field are described and discussed.

The chapter is split into two broad sections: judgment and decision making tasks and deductive reasoning tasks. The deductive reasoning section is fur- ther divided into three parts: disjunctive reasoning tasks, conditional reasoning tasks, and syllogisms tasks. After each of the tasks has been described, I will present an argument for why the Conditional Inference and Belief Bias Syllo- gisms tasks were chosen to measure reasoning in the studies presented in this thesis.