R ELIABILITY AND V ALIDITY - PASSIVE CERVICAL RANGE OF MOTION

PASSIVE CERVICAL RANGE OF MOTION

5.1.1 R ELIABILITY AND V ALIDITY

As previously stated, clinicians‘ measure cervical ROM in order to assess whether there is a limitation of range or impairment, indicate possible structures that could be causing this limitation/impairment and also to ‗objectively‘ measure treatment progress. In order for a measure of ROM to perform all of the above, i.e. be clinically useful, it must be consistently accurate. In other words the measure needs to be valid and reliable [189].

Reliability and validity can be confusing concepts due to the various synonyms that are used, often interchangeably. For the purpose of this thesis, reliability is defined as consistency of a measurement across time, patients or observers[190]. Validity is defined as the extent to which the method/tool measures what it is intended to measure [189, 190]. More recently this definition of validity has been widened to focus on the degree of confidence we have about making inferences about the population the measurement method/tool was used on; a shift of focus from the method/test to the population it is utilised on.

Several authors have used an analogy of shooting at a target to explain the concepts of reliability and validity as presented in Figure 12 [189, 191]. In order to be defined as a ‗good shot‘ one needs to be accurate and consistent when shooting at a target (A). There is no use in being consistently off-target (B) or inconsistently on-target (C).

140 Figure 12 – Target analogy for reliability and validity

5.1.1.1 Reliability

The theory of reliability is derived from the discipline of psychology and in particular

Classical Test Theory [192]. This theory states that any observed measurement consists of a true value and an error value. It is very rare to find a truly consistent clinical measurement method; all methods have some error within them. Only random errors are considered in reliability theory (systematic errors -predictable errors occurring in one direction only- are normally dealt with under the construct of Validity).

Classical test theory provides us with the formula where reliability = σ²t / (σ²t + σ²e) where σ²t is equal to the true score variance and σ²e is equal to the error score variance. This results in a unitless number that ranges from zero (all variance due to measurement error or zero reliability) to one (all variance due to true score or perfect reliability).

Two categories of reliability have traditionally been constructed and tested for methods of ROM measurement; Intra-observer reliability – the reliability within a single tester and Inter- observer reliability – the reliability between at least two examiners/ populations/ settings.

141

One helpful distinction that has been made, particularly with reference to the correct use of statistical techniques, is that of relative reliability and absolute reliability [193]. Relative reliability informs us of whether the differences in one set of measurements are ranked in the same order as a second set of measurements (also known as association). The limitation of this type of reliability is that readings don‘t necessarily have to agree to result in ‗perfect reliability‘ – therefore this can lead to an exaggeration in degree of reliability.

Absolute reliability is a more recent concept and this is concerned with the degree with which repeated measurements vary for individuals (also known as agreement). It is expressed statistically using the Standard Error of Measurement (SEM) or Limits of Agreement tests (LoA) [194].

The Standard Error of Measurement (SEM) is the standard deviation of measurement errors and provides an estimate of error around a ‗true score‘ of a repeated test on an individual for interval data [195]. When the standard deviation and reliability co-efficient are known it can be calculated as follows:

Where SD = Standard Deviation and r = reliability co-efficient.

Limits of Agreement tests are graphical techniques and basic calculations that allow observation of outliers and bias relatively quickly and easily. Differences in results are plotted against the mean value of the two measurements, then mean and SD of the differences between the measures are calculated and then finally 95% limits of agreement with confidence intervals are calculated [194, 195].

142

Figure 4 below displays an example of good agreement[194]. The mean difference is 0.42 % points (95% CI 0.13 – 0.70). Limits of agreement are -2.0 and 2.8.

Figure 13 - Limit of Agreement plot example[194]

The Intra-class Correlation Coefficient (ICC) reflects both measurement error and degree of consistency and is currently the most commonly used statistical technique to interpret reliability. It expresses the ratio of variance between subjects to total variance in scores. ICC has several versions [196], the use of which depends on assumptions made about the

observers and population observed.

5.1.1.2 Validity

Traditionally there have been three categories of validity discussed; content, criterion and construct. Various sub-categories have been proposed, often leading to confusion of this

143

fundamental area. For the purposes of this thesis a brief outline of the three distinct categories is provided and readers are referred to specific texts for further reading of the fluctuations in definitions [189, 190].

Content validity, the scope of a method/tool, is seldom measured formally; instead the ‗face- validity‘ or clinical credibility of a method/tool is determined from expert opinion. Range of motion is widely accepted to be face-valid i.e. a value of 20 degrees is less ROM than a value of 40 degrees, although there is no statistical evidence that can be provided for this. Criterion validity is concerned with comparing a method with a definitive ‗gold or criterion standard‘. A ‗gold or criterion standard‘ is a method/tool or test that hypothetically has a sensitivity and specificity of 100% (no false positives and no false negatives). In practice there are no gold or criterion standards. Therefore there is the potential for a gold standard to change if a more specific and/or sensitive method is found. With regards to cervical ROM, no gold standard exists and it is unlikely there will ever be one confirmed. Radiographs have been considered the closest method to a gold standard; however the method has not

undergone sufficient reliability and validity experimentation to be truly classed as this [101]. Criterion validity is divided into two types, concurrent and predictive, depending on when the comparison with the method/tool is compared. Concurrent criterion validity is

established when the comparison is made at the same time. For example when ROM is measured using visual estimation and then using a goniometer immediately after. Predictive criterion validity is established when the new method is applied at baseline and compared to subsequent outcomes at a later date. Because of the time delay and the resulting potential for bias, predictive criterion validity studies are rarely conducted.

144

Construct validity is concerned with the accuracy with which a method represents an attribute that cannot be directly observed. It is determined through deductive reasoning and assessment of convergence to similar methods/tools and divergence to different

methods/tools. An example of this is ‗neck stiffness‘. This is a construct – we cannot definitively prove that an individual has a ‗stiff‘ neck. However we might hypothesise that an individual who complains of a ‗stiff‘ neck might be observed to have difficulty turning their head by a certain amount.

In order to optimise the reporting of this systematic review, the following sections are structured according to the PRISMA statement with modifications appropriate to the nature of the studies within the review [42].

In document The role of cervical spine range of motion in recovery from whiplash associated disorders (Page 165-170)