Background - The developing role of magnetic resonance imaging in Phase III multiple sclerosis

To provide a measure of brain lesion volume requires both lesion identification and delineation. A substantial number of techniques are now available to perform such an analysis, and the following criteria can be used to judge their performance: accuracy, reproducibility, reliability, efficiency and stability over time.

2.1 Accuracy

This refers to the extent to which a technique measures the truth. The accuracy of lesion volume measurement is hard to establish since no perfect gold standard measure exists with which to compare measured values. Approaches to determining the accuracy of a technique include; (i)

use of realistic phantoms where a true value is known (Tofts et al, 1997), (ii) the use of a

simulated MR dataset (Evans et al, 1997), and (iii) comparison against a gold standard measure.

At present, the recognised gold standard is manual outlining, performed in consensus by a group

of experts or experienced raters (Filippi et al, 1998a).

2.2 Reproducibility

Reproducibility (or precision) refers to the degree with which repeated measurements on the same object are in agreement. The assessment of reproducibility is important, since it can define the extent of random measurement error that can be anticipated in an MRI study. There are

multiple potential sources of variability in the measurement of T2 lesion volume (Plante &

analysis. Examples include repositioning errors (Gawne-Cain et al, 1996; Simon et al, 1997;

Filippi et al, 1997c), inconsistent scanner performance (Wang et al, 1997a), variation in scanner

model and field strength (Filippi et al, 1997a), variable motion and flow artifact, and variability

in the segmentation technique itself where any human interaction is required (intra and inter observer measure-remeasure variability). With automated techniques, repeated measurements on an individual MR dataset should necessarily be highly reproducible. However, the robustness of any segmentation technique can only be fully assessed by the analysis of multiple scans that are acquired within a short period, with the patients leaving the scanner between measurements

(scan-rescan reproducibility) (Simon et al, 1997; Filippi et al, 1998a). This forms an important

part of technique validation.

2.3 Reliability

Reliability can be defined as a measurement technique’s ability to discriminate between the different members of a sample population (Fleiss 1985; Streiner & Norman 1995). It describes the proportion of the variance in repeated measurements on a patient sample that is attributable to differences between patients. For a measurement technique to have perfect reliability, all the variance in repeated measurements must arise from systematic differences between subjects.

Since the aim of serial T2 lesion volume quantification is to discriminate between subjects and

identify trends, reliability is an important consideration. An evaluation of a technique's reliability also allows the impact of measurement error on sample size requirements to be calculated (Fleiss

1985).

2.4 Efficiency

This describes the relative ease with which any technique can be applied in a treatment trial, particularly in terms of cost, computational and human resources. Many available segmentation

techniques require a substantial level of human intervention, in either delineating lesion boundaries, or in the process of reviewing the effectiveness of more complex algorithm-derived lesion volumes. With MR protocols for phase III studies typically requiring analysis of many hundreds (or even thousands) of scans, efficiency is clearly at a premium. For more operator- dependant techniques, analysis of the entire dataset by a single observer may not be possible within a reasonable time frame, thereby introducing inter observer inconsistency as a further potential source of variability. Considerable efforts are now being directed towards the development of more efficient quantitative techniques.

2.5 Stability over time

Measurement stability is a further important consideration. While it is imperative that the image acquisition methodology remains stable, consistency in the application of the segmentation technique is also essential. The likely occurrence of operator drift over time has already been demonstrated in two phase III clinical trials that used Tj lesion volume quantification to provide

an outcome measure (Paty et al, 1993; IFNB MS Study Group 1995; Jacobs et al, 1996; Simon

et al, 1998). In one study (Paty et al, 1993), a step decrease in lesion volume of about 9% was

seen in the placebo arm after three years. This was attributed to a change of strategy by the single observer over the course of the study. Subsequent re-analysis confirmed that this step change was

artificial (IFNB Study Group 1995). In a second, more recent study (Jacobs et al, 1996; Simon

et al, 1998), a similar paradoxical reduction in T2 lesion volume in the placebo group was also

identified. This was also attributed to measurement drift, disappearing on re-analysis with an

improved quantification method based on manual outlining (Simon et al, 1998). The risk of such

a step change in analysis technique application might be reduced by rigorous operator training and regular consistency checks on a representative dataset, to identify and address operator drift where this occurs. Furthermore, all the scans from an individual patient should be analysed by

the same observer, ideally within a single session (Filippi et al, 1998a).

2,6 Manual outlining

Until recently, manual outlining remained the gold standard technique for quantifying T2 lesion

volume. The process involves lesion identification by experienced observers, and subsequent computer-assisted delineation of lesion boundaries by manual tracing with a mouse or tracker-

ball. The technique has been used in several clinical trials (Kastrukoff et al, 1990; Paty et al,

1993; Koopmans et al, 1993; Zhao et al, 1997; Simon et al, 1998). Manual outlining utilises

human pattern recognition capabilities for discriminating lesion from artefact/normal anatomy.

However, it suffers from two major disadvantages; (i) sub optimal reproducibility (Filippi et al,

1995d; Grimaud et al, 1996; Filippi et al, 1998c), and (ii) substantial time consumption. While

comprehensive operator training has been shown to improve reproducibility to an extent (Filippi

et al, 1998c), the high level of operator intervention results in intra and inter rater variability

even for experienced observers. Furthermore, the analysis time is long at 30-60 minutes per scan. Therefore, many hundreds of hours of operator time are needed to analyse the large MRI datasets generated in phase III treatment trials. These limitations have led to attempts to develop quantitative techniques with a higher level of automation; the contour technique is one such approach. A study comparing the contour with the manual outlining technique is now presented.

In document The developing role of magnetic resonance imaging in Phase III multiple sclerosis treatment trials: Technical considerations and results of a large multicentre study (Page 67-71)