Improving the evaluation of
therapeutic interventions in
multiple sclerosis: development of a
patient-based measure of outcome
JC Hobart, A Riazi, DL Lamping,
R Fitzpatrick and AJ Thompson
HTA
Health Technology Assessment
NHS R&D HTA Programme
March 2004
How to obtain copies of this and other HTA Programme reports.
An electronic version of this publication, in Adobe Acrobat format, is available for downloading free of charge for personal use from the HTA website (http://www.hta.ac.uk). A fully searchable CD-ROM is also available (see below).
Printed copies of HTA monographs cost £20 each (post and packing free in the UK) to both public and
private sector purchasers from our Despatch Agents.
Non-UK purchasers will have to pay a small fee for post and packing. For European countries the cost is £2 per monograph and for the rest of the world £3 per monograph.
You can order HTA monographs from our Despatch Agents: – fax (with credit card or official purchase order)
– post (with credit card or official purchase order or cheque) – phone during office hours (credit cardonly).
Additionally the HTA website allows you either to pay securely by credit card or to print out your order and then post or fax it.
Contact details are as follows:
HTA Despatch Email: [email protected]
c/o Direct Mail Works Ltd Tel: 02392 492 000
4 Oakwood Business Centre Fax: 02392 478 555
Downley, HAVANT PO9 2NP, UK Fax from outside the UK: +44 2392 478 555
NHS libraries can subscribe free of charge. Public libraries can subscribe at a very reduced cost of £100 for each volume (normally comprising 30–40 titles). The commercial subscription rate is £300 per volume. Please see our website for details. Subscriptions can only be purchased for the current or forthcoming volume.
Payment methods Paying by cheque
If you pay by cheque, the cheque must be in pounds sterling, made payable to Direct Mail Works Ltd
and drawn on a bank with a UK address.
Paying by credit card
The following cards are accepted by phone, fax, post or via the website ordering pages: Delta, Eurocard, Mastercard, Solo, Switch and Visa. We advise against sending credit card details in a plain email.
Paying by official purchase order
You can post or fax these, but they must be from public bodies (i.e. NHS or universities) within the UK. We cannot at present accept purchase orders from commercial companies or from outside the UK.
How do I get a copy of HTA on CD?
Please use the form on the HTA website (www.hta.ac.uk/htacd.htm). Or contact Direct Mail Works (see contact details above) by email, post, fax or phone. HTA on CDis currently free of charge worldwide.
The website also provides information about the HTA Programme and lists the membership of the various committees.
patient-based measure of outcome
JC Hobart,
1,2*
A Riazi,
1
DL Lamping,
3
R Fitzpatrick
4
and AJ Thompson
1
1
Neurological Outcome Measures Unit, Institute of Neurology,
London, UK
2
Peninsula Medical School, Derriford Hospital, Plymouth, UK
3
Health Services Research Unit, London School of Hygiene and Tropical
Medicine, UK
4
Division of Public Health and Primary Health Care, Institute of Health
Sciences, University of Oxford, UK
* Corresponding authorDeclared competing interests of authors:none
Published March 2004
This report should be referenced as follows:
Hobart JC, Riazi A, Lamping DL, Fitzpatrick R, Thompson AJ. Improving the evaluation of therapeutic interventions in multiple sclerosis: development of a patient-based measure of outcome. Health Technol Assess2004;8(9).
Health Technology Assessmentis indexed in Index Medicus/MEDLINE and Excerpta Medica/ EMBASE.
T
technologies is produced in the most efficient way for those who use, manage and provide care in the NHS.
Initially, six HTA panels (pharmaceuticals, acute sector, primary and community care, diagnostics and imaging, population screening, methodology) helped to set the research priorities for the HTA Programme. However, during the past few years there have been a number of changes in and around NHS R&D, such as the establishment of the National Institute for Clinical Excellence (NICE) and the creation of three new research programmes: Service Delivery and Organisation (SDO); New and Emerging Applications of Technology (NEAT); and the Methodology Programme.
This has meant that the HTA panels can now focus more explicitly on health technologies (‘health technologies’ are broadly defined to include all interventions used to promote health, prevent and treat disease, and improve rehabilitation and long-term care) rather than settings of care. Therefore the panel structure was replaced in 2000 by three new panels: Pharmaceuticals; Therapeutic Procedures (including devices and operations); and Diagnostic Technologies and Screening.
The HTA Programme will continue to commission both primary and secondary research. The HTA Commissioning Board, supported by the National Coordinating Centre for Health Technology Assessment (NCCHTA), will consider and advise the Programme Director on the best research projects to pursue in order to address the research priorities identified by the three HTA panels. The research reported in this monograph was funded as project number 95/01/03.
The views expressed in this publication are those of the authors and not necessarily those of the HTA Programme or the Department of Health. The editors wish to emphasise that funding and publication of this research by the NHS should not be taken as implicit support for any
recommendations made by the authors.
HTA Programme Director: Professor Tom Walley
Series Editors: Dr Ken Stein, Professor John Gabbay, Dr Ruairidh Milne, Dr Chris Hyde and Dr Rob Riemsma
Managing Editors: Sally Bailey and Caroline Ciupek
The editors and publisher have tried to ensure the accuracy of this report but do not accept liability for damages or losses arising from material published in this report. They would like to thank the referees for their constructive comments on the draft document.
ISSN 1366-5278
© Queen’s Printer and Controller of HMSO 2004
This monograph may be freely reproduced for the purposes of private research and study and may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising.
Applications for commercial reproduction should be addressed to HMSO,The Copyright Unit, St Clements House, 2–16 Colegate, Norwich, NR3 1BQ.
Published by Gray Publishing, Tunbridge Wells, Kent, on behalf of NCCHTA.
Printed on acid-free paper in the UK by St Edmundsbury Press Ltd, Bury St Edmunds, Suffolk.
Criteria for inclusion in the HTA monograph series
Reports are published in the HTA monograph series if (1) they have resulted from work commissioned for the HTA Programme, and (2) they are of a sufficiently high scientific quality as assessed by the referees and editors.
Reviews in Health Technology Assessment are termed ‘systematic’ when the account of the search,
appraisal and synthesis methods (to minimise biases and random errors) would, in theory, permit the replication of the review by others.
Objectives: To develop a patient-based, disease-specific measure of the health impact of multiple sclerosis (MS) for use in clinical trials and clinical practice.
Data sources: People with MS. Members of the MS Society of Great Britain and Northern Ireland.
Methods: Standard psychometric methods were used to develop the Multiple Sclerosis Impact Scale (MSIS-29) in three stages. Stage 1 (item generation): questionnaire items were generated from 30 patient interviews on the impact of MS on their lives, expert opinion and literature review. Stage 2 (item reduction and scale generation): the questionnaire developed in stage 1 was administered by postal survey to 1530 randomly selected members of the MS Society. Standard item reduction techniques were used to develop a rating scale from the pool of questionnaire items. Stage 3 (psychometric evaluation): the questionnaire was evaluated for data quality, scaling assumptions, acceptability, reliability and validity in a separate postal survey of 1250 MS Society members. Responsiveness was evaluated in 55 people admitted to hospital for rehabilitation and intravenous steroid treatment of MS relapses.
Results: Stage 1 resulted in a 129-item questionnaire. Stage 2 resulted in a 29-item rating scale measuring the physical and psychological impact of MS. The MSIS-29 satisfied all recommended psychometric criteria for rigorous measurement. Data quality was excellent:
missing data were low, item test–retest reliability was high and scale scores could be generated for over 98% of respondents. Item descriptive statistics, item convergent and discriminant validity, and factor analysis supported summing items to produce two summary scores. MSIS-29 physical and psychological scale scores showed good variability, low floor and ceiling effects, good internal consistency and test–retest reliability. Correlations with other measures and confirmation of hypotheses about group differences provided evidence for the validity of the MSIS-29 as a measure of the physical and psychological impact of multiple sclerosis. Effect sizes provided preliminary evidence for
responsiveness.
Conclusions: The 29-item MSIS-29 is a rigorous new measure of the physical and psychological impact of MS. All psychometric criteria were satisfied and there is preliminary evidence of responsiveness. The MSIS-29 is particularly appropriate for use in clinical trials to evaluate therapeutic effectiveness from the patient’s perspective. Further critical evaluations of the MSIS-29 completed by people with neurologist-confirmed MS in different settings are suggested. Head-to-head
comparisons of the psychometric properties of the MSIS-29 and other outcome measures for MS will help to determine the relative advantages of different instruments so that the choice of measures for studies can be evidence based.
iii
Abstract
Improving the evaluation of therapeutic interventions in multiple
sclerosis: development of a patient-based measure of outcome
JC Hobart,
1,2*A Riazi,
1DL Lamping,
3R Fitzpatrick
4and AJ Thompson
11 Neurological Outcome Measures Unit, Institute of Neurology, London, UK 2 Peninsula Medical School, Derriford Hospital, Plymouth, UK
3 Health Services Research Unit, London School of Hygiene and Tropical Medicine, UK
4 Division of Public Health and Primary Health Care, Institute of Health Sciences, University of Oxford, UK
v
List of abbreviations ... vii
Executive summary... ix
1 Overview of report ... 1
2 Background ... 3
Overview ... 3
Evaluation of therapeutic interventions for MS ... 3
Health outcomes measurement: history, concepts and theory ... 4
3 Development of the MSIS-29 ... 11
Overview ... 11
Methods ... 11
Results ... 14
4 Psychometric evaluation of the MSIS-29 ... 19
Overview ... 19
Data quality, scaling assumptions, acceptability, reliability and validity ... 19
Responsiveness ... 23
5 Discussion ... 29
Overview ... 29
Discussion of results ... 29
Study limitations ... 30
Implications for health care ... 30
Recommendations for future research ... 30
Conclusions ... 31
Acknowledgements... 33
References... 35
Appendix 1 The 129-item long-form questionnaire and item reduction strategy ... 41
Appendix 2 Multiple Sclerosis Impact Scale (MSIS-29) ... 45
Appendix 3 Instructions for administration and scoring the MSIS-29 ... 47
Health Technology Assessment reports published to date ... 49
Health Technology Assessment Programme ... 57
vii AERA American Educational
Research Association ANOVA analysis of variance APA American Psychological
Association
BI Barthel Index
EDSS Expanded Disability Status Scale
EQ-5D EuroQol
ES effect size
FAMS Functional Assessment of MS GHQ-12 General Health
Questionnaire
GNDS/UKNDS Guy’s (now UK) Neurological Disability Scale
HRQOL-MS Health-related Quality of Life Questionnaire for MS ICC intraclass correlation
coefficient
ION Institute of Neurology IR Irene Richardson
MD missing data
MEF maximum endorsement frequency
MRI magnetic resonance imaging MS multiple sclerosis
MSIS-29 Multiple Sclerosis Impact Scale
MSQLI MS Quality of Life Inventory
NA not applicable
NCME National Council on Measurement in Education NHNN National Hospital for
Neurology and Neurosurgery PCA principal components
analysis
SE standard error
SF-36 Medical Outcomes Study 36-Item Short Form Health Survey
SIP Sickness Impact Profile
List of abbreviations
All abbreviations that have been used in this report are listed here unless the abbreviation is well known (e.g. NHS), or it has been used only once, or it is a non-standard abbreviation used only in figures/tables/appendices in which case the abbreviation is defined in the figure legend or at the end of the table.
Background
Multiple sclerosis (MS) is an incurable progressive neurological disorder that has a profound impact on people’s lives. Although a wide range of problems has been documented, the impact of MS from the individual’s perspective has not been systematically and directly measured. There is no outcome measure that incorporates patients’ own perspectives about the impact of MS that is sufficiently rigorous to be used in treatment trials, epidemiological studies and audit. This report describes the development and validation of a new instrument, the Multiple Sclerosis Impact Scale (MSIS-29), a rigorous measure of the physical and psychological impact of MS from the patient’s perspective.
Objectives
To develop a patient-based, disease-specific measure of the health impact of MS that is clinically useful, and scientifically sound, and suitable for use as an outcome measure in clinical trials and in routine clinical practice.
Methods
Standard psychometric methods were used to develop the MSIS-29 in three stages.
● Stage 1 (item generation): questionnaire items were generated from 30 patient interviews on the impact of MS on their lives, expert opinion and literature review.
● Stage 2 (item reduction and scale generation): the questionnaire developed in stage 1 was administered by postal survey to 1530 randomly selected members of the MS Society. Standard item reduction techniques were used to develop a rating scale.
● Stage 3 (psychometric evaluation): the rating scale was evaluated for data quality, scaling assumptions, acceptability, reliability and validity in a separate postal survey of 1250 MS Society members. Responsiveness was evaluated in 55 people admitted to hospital for
rehabilitation and intravenous steroid treatment of MS relapses.
Results
● Stage 1: a pool of 129 items was generated. ● Stage 2: the item pool was reduced to a 29-item
measure of the physical (20 items) and psychological (nine items) impact of MS: the MSIS-29.
● Stage 3: the MSIS-29 satisfied all recommended psychometric criteria for rigorous measurement. Data quality was excellent: missing data were low (maximum 3.9%), item test–retest reliability was high (r= 0.65–0.90) and scale scores could be generated for >98% of respondents. Item descriptive statistics, item convergent and discriminant validity, and factor analysis supported summing items to produce two summary scores. MSIS-29 physical and psychological scale scores showed good variability, low floor and ceiling effects, good internal consistency (Cronbach’s α ≥0.91) and test–retest reliability (intraclass correlation
≥0.87). Correlations with other measures, and confirmation of hypotheses about group differences, provided evidence for the validity of the MSIS-29 as a measure of the physical and psychological impact of multiple sclerosis. Effect sizes (physical scale = 0.82, psychological scale = 0.66) provided preliminary evidence for responsiveness.
Conclusions and
recommendations
The 29-item MSIS-29 is a rigorous new measure of the physical and psychological impact of MS. All psychometric criteria were satisfied and there is preliminary evidence of responsiveness. The MSIS-29 is particularly appropriate for use in clinical trials to evaluate therapeutic effectiveness from the patient’s perspective.
A limitation of the study is that the MS Society membership database was used to define the sampling frame; the percentage of people in the database with a neurologist-confirmed diagnosis of clinically definite MS, the disease type of those with MS and the representativeness of people who join charitable groups are
unknown. ix
Critical evaluations of the MSIS-29 completed by people with neurologist-confirmed MS in different settings will identify its strengths and weaknesses, and further define its role in clinical practice and research. Head-to-head comparisons of the
psychometric properties of the MSIS-29 and other outcome measures for MS will help to determine the relative advantages of different instruments so that the choice of measures for studies can be evidence based.
T
his report describes the development and validation of the Multiple Sclerosis Impact Scale (MSIS-29), a rigorous new measure of the physical and psychological impact of multiple sclerosis (MS) from the patient’s perspective. The MSIS-29 was developed and tested in three stages. In stage 1 (item generation), a 129-item questionnaire was generated from 30 patient interviews, expert opinion and a review of the literature. In stage 2 (item reduction and scale generation), the 129-item questionnaire was administered by postal survey to 1530 randomly selected members of the MS Society to identify items for elimination on the basis of psychometric performance. This process generated the MSIS-29. In stage 3 (psychometric evaluation), a comprehensive evaluation of the psychometric properties of the MSIS-29 (data quality, scaling assumptions, acceptability, reliability, validity and responsiveness) was undertaken in a postal survey of 1250 MS Society members and 55 people admitted to hospital for rehabilitation or intravenous steroid treatment.Chapter 2 describes the evaluation of therapeutic interventions for MS and, in some detail, the psychometric concepts and methods used to develop and validate the MSIS-29. Readers familiar with psychometric methods may prefer to omit this section.
Chapter 3 presents the methods and results of stages 1 (item generation) and 2 (item reduction and scale generation).
Chapter 4 presents the methods and results of stage 3 (psychometric evaluation).
Chapter 5 presents a discussion of study results, study limitations and the implications of the MSIS-29 for healthcare, and provides recommendations for future research.
The appendices include a copy of the MSIS-29 and instructions for administration and scoring of this measure.
1
Chapter 1
Overview
This chapter describes the evaluation of therapeutic interventions for MS and the
psychometric methods and concepts used. Readers familiar with psychometric methods may prefer to skip to Chapter 3 after the section on Health outcomes measurement.
Evaluation of therapeutic
interventions for MS
MS is an incurable progressive neurological disorder that has a profound impact on individuals and their families. Although the incidence in the UK is relatively low (2500 new cases/year), the prevalence is much higher (85,000). This is because MS tends to begin in young age groups, is incurable and in the majority of people is progressive over many decades. Although MS has little effect on longevity, it has a major impact on physical function, employment and quality of life. It is a complex disorder with diverse effects, an unpredictable course, and variable manifestations that pose unique problems to patients and their families. Moreover, the cost of MS in the UK is estimated to be £1200 million per year1and is expected to increase.2Costs due to MS have been shown to increase as disability progresses.3,4Psychosocial costs are less easily quantified, but no less real.
As MS is a major public health concern in Britain, beneficial interventions are to be welcomed. However, the outcomes of therapeutic interventions must be rigorously evaluated if policy decisions and clinical practice are to be evidence based. The need for more rigorous evaluation of treatments for MS has recently become critically important for several reasons. First, an increasing number of therapeutic pharmaceutical agents aimed at altering the course of MS is being introduced and their effectiveness needs to be determined.5Second, because the relative benefits of different
interventions are likely to be marginal, analyses of comparative effectiveness are necessary.6Third, as treatments are expensive and may be required on a long-term basis, decisions about interventions
based on short-term evaluation may have long-term economic implications. Fourth, as resources for the treatment of MS are required for other aspects of service provision, including
rehabilitation and community support, resource allocation must be equitable. Finally, it is important that the current limited resources are allocated appropriately.
Evidence-based policy and clinical decision-making require rigorous measurement of outcomes. This information is of value when the outcomes that are evaluated are appropriate to patients and the instruments that are used are clinically useful and scientifically sound. Outcome measures in MS have traditionally focused on physiological
parameters of disease and simple, easy to measure entities such as mortality morbidity and duration of survival. Although these assessments are important, they only partly address patients’ concerns,7offer little information about diverse clinical consequences, fail to address the personal impact of disease,8and are of limited relevance in conditions that do not affect longevity. As new treatments for MS are aimed at altering its natural history or modifying its impact, traditional
outcomes are inadequate in a comprehensive evaluation of therapeutic effectiveness.
Over the past two decades, outcome measurement in MS has relied heavily on the Expanded
Disability Status Scale (EDSS).9This is an observer
(neurologist)-rated scale which grades ‘disability’ due to MS in 20 steps on a continuum from 0 (normal neurological examination) to 10 (death due to MS). The EDSS was developed on the basis of the extensive clinical experience of a
neurologist specialising in MS. It addresses impairment (symptoms and signs) at the lower levels (0–3.5), mobility in the middle range (4.0–7.5), and upper limb (8.0–8.5) and bulbar function (9.0–9.5) in the higher levels. Although the EDSS evaluates disability, it was developed before psychometric methods became familiar to clinicians, was not based on recognised techniques of scale construction,10and did not directly
involve people with MS. More importantly, the EDSS is rated by neurologists rather than by patients themselves and has limited measurement
properties.11,12 3
Chapter 2
The lack of validated MS-specific measures has led to the use of generic measures, such as the
Medical Outcomes Study 36-Item Short Form Health Survey (SF-36),13Sickness Impact Profile (SIP)14and EuroQoL.15Although generic measures have the advantage of enabling comparisons across diseases, it is increasingly recognised that they do not cover some areas of outcome that are highly relevant in specific diseases,16and may have limited responsiveness.17 Psychometric limitations of the SF-36 in MS include significant floor and ceiling effects,18 limited responsiveness,18underestimation of mental health problems,19and a failure to satisfy assumptions for generating summary scores.20 Disease-specific instruments, consisting of items and domains of health that are specific to a particular disease, are more relevant and important to patients and clinicians and
consequently are more likely to be responsive to subtle changes in outcome.7,17,21
Several MS-specific measures have been developed since the mid-1990s. These include the Functional Assessment of MS (FAMS),22the MSQOL-54,23the
MS Functional Composite,24the Leeds MSQoL
scale,25the Guy’s (now UK) Neurological Disability
Scale (GNDS/UKNDS),26the MS Quality of Life
Inventory (MSQLI),27and the health-related
quality of life questionnaire for MS (HRQOL-MS).28While all are encouraging, one limitation of
these measures is that none was developed using the standard psychometric approach of reducing a large item pool generated de novofrom people with MS. The FAMS and MSQOL-54 were
developed by adding MS-specific items to existing measures, an approach that has been
demonstrated to have some limitations.29The
HRQOL-MS was developed through factor analysis of items from two generic measures and one MS-specific measure, and the MSQLI combines a large number of existing disease-specific and generic instruments. Items for the GNDS were developed through expert clinical opinion rather than on the basis of interviews with people with MS. Consequently, an outcome measure that is MS specific and combines patient perspective with rigorous psychometric methods will complement existing instruments. The aim of this study was to develop such a measure.
Health outcomes measurement:
history, concepts and theory
The scientific discipline of health measurement30 grew in response to the need to supplement
clinical judgement with reliable and valid patient-based measures of health outcomes. Recently, there has been increasing recognition of the importance of assessing more patient-relevant consequences of disease, a practice that is now considered essential in a comprehensive evaluation of healthcare.16,31 As measurement of such outcomes will influence decisions that affect patient welfare, policy
development and the expenditure of public funds, it is essential that rigorous measurement
instruments are used in healthcare evaluation.32 A simple but useful classification considers health outcomes in neurology to be either physician or patient-based. The most frequently studied physician-based outcomes in MS are magnetic resonance imaging (MRI) and relapse rate.
Although these physician-based outcomes have the patient’s interests at heart, they only address the pathological basis of MS and evaluate health in terms of quantity. They do not provide a complete picture of disease impact as they offer limited information about the diverse clinical
consequences of MS and fail to incorporate subjective assessments of health.16
Patient-based outcomes are the consequences of disease and treatment that are considered
important to patients. Patients are the best source of information about therapeutic benefit defined in terms of functioning and well-being.33As patients, their carers and their physician differ in their interpretation of the impact of illness,34–39it is important to elicit information from patients about which outcomes are important. This is supported by irrefutable evidence that patients can provide reliable and valid judgements of health status and the benefits of treatment.40,41Indeed, patient report has been described as the ultimate measure of health status.42In addition, the self-report method affords considerable methodological advantages over other methods of instrument administration.43 For example, large numbers of geographically disparate patients can be accessed by postal survey, thus reducing selection bias while minimising patient discomfort and research staff involvement. It is common to consider measures apart from traditional indicators of biological functioning as a single category of quality of life measures.33 However, as quality of life encompasses factors not generally considered to be part of health per se (e.g. income and environment), the terms health-related quality of life,7health status33and functional status are commonly used interchangeably.44,45 The title of a measure should be as descriptive as possible of the construct measured.
Psychometric theory
Although health measurement as a distinct discipline emerged in the 1980s,46–48it is derived from well-established theories and methods of measurement in the field of social sciences the origins of which can be traced to the mid-1800s. The basic scientific principles of measurement were established by mathematical psychologists interested in the human being as a measuring instrument. By studying how people make subjective judgements about measurable physical stimuli (e.g. length, weight, loudness), they
developed the science of psychophysics: the precise and quantitative study of how human judgements are made.49The investigation of overt responses to physical stimuli requires precise methods, referred to as psychophysical methods, for presenting the stimuli and for measuring responses.50
The work of psychophysicists seems far removed from health measurement. In fact, it established the fundamental principles of subjective
measurement which are as equally relevant to judgements about health as to judgements about physical stimuli. The psychophysicists
demonstrated three important findings about human judgement: that subjective judgement is a valid approach to measurement, that humans make judgements about abstract comparisons in an internally consistent manner, and that accurate judgements can be made on ratio rather than simple ordinal scales. It is notable that psychophysical methods are still used in
neurology; thermal threshold testing is based on the principle of the just noticeable differences in temperature detection, and audiometry on a person’s response to different sound frequencies. While the psychophysicists were measuring subjective judgements about physical stimuli that could be independently and objectively measured and verified, experimental psychologists were attempting to measure human attributes for which there were no independent physical scales of measurement (e.g. intelligence, personality, attitudes).50Darwin’s empirical demonstration of evolution in the On the Origin of Speciesin 1859 was
the impetus behind the study of individual differences in psychology.51It was reasoned that if animals inherit ancestral characteristics, and if individual differences influence their ability to adapt and survive, so individual differences in humans would have functional significance and could be inherited. Galton, who followed Darwin and believed that the human race could be bettered through controlled mating (eugenics), realised that human characteristics must be
measured in a standardised manner before their inheritance could be studied. He coined the term ‘mental test’ for any measure of a human attribute, and set about the large-scale testing of sensory discrimination and motor function in the belief that people with the most acute senses would be the most gifted and most knowledgeable.51 However, when Galton’s colleague Pearson developed and applied the correlation coefficient, it became clear that results from these simple sensory and motor tests bore almost no relationship to measures of intellectual
achievement, such as school grades.52This finding prompted the development of the mental test movement, with a widespread interest in the development and application of mental testing, and the measurement of individual differences. A major advance in mental testing53was made when Thurstone demonstrated that psychophysical scaling methods could be used to measure accurately psychological attributes.54,55This finding prompted the development of psychological (or psychometric) scaling methods, which are defined as procedures for constructing scales for the measurement of psychological attributes.49Spurred on by the practical need to measure diverse outcomes, the mental test movement flourished between 1930 and 1950 with the spread of standardised testing for assessing educational achievement, measuring attitudes and personality, and selecting and screening personnel. In addition, scientific interest in methods of testing led to the development of psychometrics as a prominent discipline within psychology and established the cornerstones of the scientific evaluation of measuring instruments based on reliability and validity testing.49,56 The growth and development of psychometrics required standards for the development and evaluation of measurement instruments. The first of these was introduced in 1954 by a committee of the American Psychological Association (APA).57 The following year similar guidelines were prepared by a committee representing the American Educational Research Association (AERA) and the National Council on Measurement in Education (NCME).58Subsequently, standards have been published by the Committee to Develop Standards for Educational and Psychological Testing, which represents the APA, AERA and NCME,59–61along with a commitment to the continual review of measurement standards in psychology and education.61
Thus, when healthcare evaluation needed methods
technology already existed. Since the 1970s, the focus of healthcare evaluation has moved to the measurement of function (the ability of patients to perform the daily activities of their lives), how patients feel and their own evaluation of their health in general.40The primary source of this information is standardised surveys,46for which psychometric techniques of scale construction are highly appropriate.40
Two studies in the USA confirmed the value of psychometric methods in assessing health outcomes. The Health Insurance Study,62a randomised experiment conducted by The Rand Corporation between 1974 and 1981,
demonstrated that psychometric methods can be used to generate reliable and valid measures for assessing changes in health status for both adults and children in the general population. Following on from this, the Medical Outcomes Study40,63 demonstrated that psychometric methods of scale construction and data collection were successful for measuring health status in samples of sick and elderly people. This study also demonstrated that psychometrically equivalent short-form measures could be constructed from the original longer forms,64thereby reducing respondent and administrative burden and improving
measurement efficiency. These two pivotal studies confirmed that psychometric methods, borrowed from the social sciences, generated scientifically sound and clinically useful health measures. Psychometric theory posits that when a concept cannot be measured directly (e.g. health status), it can be measured by asking a series of questions or items, each of which addresses a different aspect of the same concept.65Analysis of a large number of items generated by clearly defined standard techniques allows one to reduce the number of items and to construct scales.10Instruments developed according to psychometric principles must then be formally evaluated to ensure that they measure the outcome of interest in a manner that is reliable (consistent, stable over time and reproducible), valid (measure what they purport to or are intended to measure) and responsive (able to detect clinically important change over time).10,52,61,66–72
Instrument development
Below is an overview of the psychometric methods with appropriate references. Further information on these methods is reported in the references cited. Development of an outcomes measurement instrument in accordance with psychometric
principles involves two stages: generation of a large item pool, followed by reduction of the initial item pool to form the final instrument. Items can be generated from a variety of sources, including patients, consensus opinion of experts in the field, literature review and critical review of existing measures. They are then pretested on a small sample to assess how easily they can be understood and completed, whether there are ambiguities of wording, whether there are any irrelevant, misleading or offensive items, and whether the content of each item is appropriate. Items are revised based on pretesting to produce a version to be evaluated in the preliminary field test. Pretesting is critical for identifying problems with a questionnaire, such as problems with question content, which can cause confusion with the overall meaning of an item, as well as
misinterpretation of individual terms or concepts. Pretesting is a broad term that incorporates many different methods or combination of methods73in the prefield and field testing phase. Examples of prefield techniques include respondent focus groups and cognitive laboratory interviews. The latter consist of one-to-one interviews using a structured questionnaire in which respondents describe their thoughts while answering the questions (this is also called the ‘think aloud’ interviews). Field techniques include behaviour coding, respondent debriefings, interviewer debriefings, split-panel tests, analysis of item non-response rates and analysis of non-response
distributions.73For more information, see Ref. 73. The purpose of the first field test is to reduce the number of items and to develop scales. The instrument is administered to a large sample of patients and results are analysed using standard psychometric techniques for item analysis.10,66First, items with poor response rates and very high or low endorsement frequencies (proportion of people who endorse each response alternative) are eliminated. The remaining items are analysed using a variety of techniques including exploratory factor analysis to determine the underlying dimensions (factors) of the instruments. Items are analysed for redundancy, homogeneity and discrimination ability. Redundancy can be described as the extent to which a pair of items measures the same construct. Homogeneity refers to the fact that all of the items are tapping different aspects of the same attribute, and not different parts of different traits.10,48,74Discrimination ability is the extent to which an item (or scale) can discriminate between those individuals who differ in the construct being measured.48Based on these results, items are 6
retained or eliminated and grouped into subscales to produce a final version of the instrument.
Instrument evaluation
Instrument evaluation is the assessment of six scientific properties: data quality, scaling
assumptions, acceptability, reliability, validity and responsiveness.
Data quality
Indicators of data quality such as item non-response and missing scale scores determine the extent to which an instrument can be used successfully in a clinical setting. They reflect respondents’ understanding and acceptance of a measure and help to identify items that may be irrelevant, confusing or upsetting to patients.75 Data quality can be determined by calculating per cent missing data for items, item test–retest reproducibility and per cent computable scale scores. Item test–retest reproducibility is the degree to which an item of the questionnaire yields stable scores over time among respondents who are assumed not to have changed on the domains being assessed. When there are missing items, a scale score can be calculated provided that 50% or more of the items are completed. A
psychometrically sound method of imputing data is to replace missing items with a person-specific mean score, the average score across completed items for that respondent.13,76
Scaling assumptions
Having developed an instrument in the manner outlined above, and having used factor analysis to group items into subscales, one can now make the following assumptions about the final version: first, that items are correctly grouped into scales and that items in the same scale measure the same construct; and secondly, that the items of each scale can be summed without weights to produce scale scores. These assumptions can be evaluated by examining five criteria.
Equivalence of item variances
If items of the same scale are summed to produce a score it is assumed that the responses to items do not require standardisation or weighting.77 This assumption relies on items being roughly parallel and therefore having symmetrical item-response distributions and exhibiting equivalent means and standard deviations.
Equivalence of corrected item–total correlations
If items of the same scale are summated to
provide a score, it is assumed that each item in the same scale contains the same proportion of
information about the construct being measured. This assumption is met if item–total correlations (correlation between the score of an individual item and the scale total score) are approximately equal. The item–total correlation is corrected for overlap by subtracting the item score so that estimates of the item–total relationship are not spuriously inflated.78Recently, Ware and colleagues79stated that this criterion can be considered satisfied when values exceed 0.30, even if they vary. No empirical justification of this criterion is given.
Item convergent validity
If all items in a scale are measuring the same underlying intangible construct or ‘latent variable’80each item should be substantially linearly related to the total score computed from other items in that group. This criterion of item convergent validity is supported if an item
correlates substantially with its own scale. Different authorities interpret different values for corrected item–total correlations as substantial. These include 0.20,480.3010and 0.40.81
Item-discrimination criteria
If the items of an instrument are correctly grouped into scales, items within a particular grouping should correlate more highly with the concept they are hypothesised to represent than with the other concepts measured by the
instrument. Hypothesised groupings of items are supported when correlations between an item and its own scale (item–own-scale correlation) are significantly higher than with other scales of the measure (item–other-scale correlation). The extent of this item-discrimination criterion can be gauged by calculating scaling success rates.82A scaling success occurs when the item–own-scale
correlation is two–standard-errors (SE) or more greater than the item other scale correlation (SE of a correlation coefficient = 1/√N).83An overall scaling success rate for a scale is the percentage of item scaling successes relative to the total number of item–own-scale correlations. Correlations within 2 SE of the corresponding convergent correlations indicate limited item discrimination. Therefore, a definite scaling success is defined as item–own-scale correlations greater than item–other-item–own-scale correlations by 2 SE or more. Possible scaling success is defined as item–own-scale correlations greater than item–other-scale correlations by less than 2 SE. Possible scaling failure is defined as item own correlations less than item–other-scale correlations by less than 2 SE. Definite scaling failure is defined as item–own-scale correlations less
Factor analysis
Exploratory factor analysis is used to reduce the number of items and develop scales in the preliminary field test stage of the study. In this technique a number of decisions is taken that can have a substantial impact on the results and their interpretation. The resulting item structure of the instrument depends on choices regarding the factor model [principal components analysis (PCA) or common principal axis factor analysis], the number of factors that are appropriate, the rotation method selected and the other items that are included in the analysis.84In addition, the interrelationship of variables is left unspecified and it is impossible to test directly alternative theoretical structures underlying the data. Consequently, confirmatory factor analysis will be performed at a later stage after the instrument has been developed to assess the underlying structure of the final instrument. Acceptability
An instrument is considered acceptable when score distributions adequately represent the true
distribution of health status in the sample.81 Item score distributions are considered acceptable when four criteria are met: approximately equal endorsement across response categories;49
maximum endorsement frequencies, calculated as the percentage of responses for the most frequently endorsed response category; less than 80%, for dichotomous response options [for multipoint (polychotomous) response options, this criterion is less, and there are no published guidelines]66and minimal item floor and ceiling effects, calculated as the percentage of responses for the lowest and highest scores, respectively. Although there are no widely accepted criteria for maximum item floor and ceiling effects, two published
recommendations are 75%85and 90%.86
Scale score distributions are considered acceptable when four criteria are satisfied: scores should span the full scale range;46mean scores should be situated near the scale midpoint;87scale floor and ceiling effects, calculated as the percentage of responses for the minimum and maximum scores, respectively, are minimal; and score distributions are not excessively skewed.75There are no widely accepted criteria for floor and ceiling effects and skewness for scales. Current recommendations are that scale floor and ceiling effects should not exceed 15%88or 20%89and that skewness statistics should be within the –1 to +1 range.85
Reliability
The reliability of an instrument is defined as the
extent to which it is free from random error.49As reliability increases (or decreases), scores are more (or less) consistent and, therefore, measured variance reflects true variance in the construct (or random error). In keeping with this definition, reliability coefficients estimate the proportion of total score variance that is due to true score variance.49In practice, the evaluation of reliability is in terms of two different aspects of a measure: internal consistency and reproducibility.82 Internal consistency is the extent to which items are interrelated.84Three indicators of internal consistency can be derived: corrected item–total correlations, Cronbach’s αcoefficients and homogeneity coefficients.
Corrected item–total correlations have been discussed above in the section on ‘Scaling
assumptions’ (p. 7). The higher the correlation, the higher the variance shared by the item and the total score, and the higher the reliability of the item. Cronbach’s αprovides an estimate of reliability based on all possible correlations between two sets of items within a scale.82Although widely
interpreted as such, strictly speaking αis not a measure of unidimensionality. Rather, αis a measure of level of mean intercorrelation
weighted by variances. It will be higher when there is homogeneity of variances among items than when there is not. Furthermore, the formula for α also takes into account the number of items on the theory that the more items, the more reliable a scale will be.90–92That is, when the number of items in a scale is higher, αwill be higher even when the estimated average correlations are equal. Alpha coefficients exceeding 0.80 are considered acceptable for scales used to make group
comparisons, whereas the more stringent criterion of 0.90–0.95 is required for scales used to make individual comparisons.10
As αcoefficients are related to scale length90–92 Ware and colleagues46recommend that
‘homogeneity’ coefficients are also reported as indices of internal consistency. Homogeneity coefficients are simply the average item
intercorrelations for scales; it is recommended that values exceed 0.30.87They are of particular value when comparing the internal consistency of instruments with differing numbers of items within their subscales.
Reproducibility
Reproducibility evaluates whether an instrument yields the same results on repeated assessments, 8
assuming that respondents have not changed on the domain being measured.93Examples of reproducibility are parallel-forms, rater and test–retest reproducibility. Parallel-forms reproducibility is used when psychometrically identical versions of the same questionnaire are developed (to overcome the effects of memory or learning). Rater reproducibility is of importance in non-self-report measures, and is concerned with agreement between two or more ratings made by the same observer (intra-rater) or different
observers (inter-rater) for the same patients. Thus, test–retest reproducibility is the most relevant form of reproducibility for patient-based outcome measures because parallel forms of measures do not usually exist and most measures are self-completed. It is examined by readministering the instrument to the same respondents after a specified period. If the results from the two time points have high agreement, the instrument demonstrates high test–retest reproducibility. Although there is no rule about the length of the test–retest interval, it needs to be sufficiently long to ensure that respondents are unlikely to recall their previous answers, but not so long that changes in health have
occurred.10Although the recommended range of the test–retest interval is between 2 and 14 days,66 this must be influenced by the nature of the study. Correlation coefficients are frequently used to measure test–retest reproducibility. This method has been criticised on the basis that the results may be highly correlated but systematically different.94 Therefore, an intraclass correlation coefficient (ICC), a measure of agreement, is recommended. This uses analysis of variance (ANOVA) to
determine how much of the total variability in scores is due to true differences between individuals and how much to variability in measurement.95 Recommended minimum standards for
reproducibility are 0.80 for group comparisons and 0.90–0.95 for group comparisons.10 Validity
Validity can be broadly defined as the extent to which an instrument measures the concept it purports or is intended to measure.61,96–98Validity of measurement cannot be proven; rather,
accumulating evidence is gathered, much as in a court case.72There are three types of validity: content, criterion-related and construct.96
Content validity
This refers to how well an instrument covers the construct being measured. Appropriate methods of item generation and selection help to ensure content validity. For example, as only persons with
MS can truly define the aspects of health status affected by the disease, they act as the ultimate expert opinion. By involving a broad spectrum of MS patients and field testing large samples, omission of important domains and thus poor content validity are less likely. Nevertheless, it provides only weak evidence for the validity of a scale.
Criterion-related validity
This examines the degree to which a measure correlates with gold standard (criterion) measures obtained at a similar point in time (concurrent validity) or at a later time (predictive validity). Both types of criterion-related validity are expressed as correlations between the scale (predictor) and the criterion. However, as it is rare to find gold standard measures in the field of health status, more indirect approaches are recommended to evaluate validity.99
Construct validity
This process is used to establish the validity of a measurement instrument when no criterion or universe of content is accepted as entirely
adequate to define the attribute being measured.96 Construct validity involves testing hypotheses about how the instrument is expected to perform and examining the extent to which empirical data support these hypotheses.96Although there are several methods for determining construct validity, two categories have been distinguished: internal and external construct validity,100or psychometric and clinical tests of validity.101In the absence of gold standard measures of health status, both types of validity should be evaluated, as they are independent, complementary and on their own insufficient.101
Internal construct validity involves statistical analyses of scale scores to determine whether hypotheses concerning the theoretical structure of the instrument are supported. These analyses include PCA, within-scale correlations and relative validity.101,102Evidence for construct validity is provided if factor analysis confirms that the instrument consists of distinct scales that have items consistent with those hypothesised, and if item discrimination criteria are supported (see ‘Scaling assumptions’, p. 7). Further evidence for construct validity is provided if correlations between the scales of an instrument conform to hypotheses about the magnitude and pattern of correlations. Relative validity assessment determines the degree to which the component scales of an instrument measure the underlying
groups that are of interest (e.g. those who are more or less disabled), the measurement precision of an instrument is quantified as the degree to which it separates these two groups (the difference between the mean scores) relative to the variance within the groups. F-statistics, derived from a one-way ANOVA, take both of these attributes into account as they indicate the ratio of between-groups (systematic) variance to within-group (error) variance.101The higher the F-statistic, the greater the measurement precision. By comparing a number of instruments in the same sample, relative measurement precision is estimated as the ratio of pairwise F-statistics (Ffor one measure divided by Ffor another) and indicates, as a percentage, how much more (or less) precise one measure is compared with another at detecting group differences.64In practice, the instrument with the largest F-statistic can be chosen as the arbitrary standard and assigned a relative measurement precision of 1. By comparing different scales, relative validity can be estimated by the ratio of pairwise F-statistics (Ffor one measure divided by Ffor another).
In contrast, external (empirical) construct validity or clinical tests of validity examine the
relationships between the score on a given scale and external variables measured simultaneously or at a different point in time. This is an attempt to demonstrate that the instrument (1) measures what it is supposed to measure (convergent construct validity), (2) does not measure what it is not designed to measure (discriminant construct validity), (3) distinguishes between groups in predictable ways (group differences construct validity), and (4) produces results consistent with theoretical expectation (hypothesis testing).96,97 Responsiveness
Responsiveness is the ability of an instrument to measure clinically important change over time. While reliability and validity are the major determinants of the scientific robustness of a measure, the ability of an instrument to detect clinically significant change is also essential when evaluating the relative benefits of different interventions. This is particularly important when treatments are associated with small but significant benefits (a feature of current-day interventions in MS), which may be undetected by measures that are unresponsive. In such cases a clinically appropriate, reliable and valid, but unresponsive instrument is of limited value.
Although several methods have been used to assess the responsiveness of an instrument, there is
little consensus about which method is best and how results should be reported.103The most common method of determining responsiveness is to examine the change scores following an
intervention of known efficacy. Results are reported as an effect size, a standardised change score. There are different ways of calculating effect sizes, depending on whether the denominator is the standard deviation of baseline scores,104the standard deviation of change scores [standardised response mean105or the standard error of change scores (t-statistics)106]. These different methods of calculating effect sizes generate estimates of different magnitude and there is no consistent relationship between them.107Responsiveness measures using effect sizes are termed prospective methods.108
Another method of estimating the ability of instruments to detect change is by comparing change scores on a health status instrument with an external criterion of change, such as a transition question, also referred to as the global scale of change.109In this method, either patients or clinicians assess the amount of change
retrospectively using a transition question (e.g. 0 = no change, 1 = minimal improvement, 2 = moderate improvement, 3 = marked improvement). Responsiveness can then be determined in a number of ways, for example, correlating change scores with the transition question (high correlations indicate greater responsiveness).110Alternatively, the minimum clinically important difference can be calculated111 by dividing the mean change score for minimally improved/deteriorated patients by the mean change score for unchanged patients. Finally, the coefficient proposed by Guyatt and colleagues112 can be calculated (mean change score in patients judged to have changed divided by the standard deviation of change scores in patients judged to have not changed). Norman and colleagues108 defined these as retrospective methods of examining responsiveness as they involve the determination of subgroups of patients on the basis of their degree of change, and then the retrospective computation of responsiveness. Recently, Norman and colleagues compared prospective and retrospective methods of
reporting responsiveness108and demonstrated that there is no consistent relationship between the results generated by the two methods.
As each method of reporting responsiveness has significant limitations, it is important that the relative responsiveness of competing measures is examined. This analysis is rarely undertaken. 10
Overview
This chapter outlines the development of the MSIS-29, a 29-item questionnaire designed to assess the impact of MS on people’s lives. The MSIS-29 was developed and tested in three stages. In stage 1 (item generation), a 129-item
questionnaire was generated from 30 patient interviews, expert opinion and a review of the literature. In stage 2 (item reduction and scale generation), the questionnaire was administered by postal survey to 1530 randomly selected members of the MS Society. Standard item reduction techniques were used to develop a 29-item scale (MSIS-29) measuring the physical (20 items) and psychological (nine items) impact of MS. In stage 3 (psychometric evaluation), six psychometric properties of the MSIS-29 (data quality, scaling assumptions, acceptability, reliability, validity and responsiveness) were evaluated in two studies. Data quality, scaling assumptions, acceptability, reliability and validity were evaluated in an independent sample of 1250 members of the MS Society. The responsiveness of the MSIS-29 was evaluated in 55 people with MS admitted for inpatient
rehabilitation or intravenous steroids for treatment of relapse. This chapter presents the methods and results of stages 1 and 2. The methods and results of stage 3 are presented in Chapter 4.
Methods
The MSIS-29 was developed at the Neurological Outcome Measures Unit of the Institute of
Neurology (ION)/National Hospital for Neurology and Neurosurgery (NHNN), Queen Square, London.
Item generation
Generation of an item pool
A pool of 129 items concerning the health impact of MS was generated from three sources:
semistructured interviews of people with MS, multidisciplinary expert opinion and a comprehensive literature review.
Thirty people with MS attending the MS clinical service of the NHNN consented to participate in semistructured interviews. They were selected to
represent as much diversity of illness as possible in terms of disability, duration of illness, age of onset and educational level. None of the patients who were asked to participate refused to be
interviewed. The sample included men and women in all diagnostic categories (i.e. primary progressive, secondary progressive and
relapsing–remitting MS), who represented the entire range of disability and illness duration and an age range similar to that of the British
population of people with MS. Table 1presents the characteristics of the sample of patients who participated in the semistructured interviews. Interviews lasted for an average of 1 hour, and were tape recorded, transcribed and then content analysed. The interviews were carried out by a single investigator, Irene Richardson (IR), at either the patients’ homes or in a consultation room at NHNN. Statements relating to the health impact of MS on people’s lives were extracted from all interviews by IR and in parallel by one of the co-investigators (Ray Fitzpatrick) for approximately one-third of the interviews. The extraction process involved highlighting any phrase or a sentence made by the patient that referred to the health impact of MS on their lives. Where two sets of statements had been extracted, comparisons were made. Agreement was high.
The extraction process was reviewed twice: first, to adjust the level of inclusiveness required, in particular to identify the areas covered by the interviews which did not relate to quality of life issues (e.g. people’s reaction to their diagnosis); and second, for the last eight interviews, only completely new items, which did not belong to any of the categories identified as irrelevant, were extracted. In total, 3750 health impact statements were extracted from the interviews (mean 125, range 64–212).
Extracted statements were then classified into 11 broad categories to facilitate presentation and readability (symptoms, activities of daily living, emotional impact of MS, doctor-related statements, drug side-effects, financial strain, required planning, public response, relapses, impact on and responses of significant others, and
wheelchair-related statements). Thus, these were 11
Chapter 3
the emergent themes of the interviews; these categories came about by simply content analysing the statements into a more manageable format. In each category, statements were further organised into subcategories (e.g. the broad category of symptoms included subcategories such as spasms, numbness or fatigue).
The elimination of redundant items was conducted in two stages (via series of discussion among the study panel; see Acknowledgements section). First, redundant statements within each individual on a particular subcategory were eliminated. That is, statements made by the same individual with a high degree of overlap were discarded until only one relevant statement was retained. Next, redundant statements between patients were eliminated. For example, the statement ‘I have spasms’ was retained but ‘I have muscular spasm at night’ and ‘all my muscles were going into spasms’ were discarded (these three
statements were all made by different individuals). For both stages of elimination, it was decided to retain statements that were broader in content than specific in content, and which captured that particular subcategory succinctly. Any
disagreements were discussed among team
members and agreement was reached in all cases. Through this process, a first list of items was extracted containing 117 statements covering the whole range of issues raised by people with MS during the interviews. Items were again chosen to avoid idiosyncratic and highly specific responses. For example ‘can only walk short distances’ was chosen in preference to ‘can only walk 200 yards’. After discussions, it was agreed by all team members that statements regarding ‘coping with MS’, ‘positive impact of MS’ and ‘diagnosis’ statements were to be excluded, as the intention was for the questionnaire to focus on the impact of MS on their daily life.
12
TABLE 1 Characteristics of samples
Samples
Variablea Semistructured interviews First field test Second field test
Nb 3 766 713 Gender Female 56 74 71 Age Mean (SD) 41 (12) 51 (12) 52 (12) Range 23–70 2–87 18–82 Ethnicity White 100 98 98
Years since MS onset
Mean (SD) 12 (11) 19 (12) 19 (11)
Range 1–36 1–56 1–59
Mobility indoors
Walks unaided 40 –c 32
Walks with an aid 23 – 40
Uses a wheelchair 37 – 28 Mobility Can walk NA 79 – Cannot walk NA 21 – Marital status Married 77 66 70
Living with others 83 81
Employment status
Retired due to MS 63 54 56
Employed 18 19
Type of MS
Primary progressive 13.3 Unknown Unknown
Secondary progressive 43.4 Unknown Unknown
Relapsing–remitting 43.3 Unknown Unknown
aAll values are percentages unless specified otherwise.
bFor whom both physical and psychological scale scores could be computed. cQuestion not asked.
An additional 38 items were generated from the review of the literature and from interviews with health professionals at the NHNN who were involved in the care of people with MS (i.e. neurologists, neuropsychologists, nurses, occupational therapists, physiotherapists, social workers, speech and language therapists). There was no information to identify which of the 38 items were from healthcare professionals. However, all the items from the final measure (MSIS-29) can be referenced back to interviews from patients.
A preliminary 129-item questionnaire consisting of two sections was developed (see Appendix 1). Section 1 included items to evaluate how people perceive the impact of MS on various aspects of their lives. Section 2 included items to evaluate the extent of physical limitations of people with MS. These two sections were based on the most appropriate way to group the items without changing patients’ words. The time-frame specified for all items was the previous 2 weeks before completion of the questionnaire. Although the choice of time-frames is arbitrary, 2 weeks was chosen for three very specific reasons. First, during pretesting, a number of patients commented that 2 weeks was the most appropriate time-frame. Second, MS clinicians commented that 2 weeks was the most clinically appropriate. Third, a 2-week time-frame is most suitable for use in clinical trials.
Response options
Examination of the content of the initial pool of 129 items indicated that two distinct question stems and response scales were required. The majority of items (n= 97) were best represented by the stem ‘How much have you been bothered by…’ with a five-point response option (1 = not at all, 2 = a little, 3 = moderately, 4 = quite a bit, 5 = extremely). The remaining items (n= 32) about activity limitations were best represented by the stem ‘How much has your MS limited your ability to…’ with a six-point response option (1 = not at all, 2 = a little, 3 = moderately, 4 = quite a bit, 5 = extremely, 6 = can’t do at all). Pretesting
The preliminary questionnaire (including the instructions, item-stems, items and response options) was reviewed for content, wording and clinical appropriateness by patients, clinicians and researchers who were involved in its development. The preliminary version of the questionnaire was then pretested, first, in an independent and heterogeneous sample of 20 people with MS who
were attending the NHNN. These people were selected to be representative of the general MS population (see Table 2for a breakdown of patient characteristics). They were asked to fill in the questionnaire in the presence of the project coordinator and to comment on it by identifying items and instructions that were unclear,
ambiguous, irrelevant, misleading or offensive, and to make suggestions for alterations to the questionnaire. A formal cognitive laboratory interview using the ‘think-aloud’ approach was used. Second, the 30 people who had been interviewed (in the item generation stage) were all sent the preliminary questionnaire. These people were asked to fill in the questionnaire and to comment on it in the same manner as described above (the response rate was 70%; 21 patients returned the questionnaire). In addition, ten of these people were contacted by telephone to discuss the questionnaire. The final version of the questionnaire for the first field test consisted of 129 items (section 1, 97 items; section 2, 32 items), The questionnaire also contained open-ended questions on any comments or suggestions about the questionnaire and socio-demographic questions.
Item reduction and scale formation
The 129-item questionnaire was administered by postal survey to 1530 people, randomly selected and geographically stratified, from the
membership database of the MS Society of Great Britain and Northern Ireland. This sampling frame has the advantage of being representative of people with MS in the MS Society membership
13
TABLE 2 Pretesting patient characteristics (n= 20)
Patients Gender Male 8 Female 12 Age (years) 20–29 0 30–39 8 40–49 5 50–59 5 60–69 1 70–79 1 Type of MS Relapsing-remitting 5 Primary progressive 2 Secondary progressive 13 Mobility indoors Walks unaided 6
Walks with aid 7