Reliability

Basic Concepts

circumstance: a measure that is highly reliable when used with one group of people may be unreliable when used with a different group, for instance.For this reason it is more useful to evaluate how valid and reliable a measure is for a particular purpose and whether the levels of reliability and validity are acceptable in the context at hand.Reliability and validity are also discussed in Chapter 5, in the context of research design, and in Chapter 19, in the context of educational and psychological testing.

Reliability refers to how consistent or repeatable measurements are.For instance, if we give the same person the same test on two different occasions, will the scores be similar on both occasions? If we train three people to use a rating scale designed to measure the quality of social interaction among individuals, then showed each of them the same film of a group of people interacting and asked them to evaluate the social interaction exhibited in the film, will their ratings be similar? If we have a technician measure the same part 10 times, using the same instrument, will the measurements be similar each time? In each case, if the answer is yes, we can say the test, scale, or instrument is reliable.

Much of the theory and practice of reliability was developed in the field of educational psychology, and for this reason, measures of reliability are often described in terms of evaluating the reliability of tests.But considerations of reliability are not limited to educational testing: the same concepts apply to many other types of measurements including opinion polling, satisfaction surveys, and behavioral ratings.

The discussion in this chapter will be kept at a fairly basic level: information about calculating specific measures of reliability are discussed in more detail in Chapter 19, in connection with test theory.In addition, many of the measures of reliability draw on thecorrelation coefficient(also called simply thecorrelation), which is discussed in detail in Chapter 9, so beginning statisticians may want to concentrate on the logic of reliability and validity and leave the details of evaluating them until after they have mastered the concept of the correlation coefficient.

There are three primary approaches to measuring reliability, each useful in particular contexts and each having particular advantages and disadvantages:

• Multiple-occasions reliability • Multiple-forms reliability • Internal consistency reliability

Multiple-occasions reliability,sometimes calledtest-retest reliability, refers to how

similarly a test or scale performs over repeated testings.For this reason it is sometimes referred to as an index oftemporal stability, meaning stability over time.For instance, we might have the same person do a psychological assessment of a patient based on a videotaped interview, with the assessments performed two weeks apart based on the same taped interview.For this type of reliability to make sense, you must assume that the quantity being measured has not changed: hence the use of the same videotaped interview, rather than separate live interviews with

a patient whose state may have changed over the two-week period.Multiple- occasions reliability is not a suitable measure for volatile qualities, such as mood state.It is also unsuitable if the focus of measurement may have changed over the time period between tests (for instance, if the student learned more about a subject between the testing periods) or may be changed as a result of the first testing (for instance, if a student remembers what questions were asked on the first test administration).A common technique for assessing multiple-occasions reliability is to compute the correlation coefficient between the scores from each occasion of testing: this is called thecoefficient of stability.

Multiple-forms reliability(also calledparallel-forms reliability) refers to how simi-

larly different versions of a test or questionnaire perform in measuring the same entity.A common type of multiple forms reliability issplit-half reliability, in which a pool of items believed to be homogeneous is created and half the items are allo- cated to form A and half to form B.If the two (or more) forms of the test are administered to the same people on the same occasion, the correlation between the scores received on each form is an estimate of multiple-forms reliability.This correlation is sometimes called thecoefficient of equivalence.Multiple-forms reliability is important for standardized tests that exist in multiple versions: for instance, different forms of the SAT (Scholastic Aptitude Test, used to measure academic ability among students applying to American colleges and universities) are calibrated so the scores achieved are equivalent no matter which form is used.

Internal consistency reliability refers to how well the items that make up a test

reflect the same construct.To put it another way, internal consistency reliability measures how much the items on a test are measuring the same thing.This type of reliability may be assessed by administering a single test on a single occasion. Internal consistency reliability is a more complex quantity to measure than multiple-occasions or parallel-forms reliability, and several different methods have been developed to evaluate it: these are further discussed in Chapter 19.However, all depend primarily on the inter-item correlation, i.e., the correlation of each item on the scale with each other item.If such correlations are high, that is interpreted as evidence that the items are measuring the same thing and the various statistics used to measure internal consistency reliability will all be high.If the inter-item correlations are low or inconsistent, the internal consistency reliability statistics will be low and this is interpreted as evidence that the items are not measuring the same thing.

Two simple measures of internal consistency that are most useful for tests made up of multiple items covering the same topic, of similar difficulty, and that will be scored as a composite, are theaverage inter-item correlationandaverage item-total

correlation.To calculate the average inter-item correlation, we find the correla-

tion between each pair of items and take the average of all the correlations.To calculate the average item-total correlation, we create a total score by adding up scores on each individual item on the scale, then compute the correlation of each item with the total.The average item-total correlation is the average of those individual item-total correlations.

Split-half reliability, described above, is another method of determining internal consistency.This method has the disadvantage that, if the items are not truly

Reliability and Validity | 11

Basic Concepts

homogeneous, different splits will create forms of disparate difficulty and the reliability coefficient will be different for each pair of forms.A method that overcomes this difficulty is Cronbach’s alpha (coefficient alpha), which is equivalent to the average of all possible split-half estimates.For more about Cronbach’s alpha, including a demonstration of how to compute it, see Chapter 19.

In document OReilly Statistics in a Nutshell A Desktop Quick Reference Aug 2008 pdf (Page 33-35)