Scale Development Research

(1)

http://tcp.sagepub.com

Psychologist

The Counseling

DOI: 10.1177/0011000006288127 2006; 34; 806

The Counseling Psychologist

Roger L. Worthington and Tiffany A. Whittaker

Recommendations for Best Practices

Scale Development Research: A Content Analysis and

http://tcp.sagepub.com/cgi/content/abstract/34/6/806 The online version of this article can be found at:

Published by:

http://www.sagepublications.com

On behalf of:

Division of Counseling Psychology of the American Psychological Association

can be found at: The Counseling Psychologist

Additional services and information for

http://tcp.sagepub.com/cgi/alerts Email Alerts: http://tcp.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: http://tcp.sagepub.com/cgi/content/abstract/34/6/806#BIBL

SAGE Journals Online and HighWire Press platforms): (this article cites 51 articles hosted on the

(2)

A Content Analysis and Recommendations

for Best Practices

Roger L. Worthington University of Missouri–Columbia

Tiffany A. Whittaker University of Texas at Austin

The authors conducted a content analysis on new scale development articles appearing in the Journal of Counseling Psychology during 10 years (1995 to 2004). The authors analyze and discuss characteristics of the exploratory and confirmatory factor analysis procedures in these scale development studies with respect to sample characteristics, factorability, extraction methods, rotation methods, item deletion or retention, factor retention, and model fit indexes. The authors uncovered a variety of specific practices that were at variance with the current literature on factor analysis or structural equa-tion modeling. They make recommendaequa-tions for best practices in scale development research in counseling psychology using exploratory and confirmatory factor analysis.

Counseling psychology has a rich tradition producing psychometrically sound instruments for applications in research, training, and practice. Many areas of scholarly inquiry in counseling psychology continue to be ripe for scale development research. In a special issue of the Journal of Counseling

Psychology (JCP) on quantitative research methods, Dawis (1987)

pro-vided an overview of scale development techniques, Tinsley and Tinsley (1987) discussed the use of factor analysis, and Fassinger (1987) presented an overview of structural equation modeling (SEM). Although these articles continue to be cited in counseling psychology research, recent advances require updated information and a comprehensive overview of all three top-ics. More recently, Quintana and Maxwell (1999) and Martens (2005) pro-vided comprehensive updates of SEM, but their focus was not specifically on its use in scale development research (see also Martens & Hasse, 2006 [this issue]; Weston & Gore, 2006 [TCP, special issue, part 1]).

The purpose of this article is threefold: (a) to provide an overview of the steps taken in the scale development process using exploratory factor analy-sis (EFA) and confirmatory factor analyanaly-sis (CFA), (b) to assess current

prac-The authors contributed equally to the writing of this article. We would like to thank Jeffrey Andreas Tan for his assistance with the content analysis. Address correspondence to Roger L. Worthington, Department of Educational, School, and Counseling Psychology, University of Missouri, Columbia, MO 65211; e-mail: [email protected]

THE COUNSELING PSYCHOLOGIST, Vol. 34 No. 6, November 2006 806-838 DOI: 10.1177/0011000006288127

(3)

tices by reporting the results of a 10-year content analysis of scale develop-ment research in counseling psychology, and (c) to provide a set of recom-mendations for best practices in using EFA and CFA in scale development (for more on factor analysis, see Kahn, 2006 [TCP, special issue, part 1]). We assume the reader has basic knowledge of psychometrics, including principles of reliability (Helms, Henze, Sass, & Mifsud, 2006 [TCP, special issue, part 1]), validity (Hoyt, Warbasse, & Chu, 2006 [this issue]), and multivariate sta-tistics (Sherry, 2006 [TCP, special issue, part 1]). We begin with an overview of EFA and CFA, followed by a discussion of the procedure we used in con-ducting our content analysis. We then embed the findings of our content analysis within more detailed discussions of EFA and CFA, identifying poten-tial problems and highlighting best practices. We conclude with an integrative discussion of best practices and findings from the content analysis.

OVERVIEW OF EFA AND CFA

Factor analysis is a technique used to identify or confirm a smaller num-ber of factors or latent constructs from a large numnum-ber of observed variables (or items). There are two main categories of factor analysis: (a) exploratory and (b) confirmatory (Kahn, 2006 [TCP, special issue, part 1]). Although researchers may use factor analysis for a range of purposes, one of the most prevalent uses of factor-analytic techniques is to support the validity of newly developed tests or scales—that is, does the newly developed test or scale measure the intended construct(s)? More specifically, the application of factor analysis to a set of items may help researchers answer the follow-ing questions: How many factors or constructs underlie the set of items? What are the defining features or dimensions of the factors or constructs that underlie the set of items (Tabachnick & Fidell, 2001)?

EFA assesses the construct validity during the initial development of an instrument. After developing an initial set of items, researchers apply EFA to examine the underlying dimensionality of the item set. Thus, they can group a large item set into meaningful subsets that measure different factors. The pri-mary reason for using EFA is that it allows items to be related to any of the factors underlying examinee responses. As a result, the developer can easily identify items that do not measure an intended factor or that simultaneously measure multiple factors, in which case they could be poor indicators of the desired construct and eliminated from further consideration.

When used for scale development, EFA becomes a combination of qual-itative and quantqual-itative methods, which can be either confusing or enliven-ing for researchers. We have found that novices (and some who are not novice) hope to have the statistical program produce the ultimate solution that will provide them with a set of empirically determined, indisputable

(4)

dimensions or factors. However, effectively using EFA procedures requires researchers to use inductive reasoning, while patiently and subtly adjusting and readjusting their approach to produce the most meaningful results. Therefore, the process of scale development using EFA can become a rela-tively dynamic process of examination and revision, followed by more examination and revision, ultimately leading to a tentative rather than a definitive outcome.

The most current approach in conducting CFA is to use SEM. Prior to analyzing the data, a researcher must indicate (a) how many factors are present in an instrument, (b) which items are related to each factor, and (c) whether the factors are correlated or uncorrelated (issues that are revealed during the process of EFA). Because the items are generally con-strained to load on only one factor in CFA, it is generally intended not to explore whether a given item measures no factors, one factor, or multiple factors but instead to evaluate or confirm the extent to which the researcher’s measurement model is replicated in the sample data. Thus, it is critical to have prior knowledge of the expected relationships between items and factors before conducting CFA—hence the term confirmatory. SEM is a powerful confirmatory technique because it allows the researcher greater control over the form of constraints placed on items and factors when analyzing a hypothesized model. Furthermore, as we discuss later, researchers can also use SEM to examine competing models to assess the extent to which one hypothesized model fits the data better than an alternative model. In our discussion, we provide information about the basic concepts and procedures necessary to use SEM in scale devel-opment research. For more advanced discussions of SEM, we refer read-ers to several existing books and articles (e.g., Bollen, 1989; Kline, 2005; Martens, 2005; Martens & Hasse, 2006; Quintana & Maxwell, 1999; Thompson, 2004).

CONTENT-ANALYSIS PROCEDURE

To provide context for our discussion of scale development best practices, we conducted a content analysis of scale development articles in counsel-ing psychology that reflect common practices. In this section, we provide an overview of the article-selection process used in our content analysis. We then integrate the findings of our content analysis into the remainder of the article as we review the literature and recommend best practices for scale development.

We reviewed scale development articles published in JCP in the 10 years between 1995 and 2004, inclusive (see appendix for a list of articles). We

(5)

based our selection of articles on two central criteria: We included (a) only new scale development research articles (i.e., we excluded articles investi-gating only the reliability, validity, or revisions of existing scales) and (b) only articles that reported results from EFA and CFA. A paid graduate stu-dent assistant reviewed the tables of contents for each issue of JCP pub-lished during the specified time frame. We instructed the graduate student to err on the side of being overly inclusive, which resulted in the identifi-cation of 38 articles that used EFA and CFA to examine the psychometric properties of measurement instruments. The first author reviewed these articles and eliminated 15 that did not meet the selection criteria, resulting in 23 articles for our sample. Next, the first author and second author inde-pendently evaluated the 23 articles to identify and quantify the EFA and CFA characteristics. The only discrepancies in the independent evaluations of the articles were because of clerical errors in recording descriptive infor-mation (as opposed to disagreement in classification), which we jointly checked and verified.

We were interested in a number of characteristics of the studies. For stud-ies reporting EFA procedures, we were interested in the following: (a) sample characteristics, (b) criteria for assessing the factorability of the correlation matrix, (c) extraction methods, (d) criteria for determining rotation method, (e) rotation methods, (f) criteria for factor retention, (g) criteria for item dele-tion, and (h) purposes and criteria for optimizing scale length (see Table 1). For studies reporting CFA procedures, we were interested in the follow-ing: (a) using SEM versus alternative methods as a confirmatory approach, (b) sample-size criteria, (c) fit indexes, (d) fit-index criteria, (e) cross-validation indexes, and (f) model-modification issues (see Table 2).

THE PROCESS OF SCALE DEVELOPMENT RESEARCH

There are various strategies used in scale construction, often described using somewhat differing labels for similar approaches. Brown (1983) sum-marized three primary strategies: logical, empirical, and homogeneous. Friedenberg (1995) identified a slightly different set of categories:

logical-content or rational, theoretical, and empirical, in which the latter contains

criterion group and factor analysis methods. The rational or logical approach simply uses the scale developer’s judgments to identify or con-struct items that are obviously related to the characteristic being measured. The theoretical approach uses psychological theory to determine the con-tent of the scale items. Both the theoretical and rational and logical approaches are no longer popular methods in scale development. The more rigorous empirical approach uses statistical analyses of item responses as

(6)

TABLE 1: Characteristics of Exploratory Factor Analyses Used in Scale Development Studies Published in the Journal of Counseling Psychology (1995 to 2004)

Characteristic Frequency

Sample characteristics

Convenience sample 5

Purposeful sample of target group 10

Convenience and purposeful sampling 6

Criteria used to assess factorability of correlation matrix

Absolute sample size 1

Item intercorrelations 1

Participants per item ratio 3

Barlett’s test of sphericity 5

Kaiser-Meyer-Olkin test of sample adequacy 7

Unspecified 11 Extraction method Principal-components analysis 9 Common-factors analysis Principal-axis factoring 6 Maximum likelihood 3 Unspecified 1

Combination principal-components analysis and common-factors analysis 1

Unspecified 1

Criteria for determining rotation method

Subscale intercorrelations 2 Theory 3 Both 1 Other 3 Unspecified 12 Rotation method Orthogonal Varimax 8 Unspecified 1 Oblique Promax 1 Oblimin 3 Unspecified 4

Both orthogonal and oblique 3

Unspecified 1

Criteria for item deletion or retention

Loadings 16 Cross-loadings 13 Communalities 0 Item analysis 1 Other 3 Unspecified 2

No items were deleted 2

(7)

the basis for item selection based on (a) predictive utility for a criterion group (e.g., depressives) or (b) homogenous item groupings. The method described in this article is an empirical approach that employs factor analy-sis to form homogenous item groupings.

A number of authors have recommended similar sequences of steps to be taken prior to using factor-analytic techniques (e.g., Anastasi, 1988; Dawis, 1987; DeVellis, 2003). We review these preliminary steps in the fol-lowing section because, as is the case in most scientific endeavors, early mistakes in scale development often lead to problems later in the process. Once we have described all the steps in some detail, we address the extent to which the studies in our content analysis incorporated the steps in their designs.

Although there is little variation between models proposed by different authors, we rely primarily on DeVellis (2003) as the most current resource. Thus, the following description is only one of several similar models available and does not reflect a unitary best practice. DeVellis (2003) recommends the

TABLE 1 (continued)

Criteria for factor retention

Eigenvalues 18

Scree plot 17

Minimum proportion of variance accounted for by factor 2

Number of items per factor 4

Simple structure 5

Conceptual interpretability 15

Other 3

Unspecified 2

Optimizing scale length

None attempted 15

Purpose

Reduce total scale length 2

Limit total items per factor 3

Balance items per factor 2

Criteria

Redundant items 1

Conceptually unrelated items 1

Statistical invariance 1

Cross-loadings 1

Dropped items with lowest loadings 4

Item content 2

NOTE: Values in each category may not sum to equal the total number of studies because some studies may have reported more than one criterion or approach.

(8)

TABLE 2: Characteristics of Confirmatory Factor Analyses Used in Scale Development Studies Published in the Journal of Counseling Psychology (1995 to 2004)

SEM versus FA as a confirmatory approach

SEM used 14

FA used 2

Typical SEM approaches

Single-model approach 2

Competing-models approach 8

Nested models compared 4

Nonnested or equivalent models compared 4 Sample-size criteria (SEM only)

Participants per parameter 1

Unspecified 13

Overall model fit

Chi-square 12

Chi-square and df ratio 6

Incremental fit indexes reported

CFI 8 PCFI 1 IFI 2 NFI 4 NNFI/TLI 7 RNI 1

Absolute fit indexes reported

GFI 10

AGFI 6

RMSEA 6

RMSEA with confidence intervals 1

RMR 4

SRMR 1

Hoetler N 1

Predictive fit indexes reported

AIC 2

CAIC 1

ECVI 2

BIC 1

Fit index criteria

Recommended cutoff 11

Unspecified 3

(9)

following steps in constructing new instruments: (a) Determine clearly what you want to measure, (b) generate an item pool, (c) determine the format of the measure, (d) have experts review the initial item pool, (e) consider inclu-sion of validation items, (f) administer items to a development sample, (g) evaluate the items, and (h) optimize scale length.

In scale development, the first step is to define your construct clearly and concretely, using both existing theory and research to provide a sound con-ceptual foundation. This is sometimes more difficult than it may initially appear because it requires researchers to distinctly define the attributes of abstract constructs. Nothing is more difficult to measure than an ill-defined construct because it leads to the inclusion of items that may be only periph-erally related to the construct of interest or to the exclusion of items that are important components of the content domain.

The next step is to generate a pool of items designed to tap the construct. Ultimately, the objective is to arrive at a set of items that clearly represent the construct of interest so that factor-analytic, data-reduction techniques yield a stable set of underlying factors that accurately reflect the construct. Items that are poorly worded or not central to a clearly articulated construct will introduce potential sources of error variance, reducing the strength of correlations among items, and will diminish the overall objectives of scale development (see Quintana & Minami, 2006 [this issue], on dealing with measurement error in meta-analyses). In general, researchers should write items so that they are clear, concise, readable, distinct, and reflect the scale’s purpose (e.g., produce responses that can be scored in a meaningful way in relation to the construct definition). DeVellis (2003) and Anastasi (1988) offer a host of recommenda-tions for generating quality items and choosing a response format that are beyond the scope of this article. It suffices to say that the researcher should not

TABLE 2 (continued)

Model modification

Lagrange multiplier 3

Wald statistic 0

Item parceling 2

NOTE: Values in each category may not sum to equal the total number of studies because some studies may have reported more than one criterion or approach. AGFI = Adjusted Goodness-of-Fit Index; AIC = Akaike’s Information Criterion; BIC = Bayesian Information Criterion; CAIC = Consistent Akaike’s Information Criterion; CFI = Comparative Fit Index; ECVI = Expected Cross-Validation Index; FA = Common-Factors Analysis; GFI = Goodness-of-Fit Index; IFI = Incremental Fit Index; NFI = Normed Fit Index; NNFI/TLI = Nonnormed Fit Index or Tucker-Lewis Index; PCFI = Parsimony Comparative Fit Index; RMR = Root Mean-Square Residual; RMSEA = Root Mean-Square Error of Approximation; RNI = Relative Noncentrality Index; SEM = Structural Equation Modeling; SRMR = Standardized Root Mean-Square Residual.

(10)

take the quality of the item pool lightly, and a carefully planned approach to item generation is a critical beginning to scale development research.

Having the items reviewed by one or more groups of knowledgeable people (experts) to assess item quality on a number of different dimensions is another critical step in the process. At a minimum, expert review should involve an analysis of content validity (e.g., the extent to which a set of items reflects the content domain). Experts can also evaluate items for clarity, conciseness, grammar, reading level, face validity, and redundancy. Finally, it is also helpful at this stage for experts to offer suggestions for adding new items and length of administration.

Although it is possible to include additional scales for participants to complete that may provide information about convergent and discriminant validity, we recommend that researchers limit such efforts at this stage of development. We recommend this for two reasons. First, it is wise to keep the total questionnaire length as short as possible and directly related to the study’s central purpose. The longer the questionnaire, the less likely poten-tial participants will be to volunteer for the study or to complete all the items (Converse & Presser, 1986). Scale development studies sometimes include as many as 3 to 4 times the number of items that will eventually end up on the instrument, making inclusion of additional scales prohibitive. Second, there are several ways that items from other measures may interact with items designed for the new instrument to affect participant responses and, thus, to interfere in the scale development process. In particular, it would be very difficult, if not impossible, to control for order effects of different mea-sures while testing the initial factor structure for the new scale. Randomly administering existing measures with the other instruments might contami-nate participants’ responses on the items for the new scale, but administer-ing the new items first to avoid contamination eliminates an important procedure commonly used when researchers use multiple self-report scales concurrently within a single study. Thus, we believe that it is important to avoid influencing item responses during the initial phase of scale develop-ment by limiting the use of additional measures. Although ultimately a mat-ter of researcher judgment, assessing the convergent and discriminant validity (e.g., correlation with other measures) is an important step that we believe should occur later in the process of scale development.

Of the 23 studies in our content analysis, 14 reported a construct or scale definition that guided item generation, and all but 2 studies indicated that item generation was based on prior theoretical and empirical literature in the field. Occasionally, however, we found that articles provided only sparse details in the introductory material articulating the theoretical foundations for the research. The studies in our review used various item-generation approaches. All the approaches involved some form of rational

(11)

item generation, with the primary variations involving the combination of rational and empirical approaches. Although the extensiveness and specific approaches of the procedures varied widely, only a few studies (n = 2) did not include (or failed to report) expert review of item sets prior to conducting EFA or CFA. Finally, our content analysis showed three typical patterns with respect to the inclusion of validity items during administration to the initial development sample: (a) administering only the scale items (no validity items being included), (b) assessing only social desirability along with the scale items, or (c) administering numerous other scales along with the scale items to provide additional evidence of convergent and discriminant validity.

THE ORDERING OF EFA AND CFA IN NEW SCALE DEVELOPMENT RESEARCH

Researchers typically use CFA after an instrument has already been assessed using EFA, and they want to know if the factor structure produced by EFA fits the data from a new sample. An alternative, less typical approach, is to perform CFA to confirm a theoretically driven item set without the prior use of EFA. However, Byrne (2001) stated that “the application of CFA pro-cedures to assessment instruments that are still in the initial stages of devel-opment represents a serious misuse of this analytic strategy” (p. 99). Furthermore, reporting the findings of a single CFA is of little advantage over conducting a single EFA. Specifically, research has shown that exploratory methods (i.e., principal-axis and maximum-likelihood factor analysis) are able to recover the correct factor model satisfactorily a majority of the time (Gerbing & Hamilton, 1996). In addition, a key validity issue is the replica-tion of the hypothesized factor structure using a new sample. Thus, rather than produce a CFA that would ultimately need to be followed by a second CFA, the most logical approach would be to conduct an EFA followed by a CFA in all cases. Thus, when developing new scales, researchers should con-duct an EFA first, followed by CFA. Regardless of how effectively the researcher believes item generation has reproduced the theorized latent vari-ables, we believe that the initial validation of an instrument should involve empirically appraising the underlying factor structure (i.e., EFA).

Of the 23 new scale development articles we reviewed, a significant major-ity conducted EFA followed by CFA (n = 10) or only EFA without CFA (n = 8). One article reported using SEM following EFA, but the procedure was inconsistent with CFA. Two smaller subsets of articles reported only CFA (n = 2) or conducted CFA followed by EFA (n = 2). In the two stud-ies in which EFA followed CFA, researchers had produced theoretically derived instruments that they believed required only a confirmation of the

(12)

hypothesized factor structure (which proved wrong in both cases). As a result, when the hypothesized factor structure did not fit the data using SEM, the researchers reverted to EFA (using the same sample) as a means of uncovering the underlying factor structure—a somewhat questionable procedure that could have been avoided if they had relied on EFA in the first place. The studies that successfully used only CFA included one that reported only a single CFA and another that reported two consecutive CFAs (in which the second replicated the findings of the first).

EFA

Development sample characteristics. Representativeness in scale

develop-ment research does not follow conventional wisdom—that is, it is not neces-sary to closely represent any clearly identified population as long as those who would score high and those who would score low are well represented (Gorsuch, 1997). Furthermore, one reason many scholars have consistently advocated for large samples in scale development research (see further on) is that scale variance attributable to specific participants tends to be cancelled by random effects as sample size increases (Tabachnick & Fidell, 2001). Nevertheless, samples that do not adequately represent the population of interest affect factor-structure stability and generalizability. When all partici-pants are drawn from a particular source sharing certain characteristics (e.g., age, education, socioeconomic status, and racial and ethnic group), even large samples will not sufficiently control for the systematic variance produced by these characteristics. Thus, it is advisable to ensure the appropriateness of the development sample to the degree possible before conducting an EFA.

An important caveat with respect to sample characteristics is that in counseling psychology research, there are many potential populations whose members may be difficult to identify or from whom it is particularly difficult to solicit participation (e.g., lesbian, gay, bisexual and transgender individuals,` and persons with disabilities). Under circumstances where a researcher believes that the sample characteristics might be at variance from unknown population characteristics, she or he may be forced to adjust to these unknowns and simply move forward with a sample that is adequate but not ideal (Worthington & Navarro, 2003).

In the studies we reviewed for the content analysis, some form of purpose-ful sampling from a specific target population was the most common approach, followed by a combination of convenience and purposeful sampling. Only about 25% of the studies used convenience sampling, most often with under-graduate student participants. Three of the studies we reviewed used split samples (i.e., a large sample split into two groups for separate analyses).

(13)

Sample size. Sample size is an issue that has received considerable

discussion in the literature. There are two central risks with using too few participants: (a) Patterns of covariation may not be stable, because chance can substantially influence correlations among items when the ratio of par-ticipants to items is relatively low; and (b) the development sample may not adequately represent the intended population (DeVellis, 2003). Comrey (1973) has been cited often as classifying a variety of sample sizes from

very poor (N = 50) to excellent (N = 1,000) based solely on the number of

participants in a sample and as recommending at least 300 cases for factor analysis. Gorsuch (1983) has also proposed guidelines for minimum ratios of participants to items (5:1 or 10:1), which has been widely cited in coun-seling psychology research. However, other authors have pointed out that these general guidelines may be misleading (MacCallum, Widaman, Zhang, & Hong, 1999; Tabachnick & Fidell, 2001; Velicer & Fava, 1998). In general, there is some agreement that larger sample sizes are likely to result in more stable correlations among variables and will result in greater replicability of EFA outcomes. Velicer and Fava (1998) produced evidence indicating that any ratio less than a minimum of three partici-pants per item is inadequate, and there is additional evidence that factor saturation (the number of items per factor) and item communalities are the most important determinants of adequate sample size (Guadagnoli & Velicer, 1988; MacCallum et al., 1999). Thus, we offer four overarching guidelines: (a) Sample sizes of at least 300 are generally sufficient in most cases, (b) sample sizes of 150 to 200 are likely to be adequate with data sets containing communalities higher than .50 or with 10:1 items per fac-tor with facfac-tor loadings at approximately |.4|, (c) smaller samples sizes may be adequate if all communalities are .60 or greater or with at least 4:1 items per factor and factor loadings greater than |.6|, and (d) samples sizes less than 100 or with fewer than 3:1 participant-to-item ratios are gener-ally inadequate (Reise, Waller, & Comrey, 2000; Thompson, 2004). Note that this requires researchers to set a minimum sample size at the outset and to evaluate the need for additional data collection based on the out-comes of an initial EFA.

In our content analysis, absolute magnitude of sample sizes and participant-per-item ratios were virtually the only references made with respect to sample size, and both varied widely. Absolute sample sizes varied from 84 to 411 (M = 258.95; SD = 100.80). Participant-per-item ratios varied from 2:1 to 35:1 (the modal ratio was 3:1). The authors addressed no other sample-size criteria when discussing the adequacy of their sample sizes.

Factorability of the correlation matrix. Although many people are

famil-iar with the previously described standards regarding sample size, the fac-torability of a data set also has been related to the sizes of correlations in

(14)

the matrix. Researchers can use Bartlett’s (1950) test of sphericity to estimate the probability that correlations in a matrix are 0. However, it is highly susceptible to the influence of sample size and likely to be signifi-cant for large samples with relatively small correlations (Tabachnick & Fidell, 2001). Thus, we recommend using this test only if there are fewer than about 5 cases per variable, but this becomes moot with samples containing fewer than three cases per variable (see earlier). In studies with cases-per-item ratios higher than 5:1, we recommend that researchers provide addi-tional evidence for scale factorability.

The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is also useful for evaluating factorability. This measure of sampling adequacy accounts for the relationship of partial correlations to the sum of squared correlations. Thus, it indicates the extent to which a correlation matrix actu-ally contains factors or simply chance correlations between a small subset of variables. Tabachnick and Fidell (2001) suggested that values of .60 and higher are required for good factor analysis.

In our content analysis of scale development articles in JCP, the largest number of studies (n = 11) did not report using any criteria to assess the fac-torability of the correlation matrix. Although some studies (n = 5) reported using Bartlett’s test of sphericity, only one of those studies contained a cases-to-items ratio small enough to provide useful information on the basis of Bartlett’s test. Although other studies had cases-to-items ratios less than 5:1, they did not report using Barlett’s test to assess scale factorability. Only 7 of the articles reported the value of KMO as a precursor to completing factor analysis, and a few articles (n = 3) used the participants-per-item ratio as the sole criterion.

Extraction methods. There are a variety of factor-extraction methods

based on a number of statistical theories, but the two most commonly known and studied are principal-components analysis (PCA) and common-factors analysis (FA). There has been a protracted debate over the preferred use of PCA versus FA (e.g., principal-axis factoring, maximum-likelihood factoring) as exploratory procedures, which has yet to be resolved (Gorsuch, 2003). We do not intend to examine this debate in detail (see

Multivariate Behavioral Research, 1990, Volume 25, Issue 1, for an

exten-sive discussion of the pros and cons of both). However, it is important for researchers to understand the distinct purposes of each technique. The pur-pose of PCA is to reduce the number of items while retaining as much of the original item variance as possible. The purpose of FA is to understand the latent factors or constructs that account for the shared variance among items. Thus, the purpose of FA is more closely aligned with the develop-ment of new scales. In addition, although it has been shown that PCA and FA often produce similar results (Velicer & Jackson, 1990; Velicer,

(15)

Peacock, & Jackson, 1982), there are several conditions under which FA has been shown to be superior to PCA (Gorsuch, 1990; Tucker, Koopman, & Linn, 1969; Widamen, 1993). Finally, compared with PCA, the outcomes of FA should more effectively generalize to CFA (Floyd & Widaman, 1995). Thus, although there may be other appropriate uses for PCA, we recommend FA for the development of new scales.

An example of the use of FA versus PCA in a simulated data set might illustrate the differences between these two approaches. Imagine that a researcher at a public university is interested in measuring campus climate for diversity. The researcher created 12 items to measure three different aspects of campus climate (each using 4 items): (a) general comfort or safety, (b) open-ness to diversity, and (c) perceptions of the learning environment. In a sample of 500 respondents, correlations among the 12 variables indicated that one item from each subset did not correlate with any other items on the scale (e.g., no higher than r = .12 for any bivariate pair containing these items). In FA, the three uncorrelated items appropriately drop out of the solution because of low factor loadings (loadings < .23), resulting in a three-factor solution (each fac-tor retaining 3 items). In PCA, the three uncorrelated items load together on a fourth factor (loadings > .45). This example demonstrates that under certain conditions, PCA may overestimate factor loadings and result in erroneous decisions about the number of factors or items to retain.

We should also make clear that there are several techniques of FA, including principal-axis factoring, maximum likelihood, image factoring, alpha factoring, and unweighted and generalized least squares. Gerbing and Hamilton (1996) have shown that principal-axis factoring and maximum-likelihood approaches are relatively equal in their capacities to extract the correct model when the model is known in the population. However, Gorsuch (1997) points out that maximum-likelihood extractions result in occasional problems that do not occur with principal-axis factoring. Prior to the current use of SEM as a CFA technique, maximum-likelihood extraction had some advantages over other FA procedures as a confirmatory technique (Tabachnick & Fidell, 2001). For further discussion of less commonly used approaches, see Tabachnick and Fidell (2001).

Among the studies in our content analysis, most used some form of FA (n = 10), but a similar number used PCA (n = 9). One study used a combi-nation of PCA and FA, and another did not report an extraction method. (Note: 2 of the 23 studies used only CFA and are not included in the figures reported earlier.) A cursory examination of the publication dates indicates that the majority of studies using PCA were published prior to the majority of those using FA, suggesting a trend away from PCA in favor of FA.

Criteria for determining rotation method. FA rotation methods include

(16)

rotations when the set of factors underlying a given item set are assumed or known to be uncorrelated. Researchers use oblique rotations when the fac-tors are assumed or known to be correlated. A discussion of the statistical properties of the various types of orthogonal and oblique rotation methods is beyond the scope of this article (we refer readers to Gorsuch [1983] and Thompson [2004] for such discussions). In practice, researchers can deter-mine whether to use an orthogonal versus oblique rotation during the initial FA based on either theory or data. However, if they discover that the factors appear to be correlated in the data when theory has suggested them to be uncorrelated, it is still most appropriate to rely on the data-based approach and to use an oblique rotation. Although, in some cases, both procedures might produce the same factor structure with the same data, using an orthogonal rotation with correlated factors tends to overestimate loadings (e.g., they will have higher values than with an oblique rotation; Loehlin, 1998). Thus, researchers may retain or reject some items inappropriately, and the factor structure may be more difficult to replicate during CFA.

Our content analysis showed that relatively few of the studies in our review reported an adequate rationale for selecting an orthogonal or oblique rotation method, with only 2 using subscale intercorrelations, 3 using theory, and 1 using both. Twelve studies did not specify the criteria used to select a rotation method, and 3 studies actually reported criteria irrelevant to the task (e.g., although the factors were correlated, the orthogonal solution matched the prior expectations for the factor solution). Also, 8 studies used orthogonal rotations despite reporting moderate to high correlations among factors, and 4 studies did not provide factor intercorrelations.

Criteria for factor retention. Researchers can use numerous criteria to

estimate the number of factors for a given item set. The most widely known approaches were recommended by Kaiser (1958) and Cattell (1966) on the basis of eigenvalues, which may help determine the impor-tance of a factor and indicate the amount of variance in the entire set of items accounted for by a given factor (for a more detailed explanation of eigenvalues, see Gorsuch, 1983). The iterative process of factor analysis produces successively less useful information with each new factor extracted in a set because each factor extracted after the first is based on the residual of the previous factor’s extraction. The eigenvalues produced will be successively smaller with each new factor extracted (accounting for smaller and smaller proportions of variance) until virtually meaning-less values result. Thus, Kaiser (1958) believed that eigenvalues meaning-less than 1.0 reflect potentially unstable factors. Cattell (1966) used the relative val-ues of eigenvalval-ues to estimate the correct number of factors to examine during factor analysis—a procedure known as the scree test. Using the scree plot, a researcher examines the descending values of eigenvalues to

(17)

locate a break in the size of eigenvalues, after which the remaining values tend to level off horizontally.

Parallel analysis (Horn, 1965) is another procedure for deciding how many factors to retain. Generally, when using parallel analysis, researchers randomly order the participants’ item scores and conduct a factor analysis on both the original data set and the randomly ordered scores. Researchers determine the number of factors to retain by comparing the eigenvalues determined in the original data set and in the randomly ordered data set. They retain a factor if the original eigenvalue is larger than the eigenvalue from the random data. This has been shown to work reasonably well when using FA (Humphreys & Montanelli, 1975) as well as PCA (Zwick & Velicer, 1986). Parallel analysis is not readily available in commonly used statistical software, but programs are available that conduct parallel analysis when using principal-axis factor analysis and PCA (see O’Connor, 2000).

Approximating simple structure is another way to evaluate factor reten-tion during EFA. According to McDonald (1985), the term simple structure has two radically different meanings that are often confused. A factor pat-tern has simple structure (a) if several items load strongly on only one fac-tor and (b) if items have a zero correlation to other facfac-tors in the solution. SEM constrains the relationships between items and factors to produce simple structure as defined earlier (which will become important later). McDonald (1985) differentiates this from what he prefers to call

approxi-mate simple structure, often reported in counseling psychology research as

if it were simple structure, which substitutes the word small (undefined) for the word zero (definitive) in the primary definition. Researchers can esti-mate approxiesti-mate simple structure by using rotation methods during FA. In EFA, efforts to produce factor solutions with approximate simple structure are central to decisions about the final number of factors and about the retention and deletion of items in a given solution. If factors share items that cross-load too highly on more than one factor (e.g.,> .32), the items are considered complex because they reflect the influence of more than one factor. Approximating simple structure can be achieved through item or fac-tor deletion or both. SEM approaches to CFA assume simple structure, and very closely approximating simple structure during EFA will likely improve the subsequent results of CFA using SEM.

The larger the number of items on a factor, the more confidence one has that it will be a reliable factor in future studies. Thus, with a few minor caveats, some authors have recommended against retaining factors with fewer than three items (Tabachnick & Fidell, 2001). It is possible to retain a factor with only two items if the items are highly correlated (i.e., r> .70) and relatively uncorrelated with other variables. Under these conditions, it may be appropri-ate to consider other criteria (e.g., interpretability) in deciding whether to retain

(18)

the factor or to discard it. Nevertheless, it may be best to revisit item-genera-tion procedures to produce addiitem-genera-tional items intended to load on the factor (which would require a new EFA before moving on to the CFA).

Conceptual interpretability is the definitive factor-retention criterion. In the end, researchers should retain a factor only if they can interpret it in a meaningful way no matter how solid the evidence for its retention based on the empirical criteria earlier described. EFA is ultimately a combination of empirical and subjective approaches to data analysis because the job is not complete until the solution makes sense. (Note that this is not necessarily true for the criterion-group method of scale development.) At this stage, the researcher should conduct an analysis of the items within each factor to assess the extent to which the items make sense as a group. Although uncommon, it may be useful to submit the item-factor combinations to a small group of experts for external interpretation to avoid a situation in which a factor makes sense to the researcher eager for a viable scale but not to anybody else.

In our content analysis of JCP articles, it appeared that numerous researchers encountered problems reconciling their EFA findings with their conceptual interpretation of the factor solution and occasionally engaged in rationalizations that led to questionable practices. For example, researchers in one study selected a factor solution that fit their precon-ceived conceptualization of the scale although some of the factors were very highly intercorrelated (e.g., the data indicated fewer factors than the authors adopted). When a researcher desires a specific factor structure that is not adequately reproduced during EFA, the recommended practice would be (a) to adopt the factor solution supported by the data and engage in meaningful interpretation based on those findings and (b) to return to item generation and go back through earlier steps in the scale development process (including EFA). There were a few articles in our content analysis that inappropriately moved forward with CFA after making revisions that were not assessed by EFA.

Criteria for item deletion or retention. Although, on rare occasions, a

researcher may retain all the initial items submitted to EFA, item deletion is a very common and expected part of the process. Researchers most often use the values of the item loadings and cross-loadings on the factors to determine whether items should be deleted or retained. Inevitably, this process is intertwined with the process of determining the number of fac-tors that will be retained (described earlier). For example, in some instances, a researcher might be evaluating the relative value of several dif-ferent factor solutions (e.g., 2, 3, or 4 factors). As such, deleting items before establishing the final number of factors could actually reduce the number of factors retained. On the other hand, unnecessarily retaining

(19)

items that fail to contribute meaningfully to any of the potential factor solu-tions will make it more difficult to make a final decision about the number of factors to retain. Thus, the process we recommend is designed to retain potentially meaningful items early in the process and to optimize scale length only after the factor solution is clear.

Most researchers begin EFA with a substantially larger number of items than they ultimately plan to retain. However, there is considerable variation among studies in the proportion of items in the initial pool that are planned for deletion. We recommend that researchers wait until the last step in EFA to trim unnecessary items and focus primarily on empirical scale develop-ment procedures at this stage in the process so as not to confuse the purposes of these two similar activities (e.g., item deletion). Thus, researchers should base decisions about whether to retain or delete items at this stage on their contribution to the factor solution rather than on the final length of the scale. Most researchers use some guideline for a lower limit on item factor loadings and cross-loadings to determine whether to retain or delete items, but the criteria for determining the magnitude of loadings and cross-loadings have been described as a matter of researcher preference (Tabachnick & Fidell, 2001). Larger, more frequent cross-loadings will contribute to factor intercorrelations (requiring oblique rotation) and lesser approximations of simple structure (described earlier). Thus, to the degree possible, researchers should attempt to set their minimum values for factor loadings as high as possible and the absolute magnitude for cross-loadings as low as possible (without compromising scale length or factor structure), which will result in fewer cross-loadings of lower magnitudes and better approximations of simple structure. For example, researchers should delete items with factor loadings less than .32 or cross-loadings less than .15 difference from an item’s highest factor loading. In addition, they should also delete items that contain absolute loadings higher than a certain value (e.g., .32) on two or more factors. However, we urge researchers to use caution when using cross-loadings as a criterion for item deletion until establishing the final factor solution because an item with a relatively high cross-loading could be retained if the factor on which it is cross-loaded is deleted or collapsed into another existing factor.

Item communalities after rotation can be a useful guide for item deletion as well. Remember that high item communalities are important for determining the factorability of a data set, but they can also be useful in evaluating specific items for deletion or retention because a communality reflects the proportion of item variance accounted for by the factors; it is the squared multiple correlation of the item as predicted from the set of factors in the solution (Tabachnick & Fidell, 2001). Thus, items with low communalities (e.g., less than .40) are not highly correlated with one or more of the factors in the solution.

(20)

In our content analysis, the most common criteria for item-deletion deci-sions were absolute values of item loadings and cross-loadings, which were often used in combination. None of the studies we reviewed reported using item communalities as a criterion for deletion, and one study used item-analysis procedures (e.g., contribution to internal consistency reliability). There were no items deleted in two studies, and two others did not specify the criteria for item deletion.

Optimizing scale length. Once the items have been evaluated, it is useful to

assess the trade-off between length and reliability to optimize scale length. Longer scales of relatively highly correlated items are generally more reliable, but Converse and Presser (1986) recommended that questionnaires take no longer than 50 minutes to complete. In our experience, scales that take longer than about 15 to 30 minutes might become problematic, depending on the respondents, the intended use of the scale, and the respondents’ motivation regarding the purpose of the administration. Thus, scale developers may find it useful to examine the length of each subscale to determine whether it is a reasonable trade-off to sacrifice a small degree of internal consistency to shorten its length. Some statistical packages (e.g., SPSS) allow researchers to compare all the items on a given subscale to identify those that contribute the least to internal consistency, making item deletion with the goal of optimizing scale length relatively easy. Generally, when a factor contains more than the desired number of items, the researcher will have the option of deleting items that (a) have the lowest factor loadings, (b) have the highest cross-loadings, (c) contribute the least to the internal consistency of the scale scores, and (d) have low conceptual consistency with other items on the factor. The researcher should avoid scale-length optimization that degrades the quality of the factor structure, factor intercorrelations, item communalities, factor load-ings, or cross-loadings. Ultimately, researchers must conduct a final EFA to ensure that the factor solution does not change after deleting items.

CFA

SEM versus FA. SEM has become a widely used tool in explaining

theo-retical models within the social and behavioral sciences (see Martens, 2005; Martens & Hasse, 2006; Quintana & Maxwell, 1999; Weston & Gore, 2006). CFA is one of the most popular uses of SEM. CFA is most commonly used during the scale development process to help support the validity of a scale following an EFA. In the past, a number of published studies have used FA or PCA procedures as confirmatory approaches (Gerbing & Hamilton, 1996). With the increasing availability of computer software, however, most researchers use SEM as the preferred approach for CFA.

(21)

In our content analysis, 14 of the studies used SEM as the confirmatory approach. In comparison, 2 studies used PCA as a confirmatory approach (these appeared before SEM was widely applied in counseling psychology research).

Typical SEM approaches. Once a researcher obtains a theoretically

meaningful factor structure via EFA, the logical next step is to specify the resulting factor solution in the SEM confirmatory procedure—that is, if the researcher obtains a three-factor oblique factor structure in the EFA, speci-fying the same correlated three-factor model using SEM and finding good fit of the model to the data in a new sample will help support the factor-structure reliability and the validity of the scale. Another approach is to compare competing theoretically plausible models (e.g., different numbers of factors, inclusion or exclusion of specific paths). Thus, the researcher can compare the factor structure uncovered in the EFA with alternative models to evaluate which model best fits the data. The hypothesized model’s fitting the data better than alternative models is further evidence of construct validity. If an alternative model fits the data better than the hypothesized model, the investigator is obligated to explain how discrepan-cies between models effect construct validity and then to conduct another study to further validate the newly adopted model (or start over).

Testing nested or hierarchically related models is another typical SEM approach. A model is nested if it is a subset of another model to which it is compared. For example, suppose a researcher conducted a study on an eight-item, course-evaluation survey in which four items assess satisfaction with the readings and homework assigned in the course and the remaining four items assess satisfaction with the professor’s sensitivity to diversity, resulting in a two-factor correlated model. However, one could assume that the eight items on the survey assess overall satisfaction with the course, resulting in a one-factor model. If this one-factor model was compared with the correlated two-factor model, the one-factor (restricted) model would be

nested within the two-factor (unrestricted) model because the correlation

between the two factors in the two-factor model would be set to a value of 1.0 to form the one-factor model. When comparing nested models, researchers use a chi-square difference test to examine whether a significant loss in fit occurs when going from the unrestricted model to the nested (restricted) model (for the statistical formula, see Kline, 2005).

When structural equation models are not nested (i.e., one model is not a subset of another model), the chi-square difference test is an inappropriate method to assess model fit differences because neither of the two models can serve as a baseline comparison model. Still, there are instances when researchers compare nonhierarchically related models in terms of model fit, such as when testing different theoretical models posited to support the

(22)

data. In this case, researchers may use fit indices to select among compet-ing models. It is becomcompet-ing more and more common to compare nonnested models using predictive fit indices (discussed further on), which indicate how well a model will cross-validate in future samples.

Some competing models may be equivalent models—that is, these models are mathematically equivalent even when their parameter configurations appear different (MacCallum, Wegener, Uchino, & Fabrigar, 1993), and they will have a different configuration but yield the same chi-square test statistics and goodness-of-fit indices. Thus, theory should play the strongest role in selecting the appropriate model when comparing equivalent models. Another SEM approach that may support the construct validity of a scale is called multiple-group analysis. In multiple-group analysis, the same structural equation model may be applied to the data for two or more dis-tinct groups (e.g., male and female) to simultaneously test for invariance (model equivalency) across the two groups by constraining different sets of model parameters to be equal in both groups (for more on conducting multiple-group analysis, see Bentler, 1995; Bollen, 1989; Byrne, 2001).

Of the 10 studies in the content analysis using a confirmatory SEM approach, 2 of them used the single-model approach wherein the model produced by the EFA was specified in a CFA, and 8 of the studies per-formed model comparisons. Of these 8 studies, 4 evaluated nested models, but only 3 of the 4 used the chi-square difference test when selecting among the nested models. All 4 of the studies used fit indices to select among nonnested competing models. Of the 4 studies comparing alternative nonnested models, 2 used predictive fit indices when selecting among the set of competing models. Researchers compared equivalent and nonequiv-alent models in 2 of the studies in the content analysis. One of these stud-ies selected a nonequivalent model over 2 equivalent models based on higher values of the fit indices. In the second study, the authors relied on theory when selecting among 2 equivalent models.

Sample-size considerations. The statistical theory underlying SEM is

asymptotic, which assumes that large sample sizes are necessary to provide stable parameter estimates (Bentler, 1995). Thus, some researchers have suggested that SEM analyses should not be performed on sample sizes smaller than 200, whereas others recommend minimum sample sizes between 100 and 200 participants (Kline, 2005). Another recommendation is that there should be between 5 and 10 participants per observed variable (Grimm & Yarnold, 1995); yet another guideline is that there should be between 5 and 10 participants per parameter to be estimated (Bentler & Chou, 1987). The findings are mixed in terms of which criterion is best because it depends on various model characteristics, including the number of indicator variables per factor (Marsh, Hau, Balla, & Grayson, 1998),

(23)

estimation method (Fan, Thompson, & Wang, 1999), nonnormality of the data (West, Finch, & Curran, 1995), as well as the strength of the relation-ships among indicator variables and latent factors (Velicer & Fava, 1998). However, because there is a clear relationship between sample size and model complexity, we recommend that the researcher should account for the number of parameters to be estimated when considering sample size. Given ideal conditions (e.g., enough indicators per factor, high factor load-ings, and normally distributed data), we recommend Bentler and Chou’s (1987) guideline of at least the 5:1 ratio of participants to number of para-meters, with the ratio of 10:1 being optimal. In addition, we do not recom-mend using SEM on sample sizes smaller than 100 participants.

Only one study in our content analysis reported using one of the earlier described criteria (5 to 10 participants per indicator) to establish an ade-quate sample size. The remainder of the studies did not specify whether

TABLE 3: Incremental, Absolute, and Predictive Fit Indices Used in Structural Equation Modeling

Fit Index Citation

Incremental fit indices

Normed Fit Index (NFI) Bentler & Bonnett (1980) Incremental Fit Index (IFI) Bollen (1989)

Nonnormed Fit Index (NNFI) or Tucker & Lewis (1973) Tucker-Lewis Index (TLI)

Comparative Fit Index (CFI) Bentler (1990) Parsimony Comparative Fit Index (PCFI) Mulaik et al. (1989) Relative Noncentrality Index (RNI) McDonald & Marsh (1990) Absolute Fit Indices

Chi-square/df ratio Marsh, Balla, & McDonald (1988) Goodness-of-Fit Index (GFI) Jöreskog & Sörbom (1984) Adjusted Goodness-of-Fit Index (AGFI) Jöreskog & Sörbom (1984) McDonald’s Fit Index (MFI) or McDonald (1989)

McDonald’s Centrality Index (MCI)

Gamma hat Steiger (1989)

Hoelter N Hoelter (1983)

Root Mean Square Residual (RMR) Jöreskog & Sörbom (1981) Standardized Root Mean Square Residual (SRMR) Bentler (1995)

Root Mean-Square Error of Approximation Steiger & Lind (1980) (RMSEA)

Predictive Fit Indices

Akaike’s Information Criterion (AIC) Akaike (1987) Consistent AIC (CAIC) Bozdogan (1987) Bayesian Information Criterion (BIC) Schwarz (1978) Expected Cross-Validation Index (ECVI) Browne & Cudeck (1992)

(24)

they used particular criteria to evaluate the adequacy of the sample size to conduct SEM. However, we assessed the sample sizes for all the studies included in the content analysis and determined that the remaining studies met the 5:1 ratio of participants to parameters.

Overall model fit. Researchers typically use a chi-square test statistic

as a test of overall model fit in SEM. The chi-square test, however, is often criticized for its sensitivity to sample size (Bentler & Bonett, 1980; Hu & Bentler, 1999). The sample-size dependency of the chi-square test statistic has led to the proposal of numerous alternative fit indices that evaluate model fit, supplementing the chi-square test statistic. These fit indices may be classified as incremental, absolute, or predictive fit indices (Kline, 2005).

Incremental fit indices measure the improvement in a model’s fit to the data by comparing a specific structural equation model to a baseline struc-tural equation model. The typical baseline comparison model is the null (or independence) model in which all the variables are independent of each other or uncorrelated (Bentler & Bonnett, 1980). Absolute fit indices mea-sure how well a structural equation model explains the relationships found in the sample data. Predictive fit indices (or information criteria) measure how well the structural equation model would fit in other samples from the same population (see Table 3 for examples of incremental, absolute, and predictive fit indices).

We should note that there are various recommendations about reporting these indices as well as suggested cutoff values for each of these fit indices (e.g., see Hu & Bentler, 1999; Kline, 2005). Researchers have commonly interpreted incremental fit index, of-fit index, adjusted goodness-of-fit index, and McDonald’s Fit Index (MFI) values greater than .90 as an acceptable cutoff (Bentler & Bonnett, 1980). More recently, however, SEM researchers have advocated .95 as a more desirable level (e.g., Hu & Bentler, 1999). Values for the standardized root mean square residual (SRMR) less than .10 are generally indicative of acceptable model fit. Values for the root mean square error of approximation (RMSEA) at or less than .05 indicate close model fit, which is customarily considered accept-able. However, debate continues concerning the use of these indices and the cutoff values when fitting structural equation models (e.g., see Marsh, Hau, & Wen, 2004). One reason for this debate is that the findings are mixed in terms of which index is best, and their performance depends on various study characteristics, including the number of variables (Kenny & McCoach, 2003), estimation method (Fan et al., 1999; Hu & Bentler, 1998), model misspecification (Hu & Bentler, 1999), and sample size (Marsh, Balla, & Hau, 1996). Researchers should bear in mind that suggested cutoff criteria are general guidelines and are not necessarily definitive rules.

(25)

According to Kline (2005), a minimum collection of these types of fit indices to report would consist of (a) the chi-square test statistic with cor-responding degrees of freedom and level of significance, (b) the RMSEA (Steiger & Lind, 1980) with its corresponding 90% confidence interval, (c) the Comparative Fit Index (CFI; Bentler, 1990), and (d) the SRMR (Bentler, 1995). Hu and Bentler (1999) recommend using a two-index com-bination approach when reporting findings in SEM. More specifically, they recommend using the SRMR accompanied by one of the following indices: Nonnormed Fit Index, Incremental Fit Index, Relative Noncentrality Index, CFI, Gamma Hat, MFI, or RMSEA. Although there is evidence that Hu and Bentler’s (1999) joint criteria help minimize the possibility of rejecting the right model, there is also evidence that misspecified (incorrect) models could be considered acceptable when using the proposed cutoff criteria (Marsh et al., 2004). Thus, we adopt Kline’s (2005) recommendation with respect to the minimum fit indices to report. In addition, because structural equation models approximate truth, we further recommend that researchers compare competing theoretically plausible models whenever possible and report predictive fit indices (see Table 3) to ensure that the model will cross-validate in subsequent samples. Finally, and most important, researchers should always base their selections of the appropriate model on relevant theory.

In our content analysis, 12 of the 14 studies using SEM reported the chi-square statistic. All 14 studies reported at least two fit indices. We list the most commonly reported fit indices in these studies in Table 2. Although 7 articles reported the RMSEA, only 1 of these reported its corresponding 90% confidence interval (regarding confidence intervals around the RMSEA, see Quintana & Maxwell, 1999; for more on confidence intervals, see Henson, 2006 [TCP, special issue, part 1]). All but 3 studies assessed model fit using various suggested cutoff criteria (e.g., Bentler, 1990, 1992; Byrne, 2001; Comrey & Lee, 1992; Hu & Bentler, 1999; Kline, 2005; Quintana & Maxwell, 1999). Several of the studies were published after the seminal Hu and Bentler (1999) cutoff-criteria article and referred to the less stringent cut-off criteria suggested by previous researchers (e.g., .90 for incremental fit indices). Only 3 of the 8 studies in the content analysis comparing compet-ing models (nested or nonnested) reported predictive fit indices.

Model modification. When structural equation models do not

demon-strate good fit, researchers often modify (respecify) and subsequently retest models (MacCallum, Roznowski, & Necowitz, 1992). This results in the confirmatory approach’s reverting to an exploratory approach again but that is of less consequence than not knowing the reasons behind poor model fit. Modification indices are sometimes used to either add or drop parameters in the process of model respecification. For example, the Lagrange

(26)

Multiplier Modification index estimates the decrease in the chi-square test statistic that would occur if a parameter were to be freely estimated. More specifically, it indicates which parameters could be added to increase model fit by significantly decreasing the chi-square test statistic of overall fit. In contrast, the Wald statistic estimates the increase in the chi-square test sta-tistic that would occur if a parameter were fixed to 0, which is essentially the same as dropping a nonsignificant parameter from the model (Kline, 2005). Researchers have examined the performance of these indices in terms of helping the researcher arrive at the correct structural equation model and have shown these indices to be inaccurate under certain conditions (e.g., Chou & Bentler, 2002; MacCallum, 1986). Thus, applied researchers are warned as to the accuracy of respecified models when modifications are made using the Lagrange Multiplier and the Wald statistic. In the end, the-ory should guide model respecification, and respecified models should be tested using new samples.

Researchers may also modify models in terms of the unit of analysis used, such as item parcels. Parceling means either summing or averaging two or more items together to create parcels (sometimes referred to as bun-dles). These parcels are then used as the unit of analysis in SEM instead of the individual items. It is crucial, however, that researchers in the scale development process do not use item parceling, because item parcels can hide the true relationships among items in the scale (Cattell, 1974). In addi-tion, model misspecification may be hidden when using item parceling (Bandalos & Finney, 2001).

The data-driven methods for model respecification in SEM are more appropriate for fine-tuning a model than they are for large-scale respeci-fication of severely misspecified initial models because multiple mis-specification errors interact with each other, making remis-specification more difficult (Gerbing & Hamilton, 1996). For similar reasons, Gorsuch (1997) suggested that it is possible to use FA procedures as an appropri-ate alternative to adjusting the confirmatory model when finding mis-specification, but this does not imply reversing the typical order of FA prior to SEM in scale development research. Finally, we highly recom-mend cross-validation of respecified structural equation models to estab-lish predictive validity (MacCallum et al., 1992). Thus, another sample of data should be collected and the respecified model tested in a confirma-tory approach.

Of the 14 studies conducting SEM, three examined modification indices (e.g., the Lagrange Multiplier) to assess if they should add parameters to the model to significantly improve the fit. In two of these three studies, the authors implemented modifications and retested the models. These two studies allowed the errors to covary, and one study also allowed the factors

(27)

to covary. Neither of the two studies that modified the original structural equation model cross-validated the respecified model in a separate sample. Researchers in two of the studies in the content analysis used item parcel-ing to avoid estimatparcel-ing large number of parameters and to reduce random error, an approach we do not recommend.

CONCLUSIONS

In this article, we have examined common practices in counseling psy-chology scale development research using EFA and CFA techniques. We con-ducted a content analysis of new scale development articles in JCP during 10 years (1995 to 2004) to assess current practices in scale development. We used data from our content analysis to provide information about the typical procedures used in counseling psychology scale development research, and we compared these practices to current literature on EFA and CFA to make recommendations about best practices (which we summarize further on).

We found that counseling psychology scale development research employed a wide range of procedures. Although we did not conduct a for-mal trend analysis, our impressions were that the content-analysis data indi-cated that counseling psychology scale development research became increasingly more rigorous and sophisticated during the evaluation period, especially through the attenuation of PCA procedures and the increased employment of SEM as a confirmatory procedure. However, we also found a variety of practices that seemed at odds with the current literature on EFA and SEM, which indicated a need for even more rigor and standardization. Specifically, we found the use of the following new scale development practices to be problematic: a) employing SEM prior to using EFA, b) using criteria that varied widely (or were not reported) with respect to determin-ing the adequacy of the sample for both EFA and SEM, c) faildetermin-ing to report an adequate rationale for selecting orthogonal versus oblique rotation meth-ods, d) using orthogonal rotation methods during EFA despite clear evi-dence that the factors were moderately to highly correlated, e) using inappropriate rationales or ignoring contrary data when identifying and reporting the final factor solution during EFA (e.g., ignoring high factor intercorrelations to retain a preferred factor structure), f) using questionable criteria as the basis for decisions about item deletion or retention, g) failing to consider the extent to which the final factor solution achieved adequate approximation of simple structure, h) making revisions to item content or adding or deleting items between the conclusion of EFA and the initiation of SEM, i) using criteria and fit indices that varied widely to determine overall model fit during SEM, j) failing to report confidence intervals when

(28)

using the RMSEA, k) using item parcels (bundles) in scale development, and l) failing to engage in additional cross-validation following model mis-specification and modification during SEM.

We offer a number of caveats for the earlier described critique of scale development practices. First, some of these recommendations do not trans-fer directly to other approaches to empirical scale development (e.g., crite-rion group) and should be understood as primarily referring to the homogenous item-grouping approach. Second, it is important to note that EFA is intended to be a flexible statistical procedure to produce the most interpretable solution, which can lead to acceptable variations in practice. Thus, some researchers may disagree on how stringently to use criteria to constrain the process of EFA, and we acknowledge that the subjective and interpretive aspects of scale development may justify variations that arise in specific contexts. Finally, the current literature on both EFA and SEM con-tinue to contain debates and conflicting recommendations that may be at variance with our conclusions. We provide recommendations for best practices here to increase standardization and rigor rather than as a res-olution of those ongoing debates and data-driven improvements in best practices.

RECOMMENDED BEST PRACTICES

1. Always provide a scale definition of the construct intended to be measured.

2. Use expert review of items prior to submitting them to EFA. 3. In general, EFA should precede CFA.

4. When using EFA, set a preestablished minimum sample size (³ 100) and then evaluate the need for additional data collection on the basis of an initial EFA using communalities, factor saturation, and factor-loadings criteria: (a) sample sizes of 150 to 200 are likely to be ade-quate with data sets containing communalities higher than .50 or with 10:1 items per factor with factor loadings at approximately |.4| and (b) smaller samples sizes may be adequate if communalities are all .60 or greater or with at least 4:1 items per factor and factor load-ings greater than |.6|.

5. Verify the factorability of data via a significant Bartlett’s test of sphericity (when the participants to items ratio is between 3:1 and 5:1), the KMO measure of sampling adequacy (values greater than .60), or both.

6. Recognize and understand the basic differences between PCA and FA extraction methods. For the purpose of scale development, FA is generally preferred over PCA in most instances.