SUMMARY OF FINDINGS - Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strateg

This study aimed to compare the efficiencies of multi-dimensional CAT versus uni- dimensional CAT based on the multi-dimensional graded response model and provide information about the optimal size of the item pool. To achieve these goals, item selection and ability estimation methods based on multi-dimensional graded response models were developed and two studies, one based on simulated data, the other based on real data, were conducted. For both studies, a SAS program was developed to simulate computer adaptive testing in the context of psychological and health assessments. Five design factors were manipulated: 1) correlation between dimensions, 2) item pool size, 3) test length, 4) ability group, and 5) number of estimated dimensions. Five outcome measures, the Pearson and intra-class correlations between estimated ability and true ability, root mean squared error, bias, and standard error for trait estimates, were calculated based on three correlations between dimensions (0.0, 0.4, and 0.7), four item pool sizes (10, 20, 50, 100), four test lengths (5, 10, 15, and 20 items), and four levels of each ability (θ ≤−1, −1<θ <0, 0≤θ <1, and θ ≥1).

A multi-dimensional CAT is more complicated when compared with a uni-dimensional CAT. The present study involved several design constraints: 1) a simple factor structure, where each item loads on a single dimension was assumed; 2) only Bayesian based methods were used since these methods considered the correlations between different dimensions; 3) only fixed- length CATs were considered since these are currently used in most CAT assessments; 4) other issues, such as exposure control, content balancing, are not typical problems for psychological and health assessment, and therefore were not considered.

1. How does the correlation between dimensions affect the efficiency of multi- dimensional CAT?

2. Is a multi-dimensional CAT more efficient than a uni-dimensional CAT? 3. Is there any difference between the results at different levels of the trait? 4. What is the optimal size of the item pool?

The results of this simulation study provide evidence to answer each research question. The impact of correlation between dimensions on efficiency of multi-dimensional CAT was observed from a comparison of the outcome measures under different correlations between dimensions. A modest effect due to the correlation between dimensions on the outcome measures was observed, although the effect was found primarily for correlations of 0 versus 0.4. When the correlations between dimensions increased, the root mean squared error and the standard error of estimates tended to decrease for all three dimensions, the Pearson correlations and intra-class correlations between true and estimated abilities tended to increase.

When each item loads on a single dimension and the dimensions are uncorrelated (correlation between dimensions = 0.0), the item selection and ability estimation procedures that are based on a multi-dimensional model are equivalent to those methods based on a uni- dimensional model. Based on a comparison of this condition with conditions in which the correlation between dimensions was greater than 0, the multi-dimensional CAT was more efficient than the uni-dimensional CAT. The gains in efficiency obtained by the multi- dimensional CAT depend on the correlations between dimensions. In general, the larger the magnitude of these correlations, the higher the gains in efficiency over uni-dimensional CAT. As Segall (1996) pointed out the gain in efficiency can be attributed to: 1) item selection and 2) ability estimation. As defined in Chapter 2.3, the Bayesian-based item selection and ability estimation methods take the correlation between dimensions into account, which leads to noticeable improvements in ability estimates.

The third research question was addressed by comparing the root mean squared error and bias under different ability levels. Not unexpectedly, ability level had an impact on the outcome

dimensional CAT, a multi-dimensional CAT provided more accurate estimates for those examinees with average true ability values than those with true ability values in the extreme range. This was explained by examining the test information function which illustrated that more information for estimating ability was available in the middle range (−1<θ <1) than in the extreme range (θ <−1 and θ >1) and slightly more information was available at the higher ability range than the lower ability range. In addition, the direction of any bias was as expected with Bayesian estimates. When true ability was negative (θ <0), bias was positive or . When true ability level was positive (

θ θˆ> 0

θ ), all bias values were negative or . Therefore, for examinees with negative true ability values, their estimated ability values were over- estimated. For examinees with positive true ability values, their estimated ability values were under-estimated. This is consistent with Bayesian estimation methods which “shrink” estimates toward the mean of the prior distribution.

θ θˆ<

Information on the optimal item pool size was provided by plotting the outcome measures versus the item pool size. The plots indicated that, for short test (5 items), the optimal item pool size was 20 items; for longer test (> 5 items), the optimal item pool size was 50 items. However, if item exposure control or content balancing were an issue, a larger item pool would be needed to achieve the same efficiency in ability estimates.

The results of this simulation study provide compelling evidence for several findings as well. The first significant finding was observed from a comparison of the outcome measures for ability estimates for dimension 1, dimension 2, and dimension 3. Recall that one feature of the study was that items were administered one dimension at a time. Thus, the ability estimates for dimensions after dimension 1 were based not only on items in the administered dimension but also on the relationship between dimensions. The results showed that as the number of dimensions used for estimating ability increased, RMSE and the absolute values of bias decreased. Ability estimates for dimension 3 had the most accurate estimates, followed by dimension 2, and followed by dimension 1. This supports the idea that the ability of one dimension can be used to augment the information available to estimate ability in another dimension.

The effect of item pool size and test length on ability estimates were similar to uni- dimensional CAT (Van der Linden, 1997; Wang & Vispoel, 1998; Warm, 1989). Larger item pools and longer tests yielded more accurate and reliable estimations. The results indicated that,

as more items were in the item pool, the smaller the root mean squared error and standard errors of ability estimates, and the higher Pearson and intra-class correlations. However, the absolute values of bias measure tended to decrease when item pool size increased, although this trend was not consistent across all experimental conditions. For conditions with the same item pool size, as the test length was longer, smaller root mean squared error, absolute value of bias, and standard error of ability estimates were observed, whereas higher Pearson and intra-class correlations were observed.

In order to investigate the significance of the manipulated factors, an ANOVA was conducted for RMSE and standard errors of ability estimates. Using as the measure of effect size, the main effect of test length accounted for most of the variance for RMSE (6%) and standard errors of estimates (93%).

The application component of the study investigated the same factors as the simulation study using real data from a uni-dimensional survey “DASH” and a two-dimensional survey “SF-36”. Results of the real data application found similar effects as those observed in the simulation study: the accuracy of ability estimates was improved by a longer test or when correlations between dimensions increased; the ability estimates were more accurate for examinees with ability in the medium range (−1<θ <1) than examinees with ability in the extreme range (θ <−1 or θ >1); and the ability was over-estimated for examinees who had negative true ability value and the ability was under-estimated for examinees who had positive true ability value.

In document Comparing Multi-dimensional and Uni-dimensional Computer Adaptive Strategies in Psychological and Health Assessment (Page 119-122)