comprehension of a Spanish language exam may utilize the same listening material followed by three or more questions. However, itemsets may violate the fundamental local independence assumption of unidimensional itemresponsetheory (IRT) due to their shared content (Wainer & Kiely, 1987).
information and could possibly be removed. This was preferable to the CTT approach of item reduction where factor analysis is used and may result in small areas of a construct being covered. This problem is even more likely if some of the items have similar wordings as these would be the strongest indicator of the factor and be retained ahead of other items. Using IRT can also result in the items representing a small area of the construct. However, this is driven by a different theoretical approach to CTT, based upon items not discriminating well or not having much information.
Our contributions are as follows: First, we in- troduce IRT and describe its benefits and method- ology. Second, we apply IRT to Recognizing Tex- tual Entailment (RTE) and show that evaluation sets consisting of a small number of sampled items can provide meaningful information about the RTE task. Our IRT analyses show that different items ex- hibit varying degrees of difficulty and discrimina- tion power and that high accuracy does not always translate to high scores in relation to human perfor- mance. By incorporating IRT, we can learn more about dataset items and move past treating each test case as equal. Using IRT as an evaluation metric allows us to compare NLP systems directly to the performance of humans.
CTT statistics analyze the test difficulty and discrimination with respect to item level statistics whereas the reliability of the test is analyzed with respect to test level statistics. Classical analysis uses a test as a basis and not the test item for analysis. Although the statistics generated are often generalized to similar students taking a similar test whereas they only apply to those students taking that test.
Two methods of itemanalysis were explored, the tradi- tional CTT approach and the more recent IRT method. These methods have their strengths and weaknesses. The use of both methods may yield more information than only using one of the methods. Each method was explored individually and then the results from each method compared and contrasted. CTT and IRT methods identified common items for removal from the item pool. Each method also suggested some items that could be removed that were not indicated by the other method using the criteria outlined. The CTT-MAP analysis indi- cated that three items were more highly correlated with a total other than the hypothesised construct total. There were feasible explanations for the removal of all three items. IRT additionally highlighted items that had low information and could possibly be removed. This was preferable to the CTT approach of item reduction where factor analysis is used and may result in small areas of a construct being covered. This problem is even more likely if some of the items have similar wordings as these would be the strongest indicator of the factor and be retained ahead of other items. Using IRT can also result in the items representing a small area of the construct. However, this is driven by a different theoretical approach to CTT,
This article is about FMCSA data and its analysis. The article responds to the two-part question: How does an ItemResponseTheory (IRT) model work differently . . . or better than any other model? The response to the first part is a careful, completely non-technical exposition of the fundamentals for IRT models. It differentiates IRT models from other models by providing the rationale underlying IRT modeling and by using graphs to illustrate two key properties for data items. The response to the second part of the question about superiority of an IRT model is, “it depends.” For FMCSA data, serious challenges arise from complexity of the data and from heterogeneity of the carrier industry. Questions are posed that will need to be addressed to determine the success of the actual model developed and of the scoring system.
As discussed above, conducting an EFA within the CTT framework can be considered problematic, as ordinal data is treated as equal interval data. Also, Brown (2005) found that EFA is plagued by statistical artefacts such as overfactoring and Heywood cases, especially when a large number of items are factor analysed at the same time. The questionnaire data was thus re-examined from an IRT perspective. IRT analysis was conducted from both a unidimensional and multidimensional perspective – although multidimensional itemresponsetheory (MIRT) was the primary focus.
Arbitrary weighting methods clearly affect the reliability of measures from a CTT perspective. Of present interest is how arbitrarily weighting items and test sections might affect a test with IRT analysis. Lukhele and Sireci (1995) discussed this problem in the context of the conversion of the writing skills section of the GED test from CTT analysis to an IRT analysis. Traditionally, the test had weights of .64 and .36 for the MC and CR sections, respectively, which were arbitrarily chosen to allow the essay section to adequately contribute without overly reducing the composite reliability (like most tests, we can assume the reliability of the CR section to be substantially less than that of the MC section). Lukhele and Sireci obtained the “unweighted” IRT marginal reliabilities (Green, Bock, Humphreys, Linn, & Reckase, 1984; Thissen & Orlando, 2001, p. 119) for each section, followed by a “maximum” reliability, in the IRT sense of the word, for the composite as computed by a simultaneous IRT analysis of all items. From these IRT marginal reliabilities, the traditional .64/.36 weights were applied to the trait estimates to compute a “weighted” composite reliability, which was nearly 6 percent less than the unweighted IRT “maximum” reliability. This loss in reliability can be in part attributed to weighting the test section marginal
Linear factor analysis (FA) models can be reliably tested using test statistics based on residual covariances. We show that the same statistics can be used to reliably test the fit of itemresponsetheory (IRT) models for ordinal data (under some conditions). Hence, the fit of an FA model and of an IRT model to the same data set can now be compared. When applied to a binary data set, our experience suggests that IRT and FA models yield similar fits. However, when the data are polytomous ordinal, IRT models yield a better fit because they involve a higher number of parameters. But when fit is assessed using the root mean square error of approximation (RMSEA), similar fits are obtained again. We explain why. These test statistics have little power to distinguish between FA and IRT models; they are unable to detect that linear FA is misspecified when applied to ordinal data generated under an IRT model.
Linearity and the assessment of change in severity Two studies investigated whether the magnitude of cogni- tive dysfunction represented by each item on the cognitive scale was equal across the scale [1,49]. In a recent paper Balsis et al.  also drew attention to the limitations as- sociated with the traditional method of measuring cogni- tive dysfunction with the ADAS-cog. This study was not included in the review as it did not provide information on the individual items or subscales however its analysis of IRT scoring of the ADAS-cog is worth noting. Balsis et al.  found that individuals with the same total score can have different degrees of cognitive impairment and conversely those with different total scores can have the same amount of cognitive impairment. These findings are supported by a similar study also failing to meet inclusion criteria due to some use of non-English language measures and a lack of information on test/item information . Re- sults indicate that participants with equal ADAS-cog scores had distinctly different levels of cognitive impairment. Equally, participants with the same estimated level of im- pairment had wide ranging ADAS-cog scores. The same differences in scores did not reflect the same differences in level of cognitive impairment along the continuum of test score range. Without equal intervals between adjacent test items change scores may reflect different amounts of change for subjects with differing levels of severity, or may fail to identify change at all . Wouters et al.  revised the ADAS-cog scoring based on the results of this IRT ana- lysis by weighting the items in accordance with their meas- urement precision and by collapsing their categories until each category was hierarchically ordered, ensuring the number of errors increase with a decline along the con- tinuum of cognitive ability. Examining difficulty hierarch- ies of the error categories within the items revealed some disordered item categories. As the categories are only use- ful if they have a meaningful hierarchy of difficulty these disordered categories were collapsed until all categories were correctly ordered in hierarchies of difficulty. This re- vision resulted in a valid one to one correspondence be- tween the summed ADAS-cog scores and estimated levels of impairment.
These items were chosen by one of the authors (MGWD) because they were clinically relevant for this patient pop- ulation and spanned the whole range of functional status represented by the ALDS item bank. All other respondents were interviewed by specially trained nurses or doctors. The respondents who had had a stroke were all presented with a single set of 21 items chosen by one of the authors (NW) to cover the lower end of the ALDS item bank (set B). The residents of supported housing, residential care or nursing homes and the respondents with Parkinson's dis- ease were presented with one of four sets of 80 items (sets C, D, E and F), which were described previously . In these sets, each of 160 items covering the whole range of levels of functioning represented by the item bank was randomly allocated to two sets. Items sets C and D have half their items in common, as do sets D and E, sets E and F and sets F and C. The data collection design is illustrated in Figure 1. The items that are in each set are indicated by the solid blocks. It can be seen that all sets except B con- tain items from the whole range of the item bank and that set B mainly contains items, which are from the lower end of the range of functional status represented by the ALDS item bank. Further details of the itemsets are given in Table 1. The the items that are in each set and the number of respondents, to whom each item was presented and the number responding in each category are indicated in Table 2.
models, first introduced by Lord in 1952 for dichotomously scored items, was developed to circumvent the limitations of CTST. However, Lord (1980; Lord & Novick, 1968) proposed formulas that link the item difficulty and item dis- crimination indices of the CTST and the two-parameter IRT model under the conditions that ability is normally distributed and there is no guessing (Lord). One example in which these formulas can be profitably used is in the area of computer adaptive testing. The cost associated with obtaining large enough samples to use IRT to screen items for the needed item banks, Lord’s formulas, which can be used with smaller sized samples, could provide the information needed to screen items. Another is in the field testing of items that have been embedded in operational test forms in such a way as to have a multiple matrix sample design. The samples of students per item may not be sufficient to conduct IRT analyses. In this case, the Lord’s estimates may be sufficient to complete an itemanalysis of the embedded field test items to determine which of these items should be selected for a future operational form.
This finding was important because it shows that, even after using questions from previously validated assessments to create a diagnostic physical science test and after a panel of experts revised and approved it, the assessment still had questions that were psychometrically problematic upon further analysis using IRT. This situation might be similar to what happens in many science classrooms, where science teachers, as content experts, prepare tests with questions that look satisfactory (that is, the test has face validity), but might include questions that could be interpreted by students in different ways. This study demonstrated that, given a large enough sample size, science teachers could use IRT concepts to post- validate and improve their diagnostic, end-of-course, and unit assessments.
and subject abilities being independent of the items being observed. Secondly, CTT researchers are forced to do an arbitrary factor analysis on parameter value to realise the final index; but this never fully explains why certain factors are more important than others. It also fails to proffer an explanation as to why when using MCMC in IRT, in explaining the underlying trait, and assuming that each parameter has equal importance and discriminatory power, it is possible to simulate the probable values of parameters, and then extract the values that best fit the response pattern.
In order to check unidimensionality, the data was analyzed by using the TESTFACT software. A factor analysis was carried out through this software and the eigen values were checked. It was found that a single dominant factor underlie the responses, therefore, the unidimensionality assumption was met. The following table shows the results. As it can be observed here, there was one major factor, which accounted for more than 17% of the variance of the scores in each subpart (Yen, 1985). This shows that every subpart had one major underlying factor. It should be mentioned that some subparts consisted of just 5 or 10 items; therefore, it would be acceptable if the percent of variances for these subparts were low. Based on these results, the researchers concluded that the unidimensionality assumption for the IRT models held for the data used in this study.
Confirmatory factor analysis: As an initial step, we examined whether each of the four subscales were unidimensional by fitting a single-factor Spearman model to each of the four itemsets (cf. Anderson & Gerbing, 1982; McDonald, 1999). The fit of the Spearman models was judged by means of the RMSEA and the Root Mean Square Standardised Residual (RMR). Each model was identified by constraining the factor variance to unity and all error variances were specified to be uncorrelated. As a second step, responses to the 27 items comprising the SDCM, VD, BM and MP subscales were jointly subjected to a restricted maximum likelihood confirmatory factor analysis. The measurement model was equivalent to the model tested by Briscoe et al. (2006) and specified that items 1–8 define the SDCM factor, items 9–14 the VD factor, items 15–22 the BM factor and items 23–27 the MP factor. All error variances were specified to be uncorrelated and the model was identified by fixing the factor variances to unity. In accordance with Briscoe et al., the fit of the model to the data was evaluated by means of the likelihood chi- square test, the RMSEA, the Comparative Fit Index (CFI), the Normed Fit Index (NFI) and the Incremental Fit Index (IFI). Unrestricted factor analysis: As a third step, responses to the 27 items were subjected to an unrestricted maximum likelihood factor analysis, which is one form of common factor analysis (Cudeck, 2000). Although unrestricted, the analysis was conducted in a confirmatory spirit with the aim of identifying sources of misfit in the restricted confirmatory factor analysis (McDonald, 2005).
curvature properties like smoothness or monotonicity ( which are usually ignored during IRT modeling) has led us to develop semi-parametric dynamic itemresponse model with monotonic and smooth growth curve in Chapter 4. Finally we have addressed partially relatively less-explored and less-stressed aspect of IRT modeling, clustering ability curves. To the best our of knowledge, clustering ability curves were never explored with an objective to identify different groups of students with significantly different learning rates from one another. This has motivated us, to build a distance-based clustering method to cluster students based on differences in learning patterns and, eventually to propose also a model-based alternative as part of future work in Chapter 5.
The comparison of IA and IRT analysis are summarised in Table 2. Note the discrepancy between the analyses for Q1b in term of the difficulty and discrimination; item Q1b is a poor item based IRT analysis, but a good item based on IA. Also note the similarity for Q1e; both analyses show that Q1e is an easy item. For the rest of the items, IRT gives more refined cutoff values for the discriminative ability of the items, while IA only shows that all the items are very good at discriminating high and low scorers. It is also easy to decide on the difficulty of any items based on IRT analysis results because the value ranges from negative to positive that represent the intuitive progression from easy to difficult.
The current study thoroughly investigated the psychometric properties of scores from four measures: the SCS, the DDI, the EES, and the ERQ-S. More specifically, I conducted a series of analyses consistent with the work of Toland (2014). I calculated descriptive statistics and reliability estimates to compare values to previous research as well as provide statistics to inform future research. Next, four CFAs were necessary to determine whether the four measures met the IRT assumption of unidimensionality. Finally, IRT analyses of each of the measures involved several steps to ascertain the intricacies of each measures’ psychometric properties. It was beneficial to conduct this study despite preexisting evidence of strong psychometric properties generated via classical test theory procedures because IRT analyses allow researchers to understand how individual differences in latent trait levels influence the precision of measures.
Note that not all cases hold the requirement of local independence. For instance, in the case where some examinees may have higher expected test score than other examinees with respect to the same ability (Hambleton & Swaminathan, 1985) and if the test item is long and the examinees learn while answering items. Accordingly, the items associated with one stimulus are likely to be more related to one another than to items associated with another stimulus (Kolen & Brennan, 2004). For example, when test are compose of sets of items that are based on common stimuli, such as items on reading passages or diagrams, then Equation  is unlikely to hold. The violated relationship between the assumption of local independence and of the unidimensionality, however, for both might hold closely enough for IRT to be used in many practical situations such as in CAT. If unidimensionality is met, then the local independence assumption is also usually met (Hambleton et al., 1991). However, local independence can be met even when the unidimensionality assumption is not (Scherbaum et. Al. 2006, Kolen & Brennan 2004).