4.3 Reliability testing of the Cognitive Therapy Scale – Revised Pain
4.5.1 Comparison with other studies
One of the first studies to examine the reliability of the CTS was conducted by Dobson, Shaw and Vallis (118) where four CBT experts rated 21 recordings. They found high internal consistency (Cronbach’s alpha .95) and Pearson correlations for each item ranging from .54 to .87 and a correlation of .94 for the total score. The same research group conducted another study where five experts rated 10
recordings of trainee therapists and found lower reliability scores for individual items using an intra-class correlation coefficient (ICC .27 to .59)(119). The use of a
Pearson correlation may have inflated agreement in the first study compared to the second which used an ICC.
Blackburn et al (121) found an overall inter-rater reliability of 0.63 for the CTS-R (using an ICC). This is somewhat lower than the current study which found
correlation of 0.82 and could be explained by two effects that have the potential to deflate reliability coefficients. If the sample in question has a wide variation in competence levels then raters will appear to be closer in agreement even when large differences occur. For that reason a narrower band of competence can artificially deflate agreement (161). Nearly all the therapists in the Blackburn trial were on higher level CBT training and we would expect them to be similar in competency level although ranges of the raw data were not presented which precludes checking this assumption. In addition, whilst four raters each rated 51 recordings in total, the sampling method meant that only 17 recordings were rated by each pair of raters. The correlation coefficients achieved between each pair of raters was averaged out to produce the overall correlation coefficient. With increasing raters comes increased opportunities for agreement and thus using average correlation coefficients from two raters may have deflated the agreement levels.
In contrast a high level of agreement was found by a group of authors who modified the CTS over two stages for use in psychosis (125, 126). They removed several skills from the original CTS as they felt these inappropriate to the client group; pacing, empathic skills, and case conceptualisation. Two items were combined into one; use of cognitive interventions and the use of behavioural interventions, and a new item relating to the overall quality of the intervention was added. The resultant tool, the CTS-Psy, was 10 items long and underwent reliability testing by four raters assessing five audio recordings from trainees undergoing specialist CBT training for patients with psychosis. Inter-rater reliability was very good for the total score (ICC 0.94) and moderate to very good for individual competencies (ICC range 0.41 to 0.95). There was no discernible pattern in the reliability of individual competencies
between the Haddock study and the current one. One reason for the high levels of correlation seen in the Haddock study could be due to the ‘intensive training’ for raters in addition to a manualised approach for tool use. In support, Barber et al (86) observed that many studies appear to show greater levels of inter-rater reliability when the raters work together, which is typical practice within clinical trials but less so in clinical training programmes.
At the same time as the CTS-Psy was developed, the original CTS was revised into the CTS-R (121). These two tools were directly compared in a study by Gordon (87). In this study a pool of nine raters were randomly selected to rate 26 recordings (two ratings for each of the 26 recordings). The ICC for the CTS-R was relatively low at .38 (95% CI .01 - .67) but rose to .76 (CI .33 - .94) when the authors excluded raters who had not attended training sessions on the use of the CTS-R. This finding is in line with a study which demonstrated increased reliability of the CTS-R after a 3.5 hour training session (123). In this study 24 students submitted two videotaped recordings to be rated, one prior to the raters training session and one
subsequently. A pool of 10 raters then assessed approximately four tapes each. The Pearson’s correlation rose from .44 pre-training to .67 post training for overall score although a wide variation was seen in individual competency items which were reduced by the training (pre-training range -0.04 to 0.59 and post training range 0.26 to 0.62). No specific pattern can be seen between the reliability for the individual competency items in the Reichelt study and the current one although the same item achieved the lowest reliability score; pacing/efficient use of time. This may reflect a poorly defined competency.
High levels of internal consistency were seen in the data presented in this thesis on the Cognitive Therapy Scale (86); this may imply a high degree of overlap of the items and skills. Alternatively, the internal consistency could have been inflated by
the halo effect where a rater has decided a level of competency for a physiotherapist and is then influenced on this for each competency item and clusters all scores around that level (121).
Overall, the reliability scores within this study are encouraging given that the raters did not have extended training to use the tool or worked together. In addition, many of the reliability studies described used Pearson’s correlations which could have inflated agreement compared to the ICC.
Whilst the CTS-R and CTS-R-Pain have high levels of total score reliability, the individual items can be very variable. However, all results have to be interpreted with caution due to the small sample size, and resultant wide confidence intervals, discussed further in the limitations section below.