Progression through GP selection
scored 3 or 4 Scores of 1 Concerns
Pre-SJT
outcome SJT band$ 3S* Initial Outcome
21 4 0 0 1 3, 4 1 21 4 0 0 1 1, 2 2 22 4 1+ 0 2 4 1 22 4 1+ 0 2 3 2 22 4 1+ 0 2 1, 2 3 25 3 0, 1 0 2 4 1 25 3 0, 1 0 2 3 2 25 3 0, 1 0 2 1, 2 3 23/24~ 4 0 or 1+ 1+ 3 4 2 23/24~ 4 0 or 1+ 1+ 3 3 3 23/24~ 4 0 or 1+ 1+ 3 1, 2 4 26 3 0, 1 1+ 3 4 2 26 3 0, 1 1+ 3 3 3 26 3 0, 1 1 3 1, 2 4 26 3 0, 1 2+ 3 1, 2 3 N/A# 27 3 2+ 0 3 4 2 27 3 2+ 0 3 3 3 27 3 2+ 0 3 2 not3 4 27 3 2+ 0 3 2 3 3 28 3 2+ 1+ 4 4 3 28 3 2+ 1+ 4 1, 2, 3 4 29 0, 1 - - 4 - not3 3 29 0, 1 - - 4 - 3 4 30 2 - 0 3 4 2 30 2 - 0 3 3 3 30 2 - 0 3 2 not3 4 30 2 - 0 3 2 3 3 31 2 - 1+ 4 4 3 31 2 - 1+ 4 1, 2, 3 4
Competences: 4=strong evidence, 3= satisfactory, 2= limited, and 1=little evidence. Outcomes: 4= not demonstrated, 3=review unclear, 2=review likely, 1= demonstrated £ Within the Excel spreadsheet.
*Number of stations with 3 or 4 scores of 3 or 4: only used on a few occasions.
~row 23 has no scores of 1, and row 24 has scores of 1+, but have same initial outcome so should be merged. # A42 should have $F$16>0 for concerns, and should be ‘not demonstrated’.
$No-one with SJT band score of 1 should be at Stage 3, so we don’t know why it is included.
• With 1 or 0 competencies with a mean of 3 or more, it is ‘not demonstrated’ (unless there are 3 stations with 3 or 4 scores of 3 or 4, in which case all SJT bands= ‘review unclear’).
The thrust of this scoring system seems to be that competencies and SJT scores are most important, but there are three further determinants of outcomes:
1. Scores of 1 are so poor they reduce the chance of being accepted 2. Assessor concerns also reduce the chance of being accepted
3. If a very poor station (or hawkish assessor) pulls down the competency scores in one station, a candidate’s chance of being accepted is increased if their scores on all other stations are high (i.e. at least 3/4 on all competencies) with high scores.
Candidates who are assessed as ‘review likely’ or ‘review unclear’ are discussed at a moderation session when all assessors who have worked together meet at the end of the half-day session. The session is led by a moderator; all of the marks of the assessors on the candidate are displayed. Moderators had slightly different approaches, but always asked the assessors to say in turn what behaviours they have observed and/ or why they had made their judgements; the moderator then makes an overall judgement. There was no discussion of the final judgement. Assessors are not aware of other assessors’ judgements except for the candidates who are being moderated. The focus is on competencies with means of less than three. One moderator always asked the assessor who gave the lowest marks to explain their reason first. Moderators differed in how much agreement they sought before deciding. The candidate has to be deemed to pass all competencies, but seemed to be given the benefit of the doubt if they had high scores elsewhere. Also, when asked to justify marks, some assessors gave general comments i.e. they were not focussed on competency judgements.
The impact of the moderation is unclear, and it would benefit from both statistical and sociolinguistic analyses. Superficially it allows the words and the behaviours of candidates to be discussed properly, rather than just numerical ratings, and that is admirable. A null hypothesis would be that the process contributes little beyond what was already contributed by the candidate’s overall mark; a negative hypothesis would be that the process of moderation exaggerates the importance of one or two utterances of a candidate, which are viewed as positive or negative indicators, and so makes the system less reliable. Whether a moderated decision has greater predictive validity than a decision based on the mean mark is the key question; ideally this would require the relationship of the various marks to a later outcome (and MRCGP marks would be the obvious ones). However, those who fail do not go onto GP training and we have not explored this further.
2.3.2 How the final score is calculated
Table 2.3 shows the competencies that are assessed at each station. Each individual mark is in the range 1 to 4, and there are 13 marks (Y in Table 2.3), which give a total in the range 13 to 52. In 2015, this total score was used to rank candidates. ES has four marks and its total is in the range 4 to 16, whereas the raw totals for CS, CT&PS and PI are in the range 3 to 12. Prior to 2015, they were therefore rescaled to also be in the range 4 to 16. As a result, scores of 3, 4, 5, 6, 7, 8, 9, 10, 11 and 12, become rescaled marks of 4, 5, 7, 8, 9, 11, 12, 13, 15 and 16. Note that the latter is non-linear with steps of one, except between 5 and 6, 7 and 8, and 10 and 11, which have a step size of 2 (from 5 to 7, 9 to 11 and 13 to 15). It is not clear what the rationale was for this non-linear rescaling . The grand total for the Stage 3 stations was then in the range 16 to 64.
Prior to 2015, the final score at the end of Stage 3 included the band scores for CPST and SJT and was calculated as the grand total for the Stage 3 stations (range 16 to 64) + 4 times the SJT Band score (range 4 to 16) + 4 times the CPST Band score (range 4 to 16), giving a final score at the end of Stage 3 in the range 24 to 96. This final score was not used to decide whether a candidate was offered a place, although it is used to determine the subsequent allocation of posts to those who are. Note that prior to 2015 Stage 2 was worth 32/96 = 33%, but in 2015 it was not included in the final score, except as a tie-break to create a unique rank in Oriel for offering purposes16.
In 2014, the lowest Stage 3 score that led to being offered a training place was 60/96 (the written test competency scores were 1, 1, 1 and 2 respectively: clearly at moderation, it was decided this should not override the scores of 3 or 4 for the scenarios), the highest score of a rejected candidate was 79.
7 .3 SUMMARY
Candidates pass through three Selection Stages, Stage 1 (Administrative Checks), Stage 2 (CPST and SJT), and Stage 3 (Selection Centre), and may fail (be rejected) at each stage. Between 2009 and 2015, Round 1 applications for UK graduates rose from 3,503 in 2009 to 4,318 in 2013, and then down to 3,696 in 2015; non-UK graduate applications have halved in that time. Rates of offers being accepted has remained around 61% for UK graduates, with about 22% failing at Stages 1, 2 and 3, and 18% withdrawing. For non-UK graduates, the acceptance rate is much lower at about 24%, and perhaps falling; about 71% fail at Stages 1, 2 and 3 and just 6% withdraw.
The process of developing and testing Stage 2 questions has been outlined. CPST and SJT scores are put into 4 Bands; a candidate in either Band 1 is not invited to Stage 3. The Band 2 threshold was raised in 2011 meaning about 10% of candidates were excluded from Stage 3, instead of the previous 5%. However, in 2014 Round 1, five percent of candidates (334/6,688) met the criterion of both CPST and SJT Band 2 or higher but were not invited to Stage 3.
Stage 3 face validity was thought to be very high. There are four stations in Stage 3: three simulations/ scenarios that we view as OSCE-style role plays and a written exercise. One assessor at each station awards marks for 3 or 4 competencies on a 1 to 4 scale (equating to something like: little, limited, satisfactory and strong evidence). These scores and the SJT band are converted to outcomes in a complex way: we have traced 29 branches of this algorithm. For each of the four competencies, the mean score is calculated. The number of competencies with a mean score of 3 or greater is the major factor in determining the initial outcome. However, the number of scores of 1 and any concerns raised by the assessors are also involved in the decision making process. This initial outcome usually moves up with an SJT Band 4, stays the same with Band 3, and moves down with SJT Band 2. Sometimes whether there are 3 stations with 3 or 4 scores of 3 or 4 also affects the outcome. This algorithm leads to one of the following outcomes: 1= demonstrated i.e. offered a post; 2= Review likely; 3= Review unclear; and 4=Not demonstrated i.e. rejected. Those with ‘Review likely’ or ‘Review unclear’ are discussed at a moderation session, with ‘Review likely’ more likely to be offered a post; so the final outcome is either ‘Demonstrated’ or ‘Not demonstrated’. In 2014 Round 1, 874/1,173 (75%) candidates reviewed at moderation were offered places. It may be better to use a total score and no moderation instead of the current complex algorithm followed by moderation: some possibilities are modelled in Chapter 7. Chapter 3 considers the distinctiveness of the four competencies and four stations which has important implications for this issue. We discuss possible changes in the discussion (Chapter 11).