4 The evaluation of a shoulder simulator
4.3 Setting standards for surgical performance
To break through the performance testing paradox16 and establish internationally accepted standards, the author has adopted an approach based upon work arising from the 5th Cambridge Conference on Medical Education referred to by Newbel and Southgate (76;78). The sequential steps for assessing clinical competence are as firstly to determine the purpose – in this case, shoulder arthroscopy training, then define what is to be tested – e.g. orientation, navigation, pattern recognition. This leads to three steps;
1. Identify the desired clinical level of resolution. This provides the objectives. 2. For each problem, define the clinical tasks for which students are expected to be
competent.
3. Prepare a blueprint to guide the selection of problems to be included in the assessment procedure.
16 The paradox refers to the assertion that one cannot test the performance of a system unless the performance of the individuals testing it is already known. This cannot be clearly defined without knowing the performance of the tools used to test them.
4.3.1 Building a testing methodology
There are three steps to gain support of the end-user community:
1. Select test methods, which are most appropriate to the clinical task being assessed.
2. Let the clinical task dictate the method by which it is tested. This can be difficult to achieve.
3. Recognise the practical constraints on selecting optimal examination methods. When addressing methods of testing administration and scoring one must decide on the level of efficiency needed in the particular testing environment. This includes how a student’s performance is to be recorded or captured. It is then necessary to determine a method to assign scores for examined cases and all elements within the cases. Teams must take appropriate steps to ensure that the test provides unbiased measures of performance. They must evaluate the need for equating scores across different examinations. Finally they should review the procedure to ensure that it has not been trivial. When determining what is to be tested in these trials, it is necessary to elicit the boundary of competence. Relevant parameters are outlined below.
Identify the desired level of resolution: This is designed for higher-level surgical trainees. During the first four years of their training, they are likely to complete a shorter module as part of their CCT preparation and this may also be used as part of the shoulder master class for HST years 5 – 6 in preparation for CCST. It is not currently considered as a potential tool for revalidation training. i.e. Consultant Grade., though the principles are the same.
There are 3 levels of problems to resolve, starting with orientation, then navigation and then associated structure of recognition (leading to pathology recognition). It is necessary to identify the issues within each of these. This has been developed as part of ergonomics analysis of the system, in conjunction with the Helmholz Institute in Aachen. The key issues are the frequency of problems and the frequency of occurrence, importance, and severity. The shoulder simulation has been developed to model a procedure that is difficult for trainees to view, as it occurs infrequently, but is becoming more common. Like any arthroscopic surgery, it is important as it offers a potential cure for painful conditions and amelioration of others, and also carries the risk of severe complications, such as vascular or neurological injury.
Finding the clinical tasks within each problem Specific terms are found, and the level of resolution appropriate to the expected performance of the trainee is described. The definitions, boundaries of accepted thresholds for normal and abnormal and also specified clinical management paths (algorithms17) are identified. To save time, this needs to be pragmatic.
Preparing a blueprint to guide task selection Comprehensive content and competencies blueprints (checklists) are developed for the assessment procedure. This generates a multi-dimensional “grid” which considers the various factors to be evaluated relating problems to categories of competence. The blueprints indicate specific and critical tasks embedded within the problems which need to be tested. This forms the basis upon which the sample for testing is selected which, although ideally random, is, of course, going to be limited by the number of those able to test the system. For the assessment procedure, a high level of content validity should be assured. Once implemented this will rely upon ‘expert committees’ who review the material.
4.3.2 Selecting test methods
Although a wide range of methods for assessing clinical competence exist (79), only some of these methods can be appropriated for the discrete tasks that can be performed using the simulator. The three steps are outlined below:
1. Selection of the most appropriate methods for the clinical task to be assessed. Where possible, the questions posed to candidates are structured to allow didactic diagnostic yes/no answers posed. With the assessment of clinical skills, validity can only be achieved using multiple observations and performance.
2. The clinical task dictates the method by which it is tested.This returns to the principle that what is easy to test is not necessarily useful and what needs to be tested is not necessarily easy to test. No single method is therefore capable of measuring all components as more than one test method is used within the simulator assessment procedure.
3. Practical constraints upon selecting optimum test methods. Clearly, this level of generation of the simulator can only test certain aspects of the simulation, hence the concept of building in scenarios. As a consequence, only the validation and
verification (V&V) procedure for teaching and testing these discrete skills can be used to evaluate it.
4.3.3 Addressing the issues of test administration scoring
There are 6 steps to follow, with respect to the administration of scoring:
1. Decide the required level of efficiency As mentioned above, more than one test is required and these ideally should be arranged hierarchically so that the most efficient test is the first administered. It is, however, important that the tests do not just reliably discriminate between candidates but have their purpose based upon discriminating clinical competence. This returns to the issue of formative versus summative approaches. No operation is always a success, since every procedure carries with it the risk of complication and failure. The approach is one of providing an average percentage risk as part of informed consent and as Poloniechi (76;80;81) points out, half the surgeons are, by definition, below the average results because that is characteristic of a normal distribution. One surgeon, of course, will have the worst result. The aim of the assessment is to demonstrate that the skills have been acquired in all the necessary competencies and such skills are likely to be explicit as Klein (76;82) suggests. Ideally, this will ultimately progress by adaptive testing, where each component of the assessment is selected on the basis of performance of previous components, although the complexity of this strategy needs to be built within the overall model. An alternative strategy would be that of sequential evaluation, which is more akin to the gaming model of achieving certain levels of performance and then being tested upon the next level, if they are successful in the previous one.
2. Decide how performance is measured and captured:Issues are raised such as security, training, and consistency between sites and thus the centralised model of client- server architecture was developed. Pilot studies used a pen and paper approach. This method, however, is not viable for the progressive studies, for which an automated process was developed. It is necessary to introduce computer-based testing multi-site, multi-station systems, for examination and this led to the drive to develop the Virtual Orthopaedic European University (VOEU) project detailed in Chapter 6.
3. Determining the methods of scoring the cases and the tasks within the cases The case refers to a conceptual unit for assessment. It may not necessarily be an individual clinical case. With the development of the system to incorporate different pathologies, this is likely to become a suitable level for structuring the database, since it
also allows for factors such as authorisation of use of clinical material,i.e.,consent. Once the case scores are obtained, then the process of combining these for decision-making becomes all-important. The underlying principle is that, by combining the procedural experience of using the simulator with the form assessment, there is added educational value in doing the test. Unsuccessful sections can be repeated until the score demonstrates the pass / fail level of competence.
4. Steps taken to avoid bias This is likely to be introduced when a minimum threshold is being set. In the earlier models, ‘time’ bias (generating pressure to complete tasks within a certain limited time) can be introduced into the simulation. The only reliable method to improve the statistics and overcome this is by observing a wide sample of performance, recording as many users as possible with an independent marking system. This ongoing recording system allowed the numbers of users to be increased on an‘ad hoc’basis, so as to allow for asynchronous collection of results data, minimising the time pressures upon the user ‘test’ population and to allow statistical analysis upon larger populations (76;83;84).
5. Evaluating the need to retain scores between examinationsWithin the scope of this thesis, the aforementioned design for examining considers the simulation in isolation. It does, however, relate it to the AIPES system referred to in detail below. It is through this framework that it may be possible to equate different simulation systems for their validity in performing the roles required. There is, however, the risk here of comparing apples with oranges, since simulators will be designed to test different systems for different purposes. The important correlation is between performing the same test upon different systems at different sites to ensure an accurate correlation. Alternatively the candidate should know to fix upon certain points or pathologies, which are adequately represented with different systems. This ‘anchor test’ could be introduced as part of the registration process for starting a new test. This effectively represents a generic
‘educational scenario’.
6. Ensuring that the test is not trivialised This depends upon scoring the relative importance of various components of the test with appropriate weighting. This is only significant when the test becomes part of a formal evaluation system such as the CCST.