Confidence interval scoring - Scoring

5.2 E VALUATION OF DATA

5.2.1 Scoring - Step 4

5.2.1.2 Confidence interval scoring

a) b)

Figure 16: a) End-of-trial data for the drug and comparator are shown. Data are in this case fitted to exponential distributions. b) The difference distribution is shown. 71 % of the subjects in the drug group experienced fewer minor hypoglycaemic events.

5.2.1.2 Confidence interval scoring

As mentioned in the beginning of section 5.2 Evaluation of data, another aspect in the evaluation of data is to capture tendencies in data. Randomised clinical trials are often constructed so that they answer questions related to the efficacy of a drug. Adverse events are often summarised without any statistical analysis. In order to protect patients from potentially harmful adverse effects of a drug, a proactive approach is needed, where willingness to accept a false positive signal overrules other decisions. One is faced with the two options: (i) the risk of pursuing a false positive signal that will have economical costs without a doubt, or (ii) the risk of ignoring a signal that might have both economical and human costs.

For events (e.g. responder rate, rare adverse events, etc.), the scoring method is based on confidence intervals (CI)^[113]. During a trial, not all subjects will experience an event.

Therefore, a value is not obtained for each subject, and we can therefore not reuse the method described under difference distribution scoring because probability distribution function (PDF) or cumulative distribution function (CDF) curves cannot be drawn or justified.

Instead, it is assumed that such events occur independently, with the same probability for each subject. This is assumed, since it is expected that randomisation will ensure similar groups. The confidence bounds, calculated by the exact method of Clopper and Pearson^[115],

0 10 20 30 40

are used for scoring. This is a well-known and established method, and I will therefore focus on how this can be used in a clinical setting to capture tendencies in data that can be quantified in larger trials.

A chi-square test and Fisher’s exact test were also evaluated and the following was concluded: a chi-square test cannot handle small numbers and the case of 0 events and was therefore discarded. Fisher’s exact test is suitable for small numbers and can handle the case of 0 events, but the method calculates a p-value, which was not found suitable in the comparison of two options, since it can be difficult to interpret a p-value. Clopper-Pearson’s exact method can both handle small and large sample sizes and the case of 0 events.

Rare events are defined as true or false, e.g. serious adverse events (SAE) leading to withdrawal from the trial, a responder rate, etc. An event is assumed to occur only once for each subject in the trial period, or at least the probability of the event occurring twice is negligible. Furthermore, we assume that such events occur independently with the same probability for each subject. The probability is assumed to be dependent only on the treatment, and the number of events in a trial is hence assumed to be binomially distributed.

Confidence intervals (CI) are used for scoring, e.g. in the case of events. The number of events and subjects in each group is known. The question is whether the probability, p, of one event/subject is different between drug and comparator. By calculating a score for each possible scenario, a scoring table is created, as seen in figure 17. A score can hereby easily be determined for the drug for a given criterion based on clinical data.

In Figure 17, a hypothetical trial, including 500 subjects for both the drug and comparator group (the method is not restricted to the same number of subjects in both groups) is shown.

The drug is inferior to the comparator on the criteria headache and anorexia, but non-inferior on the criterion injection site reactions.

Figure 17: The scoring table for a hypothetical trial is shown. Each little square is the result of a confidence interval (CI) scoring. The red colour means that the drug is inferior to the comparator;

yellow means the drug is non-inferior and green means that the drug is superior. There were 10 events of headache in the drug group and two events in the comparator group. The drug is therefore considered inferior to the comparator on this criterion. Furthermore, the drug is considered inferior on the criterion anorexia and non-inferior on injection site reactions.

To be able to capture tendencies in data, a proactive approach is taken, in that the level of confidence is lowered. This will, as mentioned, undeniably increase the rate of Type I errors;

therefore, any signal detected must therefore be evaluated and if there is any indication of relation between the action of the drug and the adverse event, it must be investigated and quantified in future studies. As more data are gathered, the level of confidence can be raised to minimise the risk of Type I errors.

Some events occur more than once during a clinical trial, but not in the same manner as frequent events. Here a modified version of the above-mentioned method could be used in these situations. It can be assumed that during a very small time frame, all patients have the same probability of experiencing an event. Furthermore, it can be assumed that the probability is constant over time.

In situations where no comparator or placebo is available, the background incidences for the untreated disease can be used as baseline values, both in the cases where new risks are introduced by the drug and in cases where the considered types of risk already exist in the untreated state, but are promoted by the drug. Once the objective scores are determined, the

Anorexia

next natural step is an evaluation of the evidence and uncertainty surrounding these scores, and this leads to the next step.

In document Benefit-Risk Assessment in Drug Development (Page 90-93)