Preclinical <i>In Vivo</i> Antitumor Activity Experiments: Methodological Pitfalls and a New Framework for their Design and Analysis

(1)

Open Research Online

The Open University’s repository of research publications

and other research outputs

Preclinical

In Vivo

Antitumor Activity Experiments:

Methodological Pitfalls and a New Framework for their

Design and Analysis

Thesis

How to cite:

Porcu, Luca (2020). Preclinical In Vivo Antitumor Activity Experiments: Methodological Pitfalls and a New Framework for their Design and Analysis. PhD thesis The Open University.

For guidance on citations see FAQs.

c

2019 The Author Version: Version of Record

Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page.

(2)

PRECLINICAL

IN VIVO

ANTITUMOR ACTIVITY EXPERIMENTS:

METHODOLOGICAL PITFALLS AND A NEW FRAMEWORK FOR

THEIR DESIGN AND ANALYSIS

Thesis submitted for the degree of Doctor of Philosophy

at the Open University, UK

Discipline of Life Sciences

by

Luca Porcu

Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milan Italy

(3)

Abstract

Aims: Poorly designed, analyzed and reported preclinical in vivo experiments (inVivoExp) raise ethical as well as scientific concerns. It could be hypothesized that the recurring failure of apparently promising interventions to improve outcome in clinical trials has been partially caused by poor quality of statistical design and analysis (QoStat) of inVivoExp. This project aimed to assess and correlate QoStat with clinical activity, and to improve the statistical framework used in inVivoExp. Methods: A systematic search of Medline and EMBASE databases was carried out to identify epithelial ovarian cancer clinical trials assessing the antitumor activity of candidate compounds (CC) as monotherapy. For each eligible CC, a systematic search was carried out to identify scientific papers reporting inVivoExp on rats and mice, in which the CC was administered as monotherapy. An ad hoc checklist was used to assess QoStat of inVivoExp. QoStat was correlated to the clinical activity.

Results: Fifty-two eligible CCs and 121 inVivoExp were identified. In 45 out of 120 (37.5%) inVivoExp the method of treatment assignment was not specified. The randomization type was specified in 3 out of 74 (4.1%) inVivoExp and sample size was justified in 9 (7.4%) inVivoExp. If the primary outcome was tumor volume, the antitumor activity endpoint was declared in 14 out of 106 (13.2%) inVivoExp. The length of follow-up was specified in 43 (35.5%) inVivoExp. Outcome assessor was blinded in 5 (4.1%) inVivoExp. Inefficient statistical methods were often applied to analyze tumor growth data. A new statistical framework based on the Mann-Whitney statistic was proposed and applied to a specific tumor model.

Conclusions: QoStat of inVivoExp was so poor that the correlation with clinical activity was impossible. The magnitude of the biological signal was poorly estimated. The new statistical framework should be considered for the design and analysis of in vivo tumor growth studies.

(4)

“Ringrazio gli uomini di essere così buoni, per aver dato tante manifestazioni d’amore, non riconosciuto” Pier Paolo Pasolini

(5)

Acknowledgements

In the following rows I give thanks to people who, without them, my project wouldn’t be realized. Their contributions have been different in quality and quantity, but the following values were the same to all of them: kindness, openness and modesty. No real education could be given and be received without these values.

Especially, many thanks to Dr. Roberta Frapolli. Her contribution to define the statistical design of the project was decisive. In addition, she taught me primary characteristics of in vivo experiments and animal models used to evaluate candidate compounds in Oncology. Moreover, whenever I had doubts about published in vivo experiments, she saved me, solving these doubts clearly and satisfactorily. Hundreds of times I asked her for guidance. I don’t remember one single occasion that she was not available or didn’t resolve my doubts.

Then, I thank Professor Maurizio D’Incalci and Professor Silvio Garattini. I would have never taken this prestigious PhD course without their support and encouragement. They are important, brilliant and well-known researchers, and yet they have been always available to speak to me and discuss my problems with kindness, openness and modesty.

I am very grateful to my supervisor, Dr. Valter Torri. My life would have been totally different without meeting him. In a very difficult moment of my life due to personal health problems, he welcomed me in his laboratory. Then he gave me the possibility and encouraged me to study science. This unique opportunity given to me by Dr. Torri, not only made me a scientist, but above all, was surely decisive in winning the battle against my health problems.

I wish to thank my examiners Dr. Nicholas Galwey and Dr. Ettore Beghi, as their support has been crucial in correcting relevant mistakes.

Special thanks to Dr. Mauro Cortellini, Dr. Alice Casagrande and Dr. Daniela Albertini. They contributed in retrieving data from biomedical literature and helped me to organize all the activities. Mauro helped me to manage my time giving me deadlines to meet during the project. I remember the first time I met Daniela. She had to prepare a thesis for a bachelor’s degree in biology. She was confused when she started to read publications of in vivo experiments. But after only one month, she became the master and I was the scholar!

I am very grateful to my nephew Marvin Tchangwa. He is 17 years old and lives in England. He corrected the drafts of my PhD thesis discovering some mistakes in my grammar. I suppose that he was bored in doing this job. And yet, he was very kind to help me without getting bothered of me. God bless him!

(6)

Finally, I am blessed with the comfort my mother and little dog Charlie gave me during this challenging period; I could not have survived without them! Their love helped me to overcome difficulties. In the first two years of the PhD project I was worried because it seemed impossible to achieve, whilst in the last year of the PhD project I had to heavily increase my dedication towards it and consequently my mood was often cloudy. In all these situations, their presence assured me.

(7)

Chapter 2 ... 34

Aims ... 34

Chapter 3 ... 37

Methods ... 37 3.1 Survey design ... 37

3.1.1 Stage 1: identification of eligible CCs and estimation of their clinical antitumor activity .. 37

3.1.2 Stage 2: systematic review of in vivo experiments and their methodological evaluation .. 40

3.2 Quality of statistical design and analysis checklist ... 41

3.2.1 Repetition and external validity ... 42

3.2.2 Internal validity ... 42

3.2.3 Statistical design ... 42

3.2.4 Sample size ... 44

3.2.5 Outcomes and their assessment ... 45

(8)

3.2.7 Attrition bias about tumor growth curves ... 45 3.2.8 Miscellanea ... 46 3.3 Sample size ... 46 3.4 Statistical analysis ... 47

Chapter 4 ... 49

Results ... 49

4.1 Selection of clinical trials and CCs ... 49

4.2 Patient characteristics and assessment of clinical antitumor activity ... 53

4.3 Selection of preclinical in vivo antitumor activity studies ... 61

4.4 Tumor models used in preclinical in vivo antitumor activity studies ... 65

4.5 Assessment of the quality of statistical design and analysis ... 68

4.6 Correlation between preclinical quality and phase 2 activity ... 81

Chapter 5 ... 82

Improving statistics of in vivo tumor growth curves ... 82

5.1 Concepts ... 82

5.1.1 Testing statistical hypotheses ... 82

5.1.2 Point and interval estimates ... 84

5.2 Non-parametric Two-Sample Tests ... 85

5.2.1 Definition of statistical tests ... 85

5.2.2 Censoring and missing data ... 87

5.2.3 Weighted non-parametric Two-Sample Tests ... 88

5.2.4 Stratified non-parametric Two-Sample Tests ... 90

5.2.5 Paired data ... 91

5.3 Statistical power and sample size determination ... 92

5.3.1 Location shift model ... 92

5.3.2 Asymptotic power ... 97

5.3.3 Other approximations and exact distribution of test statistic ... 101

5.4 An example of statistical analysis of tumor growth curves ... 101

5.5 Estimating the treatment effect ... 104

5.5.1 Introduction ... 104

5.5.2 Definition and properties of the estimator Δrmed_{... 104}

5.5.3 Definition and properties of the estimator Δrmean_{... 105}

5.5.4 Summary estimators and heterogeneity between time intervals ... 108

5.5.5 Estimating relative effects ... 108

Chapter 6 ... 109

(9)

References ... 115

Appendix A ... 124

A.1 Preclinical search string used in the Medline database ... 124

A.2 Preclinical search string used in the EMBASE database ... 128

Appendix B ... 135

Appendix C ... 139

List of eligible clinical trials ... 139

Appendix D ... 149

List of eligible preclinical in vivo experiments ... 149

Appendix E ... 158

SAS MACRO programs ... 158

Appendix F ... 179

Tumor volumes (mm3_{) measured during the in vivo experiment with ML017/ET myxoid liposarcoma PDX} ... 179

(10)

9

Glossary of symbols and abbreviations

1 - β Power of a hypothesis test

α Type I error

β Type II error

Δr Additive treatment effect at the time interval [tr, tr+1], r=0,…,K-1

Δrmean Estimator of the parameter Δr, r=0,…,K-1

Δrmed Estimator of the parameter Δr, r=0,…,K-1

Δ Vector of additive treatment effects at time intervals [tr, tr+1], r=0,…,K-1

Ф Standard Normal Cumulative Distribution Function aAUC Adjusted area under the curve

AACR American Association for Cancer Research ASCO American Society of Clinical Oncology ANOVA Analysis of variance

ARRIVE Animal Research: Reporting In Vivo Experiments guidelines AUC Area under the curve

C Control arm

CA-125 Cancer antigen 125

CC Candidate compound

CDX Cell line derived tumor xenograft CI Confidence interval

CR Complete Response

Cr(x) Continuous probability distribution function of the control (C) arm, for all x Є (-∞,

+∞), at the time interval [tr, tr+1], r=0,…,K-1

DCR Disease Control Rate DNA Deoxyribonucleic acid e&a eligible and assessed

ECOG Eastern Cooperative Oncology Group EM Expectation maximization algorithm EOC Epithelial ovarian cancer

EORTC European Organization for Research and Treatment of Cancer

(11)

10

EU European Union

FDA Food and Drug Administration GCIG Gynecologic cancer intergroup GOG Gynecologic Oncology Group

HR Hazard Ratio

InVivoExp preclinical in vivo experiments IQR Interquartile range

LCK Log cell kill

MACRO SAS MACRO program

MASS Morphology, attenuation, aize, and structure criteria M-H Mantel-Haenzel test

NCI USA National Cancer Institute

OR Odds Ratio

ORR Objective Response Rate OS Overall Survival

PD Progression Disease

PDX Patient derived tumor xenograft PFS Progression-free Survival

PR Partial Response

PRISMA Preferred reporting items for systematic reviews and meta-analyses PS Performance Status

PSA Prostate-Specific Antigen

QoStat Quality of statistical design and analysis RCB Randomized Complete Block design RCT Randomized and controlled clinical trial RECIST Response Evaluation Criteria In Solid Tumors RevMan Review Manager

RoB Risk of bias tool

SAS Statistical Analysis Software SD Stable Disease

SE Standard error STD Standard deviation

(12)

11 SWOG South West Oncology Group

SYRCLE Systematic review center for laboratory animal experimentation

T Treatment arm

TGD Tumor growth delay TTP Time to progression

Tr(x) Continuous probability distribution function of the treatment (T) arm, for all x Є (-∞,

+∞), at the time interval [tr, tr+1], r=0,…,K-1

T/C T/C ratio, where T and C are the means, or medians, of the tumor volumes of the treatment (T) and control (C) arms

UK United Kingdom

USA United States of America WHO World Health Organization

(13)

12

Chapter 1

Introduction

1.1 Drug development in Oncology

Drug development is the process of bringing a new pharmaceutical drug to market once a candidate compound (CC, i.e. new chemical entity) has been identified. This process is essentially a set of applied methodologies that cover a wide range of objectives: the identification of targets, the identification of drug concentrations required for targets’ inhibition and modulation, the assessment of drug pharmacokinetics and pharmacodynamics, the assessment of safety, activity and favorable or negative effects on clinical endpoints. This process is an interdisciplinary endeavour involving a multitude of professional figures from biologists, chemists, computer scientists, medical staff, statisticians, and regulatory experts. This process is time consuming and expensive. It can take 10 to 15 years and an average estimated cost exceeding USA $1 billion (Morgan et al., 2011; Rick, 2015; DiMasi et al., 2016).This process is also competitive. The purpose of drug development is to select from millions of CCs those that most effectively and safely offer clinical benefit. Finally, this process is made up of a preclinical testing phase, in which in silico, in vitro and in vivo models are used, and a human testing phase, in which studies are conducted on human beings (i.e. clinical trials).

What advantages are there to use preclinical models? First of all, simplifications and controllability are obtained. Hence, a mechanistic insight into the impact of CC on the evolution of a disease could be obtained. Second, biological science provides explicit justification to study diseases abstracted from the entire human organism. For example, it is well known that the essential elements of tumor growth lie within cells. Cancer cells have defects in regulatory circuits that govern normal cell proliferation and homeostasis (Hanahan et al., 2000). Hence, isolated cells or cell cultures are suitable objects for cancer research. As a third and last point, ethical and economic considerations request the use of preclinical models. Preliminary information about CCs’ safety and efficacy profile must be collected before a CC could be reasonably administered to a human being. Without this preliminary information obtained from preclinical models, it would be unethical to test unproven chemicals in humans (Garattini et al., 2017). Of course, these models must not be separated so far from reality that relevance to the ultimate goals of being better able to prevent the disease or improve treatment is lost, remembering that relevance of a result may not be evident initially.

(14)

13

In Oncology, the classic approach taken to identify chemotherapy drugs, requires that the CC is first evaluated against a panel of malignant cell lines, such as those used by USA National Cancer Institute (NCI-60, refer to the web site: https://dtp.cancer.gov/discovery_development/nci-60/cell_list.htm. Developmental Therapeutics Program. National Cancer Institute. Last Updated: 8 May 2015. Retrieved 01 September 2019). If the CC shows antitumor activity in the panel of cell lines, other in vitro studies are performed to determine its mechanism of action. New methods develop CCs in a different manner, namely targeting specific molecules or pathways known to have a role in tumor growth. These CCs can be identified in different ways, such as the screening of small-molecule libraries or by computer-assisted protein-structured-based design. A biochemical or cellular assay is then required to evaluate the effects on the molecular target and those CCs that should undergo further development are selected.

Biological differences between primary tumors and the cancer cell lines derived from them, limit the value of in vitro studies for the evaluation of CCs (Szakács et al., 2004). Once the target and mechanism of action have been identified using in vitro models, in vivo experiments are undertaken to ensure that inhibition of the target can be achieved at tolerated doses in vivo and to identify and validate predictive biomarkers of response. Chemotherapy drugs can be considered ‘targeted’ in that they inhibit DNA synthesis and the cell division apparatus. The theory behind the preclinical in vivo experimentation is to look for activity in in vivo models which would translate into some likelihood of activity in human disease. In vivo models are also required to evaluate CCs’ pharmacokinetics, CCs’ effects on biological processes such as invasion into neighboring tissues, angiogenesis, metastasis, and the relative effects of CCs against tumor cells compared to their toxicity in normal tissues (Ocana et al., 2010).

Once preclinical in vivo experiments have been successfully performed, CCs could be administered in humans for the first time. When CCs reach the clinical setting, drug development proceeds through a series of sequential clinical phases designed to assess their safety and efficacy. The standard clinical paradigm for the evaluation of CCs consists of a phase I trial to establish the optimal dose, a phase II trial to obtain preliminary evidence of activity and a phase III trial for comparison with the standard therapy. In this approach, the phases I and II (exploratory trials) are for gathering information and screening the CC; the phase III (confirmatory trial) is for a definitive comparison with the standard therapy.

In Oncology, phase II trials are an essential bridge between the small phase I trials, which determine the dose of antitumor CCs, and the large-scale and confirmatory phase III trials. The

(15)

14

primary aim of phase II trials is to screen CCs for their biologic antitumor activity. Secondary aims are the preliminary evaluation of CCs’ safety profile and predictive biomarkers.

CCs’ antitumor activity is assessed using standardized response criteria. In solid tumors, the first international standardized response criteria were written and disseminated by the World Health Organization (WHO) in 1979 [World Health Organization. WHO Handbook for Reporting Results of Cancer Treatment Offset Publication No. 48. (WHO press, Geneva, 1979)]. Of primary relevance, the authors defined exactly what constitutes a response to treatment or progression of disease by standardizing the amount of tumor shrinkage necessary to qualify a patient for each of four categories: complete response (CR: disappearance of all known disease by two observations at least 4 weeks apart), partial response (PR: 50% or more decrease in total tumor size of the lesions that have been measured. No new lesions. No progression of any lesion), stable disease (SD: it cannot be established that the total size has decreased by at least 50%, nor has a 25% increase in the size of one or more measurable lesions been demonstrated) and progressive disease (PD: a 25% or more increase in the size of one or more measurable lesions or the appearance of new lesions). This answered an urgent need of medical oncology. Because single-arm, uncontrolled, Phase II trials were used to assess CCs’ antitumor activity, standards to compare responses across trials were urgently needed. Widespread application of the WHO criteria, however, brought to light some deficiencies/discrepancies. The reliability of the methodology both in terms of intraobserver as well as in terms of interobserver variability was questioned (Warr et al., 1984; Tonkin et al., 1985; Warr et al., 1985; Thiesse et al., 1997). Cooperative groups and pharmaceutical companies often ‘modified’ original WHO criteria to accommodate new technologies for human cancer imaging or to address areas that were unclear in the original document. For example, the South West Oncology Group (SWOG) published their version of the WHO criteria in 1992 (Green et al, 1992). As a major change, a larger increase in tumor size (50%) was requested to define PD. In the same year, the European Organization for Research and Treatment of Cancer (EORTC) published its own version of the WHO criteria (Tumor eligibility and response criteria for phase II and III studies. Brussels: EORTC Data Center Manuel 1992), defining minimum sizes for lesions from different organs to be considered as measurable. Because different versions of the original WHO criteria were used in clinical trials, the comparison of results of clinical trials became very unreliable. In 1994, several clinical research organizations began updating the WHO standards and, 6 years later, published a new version, under the acronym RECIST (Response Evaluation Criteria In Solid Tumors, Therasse et al., 2000). The WHO and RECIST standards share the same principles, that is standardizing the

(16)

15

amount of tumor shrinkage necessary to qualify a patient for each of the previous four classifications. In RECIST criteria, the assessment of tumor lesions was simplified and better specified in order to address apparent deficiencies and lack of details of the WHO criteria. RECIST requires that certain lesions are identified as the key lesions that will track disease change; RECIST alters the definition of PR and PD and changes the way that lesions are measured (unidimensional versus bidimensional in the WHO criteria); the RECIST standards also update which imaging modalities are acceptable for measuring tumor size. In 2009, RECIST criteria were updated in the version 1.1 (Eisenhauer et al., 2009). Major changes were the reduction of the number of lesions to be assessed, how to assess pathological lymph nodes was specified, the definition of PD was better specified, the confirmation of response was not requested anymore in randomized trials and, finally, what constitutes ‘unequivocal progression’ of non-measurable/non-target disease was explained.

Whereas the RECIST criteria address many shortcomings of previous attempts to classify tumor response, they have limited utility in the evaluation of ovarian cancer. In recurrent ovarian cancer, a significant proportion of patients have only micro-nodular peritoneal carcinomatosis and ascites, which are non-measurable according to RECIST criteria. Because the RECIST criteria define tumor response on the basis of evaluation of measurable disease, it precludes its use in almost 50% of ovarian cancer patients (Rustin et al., 2004). To allow the inclusion of these patients, it was proposed that the CA-125 serum tumor marker could be utilised as a tumor response criterion. CA125 is a high molecular weight glycoprotein which is raised in approximately 90% of patients with advanced epithelial ovarian cancer (Bast et al., 1983). Many more patients are evaluable according to CA-125 than those assessed by computed tomography scanning used to assess standard (WHO or RECIST) response criteria (van der Burg et al., 1993; Pearl et al., 1994; Rustin et al., 1996;). Moreover, measurement of CA-125 is less expensive and more comfortable for patients than computed tomography scanning. Characterized in 1981, the CA-125 antigen has several important roles in the routine management of ovarian cancer patients and could be used as a prognostic marker (Rustin et al., 2004). In 1996, Rustin et al. defined criteria for evaluating 50% and 75% response according to CA-125. Based on retrospective studies, the Gynecologic Cancer Intergroup (GCIG) proposed that a definition of ovarian tumor progression based on CA-125 doubling should be used in clinical trials of first-line therapies (Vergote et al., 2000). In addition to increasing the number of eligible patients for a given trial, it was suggested that utilisation of a composite definition of progression based on both RECIST and CA-125 criteria (instead of only one or the other) would increase the statistical power for tests of differences between trial arms regarding PFS (Rustin

(17)

16

et al., 2006). Thus, a public workshop sponsored by the US Food and Drug Administration, American Society of Clinical Oncology, and American Association for Cancer Research (FDA-ASCO-AACR) recommended CA-125 to be used as a surrogate marker of disease progression (Bast et al., 2007). They also proposed that CA-125 should be included as a part of a composite endpoint that includes radiological and clinical evaluation.

There are two main types of endpoints based on standardized response criteria: binary and time-to-event. Binary endpoints include the Objective Response Rate (ORR, i.e. the proportion of patients whose tumor exhibits a PR or CR), and the Disease Control Rate (DCR, i.e. the proportion of patients whose tumor exhibits SD, PR or CR). Time-to-event endpoints include the progression-free survival (PFS) and the time-to-progression (TTP): the former measures time-to-tumor progression or death whichever occurs first, while the latter treats death as a censoring event. Based on antitumor mechanism, different endpoints could be used to detect CC’s antitumor activity. Broadly, ORR is considered suitable for cytotoxic drugs but less suitable for cytostatic agents (Adjei et al., 2009; Sharma et al., 2012). PFS is said to be more informative for cytostatic agents (Seymour et al., 2010).

A recent survey (Hay et al., 2014) demonstrated that Oncology has one of the highest attrition rates in the drug development process. Oncology drugs have the lowest likelihood of success from phase I; only around 1 in 15 drugs (6.7%, n = 1.803) of all indication development paths in phase I were approved by FDA. In particular, the phase II success rate (i.e. the probability of a drug moving from phase II to phase III) was estimated at 28.3%. The unsatisfactory positive predictive value of phase II trials (i.e. the low probability of reaching market approval from the phase II) is explained by the following reasons:

o The strength of activity signal obtained in phase II Oncology trials is often too low to cause a clinical benefit in large-scale and confirmatory phase III trials

o The methodology applied to phase II Oncology trials is generally low-level. First, there is a lack of surrogate biomarkers that can be measured earlier than survival, and that can predict phase III outcome more reliably than conventional response criteria based on tumor size variations. Correlation with clinical endpoints does not mean surrogacy. Exactly 30 years ago, Prentice formulated the criteria to demonstrate the surrogacy of a biomarker (Prentice, 1989). These criteria require thousands of patients enrolled in different clinical trials. To date, there are only a handful of accepted biomarkers that are established surrogate endpoints. In prostate cancer, for example, the prostate specific antigen (PSA) decrease has

(18)

17

been reasonably well validated in Phase III studies of cytotoxic agents, although there is debate on using this biomarker in exploratory trials (Stadler, 2002; Williams, 2018). Second, poor quality statistical designs have been traditionally used in phase II Oncology trials. The traditional single-arm phase II Oncology trial uses a historical response rate as the reference point by which improved response rate is judged. Outcomes of single-arm phase II trials reflect some combination of treatment effect, random effect, and unknown differences between treated and historical control patients. Recommendations have been produced to use randomization to protect against selection bias in phase II Oncology trials (Booth et al., 2008; Ratain et al., 2009). Also dose-ranging, controlled phase II trials should receive considerable attention in order to determine the relationship between dose and CCs’ antitumor activity (Ratain, 2005; Michaelis et al., 2006). Finally, blinding techniques could be useful to prevent different types of biases (i.e. performance, assessment, and attrition biases. Table 1.2.1.1 reports their definition), especially for time-to-event endpoints such as PFS and TTP. Unfortunately, it is often difficult to apply blinding techniques in Oncology. For example, to mask devices, routes of administration and side effects such as myelosupression or nausea is often impossible and unethical

o CCs are wrongly selected in the preclinical drug development or, at least, the positive predictive value (i.e. the probability of reaching market approval from the preclinical testing phase) of methodologies applied in the preclinical drug development, is unsatisfactory.

1.2 Principles of methodology of preclinical

in vivo

experiments

Animal experiments remain essential to understand the fundamental mechanisms underpinning malignancies and to discover and screen methods to prevent, diagnose and treat them. Given the limited usefulness and predictive capability of in silico and in vitro models, the use of animal models must continue (Garattini et al., 2017). In Europe, animal research is tightly controlled under the European Directive 2010/63/EU; ethical validity is usually judged in relation to the “three Rs” (i.e. Replacement, Reduction and Refinement) introduced by Russell and Burch in their book, “The Principles of Humane Experimental Technique”, first published in 1959 (Russell et al., 1992).

Michael Festing, one of the most prominent statistician involved in animal research, reminds us that “the use of animals in biomedical research generates strong emotions, but everyone will surely agree that if they are used the experiments should be properly designed and cause the minimum amount of pain and distress” (Festing, 2010). And yet, a recent survey of 271 papers from academic organisations in the UK and USA involving work on live laboratory mice, rats or non-human

(19)

18

primates, has found that the design, analysis and reporting of animal experiments could be improved (Kilkenny et al., 2009). The survey’s major findings are reported in Table 1.2.1 (Festing, 2010). That survey has spawned a follow-up paper introducing the ARRIVE guidelines (Kilkenny et al., 2010). ARRIVE stands for Animal Research: Reporting In Vivo Experiments. ARRIVE guidelines were published first in PLoS Biology and then in several other journals. These guidelines consist of a checklist of 20 items describing the minimum information that all scientific publications reporting research using animals should contain.

Survey finding Percentage

of studies

Purpose of the study not clearly stated in the introduction 5 Did not clearly indicate how many separate experiments were done 6

Failed clearly to identify the experimental unit 13

Failed to state the sex of the animals 26

Reported neither age nor weight of animals 24

Failed to record the exact number of animals used (although in several cases an approximate number could be estimated)

36

Failed to justify the sample sizes used 100

Reports of the numbers of animals used differed between materials and methods and results sections

35

Random allocation of animals reported 12

Studies reporting blinding when qualitative scoring was used 14 Studies where the statistical methods used were not clear or not reported 4 Studies with numerical data which failed to present a measure of variation such

as a standard deviation, standard error, or confidence interval

17

Papers judged not to have used the correct statistical methods, or where the methods used were not clear

(20)

19

Table 1.2.1 Primary findings of the survey of the quality of experimental design, statistical analysis and reporting of research using animals. 271 papers from academic organisations in the UK and USA were assessed (Festing, 2010)

The poorly designed, analysed and reported preclinical in vivo experiments raise ethical and scientific concerns about proper use of animals and reproducibility, respectively. Key methodological issues of preclinical in vivo experiments are shown in Figure 1.2.1. All these issues are put into doubt by variability of experimental results, measurements and biological models. They are summarised in the following sections.

(21)

20

(22)

21 1.2.1 Internal validity

Internal validity is the core issue. A preclinical in vivo experiment with poor internal validity implies poor reproducibility. Due to poor reproducibility, its results are suspiciously accepted by the scientific community. The situation is worse still. Systematic reviews and meta-analyses of all available evidence from preclinical in vivo experiments produce low weight of evidence if single in vivo experiments have poor internal validity. Adequate internal validity of a preclinical in vivo experiment means that the differences observed between groups of animals allocated to different interventions may, apart from random error, be attributed to the treatment under investigation (Jüni et al., 2001). By definition, random error is totally controlled by the calculus of probability. Neither the calculus of probability nor other statistical tools can handle systematic error (bias) without unverified assumptions.

Four types of bias threaten internal validity. Their definition and possible solution are reported in Table 1.2.1.1.

Type of bias Definition Solution

Selection bias Treated and control groups differ prior to treatment in ways that matter for the outcomes under study

Randomization; allocation concealment; intention-to-treat analysis

Performance bias Systematic differences in care between the treatment groups apart from the

intervention under study

Blinding

Assessment/detection bias

Systematic differences between treatment groups in the assessment of study outcomes

Blinding

Attrition bias Systematic differences between treatment groups in the number and the way animals are lost or exit from the experiment

Blinding; intention-to-treat analysis

Table 1.2.1.1 Types of bias threatening internal validity

To prevent selection bias, treatment allocation should be based on randomization. This means that an a priori determined probability of enrollment in a specific treatment or control group should be assigned to each animal. This is not enough. To prevent selection bias, concealing the allocation

(23)

22

sequence from those assigning animals to intervention groups, until the moment of assignment, should be applied. In few words, picking animals ‘at random’ from their cages has the risk of conscious or subconscious manipulation, and does not represent a true and satisfactory method of randomization. To prevent performance, detection, and attrition bias, caregivers, researchers and outcome assessors should be blinded from knowing which intervention each animal received during the experiment. Blinding may not always be possible in all stages of an experiment, for example when the treatment under investigation concerns a surgical procedure or the treatment safety profile unmasks the administered treatment. However, blinding of outcome assessment is almost always possible. In a retrospective review, 290 animal studies with intervention were classified by the use of randomization and blinding (Bebarta et al., 2003). The Odds Ratio (OR) of reporting a significant difference was 3.4 (95%CI: 1.7 to 6.9) for the studies in which randomization was not used compared to those in which randomization was used. The OR of reporting a significant difference was 3.2 (95%CI: 1.4 to 7.7) for the studies in which blinding was not used compared to those in which blinding was used. Finally, the OR of reporting a significant difference was 5.2 (95%CI: 2.0 to 13.5) for the studies in which both experimental techniques were not used compared to the studies in which both techniques were used. These results suggest that failure to blind and randomize may lead to bias.

Intention-to-treat analysis is the analysis of data of all animals included in the group to which they were randomly assigned, regardless of whether they completed the intervention. This statistical procedure is useful to prevent selection and attrition bias. For instance, suppose that animals dead for treatment toxicity are removed from the final analysis. It could be argue that only animals with specific characteristics are retained in the final analysis and the measure of treatment effect respect to control group is biased. Risk of bias tools (SYRCLE’s RoB tool) for animal intervention studies are useful to evaluate the level of internal validity of in vivo experiments (Hooijmans et al., 2014).

1.2.2 Reproducibility

The ability to reproduce experiments is at the heart of science. Goodman et al. (2016) decline this term in three different ways, that are reported in Table 1.2.2.1.

(24)

23

Type of

reproducibility Definition

Methods Methods reproducibility refers to the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated

Results Results reproducibility refers to obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possible

Inferential Inferential reproducibility refers to the drawing of qualitatively similar conclusions from either an independent replication of a study or a reanalysis of the original study. Inferential reproducibility is not identical to results reproducibility or to methods reproducibility, because scientists might draw the same conclusions from different sets of studies and data or could draw different conclusions from the same original data, sometimes even if they agree on the analytical results

Table 1.2.2.1 Types of reproducibility

Scientists in the Haematology and Oncology department at the biotechnology firm Amgen in Thousand Oaks, California, tried to confirm published findings of ‘landmark studies’ in Oncology (Begley et al., 2012). Fifty-three papers were deemed ‘landmark’ studies (i.e. something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics). Scientific findings were confirmed in only 6 (11%) cases! This disappointing result could be due to the following reasons: poor internal validity, lack of good reporting and transparency, and poor control of biological variability. Lack of reproducibility in other laboratories may also be caused by treatment x environment interactions. For example, animal houses may differ in the physical environment, management, or microflora in such a way as to alter the relative treatment differences. These are the reasons threating reproducibility in science. A similar finding was reported by Prinz et al. (2011). The scope of the Prinz et al. study was to compare in-house results with published results for wet-lab experiments related to drug target identification and validation. Sixty-seven in-house projects within the oncology (47 projects, 70%) , women’s health (12 projects, 18%) and cardiovascular (8 projects, 12%) indications were used to reproduce pub-lished data. Only in 20 to 25% of the projects in-house findings were completely in line with

(25)

24

published data. In almost two-thirds of the projects, there were inconsistencies between in-house data and published data that either considerably prolonged the duration of the target validation process or, in most cases, resulted in termination of the projects because the evidence that was generated for the therapeutic hypothesis was insufficient to justify further investments into these projects.

1.2.3 Control of biological and experimental variability

Russell and Burtch’s chapter on reduction, written in 1959, is largely concerned with the control of inter-individual biological variation through the use of inbred strains (Russell et al., 1992). The control of variability delivers enormous advantages in in vivo experimentation, that are reported in Table 1.2.3.1.

Type of benefit Explanation

Power Uncontrolled biological variability leads to increased numbers of false negative results. The noise (i.e. biological variability) prevails over the biological signal

Reproducibility Uncontrolled biological and experimental variability leads to lack of methods and results reproducibility

Reduction Controlling biological and experimental variability, the signal (i.e. treatment effect) / noise (i.e. variability) ratio is increased and less animals are necessary to detect the same treatment effect

Table 1.2.3.1 Benefits derived from the control of biological and experimental variability

One of the methods largely suggested to control biological variability has been the use of blocks. Simple randomization requires substantial numbers of animals in order to fully randomly balance all possible confounding factors (e.g. animal strain, age, gender, weight, housing). In randomized block designs different sources of variability are distributed in a controlled manner to the individual block entities to which individual animals are assigned at random. Blocks could be useful to guarantee reproducibility. Suppose that an experiment is executed in different times or laboratories. Times or laboratories could be used as blocks. If there is good agreement between these blocks, then this gives some assurance that the experiment is reproducible. Other useful statistical designs to control biological variability include Latin square, crossover designs and repeated measure design (Festing et al., 1998; Festing et al., 2002).

(26)

25 1.2.4 External validity

External validity could be defined as the extent to which the results of a preclinical in vivo experiment provide a correct basis for generalisations to the human condition. Ideally, a disease model should fully reproduce the clinical condition in a system that can be used for research and drug discovery. But all preclinical models are an imperfect replication and simplified models of the clinical condition. The following reasons could explain the failed translation of in vivo experiments to the clinic:

o Differences between in vivo models and humans, testing the same treatment (e.g. pathophysiology of disease, comorbidities, age)

o Differences between the treatment administered in an in vivo experiment and that administered in humans e.g. (timing of the administration, dosing of the study treatment, using of co-medications)

o Differences in the outcome measures (e.g. in in vivo antitumor activity studies, tumor growth curves are usually used to detect CCs’ treatment effect. In clinical trials time-to-progression could be used to detect CCs’ treatment effect)

o Shortcomings of the clinical trial. For instance, clinical trials may have had insufficient statistical power to detect a true benefit of the treatment under study or the same treatment was administered at at later time points when the window of opportunity has passed (Gladstone et al., 2002; Grotta, 2002).

If the issues regarding internal validity are almost the same in all in vivo experiments, regardless of the disease under study, the external validity of an in vivo experiment will largely be determined by disease-specific factors.

1.2.5 R as reduction

The number of animals used should be reduced to the minimum consistent with achieving the objectives of the preclinical in vivo experiment. Reduction, of course, lies squarely in the field of statistics. Table 1.2.5.1 reports the statistical techniques available to reduce sample size in in vivo experiments.

Statistical technique Explanation

Increasing the signal/noise ratio

The number of animals is smaller if a larger treatment effect is targeted and/or the biological and experimental variability is reduced

(27)

26

Multi-arm designs Multiple treatments could be evaluated in the same experiment. Control arm is the same for all active arms. Interactions between treatment factors could be fairly evaluated using a factorial design

Choosing appropriate endpoints

For instance, continuous endpoints are more powerful than categorical endpoints; repeated measures instead of single measures increase the power of common statistical tests

Using indirect evidences Historical data could be combined to in vivo experiment’s data using, for example, bayesian techniques (Gelman et al., 2004). Moreover, historical data should be used to guide statistical design

Increasing statistical errors The number of animals could be reduced by accepting more false positive (i.e. type I) and negative (i.e. type II) errors (refer to Section 5.1.1 for their explanation). For instance a type I error of 0.05 could be substituted by a type I error of 0.10 and a type II error of 0.20 could be substituted by a type II error of 0.22. Moreover, in case of in vivo experiments screening CCs, two-tailed tests could be substituted by one-two-tailed tests

Adaptive designs If data are analyzed at interim, decision rules such as stopping rules or sample size re-estimation could be applied. Hence, the the number of animals used in in vivo experiments is better justified and, stopping early the experiment, is reduced

Table 1.2.5.1 Statistical techniques to reduce sample size in in vivo experiments

Excluding multi-arm designs, other statistical techniques are rarely applied in in vivo experiments. For example, randomized block designs are scarcely used (Festing, 2014). Adaptive designs are almost never applied in preclinical in vivo experimentation.

1.2.6 Publication bias

Systematic review and meta-analysis are techniques developed for the analysis of data from clinical trials. They may be helpful also in preclinical research. For instance, a systematic review and

(28)

meta-27

analysis of all available evidence from preclinical studies should be performed before clinical trials are started.

If studies are published selectively on the basis of their results, even a meta-analysis based on a rigorous systematic review will be misleading. In a meta-analysis of 525 publications included in systematic reviews of 16 interventions tested in animal studies of acute ischaemic stroke, it was estimated that publication bias might account for around one-third of the efficacy reported in systematic reviews of animal stroke studies and that a further 214 experiments, in addition to the 1,359 identified through rigorous systematic review (non publication rate: 14%), have been conducted but not reported (Sena et al., 2010).

Nonpublication of the results of animal studies is unethical because the included animals are wasted. They do not contribute to accumulating knowledge. As a consequence, ‘wrong ways’ could be taken in preclinical and clinical research:

o overstated biological effects may lead to further unnecessary in vivo experiments testing poorly founded hypotheses

o publication bias deprives researchers of the accurate data they need to estimate the potential of novel therapies in clinical trials.

The recognition of substantial publication bias in the clinical literature has led to the introduction of clinical trial registration systems to ensure that those summarising research findings are at least aware of all relevant clinical trials that have been performed (De Angelis et al., 2004). A central register of preclinical in vivo experiments performed should be kept along with their respective reference publications (van der Worp et al., 2010).

1.3 Statistical analysis of

in vivo

tumor growth curves

In preclinical in vivo experiments, antitumor activity is usually evaluated by estimating the tumor volume at different times after drug administration.In a typical experiment, rodents, usually mice or rats, are inoculated subcutaneously with tumor cells that are either isogenic, if the rodent is immunocompetent, or xenogenic (i.e. human tumor cells are inoculated), if the rodent is immunodeficient. Alternatively, tumor cells can be injected ortothopically, into the organ from which they originate. Tumors could also be induced by administration of carcinogens or genetic manipulations (Zitvogel et al., 2016). Rodents that develop tumors reaching a predetermined volume are randomized into different treatment and control groups and drugs are administered. Those rodents injected with tumor cells but with no sign of tumor burden are usually sacrificed after inoculation. The volume of each tumor is measured at the start of treatment and periodically

(29)

28

throughout the experiment. Rodents are sacrificed either when their tumor volume reaches a maximum target volume, or when a humane endpoint (i.e. the earliest, predetermined. physiological or behavioral sign used to avoid or stop the distress, discomfort, or potential pain and suffering) is reached or at the end of follow-up (administrative censoring). The resulting dataset consists of incomplete, repeated measures of tumor volume at common time points, from the start of treatment until the time in which the last rodent has been sacrificed. An example of tumor growth curve is reported in Figure 1.3.1.

Figure 1.3.1 Antitumor effects of AZD2171 (, 0.75 mg per kg per day; , 1.5 mg per kg per day;

, 3 mg per kg per day; , 6 mg per kg per day) or vehicle () on growth of MDA-MB-231 human breast tumor xenografts. Xenografts were established s.c. in athymic mice and allowed to reach a volume of 0.2 + 0.01 cm3_{(mean + standard error) before treatment. Once-daily oral administration} of AZD2171 or vehicle then commenced and was continued for the duration of the experiment. Points, mean from 10 to 11 mice; bars, standard error in one direction (Wedge et al., 2005)

Limitations of this method include lack of information about the effects of the CCs on metastases, or the process of metastatic spread. Also, in order to evaluate the mechanism(s) of action of a drug, rodents must be killed to allow molecular analysis of the resected tumor. In addition, although tumor growth curves with and without treatment reflect tumor response or delay in progression,

(30)

29

these end points may not reflect selective effects against those tumor cells with high reproductive potential (e.g. putative stem cells) that are important in determining the long-term benefits of treatment (Ocana et al., 2010).

Tumor volumes are measured on a weekly basis using a caliper on determined days. Imaging techniques, such as bioluminescence imaging, may be used to record changes in the volume of tumors that are not restricted to superficial sites and/or to provide information about drug-influenced biological processes (e.g. metastatic spread, expression of proteins). Details about imaging techniques and their use are reported in Ocana et al., 2010.

To analyse data series of tumor volumes at different time points, the common statistical practice is first to demonstrate that the treatment influences them, then to estimate the treatment effect. To solve the former problem, statistical tests are used (Lehmann et al., 2005), while to solve the latter problem, unbiased estimators are used (Lehmann et al., 1998). Details about hypothesis testing and statistical estimation are reported in Section 4.3.1. The statistical approaches currently used to analyse data series of tumor volumes at different time points, could be classified in the following categories:

o Data analysis at a selected time point

o Use of summary statistics to estimate treatment effect

o Substitution of data series with the time required to reach a target volume o Use of multivariate methods

An overview of these statistical approaches is shown below.

1.3.1 Comparison of tumor growth curves at a selected time point

Control and treatment arms are compared at a selected time point; usually the time point at the end of follow-up. The statistical test at each time point could be parametric, namely t test for two arms or ANOVA test for more than two arms, or non-parametric, namely Mann-Whitney test for two arms or Kruskal-Wallis test for more than two arms. The T/C ratio, calculated at the selected time point, is a common measure of treatment effect (Corbett et al., 2003; Houghton et al., 2007). T and C are the means, or medians, of the tumor volumes of the treatment (T) and control (C) arms, respectively, at the selected time point. From a statistical point of view, this approach is inadequate as explained in the following three points:

1) comparing control and treatment arms at a selected time point is a weaker comparison than that of the tumor growth curves over all times. It neither captures all the data nor addresses the

(31)

30

different biological mechanisms underlying tumor growth. The suboptimal use of data series is well represented by the following examples:

1a. suppose that at the end of follow-up control and treatment arms have the same tumor volume distribution but previously, tumor growth was constant in the control arm while tumor volume was greatly reduced and then quickly increased in the treatment arm, as reported in Figure 1.3.3.1. Treatment effect is not formally recognized by this statistical approach

1b. suppose that, at the start of treatment, there is a tumor volume reduction in the treatment arm but, after few days, tumor growth curves of control and treatment arms remain parallel to each other over all the rest of follow-up. At the end of follow-up, treatment effect could be formally recognized by this statistical approach although its biological relevance is poor

2) the choice of the time point could be data driven (e.g. most of rodents in the control arm are just sacrificed) or a priori (e.g. based on the planned treatment administration). In the first case, comparison is constrained by specific events, such as animal sacrifice in the control arm, that weaken and rather render ambiguous the interpretation of this comparison. In the second case, the a priori choice of the time point creates difficulty because unreliable assumptions are needed (e.g. exponential growth with a determined growth rate in the control arm) to design and determine sample size

3) attrition bias due to censoring animals (e.g. rodents previously sacrificed) could affect the formal comparison at the selected time point.

A worst method is to repeat this approach at different time points, indicating all the times at which differences were significant. This procedure may be naively considered better because it uses all the data series. On the contrary, due to its very bad procedure caused by the inflation of type I error due to the multiple comparisons problem, post-hoc tests are difficult to apply because repeated measures are correlated and comparisons are usually underpowered.

1.3.2 Summary statistics

Per-experiment and per-animal summary statistics are commonly used to estimate treatment effect. Examples of per-experiment summary statistics are the minimal T/C ratio, which reflects the maximal tumor growth inhibition achieved (Hendriks et al., 1992), and the adjusted AUC ratio (aAUC ratio; Wu et al., 2010). The minimal T/C ratio is the minimum of the T/C ratios calculated at all time points. aAUC ratio is defined as the ratio of the means of the aAUCs of the treatment and control groups, where aAUC is the per-animal area-under-the-curve (AUC) calculated up to the last time point available for the rodent, divided by the length of the interval between the start of treatment

(32)

31

and the last time point with existing tumor volume measurements. Another example of per-animal summary statistics is the Tnadir. It is defined as the minimum of the growth curve of a treated tumor relative to the tumor volume at the start of treatment (Ubezio, 2019).

Per-experiment and per-animal summary statistics are usually easy to calculate and informative. However, their sampling distribution could be highly skewed and average values such as the aAUC ratio could suffer from suboptimal power with respect to multivariate methods. 1.3.3 Time-to-event endpoints

Control and treatment arms are compared in terms of time in days for the tumors to reach a predefined target volume (tumor growth delay).For instance, it could be the doubling time of tumor volume, defined as the earliest day on which the tumor volume is at least twice as large as on the first day of treatment. The non-parametric log-rank test and the semi-parametric Cox regression model are available to detect and estimate treatment effect, respectively, in the presence of right-censored data. However, there are two major disadvantages when using this approach. First, the choice of the target volume at which to assess the delay is critical for this comparison (Begg, 1980). Second, it neither captures all of the data nor addresses the different biological mechanisms underlying tumor growth, as reported in Figure 1.3.3.1.

Figure 1.3.3.1 Repeated measures of tumor volume on two rodents, at the start of treatment until the doubling time. Tumor growth delay is the same but biological mechanisms are different

(33)

32 1.3.4 Multivariate methods

The methods are called “multivariate” because they treat the series of tumor volumes on an animal as a single multivariate observation (Heitjan et al., 1993). They use the entire data series and permit detailed modelling of tumor growth curves and intra-animal correlation patterns, substantially improving the efficiency of testing and reducing sample size requirements. Furthermore, they provide more descriptive features that address mechanisms underlying tumor growth inhibition and maximize the biological information obtained from in vivo studies. Repeated-measures ANOVA, or Friedman repeated-measures ANOVA on ranks, can compare tumor growth curves after accounting for the correlation of measurements on the same tumor. Other multivariate models are reported in Heitjan et al. (1993). On the other hand, these multivariate methods could be criticized for the following reasons:

o in case normality or homoscedasticity (i.e. same variance in all groups and at all times) are assumed, they are often unreliable and, due to small samples, usually unverifiable

o in case a correlation structure between repeated measures is assumed, it is often unreliable and, due to small samples, usually unverifiable

o in case of missing values, either data series are excluded, or imputation is used, or a correlation structure should be specified. Due to informative missing and small samples, imputation techniques could introduce biases into the analyzed data.

Sophisticated regression models have been proposed to fit tumor growth curves; a biexponential model (Demidenko, 2004; Liang et al., 2004), a linear exponential model (Demidenko, 2006), a non-parametric model (Liang, 2005) and a Bayesian model (Zhao et al., 2011). However, regression models to fit tumor growth curves have limits: due to a small sample size of preclinical in vivo experiments, assumptions are only verifiable with great difficulty and, if an excessive number of parameters are used, overfitting occurs.

In addition to the statistical approaches reported in Sections 1.3.1-1.3.4, many statistical tests, unfortunately not combined with appropriate estimators (i.e. only p-values are obtained), have been proposed. Tan and colleagues (Tan et al., 2002) proposed a small-sample t-test via the EM (expectation maximization) algorithm. They assumed a multivariate normal distribution for the repeated log tumor volumes with a Toeplitz covariance matrix. Due to the strong model assumption, their method has limited application to preclinical in vivo experiments. Vardi et al. (2001) proposed a nonparametric two-sample U-test. The proposed methodology is a fully nonparametric approach.

(34)

33

Finally, Liang (2007) proposed a non-parametric approach to compare antitumor effects in two treatment groups. The approach yields a p-value only.

In conclusion, different shortcomings are present in the current statistical methods used to analyse preclinical in vivo tumor growth curves: incomplete use of the entire data series, unreliable assumptions, poorly addressed biological mechanisms underlying different patterns of tumor growth, lack of statistical power and inferential estimators with inadequate statistical properties. This project would like to improve statistical methodology applied to preclinical in vivo tumor growth curves, overcoming previous shortcomings.

(35)

34

Chapter 2

Aims

There is no doubt that poorly designed, executed, analyzed and reported in vivo experiments raise ethical as well as scientific concerns. Briefly, on one side the weight of scientific evidence is reduced and no statistical method could completely fix this damage. On the other side, research reproducibility, the fundamental assumption of science, is definitely compromised. As a consequence of poor methodology applied to in vivo experiments, ‘wrong roads’ could be taken in preclinical research (Figure 2.1). Many laboratories spend time and money and use in vivo models in vain, trying to extend unreliable findings or apply them to different problems. The mean number of citations of the forty-seven landmark studies non-reproduced by the scientists in the Haematology and Oncology department at the biotechnology firm Amgen in Thousand Oaks, was about two hundred (range: 3-1.909 citations, Begley et al., 2012) .

Figure 2.1 Consequence of poor methodology applied to in vivo experiments

It could be worse still. Methodological flaws in in vivo experiments could ruin the drug development process. It could be hypothesized that the recurring failure of apparently promising interventions to improve outcome in clinical trials has been partially caused by these flaws. For instance, several of these errors could have led to bias with false positive (i.e. type I) errors. And false positive errors could have wrongly selected CCs for clinical evaluation.

To the best of the author’s knowledge, the impact of methodological flaws in in vivo experiments on the drug development process has never been quantitatively investigated. In Oncology, the assessment of antitumor activity in in vivo experiments and clinical trials could be a useful way to detect and estimate this impact. CCs demonstrating better antitumor activity than no treatment or standard therapies (i.e. active controls) in preclinical cancer models (i.e. in silico, in

(36)

35

vitro and in vivo models), are advanced to confirmatory testing in early (i.e. Phase I and II) clinical trials. Antitumor activity detected in preclinical cancer models is a fundamental prerequisite for advancing a CC from preclinical testing in the laboratory to clinical testing and for prioritizing CCs’ progress to clinical cancer trials. This prerequisite is based on the assumption that CCs’ activity in preclinical cancer models translates into at least some efficacy in human patients. In drug discovery and development, in vivo models have the greatest complexity and, above all, the greatest similarity to human patients among preclinical cancer models. At the same time, the assessment of antitumor activity is the first test bench of the CCs’ clinical development after CCs’ dose has been defined. Therefore, detecting and estimating the correlation between methodological quality of in vivo experiments, whose primary objective is to assess CCs’ antitumor activity, and the level of CCs’ antitumor activity in phase II clinical trials, could be a direct way to estimate the impact of methodological flaws in preclinical in vivo experiments on the process of drug discovery and development.

It is necessary to retrieve data about statistical design and analysis of preclinical in vivo tumor efficacy studies from scientific literature in order to address the previous issue. Hence, it is also possible to assess the methodological quality of in vivo tumor efficacy studies using the same data. To the best of the author’s knowledge, the methodological quality of preclinical in vivo tumor efficacy studies has never been qualitatively and quantitatively investigated.

Finally, the statistical design and analysis of experiments to study in vivo tumor growth curves is a primary issue. A new methodological framework, based on the Wilcoxon-Mann-Whitney test, will be introduced for the statistical design and analysis of these experiments.

More specifically, the project addresses the following interrelated aims:

1. to correlate the quality of statistical design and analysis of preclinical in vivo tumor efficacy studies with the level of antitumor activity estimated in phase II clinical trials

2. to evaluate the quality of statistical design and analysis of preclinical in vivo tumor efficacy studies

3. to improve the statistical design and analysis of experiments to study tumor growth curves. Regarding the first and second aim, research will focus on epithelial ovarian cancer (EOC). EOC was chosen as tumor type for the following reasons:

1. EOC treatment has not been substantially changed in the last thirty years. A platinum-based chemotherapy is the mandatory first line treatment. Stability and simplicity of administered treatment has a favourable influence on the control of variability

(37)

36

2. this project refers to the Oncology Department, Istituto di Ricerche Farmacologiche Mario Negri IRCCS, Milan (Italy). A large number of EOC research studies has been performed in this department in the last thirty years. Specifically, good expertise and skills has been developed in animal models and translational research about this type of tumor.

It was necessary to design a survey in order to achieve the first two aims of the project. This task has been difficult and time-consuming. Survey design has been profoundly amended. Two previous survey designs have been rejected because they could not be effectively applied. Their ineffective applicability was due to methodological limits of designs used for phase II trials in Oncology (e.g. phase II clinical trials in Oncology are generally single-arm trials) and publication bias (i.e. in vivo experiments are not clearly identifiable in public assessment reports published by the European Medicines Agency and the Food and Drug Administration). Failed survey designs will be described and discussed in Chapter 6.

Regarding the third aim, the project proposes a new methodological framework to study tumor growth curves. It is a general framework because it focuses on both statistical hypotheses testing and the theory of estimation.

(38)

37

Chapter 3

Methods

3.1 Survey design

To achieve the first two aims of the project, a systematic survey of previous clinical and preclinical research has been performed using a sequential two-stage design. In the first stage, eligible CCs have been identified and estimates of their antitumor activity has been retrieved from clinical research literature. In the second stage, in vivo experiments testing antitumor activity of identified CCs have been retrieved from biomedical research literature. The quality of experimental design and statistical analysis of each preclinical in vivo experiment has been evaluated using an ad hoc checklist. Finally, the quality of experimental design and statistical analysis of preclinical in vivo experiments has been correlated to the estimates of clinical antitumor activity. If methodological flaws impact the Oncology drug discovery and development process, a positive correlation between methodological quality of in vivo experiments and estimates of clinical antitumor activity could be expected.

Details about this two-stage design follows.

3.1.1 Stage 1: identification of eligible CCs and estimation of their clinical antitumor activity A systematic search of the Medline and EMBASE databases has been carried out to identify clinical trials whose primary objective was to assess antitumor activity of CCs. This systematic search was limited to clinical trials in EOC. Reasons of this choice have been reported at the end of chapter 2. Selection of EOC clinical trials and eligible CCs was based on the following criteria:

α. Eligible criteria for clinical trials α1. Inclusion criteria

 Histologically or cytologically confirmed diagnosis of epithelial ovarian cancer, fallopian tube cancer, or primary peritoneal cancer

 Assessment of CCs’ antitumor activity was the primary or co-primary objective. At least one antitumor activity endpoint was a primary or co-primary endpoint

 Women aged 18 years or older

 ECOG/WHO performance status (PS) 0-2 or GOG PS 0-2 (Oken et al., 1982; Rubin et al., 2004)  Patients must have failed at least one prior line of platinum-based chemotherapy

 Study protocol approved by the independent ethics committees or institutional review boards of the participating institutions

(39)

38

 The final study report was published on 1st January 2010 or later  The final study report was written in English

Note: if inclusion criteria of the EOC clinical trials were broader but all patients evaluated for antitumor activity satisfied all previous α1 criteria, the clinical trial was considered eligible. For instance, if eligible criteria admitted the enrollment of children and all effectively enrolled patients were adults, EOC clinical trial was considered eligible.

β. Eligible criteria for CCs β1. Inclusion criteria

 CC was evaluated in monotherapy as experimental treatment (i.e. active arm) β2. Exclusion criteria

 Monoclonal antibodies, oncolytic viruses or reoviruses, vaccines, immunotherapeutic and endocrine CCs were excluded

 CC was administered in maintenance therapy

 CC was administered as standard treatment (i.e. control arm) The following search string was used in Medline:

("Clinical Trial, Phase II"[Publication Type] OR “clinical trial phase 2” OR “clinical trial phase ii” OR “clinical study phase 2” OR “clinical study phase ii” OR “phase 2 clinical study” OR “phase 2 clinical studies�