Model selection and evaluation for all models

2 Methods and studies

2.1 Nonlinear mixed-effects modelling

2.1.5 Model selection and evaluation for all models

Model development is an iterative process in which models are evaluated, selected and updated before the process starts over again. Model evaluation is an important step considering many different factors, such as plausibility of parameter estimates, stability of parameter estimation and model convergence. In addition, numerical/statistical and graphical methods are used to evaluate the models and ease the selection process. These different criteria will be further outlined below.

2.1.5.1 Numerical and statistical evaluation of model performance

Objective function value

Model selection in NLME modelling is to a large extent guided by the OFV. As previously stated, parameters are estimated by minimising the -2LL, which corresponds to the OFV. A lower OFV indicates a better description of the data. For nested models (complex models which can be collapsed to the simpler one), the likelihood ratio test (LRT) can be used to compare the OFV between two models. The LRT assumes a χ2_{distribution of the OFV difference between the models,}

and the distribution is defined by the degrees of freedom (number of additional parameters) in the more complex model. The resulting test statistic can then be seen as the probability of observing the difference between the models, given that the null hypothesis assumes no difference [109]. The significance level (α) is usually pre-specified and a value of 0.01 was selected in the current thesis. If considering an increase of 2 parameters (degrees of freedom=2) this significance level would correspond to a difference if OFV of 9.21.

Akaike Information Criterion

If models are not nested, other criteria such as the Akaike Information Criterion (AIC) can be used. AIC also considers the number of parameters (p, Eq. 2.14) [119]. The lower AIC indicates the better description of the data.

AIC = −2𝐿𝐿 + 2 ∙ 𝑝 (Eq. 2.14)

2.1.5.2 Graphical evaluation of model performance

In addition to statistical methods, graphical evaluation tools are useful to identify trends in the model prediction or model misspecifications. In the current thesis two kinds of graphical evaluation were applied: Standard goodness of fit (GOF) graphics and visual predictive checks (VPCs).

Standard Goodness of fit graphics

GOF graphics are an important tool for fast initial evaluation of the model, and was used in all projects of the current thesis. GOF graphics commonly include comparison of the predicted versus the observed concentrations, in which observations should be scattered evenly around the line of identity. The plot including population or individual predictions is informative for evaluating the appropriateness of the structural model or the stochastic model, respectively. In addition, residual- type diagnostics are useful to detect trends or model misspecification. Since the residuals (difference between observed and predicted concentration) are dependent on the magnitude of the prediction, conditional weighted residuals (CWRES) are a better alternative. CWRES are residuals which have been adjusted based on the FOCE approximation, and are therefore appropriate for graphical evaluation if the FOCE algorithm has been used [120]. CWRES versus population predictions are valuable for identifying concentration-dependencies in the data (assuming that the dependent variable is a concentration), and to assess appropriateness of the RUV model. CWRES versus time is also helpful graphic to find time-dependent trends, which may provide information if the model specification appears in the absorption or the elimination phase. For both plots, The CWRES should be close to zero (± 2 SD) and randomly scattered around zero [121].

Visual predictive checks

Visual predictive check is a commonly used simulation-based graphical evaluation tool to evaluate predictive performance of a model, and was used in all projects in the current thesis. The principle behind the VPC is to graphically evaluate the ability of a model to reproduce the observed data (i.e. predictive performance). This is done by simulating a large number of datasets (e.g. 1000) using the model to be evaluated. The percentiles of interest (commonly 5th_,50th_{and 95}th_{) and the confidence}

interval of respective percentiles for the simulated concentrations are derived and then compared graphically with the same percentiles of the observed concentrations [122]. Commonly, the percentiles of the simulated and observed concentrations are derived for selected time ranges (bins) instead for at every time to ease the comparison. In a structured sampling design, the bins could refer to the time interval around the planned sampling times. The percentiles of the simulated and observed data are thereafter compared graphically [123]. Since the VPC is displayed on the normal time-scale, it is easy to identify which part of the PK profile which is sub-optimally described (e.g. absorption or elimination phase). If the PK (or PK/PD) is not dependent on clock time, time after dose is a commonly used time scale to use for VPCs after multiple dosing. For categorical data, such as

when using a model to consider concentrations below LLOQ, categorical VPC is a useful tool to evaluate performance [118]

2.1.5.3 Evaluation of uncertainty in parameter estimates

The precision of a model parameter can be derived by different methods. If the variance-covariance matrix is generated in NONMEM, the standard errors of the parameter estimates can be derived from taking the square root of the diagonal elements in variance-covariance matrix. The relative standard error (%RSE) is commonly computed to evaluate parameter precision for fixed-effects, which is derived from the final population parameter (θ) and the standard error of the population parameter (SE(θ), Eq. 2.15). Usually fixed-effects estimates with %RSE below 30% are considered to be precisely estimated [109].

The %RSE for random-effects parameters on a standard deviation scale can be derived similarly from the final variance (ω2_{) and the standard error of this variance (SE(ω}2_{), Eq. 2.16). Random-effects are}

commonly less precisely estimated and %RSE of 40%-50% is acceptable [109]. In addition to the standard error from the covariance step, there are other approaches available for generating the parameter precision, such as bootstrap and log-likelihood profiling.

%RSE(θ) = 100 ∙𝑆𝐸(θ)

θ (Eq. 2.15)

%RSE(ω2_{) = 100 ∙}𝑆𝐸(ω2)

2 ∙ ω2 (Eq. 2.16)

Bootstrap method

Using the bootstrap method, a pre-specified number of new datasets are first generated from the original dataset by sampling individuals with replacement. The established model is then estimated using the new datasets to generate new parameter estimates, from which the confidence interval (e.g. 95% confidence interval (95% CI)) can be derived. The confidence interval and the median estimate of the parameter estimates can thereafter be compared to the estimates of the original data [124], to provide information regarding the generalisability of the model (i.e. if the model is too specific to the data or if the model can be applied to other populations). The number of bootstrap simulations needed depends on the aim of the bootstrap. 200 datasets may be needed for generating the standard errors [124]. In the current thesis, 1000 new datasets were sampled in all projects and the bootstrap were generated using the software PsN [124].

Log-Likelihood profiling

Log-likelihood profiling is first-most a method to assess the surface of the likelihood to see if the OFV from the final model refers to the global minimum. This approach may however also be used to generate confidence intervals for the parameter estimates which does not assume a specific distribution. The analysis is performed individually for the respective parameter estimate of interest, and the final model is initially estimated with the final parameter estimate. The model is thereafter re-estimated by fixing the respective parameter to a slightly different estimate (e.g. ±5% or ±20%) until the selected significant difference in likelihood (e.g. ΔOFV: 3.84, df=1, α=0.05) between the full and reduced model is achieved. When this difference has been attained, the lower and upper boarder of the 95% confidence interval for the parameter has been reached [109,125]. In the current thesis, log-likelihood profiling was applied using PsN to generate the confidence interval in project 2 (section 2.3.4) [124].

2.1.5.4 Identification of influential individuals

Influential individuals may have a large impact on model selection or on parameter estimates [126]. Influential individuals can be identified by comparing individual OFV in the NONMEM output. Another approach is to use case-deletion diagnostics.

Case-deletion diagnostics

In case-deletion diagnostics, new datasets are created from the original dataset, in which one individual/dataset has been removed. The developed model is estimated with the new datasets, and the difference in OFV, % change in parameter estimates or precision of the parameter estimates are assessed [124]. An individual, which generates a relative change in parameter estimates of ±20% after removal is considered an influential individual [127]. Case-deletion diagnostics was applied using PsN on the plasma protein binding model and CBG model in project 1 (section 2.2.4), as well as PK and PK/PD model in project 3 (section 2.4.5) of this thesis.

2.1.5.5 External model evaluation

An external model evaluation is a useful tool to evaluate the ability of a developed model to predict external data not used for model development. In the current thesis, external model evaluations were performed similarly to VPCs (section 2.1.5.2); the covariates (i.e. dose, body weight etc.) of the external data and the model to be assessed were used to simulate 1000 new datasets with new concentration-time profiles. The percentiles of the observed and simulated concentrations were derived and sequentially compared graphically. An external model evaluation was performed for the

plasma protein binding model in project 1 (2.2.4) and for the full adult semi-mechanistic PK model in project 2 (2.3.3).

In document Pharmacometric approaches to assess hydrocortisone therapy in paediatric patients with adrenal insufficiency (Page 49-53)