BERGQUIST, MANDY LEE. Caution Using Bootstrap Tolerance Limits with Application to Dissolution Specification Limits. (Under the direction of Dr. Marie Davidian.)
by
Mandy Lee Bergquist
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
STATISTICS
Raleigh, North Carolina 2006
APPROVED BY:
Dr. Marie Davidian Dr. Dennis Boos
Chair of Advisory Committee
Biography
Acknowledgements
Many of my friends and colleagues have contributed guidance, encouragement, and valuable insight towards completion of my degree. To all of them, I am grateful, and to the following, I owe a special word of thanks:
Marie Davidian (my advisor) for her guidance, time, and patience
with my frequently unsuccessful attempts to juggle my dissertation and my work for GSK.
Dave Cooper (my manager at GSK) for supporting my desire for
a PhD and encouraging me to complete it without relinquishing my job.
My PhD committee members for their ideas and helpful comments.
Steve Marron and Jack Brown for believing that I was “PhD
ma-terial” and encouraging me to pursue it.
Vince and Dixie Bergquist (my parents) for encouraging me to keep
trying when I thought that I would never finish.
My Creator and heavenly Father apart from whom I could not
ac-complish anything good.
Contents
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Dissolution Experiments . . . 2
1.2 Dissolution Data Analysis . . . 4
1.2.1 Common Approaches . . . 4
1.2.2 Specification Limits . . . 7
1.3 Tolerance Limits . . . 9
1.3.1 Historical Development . . . 9
1.3.2 Bootstrap Methods . . . 11
1.4 Our Proposal . . . 14
2 Theory 15 2.1 Model . . . 15
2.2 Algorithm . . . 16
2.3 Examples . . . 20
2.3.1 Simulated Data . . . 21
2.3.2 Tsong Data . . . 24
2.4 Discussion . . . 28
3 Simulations 29 3.1 Case I: A N(µ, σ2) Simple Random Sample . . . 29
3.2 Case II: A Random Intercept Model . . . 38
3.3 Case III: A Linear Mixed Model . . . 39
3.4 Case IV: A Nonlinear Mixed Model . . . 42
3.5 Case V: A Nonlinear Mixed Model with Random Lots . . . 48
4 Conclusions 49
Appendices 61
A Tolerance Limits for a Simple Random Sample 62
A.1 Purpose . . . 62
A.2 λ-Content Tolerance Limits . . . 63
A.3 λ-Expectation Tolerance Limits . . . 64
A.4 Distribution-Free Tolerance Limits . . . 65
A.5 Normal Distribution Tolerance Limits . . . 66
B Nonlinear Mixed Model Estimation 71 B.1 Stage 1: Obtain Individual Estimates β∗ ij . . . 72
List of Tables
2.1 Parameters and Estimates for Weibull Example . . . 22
3.1 Simple Random Sample: Effect of Increasing the Number of Boot-strap Samples on the Achieved Confidence Level . . . 32 3.2 Simple Random Sample: Effect of Increasing the Sample Size on the
Achieved Confidence Level . . . 33 3.3 Simple Random Sample: Effect of Increasing the Stated Confidence
Level on the Achieved Confidence Level . . . 34 3.4 Simple Random Sample: Effect of Increasing the Percentile to Cover
on the Achieved Confidence Level . . . 35 3.5 Simple Random Sample: Effect of a Known Population Mean on the
Achieved Confidence Level . . . 36 3.6 Simple Random Sample: Effect of a Known Population Standard
Deviation on the Achieved Confidence Level . . . 37 3.7 Random Intercept Model: Achieved Confidence Levels . . . 40 3.8 Linear Mixed Model: Effect of Increasing the Number of Units in a
Sample on the Achieved Confidence Level . . . 43 3.9 Linear Mixed Model: Effect of Increasing the Number of Bootstrap
Samples on the Achieved Confidence Level . . . 44 3.10 Linear Mixed Model: Effect of Increasing the Stated Confidence Level
on the Achieved Confidence Level . . . 45 3.11 Linear Mixed Model: Effect of Increasing the Percentile to Cover on
the Achieved Confidence Level . . . 45 3.12 Nonlinear Mixed Model: Achieved Confidence Levels . . . 47
List of Figures
1.1 Typical Dissolution Profiles . . . 3
1.2 USP XXIII Three-Stage Acceptance Sampling Plan . . . 8
2.1 Simulated Data: Weibull Mean Model . . . 23
2.2 Tsong Reference Data . . . 26
Chapter 1
Introduction
A new drug application (NDA) submitted to the U.S. Food and Drug Admin-istration (FDA) reports the physical and chemical properties of a pharmaceutical product’s formulation. These properties characterize the quality and performance of the product to ensure that patients receive safe and effective medications. Upon approval, the NDA specifications for the new product will become part of the United States Pharmacopoeia (USP). All manufactured lots of the product must meet the USP requirements in order to be released in the United States. This ensures that manufacturing yields a product equivalent to that which formed the basis of the NDA approval.
The chemistry, manufacturing, and controls section of NDAs often includes spec-ifications on the dissolution of the active ingredient because appropriate dissolution is necessary for the bioavailability of many products. In conjunction with the FDA, the new drug sponsor defines acceptable ranges for the percent of drug released at selected time points. Immediate release formulations may be characterized by disso-lution at a single time point that shows the release of most of the active ingredient in a short time frame. For extended release formulations, the FDA recommends spec-ifications at a minimum of three time points chosen to represent the early, middle, and late segments of the dissolution profile.
char-acteristics throughout its shelf life. All formulations that meet the approved limits are assumed to perform similarly. A lot that does not meet the specification limits must be discarded at substantial loss to the product manufacturer. In contrast, if a product continues to yield dissolution values within the specification limits after minor changes in equipment, process, or manufacturing site, then these changes may not require additional costly bioavailability studies. Both manufacturers and the FDA have a substantial interest in the choice of dissolution specification lim-its. Well-chosen limits should allow for unavoidable variation among lots in order to avoid discarding large numbers of good lots; however, the specifications must also be able to exclude lots with substantially different dissolution characteristics in order to ensure the quality and safety of the product.
1.1
Dissolution Experiments
Dissolution experiments use an apparatus that processes six vessels simultane-ously in a dissolution bath. Each vessel contains media into which a single dosage unit is dropped. A rotating basket or paddle agitates the contents of the vessel. At pre-specified times, e.g. 0.5, 1, 2, 4, 6, and 8 hours, a scientist samples an aliquot from each vessel and assays each aliquot separately, typically via chromatography or UV spectroscopy. Thus, one run of a dissolution experiment produces six dissolution profiles, each of which consists of measurements of the quantity of active ingredient released by a single dose under controlled conditions at various dissolution times.
Figure 1.1 shows a set of six typical dissolution profiles. While these are simu-lated data, the profiles reflect dissolution values that might be observed in practice. Individual dissolution values are usually expressed as a percentage of the prod-uct’s label claim. The experiment never produces values less than zero, but due to variability in the assay and the manufacturing process, observed dissolution values greater than 100% frequently occur. Variability among dissolution baths is generally negligible relative to other sources of error.
scale-0 4 8 12 16 0
10 20 30 40 50 60 70 80 90 100
Time (hours)
% Label Claim
Simulated Dissolution Profiles: Weibull Model
up or stability lots. Profiles from these lots may be measured at initial production and across time (e.g. 0, 3, 6, 9, 12, 18, and 24 months) in order to evaluate the stability of the release characteristics. Thus, the dissolution data available for de-veloping specification limits may consist of as few as eighteen profiles, representing six tablets from each of three lots, or as many as hundreds of profiles representing multiple product lots measured at various post-production intervals.
1.2
Dissolution Data Analysis
1.2.1
Common Approaches
Freitag (2001) provides a detailed comparison of dissolution testing recommenda-tions from various FDA guidance documents. In general, the FDA provides detailed guidance on testing conditions for dissolution experiments, on the number of in-dividual dosage units to be tested from each lot, and on the number of sampling time points. Much less guidance is provided on statistical methods for analyzing the resulting data. FDA reviewers have written and contributed to a number of papers discussing various methods, but these papers do not provide official guidance and have not settled on any particular approach.
The FDA guidance documents recognize three categories of methods for evalu-ating the similarity of dissolution profiles. The simplest and most frequently used method relies on the similarity factor, f2, proposed by Moore and Flanner (1996):
f2 = 50 log10
"
1 + 1
n
n X
t=1
wt(Rt−Tt)2 #−1/2
×100
, (1.1)
whereRtandTtare the average percent dissolved at timetfor the reference and test
lots, respectively, n is the number of dissolution time points, and wt is an optional
The f2 factor has been heavily studied and frequently criticized since its rec-ommendation by the FDA. It does not account for variability or correlation among time points and is sensitive to the number and times of the dissolution samples, par-ticularly after the dissolution profiles reach an asymptote (Chow and Ki, 1997; Ju and Liaw, 1997; Polli, Rekhi, Augsburger and Shah, 1997). In 1998, Shah, Tsong, Sathe, and Liu proposed the calculation of a 90% confidence interval forE(f2) using a bootstrap method and suggested comparing the lower confidence limit to a pre-specified similarity value, such as 50. Liu, Ma, and Chow (1997) noted that there was no adequate analytical formula (either exact or approximate) for the sampling distribution of f2. Thus, it is difficult to assess the type I and type II errors and to evaluate power, sample size, bias, and sensitivity of any tests based on f2. Ma, Wang, Liu, and Tsong (2000) used simulations to investigate the type I error and power of the 90% confidence interval approach. In other simulations and examples, a number of authors have found the f2 factor to be liberal in deciding for similarity (Chow and Ki, 1997; Ju and Liaw, 1997; Liu, Ma and Chow, 1997; Sierra-Cavazos and Berger, 1999), which perhaps has contributed to its popularity.
The FDA guidance documents also allow for the use of “model-independent” and “model-dependent” methods for comparing dissolution profiles. In this terminology, model-independent methods compare the dissolution profiles at given time points while model-dependent methods rely on fitting a curve to the data and comparing the parameters of the curve. Proposed model-independent methods include one-way ANOVA at each time point, two-one-way ANOVA, split-plot trend analysis, and an intersection-union test requiring equivalence at each time point (Sierra-Cavazos and Berger, 1999). FDA reviewers, Chen and Tsong (1997) suggested two multivariate analysis approaches. The first used a Mahalanobis distance statistic to construct a joint predicted region for the responses from an individual dosage unit. This method considered a newly-sampled unit to be out-of-specification if its multivariate distance failed to lie within the predicted region. For their second method, Chen and Tsong proposed the use of simultaneous multiple-time point prediction limits using a Bonferroni correction to control the overall out-of-specification rate.
principal components to deal with the correlation among responses from the same dosage unit. In this paper, they found the principal components for the tablets from the reference lots and created joint (Bonferroni-adjusted) confidence intervals on the principal component scores. The dissolution values for a new dosage unit were projected onto the selected principal components from the reference lots. If the proportion of dosage units with out-of-specification measurements were large, then the lot would not be released. Adams et al. (2001) found the principal components analysis (PCA) to compare favorably with the similarity factor (f2) and multivariate distance methods. As expected, PCA was less sensitive than the similarity factor to irrelevant irregularities in the dissolution profiles. To identify unacceptable lots, Adams et al. modified the PCA method, using a bootstrap to create a joint confi-dence region for the first two principal components of the reference lots.
In 1997, Chow and Ki (1997) took a different approach to deal with the correla-tion among time points. They proposed an autoregressive time series model for the relative dissolution rate of the reference and test products. Their method created a confidence interval for the true relative dissolution rate and compared its endpoints to pre-specified limits.
Other attempts to account for the structure of the collected data fall under the FDA’s category of model-dependent methods. A number of models have been proposed for dissolution curves, including mechanistic models (Kervinen and Yliru-usi, 1993; Crowder, 1996), and various forms of exponential, probit, Gompertz, lo-gistic, and Weibull models (Tsong and Hammerstrom, 1994). Several authors, work-ing with real dissolution data from specific pharmaceutical products, have found the Weibull, originally proposed by Langenbucher (1972), to fit dissolution curves well (Polli et al., 1997; Sathe, Tsong and Shah, 1996; Yuksel, Kanik and Baykara, 2000). The Weibull function used to simulate the data in Figure 1.1 takes the following form:
f(t, Qmax, Tc, θ) = Qmax (
1−exp
"
log
Ã
1− c Qmax
! µ
t Tc
¶θ#)
, (1.2)
where Qmax is an asymptote corresponding to the maximum dissolvable amount, c
and θ is the slope of the dissolution curve.
For model-dependent dissolution testing, most authors followed the approach described by Tsong and Hammerstrom (1994). They fit a curve to the dissolution profile for each dosage unit from the reference lots and defined a similarity region by calculating a joint confidence region for the parameter values. To test a new lot, they fit a curve using the same function (e.g. Weibull, logistic, etc.) to each of the dissolution profiles from dosage units in the new lot and determined whether the average of the newly-estimated parameters fell within the joint confidence region. In 1996, Sathe, Tsong, and Shah suggested a modification to recognize explicitly the within- and among-lot variances in calculating the similarity region for the param-eters. In deciding whether to release a lot, Tsong, Hammerstrom, and Chen (1997) suggested considering the proportion of dosage units with parameters within the similarity region rather than the average of the estimated parameters. All of these model-dependent methods rely on comparisons of parameter values from the fitted models. Thus, even assuming that an appropriate model has been identified, it is not clear that profiles with substantially different parameter values would necessar-ily have substantially different dissolution values or vice versa. A model-dependent approach that more clearly addresses the individual dissolution values from each dosage unit would be preferable.
For the interested reader, O’Hara et al. (1988) reviewed a number of similarity factor, model-independent, and model-dependent methods for comparing dissolution profiles, and Yuksel, Kanik, and Baykara (2000) compared a selection of dissolution data analysis methods via simulation.
1.2.2
Specification Limits
ac-Figure 1.2: USP XXIII Three-Stage Acceptance Sampling Plan.
ceptance sampling plan such as that defined in USP XXIII (2000). As shown in Figure 1.2, in the first stage of this plan, the manufacturer samples and tests 6 dosage units. If all 6 units dissolve more than Q+5% of label claim, then the lot passes. Otherwise, the manufacturer samples an additional 6 dosage units. If the mean dissolution of the 12 units is greater than Q and all 12 units dissolve more than Q-15%, then the lot passes. If not, the manufacturer samples an additional 12 dosage units. The lot passes in the third stage only if all 24 units dissolve more than Q-15%, at least 22 units dissolve more than Q-5%, and the mean dissolution of the 24 units is greater than Q. Modified or extended release products typically have specifications at several different times, each of which has a corresponding Q value.
data at a given time point to calculate a statistical tolerance interval; that is, an interval constructed to contain a specified proportion, λ, of the sample population with a specified degree of confidence, 100γ%. For large values ofλ and γ, with the assumption that the manufacturing and measuring processes remain in statistical control, the upper and lower bounds of the tolerance interval define a range within which most of the manufactured units should lie. Tolerance intervals directly ad-dress the central question of whether individual dosage units from the lot will have appropriate dissolution characteristics. Comparison of statistical tolerance limits to proposed specification limits enables a new drug sponsor to assess the likelihood of manufacturing lots routinely meeting the specification limits. The tolerance inter-vals also provide a basis for discussing proposed specification limits with the FDA.
1.3
Tolerance Limits
1.3.1
Historical Development
Statistical tolerance intervals were originally formulated by Wilks (1941) for monitoring manufacturing processes. Wilks used order statistics to find exact non-parametric tolerance limits for continuous univariate distributions. For simple ran-dom samples from continuous distributions, he (1942) showed that the proportion of the population between two order statistics is independent of the population sampled, depending only on the chosen order statistics. His nonparametric method was extended to multi-dimensional cases by Wald (1943) and Tukey (1947) and discontinuous distributions by Scheff´e and Tukey (1945) and Tukey (1948).
generalized gamma, double exponential, Cauchy, logistic, uniform, Poisson, bino-mial, and negative-binomial distributions (Patel, 1986).
Until recently, most research focused on the normal distribution and various approximations. In 1946, Wald and Wolfowitz produced a computational formula for approximate two-sided limits for a normally distributed population with unknown mean and variance. Weissberg and Beatty (1960), Ellison (1964), and Howe (1969) examined the adequacy of the approximation and suggested refinements. The advent of greater computing power led to the publication of extensive tables of exact (via numerical integration) factors for normal tolerance limits (Odeh and Owen, 1980) and computing routines that revealed inaccuracies in the earlier approximations with small sample sizes (Eberhardt, Mee and Reeve, 1989). In contrast to the two-sided limits, exact one-sided limits for a normal distribution were constructed relatively easily from the noncentral t distribution (Johnson, Kotz and Balakrishnan, 1994; Owen, 1968). Approximations for both one-sided and two-sided limits are widely used in practice. Hahn and Meeker’s popular reference book (1991) provides factors for common cases as do some textbooks in engineering statistics and quality control, e.g. Montgomery (2001). SAS’s proc capability software uses the approximations recommended by Hahn (1970b).
Wallis (1951) extended the Wald-Wolfowitz tolerance limits for simple random samples to linear regression, i.e. for specified values of independent variables, x, Wallis constructed an interval that contained, with confidence 100γ%, at least a proportion λ of the conditional distribution of the dependent variable Y(x), given
x. Several authors extended Wallis’s work to simultaneous tolerance intervals for regression (Limam and Thomas, 1988; Mee, Eberhardt and Reeve, 1991).
Recent interest in random effects and mixed effects models has led researchers to construct tolerance intervals in more complex situations. In the one-way ANOVA random effects model, the principal difficulty lies in the unknown ratio of the vari-ances of the random effects, R. Lemon (1977) introduced very conservative one-sided tolerance limits for balanced data. Mee and Owen (1983) used a Satterth-waite approximation to derive a somewhat less conservative tolerance limit when R
They recommended using an upper confidence bound for the unknown R in order to obtain intervals with coverage at least as large as the nominal coverage. Mee (1984) extended this approximate method to two-sided tolerance intervals. Vangel (1992) formulated the balanced one-way ANOVA problem as an integral equation and approximated the solution. With an iterative numerical procedure, he obtained nearly nominal coverage even for small sample sizes. Beckman and Tietjen (1989) followed Wald and Wolfowitz’s approach and used numerical integration to table factors for two-sided tolerance intervals for multi-way balanced random effects mod-els. Bhaumik and Kulkarni (1996) derived an exact method for both balanced and unbalanced one-way ANOVA random effects models provided the variance ratio was known. They substituted an approximately unbiased estimator forRwhen the vari-ance ratio was unknown. With their estimatedRvalue, the achieved confidence level fell short of the nominal confidence level, but approached the nominal confidence level as the number of groups increased and the estimation of R improved.
Recently, Hoffman and Kringle (2005) constructed two-sided tolerance intervals for random effects models with balanced or unbalanced data. They used exist-ing large sample methods for formexist-ing confidence bounds on linear combinations of variance parameters. In simulations, their intervals maintained nominal con-tent and coverage across sample sizes, but were conservative for small sample sizes. Alternatively, in the last few years, several authors have used generalized pivotal quantities to develop one- and two-sided tolerance intervals for mixed effects mod-els with balanced and unbalanced data (Liao and Iyer, 2004; Krishnamoorthy and Mathew, 2004; Liao, Lin and Iyer, 2005).
1.3.2
Bootstrap Methods
Starting with the standard λ-content, 100γ%-confidence normal-theory tolerance limits, they used bootstrap sampling from the original data to estimate the actual content of the tolerance interval,λ∗. Then, they renamed the interval aλ∗
content-corrected tolerance interval with the original confidence level 100γ%. Note that they did not adjust the endpoints of the interval. While the method does give statisticians an idea of the actual content of a calculated tolerance interval, it does not permit the calculation of an interval with the desired content, except perhaps by trial and error.
Robustness to the distribution of the data can also be achieved by using parametric tolerance limits based on order statistics. The coverage levels of non-parametric intervals are known precisely, but they often cannot be set equal to a pre-specified level, e.g. 0.95, since the sample size determines which order statistics are available for constructing the interval. Beran and Hall (1993) showed that simple linear interpolation between the order statistics resulted in accurate, but possibly conservative, tolerance limits. Ho and Lee (2005) recommended a bootstrap cali-bration procedure to reduce the error of coverage in the simple linear interpolated intervals. These nonparametric intervals, however, are still limited by the sample size in the sense that the endpoint of the upper tolerance limit can be no greater than the largest order statistic, X(n).
Since nonparametric intervals often rely on the largest order statistic, they can be quite unstable, particularly when the underlying distribution is heavy-tailed or contains outliers. Horn (1992) addressed this issue for the specific case of a right skewed distribution with a lower bound and possible outliers. He created an upper tolerance limit using a more moderate (stable) order statistic times a scale factor. The scale factor involved the empirical distribution of an order statistic, which could be found by bootstrapping the original data. Horn’s limits had the advantage that they could be larger than X(n) when the scale factor was adequately large.
model with balanced or unbalanced data. Their procedure bootstraps the residuals from a linear mixed model, providing the benefit of a larger sample size over us-ing only the data at a given time point. Shuong and Altan’s procedure does not require an assumed distribution for the residual errors, but it does assume a nor-mal distribution for the other random effects. Shuong and Altan provided a case study applying their procedure to content uniformity data, but did not report any simulations to verify the accuracy of their procedure.
Bootstrapping tolerance intervals, however, can be quite difficult. The bootstrap quantile estimate itself demonstrates a poor rate of convergence, O(n−1/4), where
n denotes the sample size (Singh, 1981). Two-sided percentile-type confidence in-tervals derived by bootstrapping the sample λ-quantile result in coverage error of size O(n−1/2) (Falk and Kaufmann, 1991). (In comparison, the coverage error for a two-sided interval for the mean would be O(n−1).) Discussing this result, de An-gelis, Hall, and Young (1993) note that the poor coverage accuracy is not unique to the bootstrap percentile method, but rather is inherent to any confidence proce-dure based directly on order statistics because of the discreteness of the bootstrap distribution. The coverage accuracy of two-sided bootstrap tolerance intervals can be improved by smoothing the bootstrap, but in the case of a confidence limit on a quantile, the error will generally be no smaller than O(n−2/3).
1.4
Our Proposal
Prompted by work with Shuong and Altan, we became interested in a bootstrap tolerance interval for a nonlinear mixed effect model as a possible solution to the problem of setting and evaluating dissolution specification limits. We formulated a nonlinear mixed model to capture information from the entire profile of each dosage unit and to model sources of variation in the data explicitly. By using information from the entire dissolution profile, instead of a subset of data at a few time points, we hoped to obtain more reliable inferences. For the tolerance interval, previous researchers relied on bootstrapping the original data or the residuals and observed problems with relatively low coverage. We use a parametric bootstrap to overcome the discreteness of resampling the data values in small to moderate samples. The disadvantage to our proposal lies in the computational difficulties with mixed models and resampling as well as potential increases in the required number of samples in order to fit a nonlinear mixed model.
Chapter 2
Theory
2.1
Model
Pharmaceutical product manufacturers measure dissolution on multiple dosage units from numerous lots. A good model for the resulting data should account for the nonlinear shape of the dissolution profiles as well as the primary sources of error in the observed dissolution values. In addition to assay variability and variation among dosage units, lot-to-lot variability may be substantial. Knowing the relative magnitudes of these sources of error is useful for troubleshooting the manufacturing process. The following model explicitly accounts for these three sources of error and the nonlinear profile of the dissolution data. Letyijk denote dissolution measured at
time k for dosage unit j in manufacturing lot i with k= 1,2, ..., nij, j = 1,2, ..., mi
and i= 1,2, ..., w. Let tijk be the corresponding sampling time. Assume
yijk =f(tijk, βij) +eijk (2.1)
wheref(tijk, βij) is a nonlinear function of the dosage unit-specific regression
param-eters βij. The eijk represent random error arising from analytical variation in the
is constant for all dosage units at all times. This assumption could be relaxed. Inter-unit variation is modeled through the dosage unit-specific regression pa-rameters βij (px1). By assumption,
βij =Aijβ+li+bij. (2.2)
The population parameter β represents the fixed effects while Aij denotes a
“de-sign” matrix of unit-specific covariates that may include effects for storage condi-tions or time on stability, i.e. months post-production. The bij are independent,
normally distributed random effects arising from differences among dosage units. Theli represent independent, normally distributed random effects arising from
vari-ability among lots. By assumption, the li and bij have mean zero and common
covariance matrices H and D, respectively. The random effects eijk, bij, and li are
assumed independent for all i, j, and k. These assumptions could be relaxed.
2.2
Algorithm
Within this modeling framework, we wish to determine upper, U, and lower,
L, tolerance limits to contain 100λ% of the population at a specified time t0 with 100γ% confidence. Assume Y (t0, β0, σ) is the dissolution for a randomly chosen tablet at time t0 with intra-individual variance σ2 and regression coefficient β0. Following the model described above, the coefficientβ0 is a randomly sampled set of parameters from a normal distribution with mean β and covariance matrixH+D. Let p(t0, L, U) = Pr[L ≤ Y(t0, β0, σ) ≤ U]. Then, the lower, L, and upper, U, bounds of the tolerance interval should satisfy
Pr{p(t0, L, U)≥λ} ≥γ. (2.3)
Since the model implies that the responses yijk result from a convolution of a
than 100(λ+ 1)/2 % of the population with 100(γ+ 1)/2% confidence and the lower tolerance limit L less than 100(λ+ 1)/2 % of the population with 100(γ + 1)/2% confidence. The resulting (asymmetric) interval (L, U) contains 100λ% of the pop-ulation with at least 100γ% confidence.
The following algorithm produces the proposed tolerance interval (L, U):
1. Select starting values, L(0) and U(0), for L and U by calculating standard normal-theory tolerance limits for a simple random sample using the mean and standard deviation of observed dissolution values att0. (These are readily available in most statistical software packages.)
2. Estimate the individual regression parameters ˆβij for each dissolution profile
using ordinary least squares. Pool the residuals from each dosage unit to estimate the within-unit variance parameter σ2.
3. Estimate the population parameters by their normal theory maximum likeli-hood estimates ˆβ, ˆH, and ˆD. These estimates can be obtained with standard linear mixed effects modeling software or with an iterative EM algorithm re-ferred to as the global two-stage method (Steimer, 1984; Davidian and Gilti-nan, 1995). Appendix B provides additional details on the model estimation in Steps 2 and 3.
4. Create h = 1, ..., B1 new parametric bootstrap data sets. For each h, this requires generating responses yijk(h) corresponding to each of the observed yijk.
Specifically, randomly select b(ijh) from the N(0,Dˆ) distribution and l(ih) from the N(0,Hˆ) distribution. Calculate βij(h) = Aijβˆ+l(ih) +b
(h)
ij . Then, create
the responses yijk(h) =f(tijk, βij(h)) +e(ijkh) using the generated β
(h)
ij and randomly
sampling e(ijkh) from the N(0,σˆ2) distribution. Repeating this process h = 1, ..., B1 times results in B1 bootstrap data sets, each of which corresponds to the original observed data.
6. For each bootstrap data set, estimate p(t0, L) = Pr[L ≤ Y(t0, β0, σ) ] and
p(t0, U) = Pr[Y(t0, β0, σ)≤U] as described in detail below.
7. Count the proportion of the p(t0, L) from theB1 bootstrap data sets that are
≥(λ+ 1)/2, and count the proportion of the p(t0, U) that are ≥(λ+ 1)/2.
8. Systematically modify L and U repeating Steps 6 and 7 to obtain tolerance interval limits such that the proportion of p(t0, L) that are ≥ (λ + 1)/2 is
≈(1+γ)/2 and the proportion ofp(t0, U) that are≥(λ+1)/2 is≈(1+γ)/2 .
The remainder of this section provides additional details for finding the upper tol-erance limit U. The lower tolerance limit L can be found similarly. With minor modifications, the algorithm could also be modified to calculate a symmetric toler-ance interval (L, U).
In order to estimate p(t0, U) in Step 6, define the indicator function
I{Y(t0, β0, σ)≤U}=
1 : Y(t0, β0, σ)≤U
0 : otherwise. (2.4)
Chu et.al. (2001) explain how to estimate Pr{Y(t0, β0, σ) ≥ d} by considering the iterated expectation. Adapted to this situation,
p(t0, U) = Pr [Y(t0, β0, σ)≤U] = E[I{Y(t0, β0, σ)≤U}]
= E³EhI{Y(t0, β0, σ)≤U} |β0 = ˜β
i´
= E{Ψt0,U( ˜β, σ)}.
The iterated expectation suggests first choosing a random dosage unit and condi-tioning on the dosage unit-specific parameter ˜β. Then Ψt0,U( ˜β, σ) is the within-unit
equivalently sample l0 from the N(0,Hˆ) distribution, sample b0 from the N(0,Hˆ) distribution, and calculate ˜β = ˆβ +l0 +b0. For a given ˜β, yijk follows a normal
distribution with mean f(t0,β˜) and variance σ2. Therefore, the within-dosage unit expectation
Ψt0,U( ˜β, σ) = E
h
I{Y(t0, β0, σ)≤U} |β0 = ˜β
i
(2.5)
= PrhY(t0, β0, σ)≤U|β0 = ˜β
i
= Φ
Ã
U−f(t0,β˜)
σ
!
where Φ is the cumulative distribution function of a standard normal. Substitut-ing the estimate ˆσ for σ yields an estimate of the within dosage unit expectation Ψt0,U( ˜β, σ) for a given ˜β. Then,
p(t0, U) = E{Ψt0,U( ˜β, σ)}
=
Z
Ψt0,U(b, σ)dF
N β (b),
whereFN
β (b) is the cumulative normal distribution function of the random coefficient
˜
β’s. We can estimate p(t0, U) by replacing the parameters with their estimates ˆβ, ˆ
H, ˆD, and ˆσ and approximating the integral by the empirical average
ˆ
p(t0, U) = 1
B2
B2
X
r=1
Ψt0,U( ˜βr,σˆ), (2.6)
where ˜βr,r= 1,2, ..., B2are independent samples from theN( ˆβ,Hˆ+ ˆD) distribution. Note that estimation of p(t0, U) occurs in Step 6 and must be completed separately for each of the B1 bootstrap data sets created in Step 4. Thus, the estimates ˆβ, ˆH,
ˆ
D, and ˆσ referenced above are actually ˆβ(h), ˆH(h), ˆD(h), and ˆσ(h) when estimating
p(t0, U) for bootstrap data set h.
The final step in the algorithm involves systematically modifyingU(c), the current value of the upper tolerance limit, to find an upper tolerance limit U that is greater than 100(λ+ 1)/2 % of the population with 100(γ+ 1)/2% confidence. For this step, let²be the error margin, i.e. ²= 1/B1 whereB1 is the number of bootstrap samples from Step 4. The goal is to find U such that
µγ+ 1
2
¶
≤ Pr
"
p(t0, U)≥
(λ+ 1) 2
#
<
µγ+ 1
2 +²
¶
At the first iteration, letUsmall=L(0), the starting value for the lower tolerance limit found in Step 1, and let Ularge =Usmall−1. To find U(c+1), consider the following three-part decision rule:
• If µ
γ+ 1 2
¶
≤ Pr
"
p(t0, U(c))≥
(λ+ 1) 2
#
<
µγ+ 1
2 +²
¶
,
then U(c) is a suitable upper tolerance limit.
• If
Pr
"
p(t0, U(c))≥
(λ+ 1) 2
#
<
µγ+ 1
2
¶
,
then U(c) is too small. Set Usmall = U(c). If Ularge < Usmall, i.e. we have not
yet found a value that is too large, then set U(c+1) = U(c) + 2. If, instead,
Ularge > Usmall, i.e. we have found a value that is too large, then let U(c+1) be the midpoint between U(c) and Ularge.
• If
Pr
"
p(t0, U(c))≥
(λ+ 1) 2
#
≥
µγ+ 1
2 +²
¶
,
then U(c) is larger than necessary. Set Ularge = U(c) and let U(c+1) be the midpoint between U(c) and Usmall.
The sequence U(0), U(1), U(2), . . . generated by this scheme converges to a suitable upper tolerance limit U.
2.3
Examples
2.3.1
Simulated Data
A number of models have been proposed for dissolution curves, including ex-ponential, probit, Gompertz, and logistic models — any of which could be used with this method of constructing tolerance limits. For this example, we simulated dissolution profiles for sixty dosage units using the Weibull model in Equation 1.2 with observed dissolution times t = 0.5, 1, 2, 4, 6, 8, 12, and 16 hours. To imitate a stability study, twelve profiles from the same lot were simulated for each of the post-production times 0, 1, 3, 6, and 12 months. We let the constant c = 50 and varied the population parameters Qmax, Tc, and θ from one month to another,
as-signing them the values in Table 2.1. Following the previous notation, we generated the parameter βi for dosage unit i from the equation βi = Aiβ +bi where β was
the vector containing all of the population parameters Qmax, Tc, and θ. Since we
assumed all of the dosage units originated from the same lot, as might occur in a stability study, we did not need a model component for the random lot effect. The “design” matrix Ai selected the appropriate parameters for dosage unit i within a
particular month. We randomly generated the bi, which add inter-unit variation,
from a normal distribution with mean zero and covariance matrix
D=
64 −0.7 −0.3
−0.7 0.10 −0.006
−0.3 −0.006 0.004
.
The responsesyij for each dosage unit were constructed asyijk =f(tij, βi)+eij,
sub-stituting the Weibull model from Equation 1.2 for the mean functionf(tij, βi) where
βi was the unit-specific parameter calculated above and tij equaled the dissolution
times 0.5, 1, 2, 4, 6, 8, 12, and 16 hours. The intra-unit errorseij were independently
generated from a normal distribution with mean zero and unit variance.
Table 2.1: Fixed effect parameters and estimates for Weibull simulated data.
Months Qmax Qˆmax Tc Tˆc θ θˆ
0 84 85 4.0 4.0 0.70 0.69
1 85 88 3.9 3.8 0.72 0.70
3 87 83 3.7 3.8 0.75 0.77
6 90 90 3.6 3.6 0.74 0.75
12 96 96 3.0 2.8 0.80 0.80
intra-unit variance was ˆσ2 = 0.944, and the estimated covariance matrix was
ˆ
D=
36 −0.14 −0.18
−0.14 0.08 −0.004
−0.18 −0.004 0.003
.
Using the algorithm in Section 2.2 with B1 =B2 = 1000 bootstrap samples in each iteration, we calculated asymmetric 95% tolerance intervals to contain 99% of the population dissolution values at each of the three dissolution times t0 = 0.5, 4, and 12 hours when stability time equaled 0 months. Fitting the model and calculating the three tolerance intervals in MATLAB°R took 13 minutes on a 2 Ghz Pentium 4
computer.
0 4 8 12 16 0 20 40 60 80 100 Time (hours) %l.c. 0 months
0 4 8 12 16 0 20 40 60 80 100 Time (hours) %l.c. 1 months
0 4 8 12 16 0 20 40 60 80 100 Time (hours) %l.c. 3 months
0 4 8 12 16 0 20 40 60 80 100 Time (hours) %l.c. 6 months
0 4 8 12 16 0 20 40 60 80 100 Time (hours) %l.c. 12 months
2.3.2
Tsong Data
For a second example, we apply our method to the frequently discussed disso-lution data published by FDA reviewers Tsong and Hammerstrom (1994) and later by Chen and Tsong (1997). The data consist of dissolution values for 48 tablets at times t = 1,2,3,4,6,8, and 10 hours. According to Chen and Tsong’s paper, the dissolution values for the first 36 tablets (graphed in Figure 2.2) were simulated based on the dissolution profile of one drug product while the dissolution values for the remaining 12 tablets (graphed in Figure 2.3) were simulated based on a second drug product. Following their example, we let the dissolution values from the first 36 tablets represent data from standard “reference” lots and use these dissolution profiles to calculate tolerance limits at thet0 = 1, 4, and 10 hour time points. Then, we compare the dissolution values from the last 12 tablets, which represent tablets from a “test” lot, against the calculated limits.
In this example, we modeled each individual tablet profile with a logistic curve,
f(tij, βi) =
β1i
1 + exp (β2i+β3itij)
, (2.8)
because this function appeared to fit the reference data better than the Weibull model did. As before, we calculated asymmetric tolerance limits to contain 99% of individual dissolution values at a given time point with 95% confidence. Fitting the model and calculating the three tolerance intervals in MATLAB°R took 58 minutes
on a 2 Ghz Pentium 4 computer. The calculations for this example took much longer than those for the previous example largely because the EM algorithm did not converge as quickly. Almost 10% of the created bootstrap data sets were rejected because the EM algorithm failed to converge in 1000 iterations.
tolerance intervals. As a result, when we consider a single dissolution profile from the test data, there is little, if any, evidence that the dissolution values from this profile did not originate from the same population of dissolution values as the reference data. However, if we compare the entire set of 36 reference data values to the entire set of 12 test data values at a particular time point, we note obvious differences in the distribution of these values relative to the calculated tolerance limits. (For example, Figures 2.2 and 2.3 show the data values along with the calculated tolerance limits for t0 = 10 hours.)
It is arguable that we are not as interested in whether the dissolution of a single tablet met the specification limits at time t0 as we are in whether the collection of 12 observed dissolution values at t0 from the test batch could reasonably have been taken from the same distribution as the dissolution values from the reference batch. In this case, we might prefer to calculate a symmetric tolerance interval for the mean dissolution of 12 tablets. Our method of determining tolerance limits easily adapts to this situation, requiring only that we substitute estimation of py¯(t0, L, U) for estimation ofp(t0, L, U). In particular, this requires us to generate sets of 12 random tablet values for each new observation ˜β sampled from the N( ˆβ,Dˆ) distribution and calculate the mean of each set. Then, analogous to the p(t0, U) case, we find the conditional expectationEhI{L≤y¯12(t0, β0, σ)≤U}|β0 = ˜β
i
as in Equation 2.5, substituting the estimated standard deviation of a mean of 12 observations for σ. Finally, we estimate py¯(t0, L, U) with the empirical average across all ˜βr where r =
1,2, ..., B2 are independent samples from theN( ˆβ,Dˆ) distribution. Tolerance limits could be calculated in a similar manner for any population statistic of interest.
For illustration, we calculated symmetric 95% tolerance limits to contain 99% of the mean dissolution values of twelve tablets at t0 = 1, 4, and 10 hours, obtaining the intervals (39.7, 48.6), (62.2,71.4), and (78.6, 90.8), respectively. The means of the 12 dissolution values from the test batch were 36.5 at t0 = 1 hour, 67.9 at
0 1 2 3 4 5 6 7 8 9 10 11 30
40 50 60 70 80 90 100
Time (hours)
% Label Claim
Tsong Reference Data
0 1 2 3 4 5 6 7 8 9 10 11 30
40 50 60 70 80 90 100
Time (hours)
% Label Claim
Tsong Test Data
2.4
Discussion
This chapter described a nonlinear mixed effect model and bootstrap tolerance interval for evaluating dissolution specification limits. The model explicitly ac-counted for the three primary sources of variation in dissolution measurements — analytical variation, variation among dosage units from the same lot, and variation among manufacturing lots. The method of calculating tolerance intervals used all of the observed data in the dissolution profiles and allowed the calculation of a toler-ance limit at any dissolution time, even if the time did not correspond to one of the times at which data were observed. Although the method may require additional data collection during drug development in order to fit an appropriate model and determine reliable dissolution specifications, the method results in specification lim-its set on the observed dissolution responses at a given dissolution time. Thus, this method requires minimal data collection and no model-fitting to test manufacturing lots against the established specifications. In addition, the method can be easily adapted to set specification limits for any population characteristic of interest.
Chapter 3
Simulations
The simulations summarized in this chapter explore the performance of tolerance limits constructed via parametric bootstrap resampling. Specifically, the simulations examine the achieved confidence levels (ACLs) of upper bootstrap tolerance limits in situations ranging from a simple random sample to a nonlinear mixed effect model. In most cases, the ACLs of the bootstrap tolerance limits are compared to the ACLs of tolerance limits calculated by SAS°R proc capability. The calculations in SAS°R
assume the data consist of a simple random sample from a normally distributed population with unknown mean and variance. Appendix A provides a review of tolerance limits for simple random samples, including those calculated by SAS°R
(Section A.5). All simulations discussed in this chapter were written and executed using SAS°R v.9.1 (SAS Institute, Cary, NC).
upper tolerance limit in proc capability as well as a one-sided upper tolerance limit using a bootstrap resampling method. Calculation of the bootstrap tolerance limit proceeded as follows:
1. Specify the proportion of the population,λ, that should lie below the tolerance limit and the desired confidence level, γ, of the tolerance limit.
2. Draw a sample, y1, y2, ..., yn, from theN(µ, σ2) distribution and calculate the
usual sample mean, ¯y, and sample variance, s2.
3. Create h= 1,2, ..., Bsets of parametric bootstrap parameter estimates. First, calculate a variance estimate, s2 (h) =s2X(h)/(n−1) where X(h) is randomly sampled from a chi-squared distribution with (n−1) degrees of freedom. Then, randomly sample a mean estimate, ¯y(h), from theN(¯y, s2/n) distribution.
4. For each of the h = 1,2, ..., B sets of bootstrap parameter estimates, use the corresponding normal distribution function, N(¯y(h), s2 (h)), to find an upper percentile, Uλ(h). Together, the B upper percentile values estimate the distri-bution function of the desired upper percentile, λ.
5. Find the upper 100γ% confidence limit, UT L, from the bootstrap empirical
distribution function of the upper percentile values. Recall that the upper confidence limit, UT L, on the percentile λ is an upper tolerance limit for the
original normal distribution.
In each simulation, calculation of the parametric bootstrap tolerance limit was re-peated for 1000−4000 samples. The achieved coverage level (ACL) was calculated by counting the proportion of the samples for which the UT L was greater than the
discarded. Results from some simulations appear in more than one table to make comparisons easier.
As expected, the normal-theory (NT) tolerance limits produce the correct ACLs for all combinations ofλ,γ, and sample size (n). In contrast, the bootstrap method yields consistently low ACLs. The results in Table 3.1 suggest that increasing the number of bootstrap samples beyond 500-1000 provides no improvement in coverage. Increasing the sample size (n), however, improves the achieved coverage. This effect is seen in all of the tables, but summarized most clearly in Table 3.2.
For small sample sizes (e.g. 24), the bootstrap ACLs are markedly low. As the sample size increases, the bootstrap ACLs approach the stated confidence level,
γ. Interestingly, even a sample of 600 units may not be quite large enough to ensure adequate coverage. The bootstrap method performs well with samples of 1500 and 3000 units, but these sample sizes give such extensive knowledge of the population that distribution-free tolerance limits can be obtained for large values of λ and γ (Section A.4). Given that this is parametric bootstrap sampling from the correct distributional form, one might expect much smaller sample sizes to yield more accurate ACLs.
The bootstrap method yields low coverage at all of the stated confidence levels in Table 3.3; however, the magnitude of the difference between the stated confidence and the achieved confidence level decreases as the stated confidence level increases. This is good in practice since manufacturers tend to be interested in the most extreme confidence levels, e.g. 99% and 99.9%. Manufacturers also typically want to create tolerance intervals that cover a large proportion of the population, e.g.λ >
0.9. As shown in Table 3.4, the bootstrap method does quite well in achieving the stated confidence for an upper confidence limit on the median, i.e. an upper tolerance limit covering 50% of the population. As the percentile to cover increases, the ACL for the bootstrap method diverges further from the stated confidence level, although the magnitude of the shortfall is lessened by larger sample sizes and larger stated confidence levels.
Table 3.1: Simple Random Sample: Effect of Increasing the Number of Bootstrap Samples on the Achieved Confidence Level (ACL). In the table, bootstrap is abbre-viated BT, and NT refers to normal-theory based intervals.
Simulation Sample BT BT NT
Runs Size (n) Samples 100λ 100γ ACL ACL
2000 24 250 99 95 88.7 95.3
2000 24 500 99 95 89.8 94.9
2000 24 1000 99 95 89.5 95.4
2000 24 5000 99 95 90.6 95.2
2000 24 10000 99 95 90.0 94.9
2000 24 20000 99 95 91.1 95.5
2000 120 250 99 95 92.7 94.7
2000 120 500 99 95 93.1 95.4
2000 120 1000 99 95 93.0 94.8
2000 120 5000 99 95 92.6 95.0
2000 120 10000 99 95 93.0 95.0
2000 120 20000 99 95 93.0 94.9
1000 60 250 99 97.5 93.6 97.3
1000 60 500 99 97.5 95.5 97.5
1000 60 1000 99 97.5 94.5 97.3
1000 60 5000 99 97.5 94.5 97.3
1000 60 10000 99 97.5 95.6 98.1
1000 60 20000 99 97.5 95.2 97.2
2000 24 250 99 99 95.8 99.0
2000 24 500 99 99 95.5 99.0
2000 24 1000 99 99 96.1 99.1
2000 24 5000 99 99 95.7 99.0
2000 24 10000 99 99 96.2 99.1
2000 24 20000 99 99 96.0 99.3
2000 120 250 99 99 97.9 99.1
2000 120 500 99 99 97.9 99.0
2000 120 1000 99 99 98.1 98.9
2000 120 5000 99 99 98.1 99.3
2000 120 10000 99 99 97.7 98.9
2000 120 20000 99 99 97.8 98.8
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
Table 3.2: Simple Random Sample: Effect of Increasing the Sample Size on the Achieved Confidence Level (ACL). In the table, bootstrap is abbreviated BT, and NT refers to normal-theory based intervals.
Simulation Sample BT BT NT
Runs Size (n) Samples 100λ 100γ ACL ACL
2000 24 1000 99 95 90.0 95.3
2000 60 1000 99 95 91.6 95.0
2000 120 1000 99 95 90.8 93.6
2000 600 1000 99 95 93.8 95.0
2000 1500 1000 99 95 94.9 95.3
2000 3000 1000 99 95 95.0 95.5
1000 24 1000 99 97.5 93.9 98.0
1000 60 1000 99 97.5 94.5 97.3
1000 120 1000 99 97.5 94.7 96.6
1000 600 1000 99 97.5 97.1 97.5
1000 1500 1000 99 97.5 97.5 98.3
1000 3000 1000 99 97.5 97.1 97.8
2000 24 1000 99 99 96.9 99.2
2000 60 1000 99 99 97.8 99.0
2000 120 1000 99 99 98.0 98.9
2000 600 1000 99 99 98.1 98.8
2000 1500 1000 99 99 98.9 99.2
2000 3000 1000 99 99 98.8 99.1
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
Table 3.3: Simple Random Sample: Effect of Increasing the Stated Confidence Level on the Achieved Confidence Level (ACL).In the table, bootstrap is abbreviated BT, and NT refers to normal-theory based intervals.
Simulation Sample BT BT NT
Runs Size (n) Samples 100λ 100γ ACL ACL
4000 24 1000 99 80 73.2 79.2
4000 24 1000 99 90 84.1 90.1
4000 24 1000 99 95 89.7 94.8
4000 24 1000 99 97.5 93.5 97.7
4000 24 1000 99 99 95.8 99.1
4000 24 1000 99 99.9 98.7 99.8
4000 60 1000 99 80 74.9 79.1
4000 60 1000 99 90 86.6 90.0
4000 60 1000 99 95 91.3 94.6
4000 60 1000 99 97.5 95.5 97.9
4000 60 1000 99 99 97.4 98.9
4000 60 1000 99 99.9 99.2 99.9
4000 120 1000 99 80 77.5 80.1
4000 120 1000 99 90 87.7 90.6
4000 120 1000 99 95 92.8 95.2
4000 120 1000 99 97.5 95.9 97.7
4000 120 1000 99 99 97.6 98.9
4000 120 1000 99 99.9 99.6 99.9
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
Table 3.4: Simple Random Sample: Effect of Increasing the Percentile to Cover on the Achieved Confidence Level (ACL). In the table, bootstrap is abbreviated BT, and NT refers to normal-theory based intervals.
Simulation Sample BT BT NT
Runs Size (n) Samples 100λ 100γ ACL ACL
2000 24 1000 50 95 93.6 94.1
2000 24 1000 80 95 91.1 95.1
2000 24 1000 90 95 90.0 95.1
2000 24 1000 95 95 90.6 95.9
2000 24 1000 99 95 89.5 95.2
2000 24 1000 99.5 95 89.5 95.5
2000 24 1000 50 99 98.2 99.0
2000 24 1000 80 99 97.1 99.0
2000 24 1000 90 99 96.1 98.8
2000 24 1000 95 99 96.8 99.0
2000 24 1000 99 99 96.0 99.0
2000 24 1000 99.5 99 96.3 99.1
1000 60 1000 50 97.5 97.2 97.4
1000 60 1000 80 97.5 96.4 98.3
1000 60 1000 90 97.5 95.5 97.6
1000 60 1000 95 97.5 95.1 98.2
1000 60 1000 99 97.5 94.5 97.3
1000 60 1000 99.5 97.5 94.1 97.3
2000 120 1000 50 95 94.3 94.7
2000 120 1000 80 95 94.4 95.8
2000 120 1000 90 95 92.6 94.7
2000 120 1000 95 95 91.7 94.2
2000 120 1000 99 95 92.9 95.0
2000 120 1000 99.5 95 92.7 95.2
2000 120 1000 50 99 98.9 99.0
2000 120 1000 80 99 98.4 99.0
2000 120 1000 90 99 97.8 98.7
2000 120 1000 95 99 98.3 99.3
2000 120 1000 99 99 98.1 99.0
2000 120 1000 99.5 99 97.9 99.1
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
Table 3.5: Simple Random Sample: Effect of a Known Population Mean on the Achieved Confidence Level (ACL). In the table, bootstrap is abbreviated BT.
Simulation Sample BT BT
Runs Size (n) Samples 100λ 100γ ACL
2000 24 1000 90 95 89.7
1000 24 1000 90 97.5 93.9
2000 24 1000 90 99 95.8
2000 24 1000 99 95 90.0
1000 24 1000 99 97.5 94.3
2000 24 1000 99 99 95.5
2000 60 1000 90 95 90.7
1000 60 1000 90 97.5 96.5
2000 60 1000 90 99 97.4
2000 60 1000 99 95 92.5
1000 60 1000 99 97.5 94.3
2000 60 1000 99 99 97.0
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
ACL(100−ACL)/(simulation runs). For example,p90(10)/2000 = 0.67 and p95(5)/2000 = 0.49.
are the estimation of µ and σ2 and the discreteness of the empirical distribution function produced by the bootstrap samples. Since increasing the number of boot-strap samples beyond 1000 did not noticeably improve the ACLs, the discreteness of the bootstrap-created empirical distribution function does not appear to be a primary source of the low coverage. To investigate the effect of estimating µ, we calculated upper bootstrap tolerance limits for a few cases assuming the population mean was known. Specifically, this simulation followed the steps described earlier except each bootstrap sample mean, ¯y(h), was set equal to the true mean, µ = 0, instead of being randomly sampled from the N(¯y, s2/n) distribution. The results in Table 3.5 show no substantial improvement in the persistently low bootstrap ACLs. Thus, estimation of the mean response does not appear to cause the low coverage when the sample includes at least 24 units.
Table 3.6: Simple Random Sample: Effect of a Known Population Standard Devia-tion on the Achieved Confidence Level (ACL).In the table, bootstrap is abbreviated BT.
Simulation Sample BT BT
Runs Size (n) Samples 100λ 100γ ACL
2000 24 1000 90 95 95.0
1000 24 1000 90 97.5 97.0
2000 24 1000 90 99 99.0
2000 24 1000 99 95 93.8
1000 24 1000 99 97.5 96.9
2000 24 1000 99 99 98.5
2000 60 1000 90 95 94.9
1000 60 1000 90 97.5 98.6
2000 60 1000 90 99 98.9
2000 60 1000 99 95 94.8
1000 60 1000 99 97.5 96.9
2000 60 1000 99 99 99.0
Note: Standard errors for the achieved confidence levels can be calculated as follows: s.e.(ACL) = p
ACL(100−ACL)/(simulation runs). For example, p95(5)/2000 = 0.49 and p99(1)/2000 = 0.22.
the original procedure except this time, the calculations substituted σ2 = 16 for each value of s2 (h). As shown in Table 3.6, the bootstrap ACLs now match the stated confidence levels, even when the samples include only 24 units.
In summary, the primary problem with achieved coverage of the bootstrap toler-ance limits can be traced to estimation of the varitoler-ance in the initial sample. Because finding a tolerance limit requires knowledge of the extreme tail of the distribution, the bootstrap tolerance limit is particularly sensitive to the variance estimation. With an extremely large sample size, the estimate of the variance is almost the same as knowing the variance. Consequently, the bootstrap tolerance interval achieves the correct confidence.
achieved coverage is almost correct. Also, when the desired confidence is extreme (99.9%) and the quantile to cover is very large (λ = 0.99), the bootstrap tolerance interval approaches the correct confidence level. In this case, there is probably little difference in probability between relatively large differences in tolerance limit values. Having captured nearly all of the distribution below the upper tolerance limit, there is simply not much probability above the tolerance limit even if the limit is low.
In general, however, for reasonable sample sizes, we should expect the para-metric bootstrap tolerance interval procedure to yield achieved coverage levels that are consistently slightly low. The additional simulations that follow confirm this hypothesis.
3.2
Case II: A Random Intercept Model
In the second set of simulations, the mean model was chosen to be a linear function with a random intercept (β0i ∼ N(µ = 0, σb2 = 2)) and a fixed slope
(β1 = 8). Specifically,
yij =β0i+ 8tij +eij (3.1)
for i = 1,2, ..., m dosage units and j = 1,2, ..., n measurements on each unit. The simulations varied the number of dosage units and the number of time points at which dissolution was measured for each dosage unit. For the simulation runs with eight time points, the measurement times (tij) were 0.5, 1, 2, 4, 6, 8, 10, and 12
hours for all dosage units. The simulation run with seventy-three time points used measurement times starting at 0 and continuing at intervals of 0.16 hour through 11.52 hours for all dosage units. The residual errors followed a normal distribution with mean, µ= 0, and variance, σ2 = 1. The random intercepts and residual errors were assumed to be independent.
limit method outlined in Section 2.2 rather than the simplified procedure used for Case I in Section 3.1. With the random intercept model, the more complicated procedure was unnecessary, but it was used to mimic the nonlinear mixed model case that is of primary interest for evaluating dissolution specification limits. Each simulation included 500 samples (simulation runs), 1000 outer bootstrap samples for each run, and 1000 inner bootstrap samples for each outer bootstrap sample. No samples were discarded, and each simulation used a different starting seed.
For comparison, the simulation program also used the data at the t0 = 4 hour time point to calculate standard normal-theory tolerance intervals using SAS proc capability. For this set of simulations, the marginal distribution at tij = 4 hours
is N(µ = 32, σ2 = σ2
b +σ2e = 3). Thus, the normal-theory tolerance limits should
achieve the correct confidence level.
Table 3.7 summarizes the results from the simulations with the random intercept model. The table includes the length of time to run each simulation. Each run required considerably more time than in previous simulations due to the necessity of fitting thousands of random intercept models. (The first, second, and fourth runs were completed using a faster processor than the third and fifth runs.) As in previous simulations, the normal-theory tolerance limits consistently achieved the correct confidence level while the bootstrap tolerance limits showed persistently low coverage. The coverage did not significantly improve with moderate increases in the number of dosage units or the number of measurement times.
3.3
Case III: A Linear Mixed Model
The third set of simulations used a linear mixed model with random intercept and random slope. Specifically,
yij =β0i+β1itij +eij (3.2)
for i = 1,2, ..., m dosage units and j = 1,2, ..., n measurements on each unit. The measurement times (tij) were 0.5, 1, 2, 4, 6, 8, 10, and 12 hours for all dosage
variance σ2
e = 1. The intercept and slope followed a bivariate normal distribution
(independent of the eij) with mean vector (β0, β1) = (0,8) and covariance matrix
D=
1 0.32
0.32 0.64
.
Each simulation calculated two tolerance limits at time t0 = 4 hours: a one-sided upper tolerance limit and a one-one-sided lower tolerance limit. Both tolerance limits were calculated using the bootstrap tolerance interval method outlined in Section 2.2. Again, this procedure was used to mimic the nonlinear mixed model case, although the procedure could have been simplified by relying on the knowledge that the marginal distribution at time t0 = 4 hours is normal for this model. For comparison, the simulation also used the data at the t0 = 4 hour time point to calculate standard normal-theory tolerance limits using SAS°R proc capability. Once
again, the normal-theory tolerance limits should achieve the correct confidence level. These simulations varied the number of simulation runs, the number of inner bootstrap samples, the number of outer bootstrap samples, the confidence level (γ), the percentile on which to create the confidence interval (λ), and the number of dosage units in each sample. Each simulation run required between ten and one hundred forty-five hours to complete. Typically, about three percent of the samples were discarded due to errors in fitting the mixed model, primarily lack of a positive definite covariance matrix.
As in previous simulations, the normal-theory tolerance limits consistently achieved the stated confidence level while the bootstrap tolerance limits exhibited consistently low coverage. Increasing the number of dosage units in each sample from 18 to 48 increased the number of samples for which the linear mixed model converged. This decreased the percent of samples discarded, and may have contributed to the slightly improved coverage levels observed with 48 dosage units per sample as opposed to 18 dosage units per sample (Table 3.8). Using intermediate numbers of dosage units (24 and 36) in each sample, however, showed no improvement in coverage over including 18 dosage units in each sample.