and φ(Xi, Yj) = 1 if Xi < Yj and 0 otherwise
Z =U +n2(n2+ 1)
2 (3.15)
that means that tests based on U are equivalent to tests based on Z. If n is large,
we use the standard normal distribution as approximation
Z∗ = Z−µz σz (3.16) where isµz = n1(n1+2n2+1) and σz = q n1n2(n1+n2+1) 12 .
3.3
NPI for the Reproducibility Probability
The reproducibility of a test is an important characteristic of the practical relevance of test outcomes. Recently there has been substantial interest in the reproducibility probability (RP), where not only its estimation but also it is the actual definition and interpretation are not uniquely determined in the classical frequentist statistics framework. NPI is a frequentist statistics approach that makes few assumptions, enabled by the use of lower and upper probabilities to quantify uncertainty, and which explicitly focuses on future observations. The explicitly predictive nature of NPI provides a natural formulation of inferences on RP.
In the Sections 3.4, 3.5 and 3.6, we introduce the use of the NPI approach to RP (NPI-RP) for some basic nonparametric tests [20]. Applying NPI, for either real- valued or Bernoulli data, enables inference by deriving lower and upper probabilities for the event that a future test, of similar size and under similar circumstances as the first test, will lead to the same conclusion as the first test, that is rejection or non-rejection of the null-hypothesis. Generally, we will use the acronym NPI-RP for such inferences. It is important to emphasize that we focus on the conclusion of the future test with regard to the null-hypothesis, given the actual data of the first test; so we do not consider an exact repetition in terms of the same value for the test statistic of interest or even for the actual observations, nor do we opt to just use the information from the first test that the null-hypothesis was rejected or not. As the
strength of the first test’s conclusion depends on the actual data, it seems logical and important to use those data to infer on the reproducibility of the test result, while such prediction for the test result in a future test is more naturally reflected by the corresponding final conclusion, so rejection or not of the null-hypothesis. In Section 3.7 we briefly comment on the possibility, within the NPI framework, to only use the test’s conclusion from the first test, but we consider this of less importance than the approach followed throughout this chapter.
As is clear from the brief comments on the literature on RP in Section 1.4, there have been several different formulations of the RP problem within the classical theory of frequentist statistics, where typically properties of an assumed underlying population are estimated. However, the very nature of RP seems to be predictive; given the data from the first test, one would like to predict the overall test conclusion for a second test, if such a further test would have the same sample size(s) and would be performed under similar circumstances. Hence the NPI approach is attractive, as it is a framework of frequentist statistics that explicitly considers future observations which are exchangeable with the available data observations. We should point out that the NPI framework does not require that the sample size(s) in the actual (first) and future (second) tests are the same, but this seems a natural assumption in order to reflect reproducibility, and we will restrict attention to this situation in this thesis. We present NPI-RP for any possible results of the first test, so both in case that it leads to rejection and non-rejection of the null-hypothesis. As will be clear from the discussion in Section 1.4, in practice one is often particularly interested in reproducibility of tests that led to rejection of the null-hypothesis, as this tends to be the practically most important scenario, e.g. leading to new medication being introduced. However, for a complete view we believe that the reproducibility of tests that did not reveal a significant effect is also important, so while our discussions (including in the examples in this thesis) will mostly focus on the reproducibility of tests in cases where the null-hypothesis is rejected, we also consider RP in cases of non-rejection of the null-hypothesis. The NPI for Bernoulli observations is used for the NPI approach to reproducibility for the sign test, presented in Section 3.4. The NPI for multiple real-valued observations is used for the NPI approach to the
3.3. NPI for the Reproducibility Probability 71
reproducibility of the one-sample signed-rank test and the two-sample rank sum test, presented in Sections 3.5 and 3.6.
Before we consider NPI-RP for the sign test, which is perhaps the most basic nonparametric statistical test, we need to comment briefly on assumptions under- lying statistical tests. Generally, when a statistical test is applied there are some modelling assumptions which, ideally, should be checked. For example, Wilcoxon’s one-sample signed-rank test, which we consider in Section 3.5, assumes that the population from which the sample is drawn is symmetric about the median. This assumption is important for the distribution of the test statistic under the null- hypothesis and ideally should be checked whenever the test is applied.
In the NPI-RP approach, given the n data observations from the first test, we
consider all possible different orderings of n future observations and the n data
observations for this test, and then calculate lower and upper probabilities for the
event that the test statistic based on suchnfuture observations will lead to rejection
or non-rejection of the null-hypothesis. When doing so, one could argue that we should consider, for each of the 2nn
possible orderings of then future observations
among the n data observations, whether or not it is reasonable to assume that the
n future observations could have come from a population that is symmetric about
its median. While this could be done, e.g. by using an appropriate pre-test, we do not do this for three reasons.
First, we will typically consider quite small data sets (although the approach can be applied for all sample sizes), in which case for only few test results such an underlying assumption would be rejected when formulated as null-hypothesis for a pre-test. Secondly, implementing such a pre-test for the predicted future samples would severely complicate both computation and analytic derivation of the results presented in this chapter. Thirdly, and most importantly, while testing such as- sumptions, or at least good awareness of such assumptions, it is indeed important for the actual (first) test, the further tests as performed on all the predictive, and hence hypothetical, future data sets are mainly done to get an insight in the cor- responding values of the test statistic and the corresponding test conclusions; as we do not base the practically important overall conclusion on a single test out of
these predictive tests, whether or not the predicted data actually would support the underlying assumption is of less relevance. So, generally, we do not consider such underlying assumptions in this chapter, but we will assume that the method is only applied where such assumptions seem reasonable for the actual data from the first test, as is common when such tests are applied.