6.2
The ‘rare event’ probability estimator
6.2.1 The scope of the algorithm
Problem (6.2) can be solved in a simple way with a Monte-Carlo simulation. It consists in randomly drawing N i.i.d. new codewords {˜xj}Nj=1 (according to the distribution given by the
secret p). Sequence y was forged before, therefore these are codewords of innocent ‘users’. We would like to test if z is a proper value for τ (y, p). The Monte Carlo gives the estimation ˆ
P(Sinn > z) = N−1|{j|s(˜xj, y, p) > z}|. This is simple but not efficient as N = O(1/P(Sinn> z))
for a given estimation accuracy. It becomes hardly tractable when the level S on the probability
of false positive is lower than 10−9.
This section presents a ‘rare event’ simulation estimating small probabilities (or big quan- tiles) more efficiently. The scope of the algorithm is indeed much larger than traitor tracing. The general problem is to estimate the probability P = P(s(X) > τ ). The algorithm is an adaptive version of Importance Splitting, a.k.a. Multilevel Splitting. Let us denote the distribution of X by fX and its definition set X . Our algorithm needs three routines:
• a (pseudo) random generator of independent samples distributed as fX,
• the score function s(·) : X → R,
• a random replicator r(·) : X → X invariant to the specific distribution fX. This means that
i) r(x) is random and ii) the output r(X) is distributed as fX if the input X is distributed
as fX.
6.2.2 Adaptive Importance Splitting with fixed effort
The idea of Importance Splitting is to consider a sequence of nested events AN ⊂ AN −1. . . ⊂
A1 ⊂ A0. In our case, we defined them as Aj = {x ∈ X |s(x) > τj} with −∞ = τ0 < τ1. . . <
τN −1< τN = τ . In other words, we would like to estimate P = P(AN) which can be decomposed
into:
P = P(AN) = P(AN|AN −1)P(AN −1|AN −2) . . . P(A1). (6.5)
The estimation of P thanks to a numerical simulation is difficult because AN is a rare event whose
probability is small. The equation above breaks it into N easier problems because the conditional probabilities are much larger. Indeed, the algorithm estimates each conditional probability with a simple crude Monte Carlo simulation: Over nj independent samples {X(j)i }
nj
i=1⊂ Aj, we count
the number kj+1 of samples which also belong to Aj+1 and ˆP(Aj+1|Aj) =kj+1/nj. In practice, at
each iteration, nj = n, a parameter of the algorithm.
Our algorithm is adaptive because the subsets {Aj}j are defined by the intermediate thresh-
olds {τj}N −1j=1 adaptively: at the j-th iteration, we fix kj+1 = k, a parameter of the algorithm lower
than n, by setting τj+1 as the (k + 1)-th biggest scores observed in {s(X(j)i )}ni=1. In that way,
exactly k samples have a score larger than τj+1.
If this intermediate threshold is larger than the target τ , the algorithm stops and we need to count the number kN of scores larger than τ (which is bigger or equal than k). Note that these
intermediate thresholds are indeed random variables and so is N , the total number of iterations. In the end, the estimation of P = P(AN) is given by
ˆ P = N Y j=1 ˆ P(Aj|Aj−1) = ρN −1. kN n , (6.6)
with ρ =k/n.
The main difficulty is the generation of the random samples {X(j)i }n
i=1 ⊂ Aj. This set is
indeed composed of the k samples of the previous iteration which belong to Aj plus n − k ‘fresh’
new samples. A ‘fresh’ sample is generated as follows: we randomly pick a sample Z uniformly in Aj (among the k samples we already have), and we apply T iterations of the following routine:
If Y = r(Z) ∈ Aj, then Z ← Y. (6.7)
The random replicator proposes a random vector Y, which is accepted (i.e. it replaces Z) if Y ∈ Aj. Over T iterations, by constantly monitoring that Z ∈ Aj we render the replicator
invariant to fX|Aj, i.e. the distribution fX conditioned on Aj. Moreover, as T → ∞, the ‘fresh’
output sample becomes statistically independent of the initial sample Z ∈ Aj. We repeat this
process n − k times. In the end, we have n samples i.i.d. distributed as fX|Aj (k samples from the previous iteration and n − k ‘fresh’ samples) which we use to estimate the next conditional probability ˆP(Aj+1|Aj).
6.2.3 Properties
If we assuming that the ‘fresh’ sample is always different than the input from which it has been derived (i.e. at least one of the T applications of the replicator was accepted as a new sample in Aj), then the estimator is unbiased: E( ˆP ) = P .
In practice, T is a finite iteration number, therefore the samples are a priori not independent. We suppose that T is big enough to provide independence. This is the only approximation made in the proof of a Central Limit Theorem [48, 46]:
√ n( ˆP − P ) −→law n→∞N 0, P2 (N − 1)1 − ρ ρ + n − kN kN . (6.8)
We now measure the cost C of this algorithm by the number of calls to the routine computing the score function: Since N ≈ log P/ log ρ, we have
C = n + nT (N − 1)(1 − ρ) ≈ nT 1 − ρ
log1/ρlog1/P. (6.9)
The key feature is that the cost is proportional to log1/P whereas the cost of the crude Monte
Carlo is proportional to 1/P. A standard measurement in the ‘rare event’ literature is the cost
weighted relative variance1 encompassing both the cost and the accuracy of the estimator:
C.V(P )ˆ
P2 ≈ (log P )
2.T (1 − ρ)2
(log ρ)2ρ . (6.10)
Usually, we set ρ > 1/2 and T = 20. With this setup, our algorithm has a lower cost weighted
relative variance than the one of the crude Monte Carlo simulation (i.e. (1 − P )/P ) if P . 10−3. The algorithm is thus dedicated to the estimation of small probabilities. Fig. 6.1 shows one estimation for a problem where the expression of the true probability is know. The algorithm succeeds to estimate a probability in the order of 10−11 with a good accuracy with just 850, 000 calls to the score function. A crude Monte Carlo simulation would have required more than 1012 calls.
1Some prefer to benchmark estimators with the computational efficiency which is indeed the inverse of the cost
6.2. The ‘rare event’ probability estimator 71 0.9 0.92 0.94 0.96 Threshold 10-11 10-10 10-9 10-8 10-7 Estimated Test Estimator A Estimate Confidence True 0 0.5 1 1.5 probability #10-10 0 1 2 3 4 5 6 7 pdf
#1010pdf of the true probability
Figure 6.1: Example of one simulation. Estimation problem: s(X) = X>u/kXk, with kuk = 1 and X ∼ N (0, I20), τ = 0.95. The true probability is P = 4.7 ∗ 10−11. Setup: k = 1000, n = 2000,
T = 10. Results: ˆP = 5.1 ∗ 10−11, Confidence interval [3.8, 6.4] ∗ 10−11, N = 35, C = 842, 000.
6.2.4 Improvements
This algorithm has been improved by A. Guyader, N. Hengartner and E. Matzner-Løber [116]. They noticed that (6.10) is indeed a decreasing function of ρ, therefore the algorithm is more efficient when setting ρ to its minimum value, 1 −1/nfor k = n − 1. This means that from one
iteration to another, they keep all the samples except the ‘last’ one whose score is the lowest. They need to ‘refresh’ only this ‘last’ sample. The algorithm makes many more iterations as the conditional probabilities equal 1 −1/n: The algorithm makes tiny steps towards AN. This
is a priori just a special case of our algorithm, but it has one huge advantage: The statistical properties of the estimator are proven for a finite n.
6.2.5 Byproducts
As a last word, both algorithms have the following byproducts:
• At the end of the simulation, we have examples of ‘rare events’. This may help understanding what provoques such an event.
• The final output is not only a single estimation, but also a mapping {(τj, ρj)}N −1j=1 (see Fig. 6.1 for our algorithm and Fig. 6.2 for [116]). This is quite useful for drawing Receiver Operating Characteristic in hypotheses testing (we need one simulation per hypothesis). In the same way, the algorithm gives confidence intervals and the probability density function of the true probability knowing the estimate (see Fig. 6.1 for our algorithm and Fig. 6.2 for [116]).
0.8 0.85 0.9 0.95 Threshold 10-11 10-10 10-9 10-8 10-7 10-6 Estimated Test Estimator B Estimate Confidence True 0 0.5 1 1.5 probability #10-10 0 0.5 1 1.5 2 2.5 3 3.5 pdf
#1010pdf of the true probability
Figure 6.2: Example of one simulation. Estimation problem: s(X) = X>u/kXk, with kuk = 1 and X ∼ N (0, I20), τ = 0.95. The true probability is P = 4.7 ∗ 10−11. Setup: n = 200, T = 30.
Results: ˆP = 3.7 ∗ 10−11, Confidence interval [1.9, 7.3] ∗ 10−11, N = 4, 792, C = 138, 000.
• There is a version of the algorithm for estimating extreme quantile, i.e. estimate the value τ s.t. P(s(X) > τ ) equals a given probability P [116].