In this section, we explain in detail how we use the hypothesis testing algorithmChoose-Hypothesis
throughout this paper. In particular, the algorithmChoose-Hypothesisis used in the following places: • In Step 4 of algorithmLearn-kmodal-simplewe need an algorithm L↓δ0 (resp. L
↑
δ0) that learns
a non-increasing (resp. non-increasing) distribution within total variation distanceε and confidence δ0. Note that the corresponding algorithms L↓and L↑provided byTheorem 2.3have confidence 9/10. To boost the confidence of L↓ (resp. L↑) we run the algorithmO(log(1/δ0))times and useChoose-Hypothesisin an appropriate tournament procedure to select among the candidate hypothesis distributions.
• In Step 5 of algorithmLearn-kmodal-simplewe need to select among two candidate hypothesis distributions (with the promise that at least one of them is close to the true conditional distribution). In this case, we runChoose-Hypothesisonce to select between the two candidates.
• Also note that both algorithmsLearn-kmodal-simpleandLearn-kmodalgenerate anε-accurate hypothesis with probability 9/10. We would like to boost the probability of success to 1−δ. To achieve this we again run the corresponding algorithmO(log(1/δ))times and use Choose- Hypothesisin an appropriate tournament to select among the candidate hypothesis distributions.
We now formally describe the “tournament” algorithm to boost the confidence to 1−δ.
Lemma C.1. Let p be any distribution over a finite set W. Suppose that Dε is a collection of N distributions overWsuch that there exists q∈Dε with dTV(p,q)≤ε. Then there is an algorithm that
uses O(ε−2logNlog(1/δ))samples from p and with probability1−δ outputs a distribution p0∈Dε that satisfies dTV(p,p0)≤6ε.
Devroye and Lugosi (Chapter 7 of [12]) prove a similar result by having all pairs of distributions in the cover compete against each other using their notion of a competition, but again there are some small differences: their approach chooses a distribution in the cover which wins the maximum number of competitions, whereas our algorithm chooses a distribution that is never defeated (i. e., won or achieved a draw against all other distributions in the cover). Instead we follow the approach from [9].
Proof. The algorithm performs a tournament by running the competitionChoose-Hypothesisp(hi,hj,ε, δ/(2N))for every pair of distinct distributionshi,hjin the collectionDε. It outputs a distributionq?∈Dε
that was never a loser (i. e., won or achieved a draw in all its competitions). If no such distribution exists inDε then the algorithm outputs “failure.”
By definition, there exists some q∈Dε such thatdTV(p,q)≤ε. We first argue that with high
probability this distributionqnever loses a competition against any otherq0∈Dε (so the algorithm does not output “failure”). Consider anyq0∈Dε. IfdTV(p,q0)>4ε, byLemma B.1(ii) the probability thatq
loses toq0is at most 2e−mε2/2=O(1/N). On the other hand, ifd
TV(p,q0)≤4δ, the triangle inequality gives thatdTV(q,q0)≤5ε and thusqdraws againstq0. A union bound over allN distributions inDε
shows that with probability 1−δ/2, the distributionqnever loses a competition.
We next argue that with probability at least 1−δ/2, every distributionq0∈Dε that never loses has
small variation distance fromp. Fix a distributionq0such thatdTV(q0,p)>6ε;Lemma B.1(i) implies
thatq0loses toqwith probability 1−2e−mε2/2≥1−
δ/(2N). A union bound gives that with probability 1−δ/2, every distributionq0 that hasdTV(q0,p)>6ε loses some competition.
Thus, with overall probability at least 1−δ, the tournament does not output “failure” and outputs some distributionq?such thatdTV(p,q?)is at most 6ε. This proves the lemma.
We now explain how the above lemma is used in our context: Suppose we performO(log(1/δ)) runs of a learning algorithm that constructs anε-accurate hypothesis with probability at least 9/10. Then, with failure probability at most δ/2, at least one of the hypotheses generated is ε-close to the true distribution in variation distance. Conditioning on this good event, we have a collection of distributions with cardinalityO(log(1/δ))that satisfies the assumption of the lemma. Hence, using
O (1/ε2)·log log(1/δ)·log(1/δ)samples we can learn to accuracy 6ε and confidence 1−δ/2. The overall sample complexity isO(log(1/δ))times the sample complexity of the learning algorithm run with confidence 9/10, plus this additionalO (1/ε2)·log log(1/δ)·log(1/δ)term.
In terms of running time,we make the following easily verifiable remarks: When the hypothesis testing algorithmChoose-Hypothesisis run on a pair of distributions that are produced by Birgé’s algorithm, its running time is polynomial in the succinct description of these distributions, i. e., in log2(n)/ε. Similarly, whenChoose-Hypothesisis run on a pair of outputs ofLearn-kmodal-simple orLearn-kmodal, its running time is polynomial in the succinct description of these distributions. More
specifically, in the former case, the succinct description has bit complexityO k·log2(n)/ε2 (since the output consists ofO(k/ε)monotone intervals, and the conditional distribution on each interval is the output of Birgé’s algorithm for that interval). In the latter case, the succinct description has bit complexityO k·log2(n)/ε, since the algorithmLearn-kmodalconstructs onlykmonotone intervals. Hence, in both cases, each execution of the testing algorithm performs poly(k,logn,1/ε)bit operations. Since the tournament invokes the algorithmChoose-HypothesisO(log2(1/δ))times (for every pair of distributions in our pool ofO(log(1/δ))candidates) the upper bound on the running time follows.