PART III: SAMPLING, FEATURE SELECTION, AND SEARCH
9.4 Transition Estimation Error
This section shows how to bound the transition estimation error — formally defined in
Assumption 2.28. This error is due to states that are included in the samples, but their transition probabilities are estimated imprecisely.
We first propose simple bounds on the transition estimation error. These simple bounds, without assuming any special structure, turn out to depend linearly on the number of samples, which can be very large. As we also show, these bounds are asymptotically tight and therefore indicate that the transition estimation error may be very high in some cases. To reduce the transition estimation error and the corresponding bound, we propose to use common random numbers — a technique popular in some simulation communities — to tighten the bounds and improve the performance (Glasserman & Yao, 1992).
The transition estimation error bounds hold only with a limited probability; is it always passible that the sample is misleading and that the transition probabilities are computed incorrectly. We demonstrate the importance of the transition estimation error in the reser- voir management problem inSection 9.7.
The transition estimation error es, as we treat it, is independent of the state sampling error ep— that is the errors are additive. This section, therefore, assumes that for any(s, a)∈ Σ˜ there is a(s, a)∈Σ, and vice versa.¯
Theorem 9.14. Assume that samples ˜Σ are available and that the number of samples per each state and action pair is at least n. Also assume that e = (01T)T. Then, the transition estimation error es(Assumption 2.28) is bounded as:
P[es(ψ) >e]≤Q 2|φ|exp 2(e/(ψ·γ)) 2 Mφn ,|Σ˜|a ≤2|Σ˜|a|φ|exp 2(e/(ψ·γ))2 Mφn ,
where|Σ˜|ais the number of sampled state-action pairs,
Mφ =max
s∈S kφ(s)k∞. and
Q(x, y) =1− (1−x)y.
The proof of the theorem can be found inSection C.10. The proof of the theorem closely resembles the proofs in statistical learning theory (Bousquet et al., 2004) with one crucial difference. In statistical learning theory, the approximation error must be bound for the set of all potential classification or regression functions, just as we need to bound the error over all sampled states and actions. The important difference is in the dependence of the errors. Because the classifiers are evaluated using the same samples, the error over them are potentially dependent. In our setting, the samples are gathered independently for each sampled state and this independence leads to somewhat tighter bounds using the function Q instead of the union bound.
The function Q grows close to linearly in y when x is close to 0. A significant weakness of this bound is that it depends, close to linearly, on the number of samples and features. While the number of features is often small, the number of samples is usually very large. As a result, the bound can be very loose. Unfortunately, the transition estimation error is actually inevitable and is not only due to the looseness of the bounds as the following shows.
Proposition 9.15. The bound inTheorem 9.14is asymptotically tight with respect to|Σ˜|a. The proof of the proposition can be found inSection C.10. Interestingly, the bounds in
tion estimation error. With an increasing number of states sampled, the state selection error tends to decrease. However, the bounds on the transition estimation error will increase, when the number of transitions estimated per state remains the same. We show next that sometimes this tradeoff can be circumvented by deriving bounds that are independent of the number of samples.
9.4.1 Common Random Numbers
The results above indicate that the transition estimation error may be very large when the number of sampled states and actions is large. The transition estimation error, and the corresponding bound, can be decreased by assuming that the Bellman residual of the solution is close to 0 for a small number of states and actions. This is, however, not true in general. Here, we instead propose to use common random numbers to estimate transitions; this approach is applicable to a large class of industrial problems.
Common random numbers are applicable to problems with external uncertainty, which is not affected by the actions taken. We use the random variable ω ∈ Ω to denote this external uncertainty for some. In the reservoir management problem (seeSection B.3), the variable ω may represent the weather — an external variable that cannot be influenced. In blood inventory management (see Section B.2), the variable ω represents the external uncertainty of supplies and demands, which is independent of the current state of the inventory.
We now specify how the state transition probabilities depend on the external source stochas- ticity. Let the random variable ω be defined such that:
P(s, a, z(s, a, ω)) =P[ω] ∀s∈ S, a∈ A,
where z :S × A ×Ω→ S is the deterministic transition function. This definition applies to discrete ω; the definition for the continuous case would be similar.
The ordinary samples in this setting would are defined as:
˜
Σ⊆ (s, a,(s1. . . sn), r(s, a)) s, s0 ∈ S, a∈ A},
where s1. . . snare selected i.i.d. from the distribution P(s, a)for every s, a independently. The common random number samples are denoted as ˜Σcand defined as:
˜
Σc ⊆ (s, a,(z(s, a, ω1). . . z(s, a, ωn)), r(s, a)) s, s0 ∈ S, a∈ A},
where ω1. . . ωn ∈ Ω are selected i.i.d. according to the appropriate distribution. This distribution is independent of the current state and action. The common random number samples are used in the construction of ABP and ALP identically as the regular samples. Practically, using the common random numbers in the reservoir management means that weather is sampled independently of the level of the water in the reservoir. The constraints are then constructed for each water level, averaging the effects of the weather. Intuitively, this is desirable because the comparison between the values of various water levels are more fair when the weather is assumed to be identical.
To simplify the notation, we use
Xi(ωj) =|P(si, ai)Tφf −P(si, ai, z(si, ai, ωj))Tφf|
for some feature f and i ∈ Σ. These random variables are not necessarily independent, as˜ they would be if the state transitions were sampled independently. This random variables are have unknown values for the samples, since P(si, ai)Tφf is not known. The random variables Xiare, however, useful for theoretical analysis.
The following definition of a growth set defines a suitable structure for bounding the tran- sition estimation error for common random numbers.
Definition 9.16(Growth Set). The growth set for the number of states m is defined as:
Note that the definition is for all states, not only the sampled ones, although it could be defined for the samples only.
The purpose the definition of the growth set is more explanatory than practically useful at this point. It is currently a longstanding challenge in statistical learning theory to derive measures of complexity that both lead to sharp upper bounds and can be easily computed for a class of problems (Bousquet et al., 2004; Devroye et al., 1996). While the proposed growth set can be roughly estimated in some domains, it cannot be derived from the sam- ples alone.
The following theorem shows the theoretical importance of using the common random numbers. The important difference is that the bound does not depend on the number of sampled states, but on the growth function instead.
Theorem 9.17. Assume that samples ˜Σ are generated using common random numbers and that the number samples of common numbers is m. Then:
{I{z(s, a, ω)v≥e} ω∈ Ω, v∈ M(ψ), e∈R+}.
Then, the transition estimation error es(Assumption 2.28) is bounded as:
P[es(ψ) >e]≤2|φ|exp 2(e/(ψ·γ· |τ(m)|)) 2 Mφm
,
where|Σ˜|ais the number of sampled state-action pairs and
Mφ =max
s∈S kφ(s)k∞.
The proof of the theorem can be found inSection C.10. The common random numbers in samples introduce correlation among the constraints, which introduces additional struc- ture into the problem. This can be then used to bound the error.
It is not always trivial to determine the growth function of the set of functions. It is impor- tant to study the growth functions in the particular domains, but it may be also possible to derive complexity measures that can be estimated directly from the sample. Such an example for classification problems is the Rademacher complexity (Bousquet et al., 2004).