Definition 7.1. Let (Ω,A,P) be a probability space. A real-valued random variable
U is a measurable mappingΩ7→R. A stochastic process is a collection {Us:s∈S} of random variables onΩindexed by a setS.
More generally, we can defineB-valued random variables forB=Rd or some abstract Banach spaceB. In such a case, a random variable is a measurable map from (Ω,A,P) intoBequipped with Borelσ-algebra generated by the open sets of
B[35].
A stochastic process is defined through its state space (that is,RorB), the in- dex setS, and the joint distributions of the random variables. IfSis infinite, care should be taken to ensure measurability of events. In this course, we will omit these complications, and assume that necessary conditions hold to ensure mea- surability.
The first important stochastic process we study is the one that arises from av- eraging over i.i.d. data:
Definition 7.2. Anempirical processis a stochastic process {Gf} indexed by a func- tion class f ∈Fand defined as
Gf , 1 n n X t=1 ¡ Ef(Z)−f(Zt) ¢ =E(f)−Eˆ(f)
whereZ1, . . . ,Zn,Zare i.i.d. (Oftentimes in the literature, the normalization factor is p1
n instead of
1
n).
Definition 7.3. A random variable²taking on values {±1} with equal probability is called aRademacher random variable.1
Definition 7.4. Let²1, . . . ,²n∈{±1} be independent Rademacher random variables.
ARademacher processis a stochastic process {Sa} indexed by a setF ⊂Rn of vec- torsa∈F and defined as
Sa, 1 n n X t=1 ²tat
Givenzn ={z1, . . . ,zn}∈Zn and a classF of functions Z→R, we define the Rademacher process onFas Sf , 1 n n X t=1 ²tf(zt)
for f ∈F. Sincezn is fixed, we may think ofa=(f(z1), . . . ,f(zn)) as a vector that
corresponds to f, matching the earlier definition. From this point of view, the behavior of the functions outsideznis irrelevant, and we may view the set
F =F|(z1,...,zn)={(f(z1), . . . ,f(zn)) :f ∈F} (7.4)
as the finite-dimensionalprojectionof the function classFontozn.
The processes defined so far are averages of functions of independent random variables. We now bring in the notion of temporal dependence, which will play an important role for the analysis of sequential prediction problems.
Definition 7.5. Let S={0, 1, 2, . . . , }. A stochastic process {Us} is a discrete-time
martingaleif
E{Us+1|U1, . . . ,Us}=Us
andE|Us| < ∞for alls∈S. More generally, a stochastic process {Us} is a martingale with respect to another stochastic process {Vs} if
E{Us+1|V1, . . . ,Vs}=Us
andE|Us| < ∞. A stochastic process {Us} is amartingale difference sequence(MDS) if
E{Us+1|V1, . . . ,Vs}=0
for some stochastic process {Vs}. Any martingale {Vs} defines a martingale differ- ence sequenceUs=Vs−Vs−1.
We now define a “dependent” version of the i.i.d. empirical process.
Definition 7.6. Anempirical process with dependent datais a stochastic process {Mf} indexed by a function class f ∈Fand defined as
Mf , 1 n n X t=1 ³ E© f(Zt) ¯ ¯Z1, . . . ,Zt−1 ª −f(Zt) ´
where (Z1, . . . ,Zn) is a discrete-time stochastic process with a joint distributionP.
Clearly, the sequencenE© f(Zt) ¯ ¯Z1, . . . ,Zt−1 ª −f(Zt) o is a martingale-difference sequence for any f. Furthermore, the notion of an empirical process with depen- dent data boils down to the classical notion ifZ1, . . . ,Znare i.i.d.
When specifying martingales, we can talk more generally about filtrations, de- fined as an increasing sequence ofσ-algebras
A0⊂A1⊂. . .⊂A.
A martingale is then defined as a sequence ofAs-measurable random variablesUs such thatE©
Us+1|As
ª =Us.
Of particular interest is the dyadic filtration {At} onΩ={−1, 1}Ngiven byAt= σ(²1, . . . ,²t), where ²t’s are independent Rademacher random variables. Fix Z-
are At−1-measurable with respect to the dyadic filtration, and the discrete-time
stochastic process
n
²tzt(²1, . . . ,²t−1)
o
is a martingale difference sequence. Indeed,
E{²tzt(²1, . . . ,²t−1)|²1, . . . ,²t−1}=0 .
A sequencez=(z1, . . . ,zn) is called aZ-valued tree.
Example 2. To give a bit of intuition about the tree and the associated martingale difference sequence, consider a scenario where we start with a unit amount of money and repeatedly play a fair game. At each stage, we flip a coin and either gain or lose half of our current amount. So, at the first step, we either lose 0.5 or gain 0.5. If we gain 0.5 (for the total of 1.5) the next differential will be±0.75. If, however, we lost 0.5 at the first step, the next coin flip will result in a gain or loss of 0.25. It is easy to see that this defines a complete binary treez. Given any prefix, such as (1,−1, 1), the gain (or loss)z4(1,−1, 1) at round 4 is determined. The sum
Pn
t=1²tzt(²1, . . . ,²t−1) determines the total payoff.
We may view the martingale {Us} with
Us= s
X
t=1
²tzt(²1, . . . ,²t−1)
as a random walk with symmetric increments±zs which depend on the path that got us to this point. Such martingales are known as theWalsh-Paley martingales. Interestingly enough, these martingales generated by the Rademacher random variables are, in some sense, “representative” of all the possible martingales with values inZ. We will make this statement precise and use it to our advantage, as these tree-based martingales are much easier to deal with than general martin- gales.
A word about the notation. For brevity, we shall often writezt(²), where²= (²1, . . . ,²n), but it is understood thatzt only depends on the prefix (²1, . . . ,²t−1).
Now, given a treezand a function f :Z→R, we define the compositionf ◦zas a real-valued tree (f ◦z1, . . . ,f ◦zn). Eachf ◦zt is a function {±1}t−1→Rand
n
²tf(zt(²1, . . . ,²t−1))
o
Definition 7.7. Let²1, . . . ,²n∈{±1} be independent Rademacher random variables.
Given a treez, a stochastic process {Tf} defined as
Tf , 1 n n X t=1 ²tf(zt(²1, . . . ,²t−1))
will be calledtree processindexed byF.
Example 2, continued Let f be a function that gives the level of excitement of a person observing his fortuneUs going up and down, as in the previous exam- ple. If the incrementzt at the next round is large, the person becomes very happy (large+f(zt)) upon winning the round, and very unhappy−f(zt) upon losing. A person who does not care about the game might have a constant level f(zt)= 0 throughout the game. On the other extreme, suppose someone becomes ag- itated when the increments zt become close to zero, thus having a large±f(zt) ups and downs. SupposeFcontains the profiles of a group of people observing the same outcomes. An interesting object of study is the largest cumulative level supf∈FPn
t=1²tf(z(²1, . . . ,²t−1)) afternrounds.
We may view the tree processTf as a generalization of the Rademacher pro- cessSf. Indeed, supposez=(z1, . . . ,zn) is a sequence of constant mappings such
thatzt(²1, . . . ,²t−1)=zt for any (²1, . . . ,²t−1). In this case,Tf andSf coincide. Gen- erally, however, the tree process can behave differently (in a certain sense) from the Rademacher process. Understanding the gap in behavior of the two processes will have an implication on the understanding of learnability in the i.i.d. and ad- versarial models.