In literature there is a clear link between Markov chains on one side, and stochastic games on the other end of the spectrum. Markov chains were introduced by and named after the Russian mathematician Andrey Markov (1971). Markov chains are stochastic processes in which the process is memoryless. The Markov property states that it does not matter what the history is before the present, the only thing relevant for the future is the present. We denote a random variable in a stochastic process at timet by Xt,
the current value of the variable is denoted byx. Mathematically, memorylessness has
the following effect on a stochastic process.
P r(Xt+1 =x|X1 =x1, X2 =x2, . . . , Xt =xt) =P r(Xt+1 =x|Xt =xt) (3.1)
Definition Aperiodic Markov Chains: “A Markov chain is said to be aperiodic
if all its states are aperiodic. Otherwise the chain is said to be periodic (Häggström, 2002)."
Figure 3.1 is an aperiodic Markov chain. It is not the case that after a certain fixed number of rainy days that there always will be a sunny day. Aperiodicity and irreducibil- ity are two properties which form the basis of an interesting insight into Markov chains. When an aperiodic and irreducible Markov chain is run for a long time, it is unclear in which state the Markov chain is at a certain period in time. However, running the Markov chain for an infinite period of time will result in the Markov chain settling in a stationary distribution. This stationary distribution describes the probability of visiting a certain state when time goes to infinity. Therefore with great precision we know the frequency of being in a certain state when the Markov chain is run for an infinite period of time. We therefore present an important resulting theorem which forms an important part of this research:
Theorem: “For any irreducible and aperiodic Markov chain, there exists at least
one stationary distribution (Häggström, 2002)."
But how are Markov chains linked to stochastic games? Neymann described that “Markov chains and Markov decision processes are special cases of stochastic games (Neyman, 2003a)." Markov chains are necessary to model the dynamics of a system. They state the transition probabilities of a stochastic game. In between the Markov chain and the stochastic game is the Markov Decision Process (MDP). The MDP is a reduction of the stochastic game in which there is only one player. Therefore the player is able to control the play and hence the corresponding payoffs on his own. Filar and Vrieze (1997) describe the stochastic game in terms of a competitive MDP.
In the case of one player under the limiting average criterion they describe the MDP as follows. The player starts in initial state s while playing stationary strategy f, the
reward at timet is defined byRt. The value of this irreducible limiting average MDP is
defined as (Filar & Vrieze, 1997):
vα(f) := lim T→∞ 1 T + 1 T X t=0 Esf[Rt]
The individual rational player always wants to maximize his own payoff. So the problem is an optimal control problem in which the player wants to:
3.3. LIMITATIONS GAME USAGE 23
This problem can also be seen as an optimal control problem in which the player tries to control the process in such a matter that his own value is maximized (Blackwell, 1962). In literature these MDP models are not only used for stochastic games but for a wide range of applications. They are used for decision-theoretic planning, learning robot control and ofcourse stochastic games. MDPs are the standard for learning se- quential decision making (Otterlo, 2009). Algorithms in order to find optimal values for an MDP are divided into two categories. Model-free and model-based algorithms. The first category is also known as reinforcement learning and generates approximations while the second one is exact and uses dynamic programming as a basis (Blackwell, 1962), (Otterlo, 2009).
Because we are dealing with stochastic games in which we assume perfect infor- mation we have all information available in order to calculate an exact result. We shall therefore only look at model-based algorithms. These algorithms work on optimizing value functions by either iterating over the value function (value iteration) or by chang- ing the so-called policy (policy iteration). The policy of an MDP can be seen as a fixed pure strategy which always is taken when in a certain state. At the heart of these algo- rithms is the Bellman equation. The Bellman equation defines the relation between the value function and the recursive process in order to determine the result of the value function (Otterlo, 2009). The equation is stated as (Otterlo, 2009):
Vπ(s) = Eπ rt+Vπ(st +1)|st=s =X s0 T(s, π(s), s0 ) R(s, a, s0 ) +Vπ(s0 )
In which policy is represented byπ, the transitions by T, current state bys, rewards
by R and current reward by r. This Bellman equation is important when we come
to an algorithm for Type II games. But for now it is most important to acknowledge that stochastic games can be seen as competitive MDPs in which Markov chains are responsible for the transition dynamics between the states.