1.3 Introduction to Learning Automata
1.3.1 Basic elements
A general block diagram of the interaction between environment and an automa- ton is provided in Fig. 1.1.
Environment
Learning Automaton
Reinforcement Action
Figure 1.1: Block diagram of the interaction between a Learning Automaton and environment
The learning process of learning automata can be simply described as follows [73]: during the interaction with environment, the automaton randomly selects an action from the action set that includes the possible actions the automaton can take
at its current state, based on a probability distribution. The probabilities of select- ing the actions are the same initially. Then the environment generates a response to the action, named reinforcement signal. According to the response, the automaton updates the action probability distribution. The algorithm used to update the action probabilities is called the learning algorithm or reinforcement scheme. Afterwards, a new action is selected according to the updated probability distribution. Regarding this action, a response is elicited by the environment, and the procedure is repeated. It can be seen that the implementation of learning automata mainly concentrates on three aspects: how to select actions, learning algorithm and reinforcement sig- nal, as given in Fig. 1.2. Unlike EAs that mainly apply the concept of biological evolutionary concept, learning automata methods are based on action selection and probability distribution.
LA Action
selection algorithmLearning Reinforcement signal
Figure 1.2: The three aspects of LA
The automaton The state X X={X ,X,,X } Input set R R={r,r,,r } or {( ′ ′)} Output set A A={a,a,,a } Transition Function F Output function G
Figure 1.3: The illustration of an automaton
A learning automaton is an adaptive decision-making unit situated in a random environment. The learning automaton learns the optimal action through repeated interactions with its environment. The illustration of an automaton is given in Fig. 1.3. Being an adaptive discrete machine, an automaton can be described as:
1.3 Introduction to Learning Automata 16
These entities are described precisely as follows:
• Thestateof the automaton at instantn, denoted byX(n), is an element of the finite set:
X ={X1, X2, . . . , XXN} (1.3.2)
• The output or action of an automaton at instant n, denoted by a(n), is an element of the finite set:
A={a1, a2, . . . , aaN} (1.3.3)
• The input to an automaton at instantn, denoted byr(n), is an element of a set
R, which could be either a finite set or an infinite set, such as an interval on the real line. Thus,
R ={r1, r2, . . . , rrN}orR={( ˜α,β˜)} (1.3.4) whereα˜andβ˜are real numbers.
• The probability of choosing actioniat instantn, denoted bypi(n), is an ele- ment of the finite set:
P ={p1, p2, . . . , paN} (1.3.5)
• F(·,·) is transition function, which determines the state at instant (n + 1) regarding the state and input at instantn:
X(n+ 1) =F[X(n), r(n)] (1.3.6)
or F is a mapping from X ×R → X. F could be either deterministic or
stochastic.
• Output functionG(·)determines the output of the automaton at any instantn
according to its current state:
a(n) =G[X(n)] (1.3.7)
orGis a mappingX →A. It is again either deterministic or stochastic.
• T represents the reinforcement scheme (learning algorithm), which deter- mines the action probabilities at instantn + 1 regarding the probabilities at instantn:
p(n+ 1) =T[p(n)] (1.3.8)
Basically, the automaton takes in a sequence of inputs and outputs a sequence
of actions with respect to the observation time n. The working of an automaton
can be described as follows: given an initial stateX(0), actiona(0)is determined by functionG. Based on current state, the response generated by the environment,
r(0), and the transition function determine the next state X(1). These operations are performed recursively, and in this way, the state sequence and action sequence are obtained according to any given input sequence [74].
In terms of functionsFandG, there are two types of automaton:
• Deterministic automaton. For this automaton, functions F and G are both
deterministic mappings. It means that with a given initial state and a given input, the succeeding state and action are uniquely specified. The mappings of the two functions can be represented either in the form of matrices or graphs if the input set is finite. Corresponding to each input, the matrix or the graph can indicate how the present state transfers to a new state. In this case, the transition matrices consist of elements that are only either 0 or 1.
• Stochastic automaton. For this automaton, at least one of the functionsFand
G is stochastic. In other words, given an initial state and an input, there is no certainty concerning the succeeding state and action, which are associated with probabilities. If transition functionFis stochastic, given the present state and input, the next state is at random, andFgives the conditional probabilities (or called transition probabilities) of reaching the various states. If the output functionGis stochastic, given the present state, the action taken is at random and is determined by the conditional probabilities provided byG.
If the conditional probabilities provided byFandGare constant,i.e.they are independent ofnand the input sequence, the stochastic automaton is referred to as a fixed-structure stochastic automaton. In this case, the elements of the
1.3 Introduction to Learning Automata 18
transition matrix are constants which take values in the interval [0,1], and each transition matrix is a stochastic matrix. On the other hand, if the transition probability at instantnis updated on the basis of the input at that instant, the automaton is called a variable-structure stochastic automaton. The elements of the transition matrix are in [0,1], but they are no longer constants any more, as they are updated withn.
The environment
Environment refers to the aggregate of all the external conditions and influences which would affect the automata [74]. The role of the environment is to establish the relationship between the actions taken by the automaton and the signal inputs to the automaton [69]. An environment can be mathematically defined as a triple
{A, C, R}. A represents a finite input set. R represents the output set, i.ethe en- vironment’s response to the action taken by the automaton. The environment is referred to as aP-model environment when its response belongs to the binary set
{0,1}; the environment is named asS-model environment when its response takes an arbitrary value in the closed segment [0,1]; Q-model environment refers to the environment whose response is one of a finite number of values in the interval [0,1]. The environment is characterized by C, which represents the penalty probability,
i.e.the likelihood that the application of an action to the environment will result in a penalty output. If the penalty probabilities are constant, the environment is said to be stationary; otherwise, it is nonstationary.
Reinforcement schemes
Reinforcement schemes are the methods used to update action probabilities at each instant. They were originally proposed to model animal learning [61], but later it was found that they can be applied in the field of learning automata successfully. Choosing reinforcement scheme is a crucial factor that affects the performance of learning automata [75]. In general, a reinforcement scheme can be represented by:
p(n+ 1) =T[p(n), r(n), a(n)] (1.3.9)
whereTis a learning operator;r(n)anda(n)represent the input to the automaton and the action taken by the automaton at instantn, respectively.
The basic idea behind a reinforcement scheme can be described as follows: if the automaton selects an actionai at instantnand receives a nonpenalty input, the action probability of the actionai,i.e. pi(n), will be increased, and the probabilities for all other actions will be decreased. In this case, the change occurring onpi(n)is
called reward. For a penalty input,pi(n)is decreased and other action probabilities are increased. At this moment, the change applied onpi(n)is called penalty.
If pi(n + 1)is a linear function of pi(n), the reinforcement scheme can be de- scribed as linear, such as Linear Reward-Inaction (LR−I) algorithm [74], which is
known to be very effective in many applications. TheLR−Iupdates the action prob-
abilities as follows (assuminga(n) =ai):
pi(n+ 1) =pi(n) + ˜λr(n)(1−pi(n))
pj(n+ 1) =pj(n)−λr˜ (n)pj(n), for allj 6=i (1.3.10) On the other hand, if pi(n+ 1)is a nonlinear function ofpi(n), the reinforcement scheme is said to be nonlinear. Several non-linear schemes have been suggested by Viswanathan and Narendra [76], Lakshmivarahan and Thathachar [77]. Sometimes, it is advantageous to use different schemes to update pi(n), while the selection of
the reinforcement scheme depends on the intervals the value ofpi(n)lies in.