Simple Reinforcement Learning Algorithm

Chapter 3. ALBidS: Adaptive Learning Strategic Bidding System

3.6. Main Agent

3.6.1. Simple Reinforcement Learning Algorithm

This sub-section presents the implementation of a simple version of a reinforcement learning algorithm. This version considers the basic characteristics of this type of algorithms, the most simplified as possible, in order to provide a fast response.

3.6.1.1. Context

Reinforcement learning is an area of machine learning in computer science, and therefore also a branch of Artificial Intelligence. It deals with the optimization of an agent’s performance in a given environment, in order to maximize a reward it is awarded [Tokic, 2010].

In this type of problems, an agent is supposed to decide the best action to take, based on his current state. When this step is repeated, the problem is known as a Markov Decision Process [Kaelbling et al., 1996]. This automated learning scheme implies that there is little need for a human expert who knows about the domain of application.

Much less time will be spent designing a solution, since there is no need for the definition of rules as with expert systems.

A reinforcement learning agent interacts with its environment in discrete time steps [Sutton and Barto, 1998].

At each time t, the agent receives an observation o_t, which typically includes the reward r_t. It then chooses an action a_tfrom the set of available actions. The environment moves to a new state s_{t + 1}and the reward r_{t + 1}associated with the transition (st, at, st + 1) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection.

A reinforcement learning global view is presented in Figure 3.20.

Figure 3.20 – Reinforcement learning overview (adapted from [Sutton and Barto, 1998]).

When an agent's performance is compared to that of an agent which acts optimally from the beginning, the difference in their performances gives rise to the notion of regret [Auer et al., 2010]. In order to act near optimally, the agent must reason about the long term consequences of its actions. As the agent is not told which action to take, as in most forms of machine learning, but instead must discover which actions provide the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward, but also the next situation, and through that all subsequent rewards. These two characteristics trial-and-error search and delayed reward are the two most important distinguishing features of reinforcement learning.

One of the challenges that arises in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover which actions are the most profitable, it has to select actions that it has not tried before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploitation nor exploration can be pursued exclusively without failing at the task. Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off [Sutton, 1988].

3.6.1.2. Implementation

A reinforcement learning algorithm provides the means to choose, in a given moment and circumstance, the strategy that presents the best results for the current scenario. This way, given the responses for each problem, the reinforcement learning algorithm chooses the response that is more likely to provide the best results. For that it takes into account the past experience concerning each alternative strategy’s answers, and its results, for each context.

For each context, the Simple Reinforcement Learning Algorithm starts with the same value of confidence for each strategy. As the simulation progresses the values of confidence are updated for each context, through the application of a reinforcement that is defined according to the strategies’ performance.

The confidence value C for each possible action (strategy), at the time t, is done through the application of (3.1).

t t

C

₊

= C + Rein f

^(3.1)

where Reinf refers to the reinforcement of the considered action for time t+1. The reinforcement is calculated as in 3.2.

Rein f_i = p×Dif_i ^(3.2)

where Dif_i is the difference between a strategy’s prediction and the actual price that was verified in the market. The difference value is subject to a penalization p in case of the prediction value being above the actual market price, in case of seller agents, or below the market price in case of buyer agents. This is because if a seller’s final bid is located above the market price, the agent will not be able to sell (see sub-section 2.2.1.1 – Spot Market). The same for buyers when biding below the market price. In these cases, the penalization will assume the value 3, otherwise p

will be equal to 1, therefore having no influence on the difference value. Note that, according to way this algorithm is implemented, the reinforcement will be higher; the higher is the difference between a prediction and the actual value. Therefore, the C value does not represent a confidence value in the strategy’s success, rather the confidence in its failure. So, the best action to be chosen in this scope is the one that presents the lower C value.

The Simple Reinforcement Learning Algorithm, similarly to the other reinforcement learning algorithms presented in the scope of this work, allows the attribution of a weight value for each strategy. This weight traduces the importance of a strategy to the system, i.e., a strategy presenting a bigger weight will detach faster from the others, both in case of success or failure.

3.6.1.3. Experimental Findings

In order to verify the correct performance of the Simple Reinforcement Learning Algorithm, three tests are presented. Each of the tests considers a different simulation for fifteen consecutive days, including three seller agents, for whom the results are analyzed, and seven buyer agents. The simulations were performed considering solely the first period of the day. These simulations were built with the purpose of being simple, only to allow a demonstration of the algorithm’s adequate performance. More elaborated case studies are presented in Chapter 4 - Case-Studying.

The first test presents a simulation using only three strategies. Figure 3.21 presents the bids each of the three strategies proposed, with equal weights for all strategies, and Figure 3.22 presents the confidence values that each strategy presents to the Simple Reinforcement Learning Algorithm throughout the days.

Figure 3.21 – Strategies’ bid proposals.

Figure 3.22 – Strategies’ confidence values.

Comparing Figures 3.21 and 3.22 it is visible that the Average 2 strategy was the one that presented closer values to the market price, and located below the market price, for the first two days. This is reflected on this strategy’s lower failure confidence value for the following days (days 2 and 3), as the performance of a strategy in one day is used to update the values for the next. From day 3 forward, the Average 2 strategy started presenting bids above the market price. For that reason its confidence value in failing started to rise, and Average 1, which only surpassed the market price barrier in one day passed it in what concerns being the best proposal strategy. The only day in which that does not verify is day 10, in which Average 1 is able to very temporarily surpass Average 2.

Figure 3.23 presents the incomes achieved by the Simple Reinforcement Learning Algorithm’s choice of Average 1’s bid proposal.

Figure 3.23 – Average 1’s achieved incomes.

From Figure 3.23 it is visible that the Average 1’s proposed bid was chosen and achieved positive incomes in all days that it presented the best (smaller) failure confidence value; except for period 4, the only one in which it was chosen as the algorithm’s final bid, for being the best option (as seen in Figure 3.22), however as its proposed bid was located above the market price (as seen in Figure 3.21), it was unable to achieve incomes. The other days in which Average 1 did not achieve incomes (days 2, 3 and 10) it was because it wasn’t the one presenting the best confidence value; therefore it did not had its proposal chosen as the algorithm’s response.

Regarding day 1, in which all strategies present equal confidence value, Average 1 was chosen as the algorithm’s response, and consequently was able to achieve incomes, only by chance, because in this case the choice was made randomly, and it just “got lucky” in being chosen.

Figure 3.24 presents the confidence values of the same three strategies, in the same exact scenario, but this time with Linear Regression 1 being attributed a weight of 100%, while the other two strategies receiving only half of that value. This test intends to show the influence of the weight value.

Figure 3.24 – Strategies’ confidence values with Linear Regression 1’s higher weight.

Figure 3.24 shows that the difference between the confidence value of the Linear Regression1 and the confidence values of the other strategies is much higher than in the first test (Figure 3.22). It detached faster from the rest because it was attributed a higher weight.

The third test considers a much higher number of proposing bid’s strategies. Figure 3.25 presents the confidence values of these strategies to the Simple Reinforcement Learning Algorithm.

Figure 3.25 – Strategies’ confidence values in the third test.

From Figure 3.24 the evolution of the confidence values of the several strategies can be seen. This shows the capability of the Simple Reinforcement Learning Algorithm in considering a high number of alternative strategies.

In document Adaptive learning in agents behaviour: a framework for electricity markets simulation (Page 80-85)