Model Function Learning Rate - Reinforcement learning with motivations for realistic agents

7.3 Q Learning

7.4.2 Model Function Learning Rate

The number of testing congurations outlined in Table 7.16, are meant to show how the Model- function approximation ANN learning rate (ηM) inuences the learned policies. To clearly interpret

the results of varyingηM, all other variables are held constant, seen in Table 7.14. The results are

shown in Figure 7.22 for Tarzan, and from in Figure 7.23 for Jane. Overall results are discussed in Section 7.5.

Test 1 2 3 4 5

ηM 0.9 0.7 0.5 0.3 0.1

Table 7.16: Outline how ηM changes during dierent testing congurations in Game Three with

Dyna-Q, andε-greedy.

t−1 t−2 t−3 t−4 t−5

0 50 100

Percent reward comparing Tarzan η

(Game 3, egreedy, dynaq, test30)

% (a) t−1 t−2 t−3 t−4 t−5 0 50 100

Percent in bounds comparing Tarzan η

(Game 3, egreedy, dynaq, test30)

(b)

Figure 7.22: Comparing the results of changing ηM, for the agent Tarzan, where t-ndenes the nth

test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and

ε-greedy.

j−1 j−2 j−3 j−4 j−5

0 50 100

Percent reward comparing Jane η

(Game 3, egreedy, dynaq, test30)

% (a) j−1 j−2 j−3 j−4 j−5 0 50 100

Percent in bounds comparing Jane η

(Game 3, egreedy, dynaq, test30)

(b)

Figure 7.23: Comparing the results of changing ηM, for the agent Jane, where j-n denes the nth

test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and

ε-greedy.

7.4.3 Dyna-Q Planning Steps

The testing congurations outlined in Table 7.17, are meant to show how the number of planning steps (pS) has an impact on the learned policies. To clearly interpret the results of varying pS,

all other variables are held constant, seen in Table 5.13. The results are shown in Figure 7.22 for Tarzan, and Figure 7.23 for Jane. Overall results are discussed in Section 7.5.

Test 1 2 3 4 5 6 7

pS 0 1 2 5 10 20 50

Table 7.17: Outlines howpSchanges for dierent testing congurations in Game Three with Dyna-Q,

t−1 t−2 t−3 t−4 t−5 t−6 t−7 0 10 20 30 40 50 60 70 80 90 100

Percent reward comparing Tarzan pS (Game 3, egreedy, dynaq, test30)

% (a) t−1 t−2 t−3 t−4 t−5 t−6 t−7 0 10 20 30 40 50 60 70 80 90 100

Percent in bounds comparing Tarzan pS (Game 3, egreedy, dynaq, test30)

(b)

Figure 7.24: Comparing the results of changingpS, for the agent Tarzan, where t-ndenes thenth

test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and

ε-greedy. j−1 j−2 j−3 j−4 j−5 j−6 j−7 0 10 20 30 40 50 60 70 80 90 100

Percent reward comparing Jane pS (Game 3, egreedy, dynaq, test30)

% (a) j−1 j−2 j−3 j−4 j−5 j−6 j−7 0 10 20 30 40 50 60 70 80 90 100

Percent in bounds comparing Jane pS (Game 3, egreedy, dynaq, test30)

(b)

Figure 7.25: Comparing the results of changing pS, for the agent Jane, where j-n denes the nth

test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and

ε-greedy.

7.5 Discussion

The added complexity of Game Three over Games One, and Two, make a direct comparison hard to make. This section will evaluate the most signicant variables, compare the results of dierent RL methods (Sarsa, Q-learning, and Dyna-Q), and compare the dierence in results between agents.

The diculty in this scenario is not the delayed reward associated with opening the fridge, but rather the fact that an agent could be doing nothing, and still gain social interaction through other agent's actions.

7.5.1 Parameters

Exploration Rate The ε-greedy action selection algorithm showed predictable results with the

optimal value being εs = 0.3 and εe = 0.1, for Sarsa (Figure 7.2) and Q-learning (Figure 7.11).

The results from the agent Jane, show slightly higher performance over the results from the agent Tarzan, regardless of reinforcement learning algorithm. The dierence in results between agents is caused by dierent threshold values.

Reinforcement Learning Rate The results comparing the learning rate for Sarsa, shown in Figures 7.3, and 7.4, and Q-learning, shown in Figures 7.12, and 7.15, show very little performance dierence between values of αs, and αe. In these tests, the agent Jane shows higher performance

than the agent Tarzan, in both Sarsa, and Q-learning.

Q-Function ANN Learning Rate Results from using Sarsa (Figure 7.5 for Tarzan, and Figure 7.6 for Jane) show near equal median percent of reward, for ηQ = 0.9, ηQ = 0.7, ηQ = 0.5, and

ηQ = 0.3. Results from using Q-learning show similar results, with slightly more variation (Figure

7.14 for Tarzan, and Figure 7.15 for Jane). The results for Jane consistently have higher performance (percent of reward, and percent of steps in bounds) than results for Tarzan.

Motive Reward Factor The minimum reward received by the agent from one of their motivations does signicantly aect the resulting percentages ifmR <−0.1, see Figures 7.8 and 7.17. Note that

Tarzan and Jane have dierent optimal values, caused by their dierent motivation thresholds. Q-Function ANN Hidden Neurons Testing has shown that it is best to use only one hidden neuron layer. With this game scenario, good policies are found with at least 10 hidden neurons, and up to 100 hidden neurons.

Dyna-Q All the results associated with Dyna-Q, in Figures 7.20, 7.21, 7.22, 7.23, 7.24, and 7.25, show worst performance than those of Q-learning, indicating that only a few of the parameter congurations were able to accurately train the model-function approximation. There are still cer- tain parameter congurations that reach 100% of total reward with Dyna-Q, indicating the model function approximation was correct in some cases.

7.5.2 Consistency

Testing the consistency of the results involved repeating tests under identical conditions, and comparing the results. In this case, the values in Table 7.18 are used in 5 separate game tests, with results in Figures 7.26, and 7.27. The results show similar median percent of rewards, but with a larger variation in result for percent of steps in bounds.

RL Algorithm HnQ ηQ mR RL Rate Exploration Rate

Q learning [100] 0.7 -0.05 αs= 0.9αe= 0.7 εs= 0.3εe= 0.1

Table 7.18: Outline of the variables used to test the consistency of Game Three's optimal parameters.

0 50 100 1 2 3 4 5

Repeated test

%

Percent reward comparing

consistency of Tarzan in Game 3

0 50 100 1 2 3 4 5

Repeated test

%

Percent reward comparing

consistency of Jane in Game 3

Figure 7.26: Comparing the percent of reward from policies trained using optimal parameters outlined in Table 7.18. 0 50 100 1 2 3 4 5

Repeated test

%

Percent in bounds comparing

consistency of Tarzan in Game 3

0 50 100 1 2 3 4 5

Repeated test

%

Percent in bounds comparing

consistency of Jane in Game 3

Figure 7.27: Comparing the percent of steps in bounds from policies trained using optimal parameters outlined in Table 7.18.

In document Reinforcement learning with motivations for realistic agents (Page 109-113)