7.3 Q Learning
7.4.2 Model Function Learning Rate
The number of testing congurations outlined in Table 7.16, are meant to show how the Model- function approximation ANN learning rate (ηM) inuences the learned policies. To clearly interpret
the results of varyingηM, all other variables are held constant, seen in Table 7.14. The results are
shown in Figure 7.22 for Tarzan, and from in Figure 7.23 for Jane. Overall results are discussed in Section 7.5.
Test 1 2 3 4 5
ηM 0.9 0.7 0.5 0.3 0.1
Table 7.16: Outline how ηM changes during dierent testing congurations in Game Three with
Dyna-Q, andε-greedy.
t−1 t−2 t−3 t−4 t−5
0 50 100
Percent reward comparing Tarzan η
M
(Game 3, egreedy, dynaq, test30)
% (a) t−1 t−2 t−3 t−4 t−5 0 50 100
Percent in bounds comparing Tarzan η
M
(Game 3, egreedy, dynaq, test30)
%
(b)
Figure 7.22: Comparing the results of changing ηM, for the agent Tarzan, where t-ndenes the nth
test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and
ε-greedy.
j−1 j−2 j−3 j−4 j−5
0 50 100
Percent reward comparing Jane η
M
(Game 3, egreedy, dynaq, test30)
% (a) j−1 j−2 j−3 j−4 j−5 0 50 100
Percent in bounds comparing Jane η
M
(Game 3, egreedy, dynaq, test30)
%
(b)
Figure 7.23: Comparing the results of changing ηM, for the agent Jane, where j-n denes the nth
test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and
ε-greedy.
7.4.3 Dyna-Q Planning Steps
The testing congurations outlined in Table 7.17, are meant to show how the number of planning steps (pS) has an impact on the learned policies. To clearly interpret the results of varying pS,
all other variables are held constant, seen in Table 5.13. The results are shown in Figure 7.22 for Tarzan, and Figure 7.23 for Jane. Overall results are discussed in Section 7.5.
Test 1 2 3 4 5 6 7
pS 0 1 2 5 10 20 50
Table 7.17: Outlines howpSchanges for dierent testing congurations in Game Three with Dyna-Q,
t−1 t−2 t−3 t−4 t−5 t−6 t−7 0 10 20 30 40 50 60 70 80 90 100
Percent reward comparing Tarzan pS (Game 3, egreedy, dynaq, test30)
% (a) t−1 t−2 t−3 t−4 t−5 t−6 t−7 0 10 20 30 40 50 60 70 80 90 100
Percent in bounds comparing Tarzan pS (Game 3, egreedy, dynaq, test30)
%
(b)
Figure 7.24: Comparing the results of changingpS, for the agent Tarzan, where t-ndenes thenth
test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and
ε-greedy. j−1 j−2 j−3 j−4 j−5 j−6 j−7 0 10 20 30 40 50 60 70 80 90 100
Percent reward comparing Jane pS (Game 3, egreedy, dynaq, test30)
% (a) j−1 j−2 j−3 j−4 j−5 j−6 j−7 0 10 20 30 40 50 60 70 80 90 100
Percent in bounds comparing Jane pS (Game 3, egreedy, dynaq, test30)
%
(b)
Figure 7.25: Comparing the results of changing pS, for the agent Jane, where j-n denes the nth
test according to Table 7.15. The testing congurations are run in Game Three using Dyna-Q, and
ε-greedy.
7.5 Discussion
The added complexity of Game Three over Games One, and Two, make a direct comparison hard to make. This section will evaluate the most signicant variables, compare the results of dierent RL methods (Sarsa, Q-learning, and Dyna-Q), and compare the dierence in results between agents.
The diculty in this scenario is not the delayed reward associated with opening the fridge, but rather the fact that an agent could be doing nothing, and still gain social interaction through other agent's actions.
7.5.1 Parameters
Exploration Rate The ε-greedy action selection algorithm showed predictable results with the
optimal value being εs = 0.3 and εe = 0.1, for Sarsa (Figure 7.2) and Q-learning (Figure 7.11).
The results from the agent Jane, show slightly higher performance over the results from the agent Tarzan, regardless of reinforcement learning algorithm. The dierence in results between agents is caused by dierent threshold values.
Reinforcement Learning Rate The results comparing the learning rate for Sarsa, shown in Figures 7.3, and 7.4, and Q-learning, shown in Figures 7.12, and 7.15, show very little performance dierence between values of αs, and αe. In these tests, the agent Jane shows higher performance
than the agent Tarzan, in both Sarsa, and Q-learning.
Q-Function ANN Learning Rate Results from using Sarsa (Figure 7.5 for Tarzan, and Figure 7.6 for Jane) show near equal median percent of reward, for ηQ = 0.9, ηQ = 0.7, ηQ = 0.5, and
ηQ = 0.3. Results from using Q-learning show similar results, with slightly more variation (Figure
7.14 for Tarzan, and Figure 7.15 for Jane). The results for Jane consistently have higher performance (percent of reward, and percent of steps in bounds) than results for Tarzan.
Motive Reward Factor The minimum reward received by the agent from one of their motivations does signicantly aect the resulting percentages ifmR <−0.1, see Figures 7.8 and 7.17. Note that
Tarzan and Jane have dierent optimal values, caused by their dierent motivation thresholds. Q-Function ANN Hidden Neurons Testing has shown that it is best to use only one hidden neuron layer. With this game scenario, good policies are found with at least 10 hidden neurons, and up to 100 hidden neurons.
Dyna-Q All the results associated with Dyna-Q, in Figures 7.20, 7.21, 7.22, 7.23, 7.24, and 7.25, show worst performance than those of Q-learning, indicating that only a few of the parameter congurations were able to accurately train the model-function approximation. There are still cer- tain parameter congurations that reach 100% of total reward with Dyna-Q, indicating the model function approximation was correct in some cases.
7.5.2 Consistency
Testing the consistency of the results involved repeating tests under identical conditions, and com- paring the results. In this case, the values in Table 7.18 are used in 5 separate game tests, with results in Figures 7.26, and 7.27. The results show similar median percent of rewards, but with a larger variation in result for percent of steps in bounds.
RL Algorithm HnQ ηQ mR RL Rate Exploration Rate
Q learning [100] 0.7 -0.05 αs= 0.9αe= 0.7 εs= 0.3εe= 0.1
Table 7.18: Outline of the variables used to test the consistency of Game Three's optimal parameters.
0 50 100 1 2 3 4 5
Repeated test
%
Percent reward comparing
consistency of Tarzan in Game 3
0 50 100 1 2 3 4 5
Repeated test
%
Percent reward comparing
consistency of Jane in Game 3
Figure 7.26: Comparing the percent of reward from policies trained using optimal parameters out- lined in Table 7.18. 0 50 100 1 2 3 4 5
Repeated test
%
Percent in bounds comparing
consistency of Tarzan in Game 3
0 50 100 1 2 3 4 5
Repeated test
%
Percent in bounds comparing
consistency of Jane in Game 3
Figure 7.27: Comparing the percent of steps in bounds from policies trained using optimal parameters outlined in Table 7.18.