The update process should include new knowledge and already collected knowledge. Needless to say that the human decision

making process does not exclusively consider the last obtained reward but also considers a history of rewards.

Every time the agent obtains a reward r_i,j, for an executed action a_j in consequence of stimulus s_i, the corresponding somatic marker σ_i,j is updated.

It depends on the stimulus which rewards are even possible. This is to say that, the consequences of a decision concerning the choice of a job are different from e.g. the decision which potato chips to buy.

Every obtained reward is a component of an infinite time series. Therefore, an obvious starting point for the computation of somatic markers is the exponential smoothing function. Equation (2.2) shows the computation of the exponential smoothing (Winters (1960)).

r^t_i,j = ∆ · r_i,j^t + (1 − ∆) · r^t−1_i,j , r_i,j^t ∈ R_s_i, ∆ ∈ [0, 1] (2.2) This computation fulfils the combined request of including new knowledge r^t_i,jand collected knowledge r^t−1_i,j . There is also a weighting of new and collected knowledge but there is no adaption of the weighting parameter ∆. At the initial state, at which the agent does not have access to any knowledge, it would be preferable to weight new knowledge exclusively or at least to a higher extent than collected knowledge. In contrast to that, the weighting of collected knowledge should increase when the agent gathers reliable information in order to reduce the impact of outliers in the rewards. When the agent has reliable collected knowledge it is necessary to not exclusively consider collected knowledge but to take into account new knowledge as well. Otherwise, the agent is not able to relearn, because any new reward would be discarded.

In addition to the problem that a fixed weighting is used, the convergence of the resultant value can also be disadvantageous for the representation of knowledge. Under the assumption that a set of rewards R_s_i exists for every stimulus and that the initial value r^init_i,j ∈ [min(R_s_i), max(R_s_i)], the upper bound for r_i,j^t is defined by max(R_s_i), while the lower bound is defined by min(R_s_i), when an exponential smoothing is used. Therefore, rewards with values lying in the interval ] min(R_s_i), max(R_s_i)[ are not able to reach the upper or lower bound.

Figure 2.3 illustrates an exemplary case with a constant obtained reward r^t_i,j = 50 and a set R_s_i = {−100, −50, +50, +100}. It is observable that the impact of the obtained rewards is small at the beginning and negligible after some iterations. Furthermore, the best case value +100 cannot be reached.

Due to this fact, a single negative reward (e.g. at t=100) would have a major impact on the value r^t_i,j, even if 100 positive rewards were obtained previously.

2.4. Creation of Artificial Somatic Markers 23

-100 -50 0 50 100

0 10 20 30 40 50 60 70 80 90 100

rt i,j

t r_i,j^t = 0.5· r^ti,j+ 0.5· ri,j^t⁻¹

Figure 2.3: Exemplary case that shows the convergence problem when an exponential smoothing is used (r_i,j^init= 0).

In the same way, frequently obtained negative rewards (r^t_i,j = −50) do not lead to the worst case value of -100.

The convergence is unproblematic in the case that a binary set of rewards is used (Rsi = {−v, +v}). In such a case −v is interpreted as worst case and +v as best case. However, for human daily decisions there are often several advantageous and disadvantageous options and the positive or negative outcomes are of different quality as well. To express such different qualities R_s_i can be defined as follows: R_s_i = {−v₁, · · · , 0, · · · , +v_o}. It might happen that some actions won’t ever lead to the maximal or the minimal reward that is possible. This fact leads to the SM not converging against the maximum or minimum value but against another level, even if the frequency of e.g.

positive rewards is very high. Dependent on the task, it can be important that even actions which lead to rewards with a small magnitude adapt the somatic marker in such a way that the best/worst case value can be reached.

Otherwise, single outliers in the rewards have too much influence on the computation of collected knowledge. To construct an example, the following configuration is assumed:

• S = {s₁}

• A = {a₁}

• R_s₁ = {−1000, −50, +50, +1000}

• r^init_i,j = 0

Figure 2.4 shows an example in which the best case (+1000) and the worst case (-1000) never occur. The obtained rewards can be found on top of the figure. Until t = 18 the agent has obtained only positive rewards (+50) followed by two negative rewards (-50) at t = 19 and t = 20. In order to show the influence of the fixed weighting on the result, three different weightings

-100 -50 0 50 100

0 5 10 15 20

Reward

t Obtained rewards

-50 -25 0 25 50

0 5 10 15 20

rt i,j

t Reward value example

r^t_i,j= 0.5· ri,j^t + 0.5· r_i,j^t⁻¹ r^t_i,j= 0.9· ri,j^t + 0.1· r_i,j^t⁻¹ r^t_i,j= 0.1· r_i,j^t + 0.9· r_i,j^t⁻¹

Figure 2.4: Exemplary case that shows the influence of different weightings when an exponential smoothing is used.

are used for the computations. The used weightings are an equal weighting of new knowledge r^t_i,j and collected knowledge r_i,j^t−1, a weighting with a high consideration of new knowledge and a weighting with a high consideration of collected knowledge. It can be noticed that even when 19 positive rewards are obtained in a row the best case value +1000 is not reached due to the convergence. In consequence, the two negative rewards at t = 19 and t = 20 affect the value so much that the option is marked as being disadvantageous (negative value). The case in which new knowledge is weighted with 0.1 is an exception. However, a low weighting of new knowledge leads to the problem that the agent is not able to adapt its behaviour to new situations quickly.

In summary, the influence of the history of rewards is not sufficient to prevent that a few outliers change r^t_i,j drastically. A high weighting of collected knowledge counteracts this problem but leads to further problems.

The modifications of the computations presented in the next paragraph allow that r^t_i,j does not converge against 50 but against 1000 in such a case. This decreases the influence of outliers.

Furthermore, the example shows that the used weightings have a major influence on the result. As discussed previously, there are always cases in which a chosen fixed weighting is unfavourable. Therefore, the following modifications also include an adaptive weighting.

2.4. Creation of Artificial Somatic Markers 25

Case r^t_i,j

Unreliable knowledge 1 · r_i,j^t + 0 · r_i,j^t−1 Reliable knowledge 1 · r_i,j^t + 1 · r_i,j^t−1 Learning period w · r_i,j^t + ˆw · r_i,j^t−1, w + ˆw ∈]1; 2[

Table 2.2: Desired weightings for new and collected knowledge.

Criteria 2 and 3: The consideration of the rewards’ frequency and

In document A Decision Making Framework for Robot Companion Systems Usable by Non-Experts (Page 30-33)