• No results found

Controller Development

5.3 SDP Solution

Given the probability and cost of each transition it is possible to calculate the statistically optimal action to perform in each state so as to minimise the overall cost. Considering a single time-step, the calculation is very simple. Firstly, the cost and the probability ma-trices are multiplied on an element-by-element basis. The resulting matrix is a statistical representation of the cost for each transition. The sum of each element in the final state dimension can then be calculated to estimate the stochastic cost of each action for each initial state. The optimal action can be chosen for each initial state as the one that gives the lowest stochastic cost. This set of actions is known as the optimal policy.

If now, a second step is considered, the vehicle would have moved from its initial state.

The total cost requires calculation of the cost of the first step, but also the cost of the second step, given the probability of each transition during the first step. The probability distribu-tion of this new state can be calculated using the transidistribu-tional probability matrix. The cost of performing the optimal action in each new state has already been calculated, as that would be a single time-step problem. Therefore, the cost of the second step can be calculated as the element-by-element product of the probability of each transition during the first step multiplied by the cost of the second step given the single step optimal policy. This can be then added to the stochastic cost of each action for each initial state. The new policy can be calculated as the one that minimises the overall cost over both steps.

The two-step policy may or may not be the same as the single step policy. For example, the single step policy may determine that it is optimal to turn the fuel cell off under almost all circumstances in order to prevent degradation and conserve fuel. However, in the second step, part of this policy may result in the battery voltage penalty being triggered due to demand from the motors. As a result, it would be beneficial to have the fuel cell producing power during the first step in order to prevent this from occurring. Therefore, it may be required to re-calculate the cost of the second step based on the new policy. The total cost would then need to be recalculated, which may result in yet another change to the policy.

This process is then repeated until the policy is unchanged.

SDP works by taking this idea of adding the immediate cost of each action to the antici-pated future cost due to the probability distribution of each transition and taking it further into the future so as to be representative of the time-scales seen in the real world. There are two main methods, finite horizon and infinite horizon. Finite horizon methods assume a fixed number of time-steps and work in an equivalent way to the example given above. Infi-nite horizon methods however assume that the process is continued for an infiInfi-nite number of steps. As this would invariably result in infinite cost, a discount factor is exponentially applied to future steps so as to allow the final cost to converge.

The choice of discount factor will affect both the accuracy of the results and the time required for the cost to converge. A low discount factor may cause the solution to converge very quickly, reducing processing time, but as a result may be unrepresentative of the typi-cal time-stypi-cales seen in the real world, and therefore produce a sub-optimal results. Too high a discount factor will cause the convergence to take an excessive amount of time, which may not be necessary to obtain the optimal policy. Often the choice of discount factor is a compromise between calculation time and accuracy.

More recently a number of authors [52, 64] have used an alternative infinite horizon method which does not require a discount factor. This is known as Shortest Path

Stochas-tic Dynamic Programming (SP-SDP) and uses a “terminal state”. The terminal state is an additional state added to the model which represents the end of the duty cycle. In order to do this, it has a 100% probability of transitioning to itself with no associated cost. As a result, no cost accumulates once the vehicle has entered the terminal state. This means that the solution will converge for any initial state that has a non-zero probability of eventually entering the terminal state given an infinite number of steps. Therefore, no discount factor (or a discount factor of 1) is required for the infinite horizon cost to converge.

5.3.1 Mathematical Description

Two techniques will be used in this work, the infinite horizon and shortest path SDP. A brief mathematical description of each is given below. Despite its apparent complexity, SDP can be described by two steps. The first is the policy evaluation where the expected costs of performing the current policy are calculated. The second step is the policy improvement step, where the policy is chosen as the set of actions which minimises the expected cost.

This process is then repeated until the policy converges.

5.3.1.1 Infinite Horizon

The MDP problem described has been solved using infinite horizon SDP. The objective is to find the optimal control policy, u = π(S) so as to minimise the total expected cost, Jπ(S0), over an infinite time horizon. The total expected cost is calculated using Equation 5.3.9, where λ ∈ [0, 1), represents the one second discount factor.

Jπ(S0) = lim

The optimal policy can be found using a policy iteration algorithm. This works by iteratively evaluating the current policy and then improving the policy until the policy converges. The policy evaluation step (Equation 5.3.10), given the current control policy, π is calculated as the cost incurred during the current step added to the expected cost of future steps given the new state, S0, that the vehicle has transitioned to.

Jπk+1(Si) = Γ(Si, π(Si)) + λEJπk(S0)

(5.3.10) The policy is then improved by finding the action which will minimise the total expected cost, see Equation 5.3.11.

π0(Si) = argmin

a∈A(Si)

Γ(Si, a) + λEJπ(Si) 

(5.3.11)

This process is iterated until the policy remains unchanged for a number of improve-ment steps. The optimal policy π(S)is based on the state of the vehicle, and is causal and time-invariant and therefore can be directly implemented in simulation or on board the vehicle.

5.3.1.2 Terminal State

The solution to the terminal state problem is identical except for the fact that now no dis-count factor is required. That is to say λ = 1 in the equations above, and therefore the solution can be slightly simplified:

Jπ(S0) = lim

K→∞E (K−1

X

k=0

Γ (Sk, π(Sk)) )

(5.3.12)

Jπk+1(Si) = Γ(Si, π(Si)) + EJπk(S0)

(5.3.13)

π0(Si) = argmin

a∈A(Si)

Γ(Si, a) + EJπ(Si) 

(5.3.14)

This solution will converge given the existence of a terminal state in the model. The terminal state has three requirements. Firstly, every initial state must be able to eventually transition to the terminal state given an infinite number of steps. The terminal state will always transition back to itself. This transition will incur no cost. As a result, the terminal state will be “absorbing”. This means that the probability of being in the terminal state will increase as the number of steps increases. As no cost is incurred in the terminal state, the solution will therefore converge.