Adaptivity and Robustness - Reinforcement Learning Framework for the self-learning Suppression

As a final element of this chapter, an analysis of the performance of the RL algorithm to fluctuations and permanent changes in its environment is observed.

Perceived fluctuations in system parameters in physical environments are always present to a certain extent. This is due to actual changes in environmental parameters such as temperature, humidity or to fluctuations in the measured values of parameters within the tolerances of sensor devices. On the other hand, physical environments tend to experience permanent changes as a result of deterioration and wear in their components. In this section, a RL agent is presented with different scenarios intended to reproduce each of the mentioned changes to its environment in order to analyze its behavior.

The agent implemented for this purpose uses the SARSA-algorithm with the Gaussian- update method introduced in 6.3.4 to estimate the value function.

In the first scenario, the agent has already completed the learning task with the standard parameters of the drive train model. Afterwards, a fluctuation of the mass of the vehicle, e.g. as a result of more passengers or a load being transported, and thus of its inertia 𝐽𝑉 , within ±10% of its original value is simulated. As long as the fluctuation of one parameter is simulated, all other parameters are held at a constant value. The agent then performs ten greedy runs, i.e. without taking random actions, with a

208_{The duration of the standard SARSA-agent in the homogenously discretized state space is used as}

benchmark.

209_{As stated previously, judder is quantified in terms of the return computed after a synchronization}

maneuver. The return for a standard engagement without RL-control is regarded to generate 100% judder and serves as base of comparison.

fluctuating mass and otherwise constant system parameters. The results of the performance of the agent under the described circumstances can be taken from Fig. 6.29.

Figure 6.29: Performance of the RL-agent under fluctuating vehicle mass 𝐽𝑉 in comparison to

actuation with a standard force ramp without RL-control. The original value of 𝐽_𝑉 is marked in red.

It can be observed that the agent is able to reduce judder vibrations in comparison to a standard engagement ramp only when the fluctuations are relatively small and the inertia of the vehicle does not change substantially. The greater the offset to the original value, the less effective the agent is at suppressing the vibrations. In fact, for larger deviations from the standard value the RL-control is actually counter-productive. A similar experiment is performed for a decreasing static friction coefficient 𝜇_𝑠𝑡 , e.g. as a result of oil or dirt in the friction pairing. Again, an agent who has already completed a learning task with the standard parameters performs four greedy runs in an environment where the static friction has decreased. The result of the experiment can be observed in Fig. 6.30.

9 9.2 9.4 9.6 9.8 10 10.2 10.4 10.6 10.8 11 -25 -20 -15 -10 -5

J

[Kgm

]

ret

urn

without RL control

with RL control

Figure 6.30: Performance of the RL-agent for a decreasing static friction coefficient 𝜇_𝑠𝑡. The original value of 𝜇_𝑠𝑡 is marked in red.

The loss in effectiveness of judder suppression by the agent is observable. As was the case in the previous experiment, the greater the deviation of the original value of the fluctuating parameter, the less effective the agent becomes. In fact, the RL-control is likely to actually cause judder instead of reducing it, when the value it had learned is altered. It should be mentioned that the presence of e.g. oil in the friction pairing usually leads to a much more dramatic decrease of the friction coefficient, as considered for this experiment.

The second scenario regards the case in which the changes to the environment are permanent. Again, changes in the mass of the vehicle are considered first. A considerable permanent change in the mass of the vehicle is rather unlikely, however, the weight of the vehicle could be notably higher as a result of more passengers over an extended period of time, which could cause a more difficult or extended start up. Another option would consider the use of an existing implementation of the RL- controller for another type of vehicle with a different mass inertia. Therefore, for the purpose of this investigation the ability of the RL-agent to learn a successful strategy for the altered weight is analyzed. The result of the adaptivity experiment is contained in Fig. 6.31.

The agent retains its ability to learn to suppress judder vibrations under altered vehicle inertia. It is worth noting that both the number of episodes to reach new convergence and the achieved return after it is reached remains almost unaffected with a decrease in the inertia, whereas even a relatively small increase affects both negatively.

0.39 0.40 0.41 0.42 0.43 -40 -35 -30 -25 -20 -15 -10 -5

ret

urn



Without RL-control

With RL-control

Figure 6.31: Adaptation of the RL-agent to a permanent change in the inertia 𝐽_𝑚 of the vehicle. Top: episodes to reach convergence in second learning process. Bottom: Achieved return after convergence of second learning process. The return of original learning process

is marked in red.

Figure 6.32: Adaption of the RL-agent to a permanent change in the static friction coefficient 𝜇_𝑠𝑡 of the vehicle. Top: episodes to reach convergence in second learning process. Bottom: Achieved return after convergence of second learning process. The return

of original learning process is marked in red.

0

200

400

600

ep

is

od

es

9

9.25

9.5

9.75

10

10.25

10.5

10.75

11 -9

-8

-7

-6

-5

J

F

[kg/m

2

]

ret

urn

0

250

500

750 1000

ep

is

od

es

0.39

0.4

0.41

0.42

0.43 -10

-8

-6

-4

µ_st

ret

urn

Lastly, the RL-agent is confronted with a permanently altered static friction coefficient, e.g. as a result of wear. The results of the experiment are contained in Fig. 6.32. The agent is able to learn a new strategy to suppress judder vibrations. However, the effectiveness of the algorithm is reduced by lower friction coefficients.

In summary, the ability of the agent to suppress judder vibrations under fluctuations and permanent changes in the environment is retained. However, its effectiveness is reduced for both.

In the first scenario, fluctuations of system parameters lead to a temporary loss of the correct mapping of actions to states. Unless these fluctuations are accounted for, e.g. through an adequate model, it represents a partial loss of the Markov property. The smaller the fluctuations, the “more Markov” the environment becomes and the better the performance of the agent remains.

In the second scenario, permanent changes to the environment do not prevent the agent from learning a new strategy to reduce judder vibrations. However, the effectiveness of the new strategy and the additional effort to learn it can suffer considerably. A possible explanation lies in the change in the vibration behavior of the system and the need for an adaptation of the RL framework to these changes. For example, a change in the mass of the vehicle leads to a change in its eigenfrequency, which might make it necessary to adjust the rate at which the agent sets actions in order to counteract judder vibrations.

Therefore, it is of great importance that fluctuations and changes in the environment are avoided as far as possible for the agent to learn an effective strategy in as few episodes as possible. The importance of this requirement becomes more evident in the following chapter, when the RL framework is applied to a physical environment in which the Markov property can never be fully ensured.

7 Implementation of the RL Framework on the IPEK Mini

Hardware-in-the-Loop scaled down Test Bench

The IPEK Mini-Hardware-in-the-Loop test bench (Mini-HiL) is designed to provide a scaled down experimental environment for drive trains and drive train components. Due to its flexible architecture, different levels of the XiL-framework introduced in 2.3 can be realized in order to provide an adequate physical model of the drive train with the desired levels of partitioning, maturity and abstraction.209F

210

In this chapter, the conception of the physical test bench and a simulation model of it as new environments for the RL framework are presented. Afterwards, the Gaussian- update algorithm is implemented in this new environment and the results are discussed. Furthermore, new RL-algorithms are introduced for this new environment, in order to assess if they are better suited for the implementation on the physical test bench. After a discussion of the results of all implemented approaches, the one considered the most promising is implemented on the physical test bench and the results are discussed.

In document Reinforcement Learning Framework for the self-learning Suppression of Clutch Judder in automotive Drive Trains (Page 123-129)