A Real-Data-Based Simulated User - Argumentation accelerated reinforcement learning

Recall that the goal of EURS is to learn the user’s usage pattern of some appliances and accordingly give the user recommendations, so as to save money for the user as well as minimising disruption3’. So we can see that in this learning problem, the user is the ‘environment’, and RL-based EURS learns the best actions (in this

00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00_Time 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Switching on probability

Figure 6.1: The probability distribution of the switching on time of the washing machine. This distribution is obtained by using the user’s usage data in 51 selected days.

case, the actions are recommendations) in each time slot of a day by interacting with this environment. Because many episodes of learning may be needed before a good policy can be found (this can be seen from, for example, experiment on the Keepaway and Takeaway games in Section 4.2, experiments on the Taxi problem in Section 5.2.1, and experiments on the stochastic Wumpus World game in Section 5.2.2), it is unrealistic to make the system interact with real users for very long time. As a result, it is essential to design a simulated user to train this system before it is deployed in real households.

To make the simulated user as lifelike as possible, we use a big amount of data collected from a real user to design the simulated user. We use the longest dataset in the UK-DALE database [KK14], which contains power readings of 54 appli- ances over 470 days in one household. We only consider six appliances: the vac- uum cleaner (hoover), dishwasher, washing machine, TV, kitchen electronics4and PC5. We consider these appliances because they consume considerable energy (av- erage≥ 0.1 kWh/day) and have quite flexible working times (unlike, e.g., the WiFi router, which typically works 24/7). We only consider one specific type of day, in which the simulated user uses these six appliances once a day. In the dataset, 54 days are of this type, and most of them are weekend days (51).

These electronics include a kettle, two toasters, two food mixers and a kitchen aid. According to the description of the UK-DALE database, the data of these electronics are collected by using one sensor, so they are treated as one appliance in our work.

Figure 6.2: The probability distribution of the switching on time and working time of the TV set.

Note that we need to simulate a user in the following two aspects: (1) how the user plans his usage of each appliance, namely the original usage pattern of this user, and (2) how he responds to different recommendations in different situ- ations. To simulate the first aspect, we assume that, in this selected type of days, the user has a fixed probability to use each appliance in each time slot, and this probability can be simply computed from the user’s existing usage data. For example, the probability distribution of the user turning on the washing machine in different time slots is shown in Figure 6.1, and the probability distribution of the user switching on the TV in different time slots and using it for different lengths of time is shown in Figure 6.2. Note that Figure 6.1 is two-dimensional because we do not need to consider the working time of the washing machine: we ob- serve from the data set that, in each usage, the washing machine works for three time slots and automatically turns off. For example, in Figure 6.1, the right-most bar represents that the probability of switching on the washing machine between 21:00 and 21:30 (and using it for the three time slots) is 0.01. Figure 6.2 is, how- ever, three-dimensional, because the working time of each usage is also needed to be taken into account when computing the user’s usage pattern. For example, the

bottom left bar in Figure 6.2 represents that the probability of switching on the TV between 00:00 and 00:30 and using it for thirty minutes (one time slot) is 0.002. Three appliances switch off automatically after working for fixed numbers of time slots: the washing machine, dish washer and kitchen electronics. They work for three, two and one time slot(s) before they are turned off, respectively.

More specifically, for each appliancei, i = 1, · · · , 6, we compute Pi(ts, te), the

probability of using appliancei from time ts to timete,6 by using data collected

from the specific type of days in the dataset. In the beginning of each day, the simulated user uses these probability distributions to plan when to switch on and off each appliance. For example, given the distribution of the washing machine shown in Figure 6.1, the simulated user has1% probability to switch on the washing machine between 00:00 and 00:30, and has8.5% probability to use turn it on between 13:00 and 13:30. For simplicity, we assume that each day, any appliance’s usage is independent of other appliance usage, and is also independent of the usage on other days.

As for point (2), we will describe how the user responds to recommendations below in Section 6.5, because this is closely related to the transition function and the rewards function of this problem.

In document Argumentation accelerated reinforcement learning (Page 149-152)