Deriving an optimal policy - Reinforcement Learning model

Chapter 3 State-space Machine Learning Algorithms for Classification

3.3 Reinforcement Learning model

3.3.6 Deriving an optimal policy

Optimality in RL hinges off of Bellman’s “Principle of Optimality” [100]. In an optimal policy, the sequence of decisions is optimal no matter the state in which the agent begins. An optimal policy was derived for each of the three learning algorithms via a training/validation/testing paradigm for the RL Classifier and RL-HMM hybrid (HyQ, HyF) model.

Learning occurred during the training phase and used a training dataset to obtain the optimal decisions (actions) to be taken at every state. These optimal actions demonstrated the behaviour used to construct a predictive relationship in the optimal policy. The reward in the training phase was the true class

assignment (column 5 in Table 3-1); this was the supervised learning phase of the algorithm.

All of the learning algorithms collected rewards and processed them through the Bellman action value equation to calculate Q-values, 𝑄(𝑆𝑛) (Table 3-5). The optimal value for each state was given when the Q-values were maximized for each state resulting in a V-table (Table 3-6), an example of which is shown in the second column of Table 3-12.

Table 3-12: V-table for the insider trading problem ε-greedy learning

State V-table = max(Q-

table) Corresponding optimal action 1 1.1620 1 2 0.1200 2 3 8.0743 1 4 0.6000 1

On each iteration through the database, the resulting optimal values 𝑉_𝜋′ were compared to the optimal values generated by the previous learning iteration through the training dataset, 𝑉𝜋. Optimality was calculated based on 𝜁:

𝜁 = 𝑉_𝜋′(𝑆𝑡) > 𝑉_𝜋(𝑆𝑡) _(3-1)

for all states after t observations.

The optimal stopping criterion was based on a convergence value, conv.

𝜁 ≥ 𝑐𝑜𝑛𝑣 ∗ 𝑁 (3-2)

When conv=0.7, it meantthat at least 70% of the states were visited and assigned a weight. If 𝜁 did not meet the optimal stopping criterion, the learning agent went through another episode of searching. The convergence constraint was implemented to speed up the processing time. In the process of filling the Q- table, the weights on the significant states increased faster than the others and converged. However, it did not mean that the search space had been covered. Therefore the search had to be constrained to ensure a sufficient number of states were visited to produce good results.

An optimal policy was the set of actions associated with the maximal values for each state once the algorithm converges. An example of an optimal policy is in column 3 of Table 3-12.

In the case of the insider trading problem, the weights represent the stocks and traders involved in frequent high returns relative to other stocks and traders. This deviation is calculated by the temporal differencing calculation of the BAV. The deviation becomes apparent in the differencing of state in the equation: 𝑄(𝑗, 𝑏) − 𝑄(𝑖, 𝑎). This is also where the distinction in action emerges, that is, the action resulting in the highest reward. After training, the optimal actions for each algorithm are shown in Table 3-13.

Table 3-13: Optimal policy for all learning algorithms

State ε-greedy Optimal learning Boltzmann learning

1 1 2 1

2 2 1 1

3 1 2 1

4 1 1 1

Boltzmann was the most exploitative learning algorithm in contrast to optimal learning which was the most balanced decision-maker.

The next step was validation to select the optimal learning agent. In this stage, the optimal policy was assessed on the training set using the rewards in column 4 of Table 3-1. The policy was then evaluated as a classifier using a confusion matrix. c% of heavily weighted states were labeled as positive samples.

The remaining states were considered to be negative samples. Since the ground truth vector used in the classification scheme assigned class by record, the assumption was that if a state is fraudulent, then all records belonging to that state are fraudulent too.

The best learning algorithm was selected using “correct rate” which was calculated as the sum of the diagonal of the confusion matrix. Values on the diagonal are correctly predicted samples. The correct rate was selected during the validation phase to identify the agent with the best ability to differentiate between positive and negative samples. In contrast, if the learning agent was selected based on sensitivity, then it was possible to have a high false positive rate as well. Therefore the best all-around learner was promoted to the testing phase along with its optimal policy.

A testing data set was independent of the training data; it was used to evaluate the quality of the classifier and its optimal policy. It was an off-policy method because the actions were not variable; they were fixed by the optimal policy derived from the most correct learner during the validation phase. These optimal actions were decisions made in each state on a test database and Q-values were calculated.

It was possible that states existed in the testing data that did not exist in the training data. Therefore no optimal policy existed for those states. In this scenario, the model defaulted to the original learning algorithm decision making process. If this scenario arose during initialization, the model selected a random action because there was no information collected yet to inform the decision maker. This completed the learning agent’s work. It was trained, validated and tested.

For the RL Classifier, the output was a V-table sorted by Q-values in which c% of states and their corresponding samples were labeled as positive.

For the RL-HMM hybrid (HyQ, HyF) model, the output of the testing phase was a path of rewards collected by the agent. These rewards were considered to be “observations” made by the agent as it conducted an optimal search through the database. The next step was to uncover the states that generated the optimal rewards. These states were then flagged as fraudulent. The fraudulent states were uncovered by a Hidden Markov Model.

In document Financial Fraud Detection and Data Mining of Imbalanced Databases using State Space Machine Learning (Page 60-62)