• No results found

Unsupervised and Reinforcement Learning

2.3 Machine Learning

2.3.2 Unsupervised and Reinforcement Learning

In this work we were mainly (almost exclusively) focused on supervised Ma- chine Learning techniques. As we are going to see in Section 2.3.3 most ML methods commonly used in HPC area deal with prediction problems involving labeled data and therefore supervised algorithm are typically employed. Never- theless, for the sake of completeness we are going to briefly present two different paradigms of ML approaches, namely unsupervised Machine Learning and re- inforcement Machine Learning.

2.3.2.1 Unsupervised Learning

Unsupervised Machine Learning main difference compared to the supervised approach is the lack of labels from the data [HS99,Bar89]. Unsupervised learning studies how systems can learn to represent particular input patterns in a way that reflects the statistical structure of the overall collection of input patterns [Day99]. There are no explicit target outputs or external evaluation or reward. The only elements that can be used by unsupervised learning methods are the input patterns xi, often assumed to be independent samples from an underlying

unknown probability distribution P (x) and some explicit or implicit a priori information to determine what is relevant. Ghahramani provides a great high- level introduction on unsupervised learning [Gha04], from the perspective of statistical learning. A very interesting survey on unsupervised algorithms for automation, classification and maintenance tasks can be found in [KMI+15].

The key concept underlying unsupervised learning is the possibility to learn a probabilistic model coherent with the input data. The model should be capable of estimating the probability distribution of a new input xn given a series of

previous inputs x1, .., xn−1; the model learns P (xn|x1, .., xn−1). At the heart of

most of probabilistic models there is again the Bayes theorem and its corollaries – as already introduced and discussed in Section 2.3.1.3. These models can then be used for outlier detection or monitoring – i.e. detecting unexpected outcomes. Other application of such a learned model are represented by classification task or communication and data compression [Mac03].

In many real world contexts there might be a mixture of labeled and un- labeled data. This could happen very easily for example in domains where

collecting data is cheap (i.e. the internet) but the labeling phase is much more expensive or time consuming. The research area of semi-supervised learning focuses precisely on this kind of issues. The main aspect to be considered in semi-supervised learning is how the data distribution of the unlabeled data in- fluences the supervised learning problem [See00]. Delving into the details of semi-supervised learning is outside of the scope of this work but we can hint that many of the proposed approaches try to infer a manifold, graph structure or tree structure from unlabeled data and employ this information to determine how labels should be generalized to new unlabeled points [Kru64,JS02,ZGL+03].

2.3.2.2 Reinforcement Learning

Reinforcement-based learning systems perform actions and receive punishments, negative reinforcement, or prizes, positive reinforcement [SB98, Sut92, TL00]. Consequently, the learner will try to take those actions guaranteeing the best possible result. Reinforcement-based learning systems can be seen also as agents operating in an environment and receiving positive, negative or neutral rewards according to their actions; the rewards are usually given by a trainer. The task of the agent is to learn from this indirect reward and choose the sequence of actions that guarantee the best overall outcome. An important aspect in rein- forcement learning is the need to balance exploration and exploitation [SB98]. In order to optimally execute a task a learning agent needs to exploit the “good” actions that it learned through the rewards mechanism but at the same time the learning process requires that multiple different actions must be taken – exploring the possible actions space by taking new actions. Neither exploration nor exploitation can be pursued exclusively without failing at the task. The ex- ploration/exploitation dilemma has been intensively studied by mathematicians for many decades [Bel56, BT95].

The most important components present in the vast majority of reinforce- ment learning system are the following: 1) a policy, 2) a reward function and 3) a value function. A policy defines the behaviour of an agent, the action it will take in a given state (state perceived from the environment). In general, policies may be stochastic. The reward function specifies the goal: it maps every possi- ble state (or state-action couple) to a single value representing the desirability of that state. The only goal of a reinforcement learning agent is to maximize the reward function in the long run. While the reward function indicates the imme- diate attractiveness of a certain state, the value function presents a long-term point of view. The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Therefore the value function takes into account the reward functions of the states that may follow. Rewards are given directly by the environment (response obtained after reaching a certain state) whereas values are predictions of rewards; actions and decisions are made based on value judgments.

A very good survey on reinforcement learning (even though a bit dated) can be found in [KLM96]. Gosavi [Gos09] considers more recent developments in reinforced ML with a particular regard to its application in control theory and to agents performing decision-making. Both these aspects can be modeled as Markov decision problems [Put14], which were traditionally solved through Dynamic Programming. The author argues that reinforcement learning is a powerful tool to deal with these kind of problems thanks to its ability to solve

2.3 Machine Learning 57

near-optimally, complex and large-scale Markov decision processes.