Theoretical Analysis - Posterior Regularization for Learning with Side Information and Weak Sup

10.5 Analysis

11.1.4 Theoretical Analysis

For supervised learning, the probably approximately correct model of learning gives us a theoretical understanding of generalization. For example, we know that as we increase the power of a function class we wish to learn we need a larger training set in order to have high confidence of probably selecting an approximately correct function from that class. However, the situation is manageable: we only need logarithmically more examples for finite classes and for infinite classes only linearly in the Vapnik-Chervonenkis dimension of the function class.

There is no similar theory for the generalization performance of PR, GE or CoDL. It is not clear how much unlabeled data we need so that we can expect that constraints we that are satisfied on the training data will be satisfied on a new sample.

In addition to sample complexity, it would be useful to know how much prior knowledge we need for a particular problem. For example, if the constraint set contains only a few distributions of labels for a large unlabeled sample, then it seems intuitively that we should not be able to get much benefit by adding additional prior knowledge in the form of additional constraints.

Part III

Appendices

Appendix A

Probabilistic models

Throughout this thesis we are interested in estimating the parameters of probabilistic models. These are models of some quantities of interest that define a probability distribution over several outcomes. Perhaps the simplest example is a coin-flip: for some particular coin, we want to be able to predict whether it will land “heads” or “tails” when we flip it. In reality, the process of flipping a coin is very complicated: it involves a physical environment which might be changing, a precise time at which the coin is flipped, and a person perform- ing the action. Overwhelmed by this incredibly complicated system, we typically create a very simplistic model: every time a coin is flipped, the “heads” vs. “tails” outcome is a random event, which we represent with a random variable. Lety∈ {heads,tails}be our nota- tion for this random variable, which we with a single free parameter0≤p(y =heads)≤1, known as the “heads” probability. We assume that with probabilityp(y=heads), the coin will land “heads” and with probabilityp(y=tails) = 1−p(y=heads)it will land “tails.” The probability of all other outcomes — the coin landing on its side and balancing, be- coming wedged vertically, being stolen by a passer-by before it reaches the ground — are assumed to be zero. Furthermore, we assume that this probabilityp(y=heads)is the same every time we flip the coin. This assumption allows us use observations of past coin flips to try to predict future coin flips. In the machine learning jargon, coin flips are assumed to be independently, identically distributed (IID). This means that the outcome of future coin flips do not depend on the outcome of previous ones, and it means that all the coin flips have the same “heads” probability.

Obviously, with a real coin in a real environment these assumptions are violated, but without making them it would be very hard to make any progress. Additionally, for the case of flipping a coin, the cost of trying to make a more complicated and powerful model outweigh the potential benefits. For phenomena where we can get more traction with rel- atively moderate increases in model complexity we often have much more complicated models. However, we are always forced to make simplifying assumptions, even when we know that they will be grossly violated. A goal of this thesis is to present a way to include information that we have about the real world so that the simple models we have at our disposal can more effectively predict phenomena of interest.

A.1 Latent Variables, Generative and Discriminative

Models

In the coin-flip example discussed above, we have a fully observable model. When some of the random variables are hidden from observation, we might create what is called latent- variable model. For example, suppose that we cannot actually see the coin flip directly, but have the result is communicated to us through a noisy channel. Because of noise in the channel there is some probability that a result of “heads” will be communicated as “tails” and vice-versa. Let x ∈ {heads,tails} be a random variable representing whether we receive “heads” or “tails” from the communication channel. We say thatxis the observed variable, whileyis the latent variable.

A generative model for this scenario would have three free parameters: the original heads probability p(y = heads), as well as two free parameters for the probabilities that describe the noisy channel p(x|y): the probability we receive heads give the coin landed tails and the probability that we receive heads given the coin landed tails.

Potentially, we might not be interested in this full model, but might only be interested in making a decision aboutygiven an observationx. Adiscriminativemodel for the system would directly modelp(y|x): the probability that the coin landed “heads” vs “tails” given what communication we receive from the noisy channel. This model now has only two free

parameters: p(y = heads|x = heads)andp(y = heads|x = tails). Discriminative probabilistic models are also known as conditional models. Chapter 2 shows how our framework for including prior knowledge can be used with both generative and discriminative models.

In document Posterior Regularization for Learning with Side Information and Weak Supervision (Page 158-162)