Our goal is to determine under what conditions unsupervised representation learn- ing enhances the performance of a subsequent supervised learner. This objective is pertinent to a range of common machine learning scenarios. Do the features learned by an autoencoder enhance the performance of a linear classifier compared to using the original inputs? Does a particular kernel function outperform a linear kernel when used with an hypothesis class of linear separators (recalling that kernel func- tions implicitly specify a representation space)? Do vector representations of words outperform one-hot unigram representations for natural language processing tasks?
As a step towards our goal, we first formalize the problem setting. We provide a comparison to a previous formalization of semi-supervised learning. We then state the objectives of unsupervised representation learning, and a high-level approach to determining the conditions under which these objectives are met.
2.3.1 Problem Setting
LetX andY be input and target spaces respectively. LetXandYbe input and target random variables respectively. LetµXY be a probability distribution overX × Y and
µX be its marginal distribution over X. Let p(·)refer to the probability that a point
drawn from µXY satisfies some condition, and let p(x) := p(X = x). Let SXY be
a sample drawn from µXY and let SX be a sample drawn from µX. Let H be an
hypothesis class whose elements are of typeh:X → Y.
LetZ be the representation space. LetFbe a representation function class whose elements are of type f : X → Z. In unsupervised representation learning, we use F andSX and learn some f ∈ F. Let the representation variableZ := f(X). Let G be
an hypothesis class whose elements are of typeg :Z → Y.
Let l:Y × Y →R+be a loss function. Let the risk of someh ∈ Hbe R(h):=EX,Y∼µXY[l(h(X),Y)].
Similarly, let the risk of hypothesisg∈ Gusing the representation function f ∈ Fbe R(g◦f):=EX,Y∼µXY[l(g(f(X)),Y)].
§2.3 When Unsupervised Representation Learning is Provably Useful 19
Semi-supervised learning Unsupervised representation learning H Hs h∗ H G◦f h∗
Figure 2.2: Relationship between semi-supervised learning proposed as formalized in [Balcan and Blum, 2010] (left) and unsupervised representation learning (right). In semi-supervised learning, unlabeled data is used to prune the hypothesis class H to the subset Hs ⊂ H. In unsupervised representation learning, unlabeled data is used to discover f and the hypothesis class changes toG◦f. In both cases we hope
that the target functionh∗ is within our new hypothesis class.
We are interested in computing and comparing these two quantities. Of particular interest are the cases where G and H are the same, or are related to each other through a straightforward change of type signature. For example,Gand Hare both the classes of linear separators for their respective input types.
2.3.2 Comparison to Semi-Supervised Learning
Semi-supervised learning involves using unlabeled data to help supplement scarce labeled data in a supervised learning problem. By this broad definition, unsuper- vised representation learning can be considered a kind of semi-supervised learning. However, previous formal analysis of semi-supervised learning from the perspective of statistical learning theory proposed by Balcan and Blum [2010] has used a some- what narrower definition of semi-supervised learning. In this case, unsupervised representation learning as we describe it is conceptually different. The differences between the two approaches are shown in Figure 2.2.
In semi-supervised learning, we aim to prune H to some Hs ⊂ H using only unlabeled data. Supposing that there is some target functionh∗ ∈ Hwe are trying to learn, we also wish to ensure thath∗ ∈ Hs. The smaller hypothesis class may make learning a subsequent supervised task more straightforward and enable a tighter generalization error bound [Balcan and Blum, 2010]. However, if the target function lies outside the original hypothesis class, semi-supervised learning will not help to discover it.
In unsupervised representation learning, the hypothesis class changes and hence it is possible to learn hypotheses not included in the original hypothesis class. We learn some f ∈ Ffrom unlabeled data, and then consider all hypotheses of the form g◦f for some g ∈ G, where ◦ denotes function composition. Our new hypothesis class is denoted by G◦f. In this case, we hope that h∗ ∈ G◦f. This is particularly useful whenh∗ 6∈H.
20 Unsupervised Representation Learning to Provably Improve Task Performance
2.3.3 Objectives of Unsupervised Representation Learning
We introduce a set of objectives of unsupervised representation learning. While these objectives are intuitive, our formalization is novel. They help us to precisely answer the question: when is unsupervised representation learning guaranteed to be useful? Once we have described these objectives, we describe conditions under which they can be achieved.
We would like to show that using a representation function learned from an unlabeled sample guarantees that the risk of some hypothesis in our hypothesis class using this representation is not too large, as shown in Objective 2.1.
Objective 2.1(Risk upper bound for unsupervised representation learning + super- vised learning). FixµXY, F, G and l. Draw an unlabeled sample SX fromµX. Find some
f ∈ F andemaxf depending upon SX, which with probability at least1−δ over samples SX
satisfies
min
g∈GR(g◦f)≤e f
max. (2.1)
We would also like to guarantee that we cannot achieve a small risk using the original hypothesis class, as shown in Objective 2.2. We may not want to bother with unsupervised representation learning if the task is solvable using H.1
Objective 2.2(Risk lower bound for supervised learning). FixµXY, H and l. Draw a
labeled sample SXY fromµXY. Find someemin depending upon SXY, which with probability
at least1−δover samples SXY satisfies
min
h∈HR(h)≥ emin. (2.2)
Finally, we would like to show that we can achieve smaller risk using unsuper- vised representation learning compared to the original hypothesis class. In formaliz- ing this objective, it is useful to first define arisk gap.
Definition 2.3(Risk gap). FixµXY, F, G, H and l. Let the risk gap of some representation
function f be
∆R(f):=min
h∈HR(h)−ming∈GR(g◦f).
We now formalize what it means for unsupervised representation learning to be useful in Objective 2.4.
Objective 2.4(Positive risk gap). FixµXY, F, G, H and l. Draw an unlabeled sample SX
fromµX and draw a labeled sample SXY fromµXY. Find some f ∈ F andemaxf depending
upon SX and some emin depending upon SXY, which with probability at least1−2δ over
pairs of samples SXand SXYsatisfies
∆R(f)≥emin−emaxf >0. (2.3)
1It is possible that working withHis disadvantageous for computational and/or sample complexity
reasons. However, in this chapter we focus on comparing the minimum risk achievable using G◦f versus usingH.