When Unsupervised Representation Learning is Provably Useful

Our goal is to determine under what conditions unsupervised representation learning enhances the performance of a subsequent supervised learner. This objective is pertinent to a range of common machine learning scenarios. Do the features learned by an autoencoder enhance the performance of a linear classifier compared to using the original inputs? Does a particular kernel function outperform a linear kernel when used with an hypothesis class of linear separators (recalling that kernel func- tions implicitly specify a representation space)? Do vector representations of words outperform one-hot unigram representations for natural language processing tasks?

As a step towards our goal, we first formalize the problem setting. We provide a comparison to a previous formalization of semi-supervised learning. We then state the objectives of unsupervised representation learning, and a high-level approach to determining the conditions under which these objectives are met.

2.3.1 Problem Setting

LetX andY be input and target spaces respectively. LetXandYbe input and target random variables respectively. LetµXY be a probability distribution overX × Y and

µX be its marginal distribution over X. Let p(·)refer to the probability that a point

drawn from µXY satisfies some condition, and let p(x) := p(X = x). Let SXY be

a sample drawn from µXY and let SX be a sample drawn from µX. Let H be an

hypothesis class whose elements are of typeh:X → Y.

Let_Z be the representation space. LetFbe a representation function class whose elements are of type f : X → Z. In unsupervised representation learning, we use F andSX and learn some f ∈ F. Let the representation variableZ := f(X). Let G be

an hypothesis class whose elements are of typeg :_{Z → Y}.

Let l:Y × Y →R+_{be a loss function. Let the risk of some}_h _∈ _H_be R(h):=EX,Y∼µXY[l(h(X),Y)].

Similarly, let the risk of hypothesisg∈ Gusing the representation function f ∈ Fbe R(g◦f):=EX,Y∼µXY[l(g(f(X)),Y)].

§2.3 When Unsupervised Representation Learning is Provably Useful 19

Semi-supervised learning Unsupervised representation learning H Hs h∗ H G_◦f h∗

Figure 2.2: Relationship between semi-supervised learning proposed as formalized in [Balcan and Blum, 2010] (left) and unsupervised representation learning (right). In semi-supervised learning, unlabeled data is used to prune the hypothesis class H to the subset Hs ⊂ H. In unsupervised representation learning, unlabeled data is used to discover f and the hypothesis class changes toG_◦f. In both cases we hope

that the target functionh∗ is within our new hypothesis class.

We are interested in computing and comparing these two quantities. Of particular interest are the cases where G and H are the same, or are related to each other through a straightforward change of type signature. For example,Gand Hare both the classes of linear separators for their respective input types.

2.3.2 Comparison to Semi-Supervised Learning

Semi-supervised learning involves using unlabeled data to help supplement scarce labeled data in a supervised learning problem. By this broad definition, unsupervised representation learning can be considered a kind of semi-supervised learning. However, previous formal analysis of semi-supervised learning from the perspective of statistical learning theory proposed by Balcan and Blum [2010] has used a some- what narrower definition of semi-supervised learning. In this case, unsupervised representation learning as we describe it is conceptually different. The differences between the two approaches are shown in Figure 2.2.

In semi-supervised learning, we aim to prune H to some Hs ⊂ H using only unlabeled data. Supposing that there is some target functionh∗ _∈ Hwe are trying to learn, we also wish to ensure thath∗ _∈ Hs. The smaller hypothesis class may make learning a subsequent supervised task more straightforward and enable a tighter generalization error bound [Balcan and Blum, 2010]. However, if the target function lies outside the original hypothesis class, semi-supervised learning will not help to discover it.

In unsupervised representation learning, the hypothesis class changes and hence it is possible to learn hypotheses not included in the original hypothesis class. We learn some f ∈ Ffrom unlabeled data, and then consider all hypotheses of the form g_◦f for some g _∈ G, where _◦ denotes function composition. Our new hypothesis class is denoted by G_◦f. In this case, we hope that h∗ _∈ G_◦f. This is particularly useful whenh∗ 6∈H.

20 Unsupervised Representation Learning to Provably Improve Task Performance

2.3.3 Objectives of Unsupervised Representation Learning

We introduce a set of objectives of unsupervised representation learning. While these objectives are intuitive, our formalization is novel. They help us to precisely answer the question: when is unsupervised representation learning guaranteed to be useful? Once we have described these objectives, we describe conditions under which they can be achieved.

We would like to show that using a representation function learned from an unlabeled sample guarantees that the risk of some hypothesis in our hypothesis class using this representation is not too large, as shown in Objective 2.1.

Objective 2.1(Risk upper bound for unsupervised representation learning + supervised learning). FixµXY, F, G and l. Draw an unlabeled sample SX fromµX. Find some

f _∈ F andemaxf depending upon SX, which with probability at least1−δ over samples SX

satisfies

min

g∈GR(g◦f)≤e f

max. (2.1)

We would also like to guarantee that we cannot achieve a small risk using the original hypothesis class, as shown in Objective 2.2. We may not want to bother with unsupervised representation learning if the task is solvable using H.1

Objective 2.2(Risk lower bound for supervised learning). FixµXY, H and l. Draw a

labeled sample SXY fromµXY. Find someemin depending upon SXY, which with probability

at least1−δover samples SXY satisfies

min

h∈HR(h)≥ emin. (2.2)

Finally, we would like to show that we can achieve smaller risk using unsupervised representation learning compared to the original hypothesis class. In formaliz- ing this objective, it is useful to first define arisk gap.

Definition 2.3(Risk gap). FixµXY, F, G, H and l. Let the risk gap of some representation

function f be

∆R(f):=min

h∈HR(h)−ming∈GR(g◦f).

We now formalize what it means for unsupervised representation learning to be useful in Objective 2.4.

Objective 2.4(Positive risk gap). FixµXY, F, G, H and l. Draw an unlabeled sample SX

fromµX and draw a labeled sample SXY fromµXY. Find some f ∈ F andemaxf depending

upon SX and some emin depending upon SXY, which with probability at least1−2δ over

pairs of samples SXand SXYsatisfies

∆R(f)≥emin−emaxf >0. (2.3)

1_{It is possible that working with}_H_{is disadvantageous for computational and/or sample complexity}

reasons. However, in this chapter we focus on comparing the minimum risk achievable using G◦f versus usingH.

In document Learning Provably Useful Representations, with Applications to Fairness (Page 34-37)