Bilinear Prediction Using Low-Rank Models

(1)

Bilinear Prediction Using Low-Rank Models

Inderjit S. Dhillon Dept of Computer Science

UT Austin

26th International Conference on Algorithmic Learning Theory Banff, Canada

(2)

Outline

Multi-Target Prediction

Features on Targets: Bilinear Prediction Inductive Matrix Completion

1 _Algorithms

2 _{Positive-Unlabeled Matrix Completion} 3 _{Recovery Guarantees}

(3)

Sample Prediction Problems

Predicting stock prices

(4)

Regression

Real-valued responses (target)t

(5)

Linear Regression

(6)

Linear Regression: Least Squares

Choosex that minimizes

Jx= 1 2 n X i=1 (aT_i x−ti)2 Closed-form solution: x∗= (ATA)−1ATt.

(7)

Prediction Problems: Classification

Spam detection

(8)

Binary Classification

Categorical responses (target)t

Predict response for given input data (features)a

(9)

Linear Methods for Prediction Problems

Regression:

Ridge Regression: Jx= 1₂Pn_i₌₁(aT_i x−ti)2+λkxk22.

Lasso: Jx= 1₂Pni=1(aiTx−ti)2+λkxk1.

Classification:

Linear Support Vector Machines Jx= 1 2 n X i=1 max(0,1−tiaTi x) +λkxk22. Logistic Regression n

(10)

(11)

(12)

(13)

Modern Prediction Problems in Machine Learning

(14)

Modern Prediction Problems in Machine Learning

Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance

need cheap auto insurance geico com

(15)

Modern Prediction Problems in Machine Learning

Wikipedia Tag Recommendation

Learning in computer vision

Machine learning Learning

(16)

Modern Prediction Problems in Machine Learning

(17)

Prediction with Multiple Targets

In many domains, goal is tosimultaneously predict multiple target

variables

Multi-target regression: targets are real-valued

(18)

Prediction with Multiple Targets

Applications

Bid word recommendation Tag recommendation

Disease-gene linkage prediction Medical diagnoses

Ecological modeling . . .

(19)

Prediction with Multiple Targets

Input dataai is associated with m targets,ti = (t

(1) i ,t (2) i , . . . ,t (m) i )

(20)

Multi-target Linear Prediction

Basic model: Treat targets independently

(21)

Multi-target Linear Prediction

Assume targetst(j) are independent Linear predictive model: ti ≈aT_i X

(22)

Multi-target Linear Prediction

Multi-target regression problem has a closed-form solution: VAΣ−A1U > AT = arg min X kT −AXk 2 F

(23)

Multi-target Linear Prediction

Multi-target regression problem has a closed-form solution: VAΣ−A1U > AT = arg min X kT −AXk 2 F

whereA=UAΣAV_AT is the thin SVD of A

In multi-label classification: Binary Relevance(independent binary classifier

(24)

Multi-target Linear Prediction: Low-rank Model

Exploit correlations between targetsT, where T ≈AX

Reduced-Rank Regression[A.J. Izenman, 1974] — model the

coefficient matrixX as low-rank

(25)

Multi-target Linear Prediction: Low-rank Model

X is rank-k

(26)

Multi-target Linear Prediction: Low-rank Model

X is rank-k

Linear predictive model: ti ≈aT_i X

Low-rank multi-target regression problem has a closed-form solution:

X∗= min X:rank(X)≤k kT −AXk 2 F = (

VAΣ−_A1UA>Tk ifA is full row rank,

VAΣ−_A1Mk otherwise,

whereA=UAΣAVAT is the thin SVD ofA,M =UA>T, andTk,Mk are

(27)

(28)

Multi-target Prediction with Missing Values

In many applications, several observations (targets) may bemissing

(29)

Modern Prediction Problems in Machine Learning

Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance

need cheap auto insurance geico com

(30)

(31)

Multi-target Prediction with Missing Values

(32)

(33)

(34)

Bilinear Prediction

Augment multi-target prediction withfeatureson targets as well

Motivated by modern applications of machine learning — bioinformatics, auto-tagging articles

Need to modeldyadicor pairwise interactions

Move from linear models tobilinear models — linear in input features

(35)

(36)

(37)

Bilinear Prediction

(38)

Bilinear Prediction

Bilinear predictive model: Tij ≈aTi Xbj

Corresponding regression problem has a closed-form solution: VAΣ−A1U > ATUBΣ−B1V T B = arg min X kT −AXB >_k2 F

(39)

Bilinear Prediction: Low-rank Model

X is rank-k

(40)

Bilinear Prediction: Low-rank Model

X is rank-k

Bilinear predictive model: Tij ≈aT_i Xbj

Corresponding regression problem has a closed-form solution:

X∗= min X:rank(X)≤k kT −AXB >_k2 F = (

VAΣ−_A1UA>TkUBΣ−_B1VBT ifA,B are full row rank,

VAΣ−_A1MkΣ−_B1V_BT otherwise,

whereA=UAΣAVA>,B =UBΣBVB> are the thin SVDs of AandB,

M =U_A>TUB, andTk,Mk are the best rank-k approximations ofT

(41)

Modern Challenges in Multi-Target Prediction

Millions of targets

Correlations among targets Missing values

(42)

Modern Challenges in Multi-Target Prediction

Millions of targets (Scalable)

Correlations among targets (Low-rank)

(43)

(44)

Matrix Completion

Missing Value Estimation Problem

Matrix Completion: Recover a low-rank matrix from observed entries

Matrix Completion: exact recovery requiresO(knlog2(n)) samples,

under the assumptions of:

1 _{Uniform sampling} 2 _Incoherence

(45)

Inductive Matrix Completion

Inductive Matrix Completion: Bilinear low-rank prediction with missing values

Degrees of freedom inX are O(kd)

(46)

Algorithm 1: Convex Relaxation

1 Nuclear-norm Minimization:

min kXk∗

s.t. aT_i Xbj =Tij, (i,j)∈Ω

Computationally expensive

Sample complexity for exact recovery: O(kdlogdlogn)

Conditions for exact recovery: C1. Incoherence on A,B.

C2. Incoherence on AU∗ andBV∗,whereX∗=U∗Σ∗V∗T is the SVD of

the ground truth X∗

(47)

Algorithm 1: Convex Relaxation

Theorem (Recovery Guarantees for Nuclear-norm Minimization)

Let X∗ =U∗Σ∗V∗T ∈Rd×d be the SVD of X∗ with rank k. Assume A,B are

orthonormal matrices w.l.o.g., satisfying the incoherence conditions. Then if

Ωis uniformly observed with

|Ω| ≥O(kdlogdlogn),

the solution of nuclear-norm minimization problem is unique and equal to X∗ with high probability.

The incoherence conditions are

C1. max i∈[n]kaik 2 2≤ µd n , maxj∈[n]kbjk 2 2≤ µd n

(48)

Algorithm 2: Alternating Least Squares

Alternating Least Squares (ALS): min Y∈Rd1×k_Z_∈ Rd2×k X (i,j)∈Ω (aT_i YZTbj −Tij)2 Non-convex optimization

(49)

Algorithm 2: Alternating Least Squares

Computational complexity of ALS.

Ath-th iteration, fixingYh, solve the least squares problem forZh+1:

X (i,j)∈Ω (˜aT_i Z_hT₊₁bj)bja˜Ti = X (i,j)∈Ω Tijbj˜aTi

where ˜ai=YhTai. Similarly solve forYh when fixingZh.

1 _{Closed form:} _O₍_|_Ω_|_k2_d_×₍_nnz₍_A_{) +}_nnz₍_B₎₎_/_n₊_k3_d3_).

2 Vanilla conjugate gradient: O(|Ω|k×(nnz(A) +nnz(B))/n) per iteration.

3 _{Exploit the structure for conjugate gradient:}

X (i,j)∈Ω (˜aTi Z T bj)bja˜Ti =B T DA˜

whereD is a sparse matrix withDji= ˜aTi ZTbj, (i,j)∈Ω, and ˜A=AYh.

(50)

Algorithm 2: Alternating Least Squares

Theorem (Convergence Guarantees for ALS )

Let X∗ be a rank-k matrix with condition number β, and T =AX∗BT.

Assume A,B are orthogonal w.l.o.g. and satisfy the incoherence conditions.

Then ifΩis uniformly sampled with

|Ω| ≥O(k4β2dlogd),

then after H iterations of ALS,kY_HZT

H+1−X∗k2 ≤, where

H=O(log(kX∗kF/)).

The incoherence conditions are:

C1. max i∈[n]k aik22≤ µd n , maxj∈[n]k bjk22≤ µd n C2’. max i∈[n]kY T haik22≤ µ0k n , maxj∈[n]kZ T hbjk22≤ µ0k n ,

(51)

Algorithm 2: Alternating Least Squares

Proof sketch for ALS

Consider the case when the rankk = 1:

min

y∈Rd1,z∈Rd2 X

(i,j)∈Ω

(52)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min

y∈Rd1,z∈Rd2 X

(i,j)∈Ω

(aT_i y zTbj−Tij)2

(a) LetX∗=σ∗y∗zT∗ be the thin SVD ofX∗ and assume AandB are

orthogonal w.l.o.g.

(b) In the absence of missing values, ALS = Power method.

∂kAy_hzTBT−Tk2 F ∂z = 2B T₍_BzyT hA T₋_TT₎_Ay h= 2(zkyhk 2₋_BT_TT_Ay h) zh+1←(ATTB)Tyh; normalizezh+1 y_h₊₁←(ATTB)zh+1 ; normalizeyh+1 Note thatAT_TB₌_AT_AX

∗BTB=X∗and the power method converges

(53)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min y∈_Rd₁_,_z_∈ Rd2 X (i,j)∈Ω (aT_i y zTbj−Tij)2

(c) With missing values, ALS is a variant of power method with noise in each iteration zh+1 ←QR( X∗Tyh | {z } power method −σ∗N−1((yT∗yh)N−N˜)z∗ | {z } noise termg ) where N=P (i,j)∈ΩbjaTi yhyThaibTj , ˜N= P (i,j)∈ΩbjaiTyhyT∗aibTj . (d) GivenC1andC2’, the noise termg=σ∗N−1((yT∗yh)N−N˜)z∗

(54)

Algorithm 2: Alternating Least Squares

(e) GivenC1andC2’, the first iteratey₀ is well initialized, i.e. yT

0y∗≥0.9,

which guarantees the initial noise is small enough

(f) The iterates can then be shown to linearly converge to the optimal:

1−(zT_h₊₁z∗)2≤ 1 2(1−(y T hz∗)2) 1−(yT_h₊₁y_∗)2≤ 1 2(1−(z T h+1y∗)2)

(55)

Algorithm 2: Alternating Least Squares

(e) GivenC1andC2’, the first iteratey₀ is well initialized, i.e. yT

0y∗≥0.9,

which guarantees the initial noise is small enough

(f) The iterates can then be shown to linearly converge to the optimal:

1−(zT_h₊₁z∗)2≤ 1 2(1−(y T hz∗)2) 1−(yT_h₊₁y_∗)2≤ 1 2(1−(z T h+1y∗)2)

(56)

Inductive Matrix Completion: Sample Complexity

Sample complexity of Inductive Matrix Completion (IMC) and Matrix Completion (MC).

methods IMC MC

Nuclear-norm O(kdlognlogd) knlog2n (Recht, 2011)

ALS O(k4β2dlogd) k3β2nlogn (Hardt, 2014)

whereβ is the condition number of X

In most cases,nd

Incoherence conditions onA,B are required

Satisfied e.g. whenA,B are Gaussian (no assumption onX needed)

B. Recht.A simpler approach to matrix completion.The Journal of Machine Learning Research 12 : 3413-3430 (2011). M. Hardt.Understanding alternating minimization for matrix completion.Foundations of Computer Science (FOCS), IEEE 55th Annual Symposium, pp. 651-660 (2014).

(57)

Inductive Matrix Completion: Sample Complexity Results

All matrices are sampled from Gaussian random distribution.

Left two figures: fixk = 5,n = 1000 and changed.

Right two figures: fixk = 5,d = 50 and changen.

Darkness of the shading is proportional to the number of failures (repeated 10 times). 50 100 150 0 500 1000 1500 2000 2500 3000 d Number of Measurements 0 0.2 0.4 0.6 0.8 1 50 100 150 0 500 1000 1500 2000 2500 3000 d Number of Measurements 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 200 300 400 500 600 700 800 900 1000 n Number of Measurements 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 200 300 400 500 600 700 800 900 1000 n Number of Measurements 0 0.2 0.4 0.6 0.8 1

(58)

(59)

Modern Prediction Problems in Machine Learning

(60)

Bilinear Prediction: PU Learning

(61)

PU Learning

(62)

PU Inductive Matrix Completion

Guarantees so far assume observations are sampled uniformly What can we say about the case when observations are all 1’s (“positives”)?

(63)

PU Inductive Matrix Completion

Inductive Matrix Completion: min

X:kXk∗≤t

X

(i,j)∈Ω

(aT_i Xbj −Tij)2

Commonly used PU strategy: Biased Matrix Completion min X:kXk∗≤tα X (i,j)∈Ω (aT_i Xbj −Tij)2+ (1−α) X (i,j)6∈Ω (aT_i Xbj −0)2 Typically,α >1−α (α≈0.9).

(64)

PU Inductive Matrix Completion

Inductive Matrix Completion: min

X:kXk∗≤t

X

(i,j)∈Ω

(aT_i Xbj −Tij)2

Commonly used PU strategy: Biased Matrix Completion min X:kXk∗≤tα X (i,j)∈Ω (aT_i Xbj −Tij)2+ (1−α) X (i,j)6∈Ω (aT_i Xbj −0)2 Typically,α >1−α (α≈0.9).

We can show guarantees for the biased formulation

V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic.One-class matrix completion with low-density factorizations. ICDM, pp. 1055-1060. 2010.

(65)

PU Learning: Random Noise Model

(66)

PU Inductive Matrix Completion

A deterministic PU learning model

Tij =

(

1 ifMij >0.5,

(67)

PU Inductive Matrix Completion

A deterministic PU learning model

P( ˜Tij = 0|Tij = 1) =ρ andP( ˜Tij = 0|Tij = 0) = 1.

We are givenonlyT˜ but not T or M

(68)

Algorithm 1: Biased Inductive Matrix Completion

b X = min X:kXk∗≤tα X (i,j)∈Ω (aT_i Xbj −1)2+ (1−α) X (i,j)6∈Ω (aT_i Xbj −0)2 Rationale:

(a) Fixα= (1 +ρ)/2 and defineTbij =I

(AX Bb T)ij >0.5

(b) The above problem is equivalent to:

b X = min X:kXk∗≤t X i,j `α((AXBT)ij,T˜ij) where `α(x,T˜ij) =αT˜ij(x−1)2+ (1−α)(1−T˜ij)x2 (c) Minimizing`α loss is equivalent to minimizing the true error, i.e.

1 mn X ij `α((AXBT)ij,T˜ij) =C1 1 mnkTb−Tk 2 F +C2

(69)

Algorithm 1: Biased Inductive Matrix Completion

Theorem (Error Bound for Biased IMC)

Assume ground-truthX satisfieskXk∗ ≤t (whereM =AXBT). Define

b

Tij =I

(AX Bb T)_ij >0.5

,A= maxikaik andB= maxikbik. If α= 1+₂ρ,

then with probability at least 1−δ, 1 n2kT−Tbk 2 F =O ηplog(2/δ) n(1−ρ) + η tAB√log2d (1−ρ)n3/2 whereη= 4(1 + 2ρ).

(70)

(71)

Multi-target Prediction: Image Tag Recommendation

NUS-Wide Image Dataset

161,780 training images 107,879 test images

(72)

Multi-target Prediction: Image Tag Recommendation

H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon.Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp. 593-601 (2014).

(73)

Multi-target Prediction: Image Tag Recommendation

Low-rank Model withk = 50:

time(s) prec@1 prec@3 AUC

LEML(ALS) 574 20.71 15.96 0.7741

WSABIE 4,705 14.58 11.37 0.7658

LEML(ALS) 1,097 20.76 16.00 0.7718

(74)

Multi-target Prediction: Wikipedia Tag Recommendation

Wikipedia Dataset

881,805 training wiki pages 10,000 test wiki pages 366,932 features 213,707 tags

(75)

Multi-target Prediction: Wikipedia Tag Recommendation

LEML(ALS) 9,932 19.56 14.43 0.9086

WSABIE 79,086 18.91 14.65 0.9020

LEML(ALS) 18,072 22.83 17.30 0.9374

(76)

PU Inductive Matrix Completion: Gene-Disease Prediction

N. Natarajan, and I. S. Dhillon.Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

(77)

PU Inductive Matrix Completion: Gene-Disease Prediction

0 20 40 60 80 100

0 0.1 0.2

Number of genes looked at

P(hidden gene among genes looked at)

Catapult Katz Matrix Completion on C IMC LEML Matrix Completion 0 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 Recall(%) Precision(%)

Predicting gene-disease associations in the OMIM data set (www.omim.org).

(78)

PU Inductive Matrix Completion: Gene-Disease Prediction

0 20 40 60 80 100

0 0.1

Catapult Katz Matrix Completion on C IMC LEML 0 20 40 60 80 100 0 0.1 0.2

Predicting genes for diseases withno training associations.

N. Natarajan, and I. S. Dhillon.Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

(79)

Conclusions and Future Work

Inductive Matrix Completion:

Scales to millions of targets

Captures correlations among targets Overcomes missing values

Extension to PU learning

Much work to do:

Other structures: low-rank+sparse, low-rank+column-sparse (outliers)? Different loss functions?

Handling “time” as one of the dimensions — incorporating smoothness through graph regularization?

Incorporating non-linearities? Efficient (parallel) implementations? Improved recovery guarantees?

(80)

References

[1] P. Jain, and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint

arXiv:1306.0626 (2013).

[2] K. Zhong, P. Jain, I. S. Dhillon. Efficient Matrix Sensing Using Rank-1 Gaussian

Measurements. In Proceedings of The 26th Conference on Algorithmic Learning Theory (2015).

[3] N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene

disease associations. Bioinformatics, 30(12), i60-i68 (2014).

[4] H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with

Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp. 593-601 (2014).

[5] C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In

Proceedings of The 32nd International Conference on Machine Learning, pp. 2445-2453 (2015).