• No results found

Bilinear Prediction Using Low-Rank Models

N/A
N/A
Protected

Academic year: 2021

Share "Bilinear Prediction Using Low-Rank Models"

Copied!
80
0
0

Loading.... (view fulltext now)

Full text

(1)

Bilinear Prediction Using Low-Rank Models

Inderjit S. Dhillon Dept of Computer Science

UT Austin

26th International Conference on Algorithmic Learning Theory Banff, Canada

(2)

Outline

Multi-Target Prediction

Features on Targets: Bilinear Prediction Inductive Matrix Completion

1 Algorithms

2 Positive-Unlabeled Matrix Completion 3 Recovery Guarantees

(3)

Sample Prediction Problems

Predicting stock prices

(4)

Regression

Real-valued responses (target)t

(5)

Linear Regression

(6)

Linear Regression: Least Squares

Choosex that minimizes

Jx= 1 2 n X i=1 (aTi x−ti)2 Closed-form solution: x∗= (ATA)−1ATt.

(7)

Prediction Problems: Classification

Spam detection

(8)

Binary Classification

Categorical responses (target)t

Predict response for given input data (features)a

(9)

Linear Methods for Prediction Problems

Regression:

Ridge Regression: Jx= 12Pni=1(aTi x−ti)2+λkxk22.

Lasso: Jx= 12Pni=1(aiTx−ti)2+λkxk1.

Classification:

Linear Support Vector Machines Jx= 1 2 n X i=1 max(0,1−tiaTi x) +λkxk22. Logistic Regression n

(10)
(11)
(12)
(13)

Modern Prediction Problems in Machine Learning

(14)

Modern Prediction Problems in Machine Learning

Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance

need cheap auto insurance geico com

(15)

Modern Prediction Problems in Machine Learning

Wikipedia Tag Recommendation

Learning in computer vision

Machine learning Learning

(16)

Modern Prediction Problems in Machine Learning

(17)

Prediction with Multiple Targets

In many domains, goal is tosimultaneously predict multiple target

variables

Multi-target regression: targets are real-valued

(18)

Prediction with Multiple Targets

Applications

Bid word recommendation Tag recommendation

Disease-gene linkage prediction Medical diagnoses

Ecological modeling . . .

(19)

Prediction with Multiple Targets

Input dataai is associated with m targets,ti = (t

(1) i ,t (2) i , . . . ,t (m) i )

(20)

Multi-target Linear Prediction

Basic model: Treat targets independently

(21)

Multi-target Linear Prediction

Assume targetst(j) are independent Linear predictive model: ti ≈aTi X

(22)

Multi-target Linear Prediction

Assume targetst(j) are independent Linear predictive model: ti ≈aTi X

Multi-target regression problem has a closed-form solution: VAΣ−A1U > AT = arg min X kT −AXk 2 F

(23)

Multi-target Linear Prediction

Assume targetst(j) are independent Linear predictive model: ti ≈aTi X

Multi-target regression problem has a closed-form solution: VAΣ−A1U > AT = arg min X kT −AXk 2 F

whereA=UAΣAVAT is the thin SVD of A

In multi-label classification: Binary Relevance(independent binary classifier

(24)

Multi-target Linear Prediction: Low-rank Model

Exploit correlations between targetsT, where T ≈AX

Reduced-Rank Regression[A.J. Izenman, 1974] — model the

coefficient matrixX as low-rank

(25)

Multi-target Linear Prediction: Low-rank Model

X is rank-k

(26)

Multi-target Linear Prediction: Low-rank Model

X is rank-k

Linear predictive model: ti ≈aTi X

Low-rank multi-target regression problem has a closed-form solution:

X∗= min X:rank(X)≤k kT −AXk 2 F = (

VAΣ−A1UA>Tk ifA is full row rank,

VAΣ−A1Mk otherwise,

whereA=UAΣAVAT is the thin SVD ofA,M =UA>T, andTk,Mk are

(27)
(28)

Multi-target Prediction with Missing Values

In many applications, several observations (targets) may bemissing

(29)

Modern Prediction Problems in Machine Learning

Ad-word Recommendation geico auto insurance geico car insurance car insurance geico insurance

need cheap auto insurance geico com

(30)
(31)

Multi-target Prediction with Missing Values

(32)
(33)
(34)

Bilinear Prediction

Augment multi-target prediction withfeatureson targets as well

Motivated by modern applications of machine learning — bioinformatics, auto-tagging articles

Need to modeldyadicor pairwise interactions

Move from linear models tobilinear models — linear in input features

(35)
(36)
(37)

Bilinear Prediction

(38)

Bilinear Prediction

Bilinear predictive model: Tij ≈aTi Xbj

Corresponding regression problem has a closed-form solution: VAΣ−A1U > ATUBΣ−B1V T B = arg min X kT −AXB >k2 F

(39)

Bilinear Prediction: Low-rank Model

X is rank-k

(40)

Bilinear Prediction: Low-rank Model

X is rank-k

Bilinear predictive model: Tij ≈aTi Xbj

Corresponding regression problem has a closed-form solution:

X∗= min X:rank(X)≤k kT −AXB >k2 F = (

VAΣ−A1UA>TkUBΣ−B1VBT ifA,B are full row rank,

VAΣ−A1MkΣ−B1VBT otherwise,

whereA=UAΣAVA>,B =UBΣBVB> are the thin SVDs of AandB,

M =UA>TUB, andTk,Mk are the best rank-k approximations ofT

(41)

Modern Challenges in Multi-Target Prediction

Millions of targets

Correlations among targets Missing values

(42)

Modern Challenges in Multi-Target Prediction

Millions of targets (Scalable)

Correlations among targets (Low-rank)

(43)
(44)

Matrix Completion

Missing Value Estimation Problem

Matrix Completion: Recover a low-rank matrix from observed entries

Matrix Completion: exact recovery requiresO(knlog2(n)) samples,

under the assumptions of:

1 Uniform sampling 2 Incoherence

(45)

Inductive Matrix Completion

Inductive Matrix Completion: Bilinear low-rank prediction with missing values

Degrees of freedom inX are O(kd)

(46)

Algorithm 1: Convex Relaxation

1 Nuclear-norm Minimization:

min kXk∗

s.t. aTi Xbj =Tij, (i,j)∈Ω

Computationally expensive

Sample complexity for exact recovery: O(kdlogdlogn)

Conditions for exact recovery: C1. Incoherence on A,B.

C2. Incoherence on AU∗ andBV∗,whereX∗=U∗Σ∗V∗T is the SVD of

the ground truth X∗

(47)

Algorithm 1: Convex Relaxation

Theorem (Recovery Guarantees for Nuclear-norm Minimization)

Let X∗ =U∗Σ∗V∗T ∈Rd×d be the SVD of X∗ with rank k. Assume A,B are

orthonormal matrices w.l.o.g., satisfying the incoherence conditions. Then if

Ωis uniformly observed with

|Ω| ≥O(kdlogdlogn),

the solution of nuclear-norm minimization problem is unique and equal to X∗ with high probability.

The incoherence conditions are

C1. max i∈[n]kaik 2 2≤ µd n , maxj∈[n]kbjk 2 2≤ µd n

(48)

Algorithm 2: Alternating Least Squares

Alternating Least Squares (ALS): min Y∈Rd1×kZ Rd2×k X (i,j)∈Ω (aTi YZTbj −Tij)2 Non-convex optimization

(49)

Algorithm 2: Alternating Least Squares

Computational complexity of ALS.

Ath-th iteration, fixingYh, solve the least squares problem forZh+1:

X (i,j)∈Ω (˜aTi ZhT+1bj)bja˜Ti = X (i,j)∈Ω Tijbj˜aTi

where ˜ai=YhTai. Similarly solve forYh when fixingZh.

1 Closed form: O(||k2d×(nnz(A) +nnz(B))/n+k3d3).

2 Vanilla conjugate gradient: O(|Ω|k×(nnz(A) +nnz(B))/n) per iteration.

3 Exploit the structure for conjugate gradient:

X (i,j)∈Ω (˜aTi Z T bj)bja˜Ti =B T DA˜

whereD is a sparse matrix withDji= ˜aTi ZTbj, (i,j)∈Ω, and ˜A=AYh.

(50)

Algorithm 2: Alternating Least Squares

Theorem (Convergence Guarantees for ALS )

Let X∗ be a rank-k matrix with condition number β, and T =AX∗BT.

Assume A,B are orthogonal w.l.o.g. and satisfy the incoherence conditions.

Then ifΩis uniformly sampled with

|Ω| ≥O(k4β2dlogd),

then after H iterations of ALS,kYHZT

H+1−X∗k2 ≤, where

H=O(log(kX∗kF/)).

The incoherence conditions are:

C1. max i∈[n]k aik22≤ µd n , maxj∈[n]k bjk22≤ µd n C2’. max i∈[n]kY T haik22≤ µ0k n , maxj∈[n]kZ T hbjk22≤ µ0k n ,

(51)

Algorithm 2: Alternating Least Squares

Proof sketch for ALS

Consider the case when the rankk = 1:

min

y∈Rd1,z∈Rd2 X

(i,j)∈Ω

(52)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min

y∈Rd1,z∈Rd2 X

(i,j)∈Ω

(aTi y zTbj−Tij)2

(a) LetX∗=σ∗y∗zT∗ be the thin SVD ofX∗ and assume AandB are

orthogonal w.l.o.g.

(b) In the absence of missing values, ALS = Power method.

∂kAyhzTBT−Tk2 F ∂z = 2B T(BzyT hA TTT)Ay h= 2(zkyhk 2BTTTAy h) zh+1←(ATTB)Tyh; normalizezh+1 yh+1←(ATTB)zh+1 ; normalizeyh+1 Note thatATTB=ATAX

∗BTB=X∗and the power method converges

(53)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min y∈Rd1,z Rd2 X (i,j)∈Ω (aTi y zTbj−Tij)2

(c) With missing values, ALS is a variant of power method with noise in each iteration zh+1 ←QR( X∗Tyh | {z } power method −σ∗N−1((yT∗yh)N−N˜)z∗ | {z } noise termg ) where N=P (i,j)∈ΩbjaTi yhyThaibTj , ˜N= P (i,j)∈ΩbjaiTyhyT∗aibTj . (d) GivenC1andC2’, the noise termg=σ∗N−1((yT∗yh)N−N˜)z∗

(54)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min y∈Rd1,z Rd2 X (i,j)∈Ω (aTi y zTbj−Tij)2

(e) GivenC1andC2’, the first iteratey0 is well initialized, i.e. yT

0y∗≥0.9,

which guarantees the initial noise is small enough

(f) The iterates can then be shown to linearly converge to the optimal:

1−(zTh+1z∗)2≤ 1 2(1−(y T hz∗)2) 1−(yTh+1y)2≤ 1 2(1−(z T h+1y∗)2)

(55)

Algorithm 2: Alternating Least Squares

Proof sketch for rank-1 ALS min y∈Rd1,z Rd2 X (i,j)∈Ω (aTi y zTbj−Tij)2

(e) GivenC1andC2’, the first iteratey0 is well initialized, i.e. yT

0y∗≥0.9,

which guarantees the initial noise is small enough

(f) The iterates can then be shown to linearly converge to the optimal:

1−(zTh+1z∗)2≤ 1 2(1−(y T hz∗)2) 1−(yTh+1y)2≤ 1 2(1−(z T h+1y∗)2)

(56)

Inductive Matrix Completion: Sample Complexity

Sample complexity of Inductive Matrix Completion (IMC) and Matrix Completion (MC).

methods IMC MC

Nuclear-norm O(kdlognlogd) knlog2n (Recht, 2011)

ALS O(k4β2dlogd) k3β2nlogn (Hardt, 2014)

whereβ is the condition number of X

In most cases,nd

Incoherence conditions onA,B are required

Satisfied e.g. whenA,B are Gaussian (no assumption onX needed)

B. Recht.A simpler approach to matrix completion.The Journal of Machine Learning Research 12 : 3413-3430 (2011). M. Hardt.Understanding alternating minimization for matrix completion.Foundations of Computer Science (FOCS), IEEE 55th Annual Symposium, pp. 651-660 (2014).

(57)

Inductive Matrix Completion: Sample Complexity Results

All matrices are sampled from Gaussian random distribution.

Left two figures: fixk = 5,n = 1000 and changed.

Right two figures: fixk = 5,d = 50 and changen.

Darkness of the shading is proportional to the number of failures (repeated 10 times). 50 100 150 0 500 1000 1500 2000 2500 3000 d Number of Measurements 0 0.2 0.4 0.6 0.8 1 50 100 150 0 500 1000 1500 2000 2500 3000 d Number of Measurements 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 200 300 400 500 600 700 800 900 1000 n Number of Measurements 0 0.2 0.4 0.6 0.8 1 0 200 400 600 800 1000 200 300 400 500 600 700 800 900 1000 n Number of Measurements 0 0.2 0.4 0.6 0.8 1

(58)
(59)

Modern Prediction Problems in Machine Learning

(60)

Bilinear Prediction: PU Learning

(61)

PU Learning

(62)

PU Inductive Matrix Completion

Guarantees so far assume observations are sampled uniformly What can we say about the case when observations are all 1’s (“positives”)?

(63)

PU Inductive Matrix Completion

Inductive Matrix Completion: min

X:kXk∗≤t

X

(i,j)∈Ω

(aTi Xbj −Tij)2

Commonly used PU strategy: Biased Matrix Completion min X:kXk∗≤tα X (i,j)∈Ω (aTi Xbj −Tij)2+ (1−α) X (i,j)6∈Ω (aTi Xbj −0)2 Typically,α >1−α (α≈0.9).

(64)

PU Inductive Matrix Completion

Inductive Matrix Completion: min

X:kXk∗≤t

X

(i,j)∈Ω

(aTi Xbj −Tij)2

Commonly used PU strategy: Biased Matrix Completion min X:kXk∗≤tα X (i,j)∈Ω (aTi Xbj −Tij)2+ (1−α) X (i,j)6∈Ω (aTi Xbj −0)2 Typically,α >1−α (α≈0.9).

We can show guarantees for the biased formulation

V. Sindhwani, S. S. Bucak, J. Hu, A. Mojsilovic.One-class matrix completion with low-density factorizations. ICDM, pp. 1055-1060. 2010.

(65)

PU Learning: Random Noise Model

(66)

PU Inductive Matrix Completion

A deterministic PU learning model

Tij =

(

1 ifMij >0.5,

(67)

PU Inductive Matrix Completion

A deterministic PU learning model

P( ˜Tij = 0|Tij = 1) =ρ andP( ˜Tij = 0|Tij = 0) = 1.

We are givenonlyT˜ but not T or M

(68)

Algorithm 1: Biased Inductive Matrix Completion

b X = min X:kXk∗≤tα X (i,j)∈Ω (aTi Xbj −1)2+ (1−α) X (i,j)6∈Ω (aTi Xbj −0)2 Rationale:

(a) Fixα= (1 +ρ)/2 and defineTbij =I

(AX Bb T)ij >0.5

(b) The above problem is equivalent to:

b X = min X:kXk∗≤t X i,j `α((AXBT)ij,T˜ij) where `α(x,T˜ij) =αT˜ij(x−1)2+ (1−α)(1−T˜ij)x2 (c) Minimizing`α loss is equivalent to minimizing the true error, i.e.

1 mn X ij `α((AXBT)ij,T˜ij) =C1 1 mnkTb−Tk 2 F +C2

(69)

Algorithm 1: Biased Inductive Matrix Completion

Theorem (Error Bound for Biased IMC)

Assume ground-truthX satisfieskXk∗ ≤t (whereM =AXBT). Define

b

Tij =I

(AX Bb T)ij >0.5

,A= maxikaik andB= maxikbik. If α= 1+2ρ,

then with probability at least 1−δ, 1 n2kT−Tbk 2 F =O ηplog(2/δ) n(1−ρ) + η tAB√log2d (1−ρ)n3/2 whereη= 4(1 + 2ρ).

(70)
(71)

Multi-target Prediction: Image Tag Recommendation

NUS-Wide Image Dataset

161,780 training images 107,879 test images

(72)

Multi-target Prediction: Image Tag Recommendation

H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon.Large-scale Multi-label Learning with Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp. 593-601 (2014).

(73)

Multi-target Prediction: Image Tag Recommendation

Low-rank Model withk = 50:

time(s) prec@1 prec@3 AUC

LEML(ALS) 574 20.71 15.96 0.7741

WSABIE 4,705 14.58 11.37 0.7658

Low-rank Model withk = 100:

time(s) prec@1 prec@3 AUC

LEML(ALS) 1,097 20.76 16.00 0.7718

(74)

Multi-target Prediction: Wikipedia Tag Recommendation

Wikipedia Dataset

881,805 training wiki pages 10,000 test wiki pages 366,932 features 213,707 tags

(75)

Multi-target Prediction: Wikipedia Tag Recommendation

Low-rank Model withk = 250:

time(s) prec@1 prec@3 AUC

LEML(ALS) 9,932 19.56 14.43 0.9086

WSABIE 79,086 18.91 14.65 0.9020

Low-rank Model withk = 500:

time(s) prec@1 prec@3 AUC

LEML(ALS) 18,072 22.83 17.30 0.9374

(76)

PU Inductive Matrix Completion: Gene-Disease Prediction

N. Natarajan, and I. S. Dhillon.Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

(77)

PU Inductive Matrix Completion: Gene-Disease Prediction

0 20 40 60 80 100

0 0.1 0.2

Number of genes looked at

P(hidden gene among genes looked at)

Catapult Katz Matrix Completion on C IMC LEML Matrix Completion 0 5 10 15 20 25 30 0 0.5 1 1.5 2 2.5 Recall(%) Precision(%)

Predicting gene-disease associations in the OMIM data set (www.omim.org).

(78)

PU Inductive Matrix Completion: Gene-Disease Prediction

0 20 40 60 80 100

0 0.1

Number of genes looked at

P(hidden gene among genes looked at)

Catapult Katz Matrix Completion on C IMC LEML 0 20 40 60 80 100 0 0.1 0.2

Number of genes looked at

P(hidden gene among genes looked at)

Predicting genes for diseases withno training associations.

N. Natarajan, and I. S. Dhillon.Inductive matrix completion for predicting gene disease associations. Bioinformatics, 30(12), i60-i68 (2014).

(79)

Conclusions and Future Work

Inductive Matrix Completion:

Scales to millions of targets

Captures correlations among targets Overcomes missing values

Extension to PU learning

Much work to do:

Other structures: low-rank+sparse, low-rank+column-sparse (outliers)? Different loss functions?

Handling “time” as one of the dimensions — incorporating smoothness through graph regularization?

Incorporating non-linearities? Efficient (parallel) implementations? Improved recovery guarantees?

(80)

References

[1] P. Jain, and I. S. Dhillon. Provable inductive matrix completion. arXiv preprint

arXiv:1306.0626 (2013).

[2] K. Zhong, P. Jain, I. S. Dhillon. Efficient Matrix Sensing Using Rank-1 Gaussian

Measurements. In Proceedings of The 26th Conference on Algorithmic Learning Theory (2015).

[3] N. Natarajan, and I. S. Dhillon. Inductive matrix completion for predicting gene

disease associations. Bioinformatics, 30(12), i60-i68 (2014).

[4] H. F. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale Multi-label Learning with

Missing Labels. In Proceedings of The 31st International Conference on Machine Learning, pp. 593-601 (2014).

[5] C-J. Hsieh, N. Natarajan, and I. S. Dhillon. PU Learning for Matrix Completion. In

Proceedings of The 32nd International Conference on Machine Learning, pp. 2445-2453 (2015).

References

Related documents

Poulston et al: Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017.. Working Notes of CLEF 2017 - Conference and Labs of

For addressing the first research topic, we analyze two differentiated time effects of blockbusters: (a) the concurrent effect (i.e., short-term effect), and (b)

The Republic of Macedonia as a small country with limited development opportunities is forced to use all resources that are available. Studies related to the

134. Iowa 2009) (concluding that the Guideline for child pornography offenses warrants little deference “because it is the result of congressional mandates, rather than

Clinical predictors Functional capacity Surgical risk Noninvasive testing Invasive testing Poor ( < 4 METs) Moderate or excellent ( > 4 METs) Minor or no clinical predictors‡

Our work in the area of multimedia information systems focuses on the integrated management and presentation of multimedia data by a DBMS.. The overall approach taken is driven by

This research was intended to investigate whether the cooperative script technique is effective to improve reading comprehension ability to thetenth grade students

A new data set has been developed by combining individual records of the London Travel Demand Survey (LTDS) with closely matched trip trajectories alongside their corresponding