Actionable Mining of Multi-relational Data using Localized Predictive Models

(1)

Actionable Mining of

Multi-relational Data

using

Localized

Predictive Models

Joydeep Ghosh

Schlumberger Centennial Chaired Professor

The University of Texas at Austin

joint work with

(2)

A Single (Affinity) Relation

•

{Users} x {Movies}

à

ratings

DYADIC

data

•

Limited

side information

–  Missing not at random

–  Time

•

Dealing with predictive heterogeneity in large,

multi-relational data

–  Multiple localized models

–  Side benefits

(5)

Simultaneous Co-clustering and Learning (SCOAL)

[Deodhar and Ghosh, TKDD ‘10]

1 1 ? -1 1 1 ? -1 1 ? -1 1 -1 1 Predictive Model Customer attributes Product attributes

•

Simultaneous

partitioning and

prediction

•

exploits heterogeneity but has

“

chicken and egg

”

problem.

(6)

Regression Example

4 5 10 9 8 8 33 17 23.2 36 39 19 21 11 17.1 26 24 15 3.1 4.5 8.5 6.6 4.5 6.1 1 2.2 5 5.2 4 4 9.5 4 13.5 14 13 10 16 9.1 14.9 19 17.5 12 4 5 8.7 8 6.8 6.8 1 7 4 2 0 2 3 3 0 4 3 1 3 2 c p

(7)

Regression Example

8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 2 0 3 1 2 7 4 3 2 1 4 0 3 c p

(8)

Regression Example

8 6 4 6.6 3.1 4.5 6 4 2 5.2 1 4 9 7 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p c p

(9)

Regression Example

8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p _{1 + c + p} c p

(10)

Reconstruction Errors

-2 -2.9 -3 -5.4 -2.9 -6 1.7 2.2 1.9 0.4 2.2 0.7 -5.3 -5.8 -6.1 -7.5 -5.6 -7.2 3.1 2.6 1.1 0.6 1.6 1.1 3 1 -3.5 2 3.5 2.5 -5.2 -7.9 -8.4 6.1 9.1 10.6 -0.5 -1.1 -3.6 6.9 7.8 6.4 0.9 -0.6 -2 3.5 6.4 3.5 -1.1 -0.8 0.2 -0.7 -0.7 -1.9 -1.6 0 1 2.3 2.5 2.2 -4.3 -3.5 -2.6 -8.6 -8.2 -8.7 1.3 2 1.7 1.5 0.9 1.6 2.5 1.8 -1.4 2 1.8 2 1.2 0 0.9 1 2 5 -0.5 0 -1 0 -2 -1 0.9 1 0.8 2 3 1.5

Reconstructed with simultaneous co-clustering and regression

MSE = 7.9 Reconstructed with a single linear model

z = 1.2 + 3.6c + 1.5p

(11)

Linear Regression (Global)

•

Binary w

_uv

associated with each matrix entry (0=missing)

•

Model parameters

β

T

= [

β

₀

,

β

_cT

,

β

_pT

,

β

_c,pT

]

z

_ij

i

j

e.g., products/ads e.g., customers/ queries

x

_ij

= {

c

_i

,

p

_j

, a

_ij

}

covariates

e.g., ratings, CTR customer

features product features joint features T

(12)

Linear Regressions (Local)

•

ρ

: Mapping from m rows to

k row clusters

•

γ

: Mapping from n columns to

l

column clusters

–  Total of k * l regression models.

•

Find co-clustering (

ρ

,

γ

) and models (

β

’

s) that minimize

the total (

regularized

) squared error

•

Clustering Based on the prediction cost

, not cluster homogeneity!

•

Classification? Change loss function; predictive model.

uv T v u uv v u uv uv uv

z

w

x

β

₍ ₎ ₍ ₎ , 2

ˆ

)

ˆ

(

γ ρ

=

−

∑

+ penalty

(13)

SCOAL Meta-Algorithm

Input: Data: Z, Weights: W, Attributes: C, P Output: Co-clustering (_ρ, _γ), models {βs}β}

Initialize _ρ, _γ

Iterate until convergence

•  Re-estimate model for each co-cluster

•  Re-estimate the co-clusters

–  Update row clusters: Assign each row to closest row cluster

–  Update col clusters: Assign each col to closest col cluster

Return _ρ, _γ, {βs}β}

(14)

Preventing Overfits

•

Regularize Local Models

•

Smooth/share parameters across same row cluster

–  Ditto for columns

•

Shrinkage between global model and local models

•

Model Selection: find the right # of local models

–  e.g. use greedy bisecting algorithm (M-SCOAL)

(15)

Results (ERIM)

•  Also handily beats MLP, M5’,…

Algorithm Test Error (MSE)

Global Model (k=1,l=1) 4.24 (0.06) CC (k=4,l=4) 4.002 (0.056) Co-Cluster then predict (k=4, l=4) 3.967 (0.034) Smoothened SCOAL (k=4,l=4) 3.893 (0.052)

(16)

Market Segmentation and Structure (ERIM Data)

Attribute Global Cust Seg 3, Prod Seg 3 Cust Seg 4, Prod Seg 4 Cust Seg 1, Prod Seg 2 intercept income # members male head emp. female head emp. # visits total spent price market share # times advertised 0.00 (1.00) -0.02 (0.00) 0.03 (0.00) 0.00 (0.42) 0.00 (0.62) 0.02 (0.00) 0.10 (0.00) -0.02 (0.00) 0.17 (0.00) 0.10 (0.00) -0.42 (0.00) -0.09 (0.31) 0.03 (0.74) -0.06 (0.42) -0.07 (0.31) -0.11 (0.05) 0.48 (0.00) -0.75 (0.00) 0.43 (0.00) 0.48 (0.00) -0.14 (0.00) -0.03 (0.00) 0.04 (0.00) 0.00 (0.87) 0.00 (0.45) 0.01 (0.06) 0.03 (0.00) -0.02 (0.00) 0.09 (0.00) 0.04 (0.00) 0.15 (0.00) -0.02 (0.42) -0.04 (0.06) 0.05 (0.04) 0.02 (0.46) 0.11 (0.00) 0.09 (0.00) 0.42 (0.00) 0.16 (0.00) 0.04 (0.06) Cheapest, most popular products

Coefficients of global model and sample co-cluster models

Low market share

High income, large # visits

(17)

Leveraging multiple prediction models

–

Interpretability

–

Mining for the most certain predictions (KDD)

–

Robust versions: dropping outlier rows/cols (ICML)

–

Active learning

(18)

Active Learning with SCOAL

[Deodhar, Ghosh and Saar-Tsechansky, Information Systems Research’11]

•

Idea:

Acquire labels in regions with poor model fits

•

Incrementally add models as more data available.

# Local models learnt by Global vs. local techniques

(19)

Issues with SCOAL

•

Hard partitioning double-edged

–  Mixed memberships may be more natural

•

Cold-start problem (new user/product)

–  Need to couple covariates with latent variables

•

Need uniform way of dealing with other

“

side

information

”

and missing data

–  Social network

–  Free text

–  Concept hierarchies

(20)

Latent Dirichlet Affinity Aware Bayesian Estimation

customers

products

models

(21)

Generative Model

(22)

Incorporating a Social Network

(23)

Affinities Not Missing at Random

W = binary variable (present/absent)

Rating, y à P (r at ed ; w= 1) à W

(24)

Key Properties of Inferencing

Direct EM formulation intractable

Variational

Methods available, providing mean-field

approximations

Iterative;

linear complexity per iteration

Dynamic versions

smoothen over time (Kalman Filtering etc).

Non-parametric Bayesian versions possible, but need

sampling.

(25)

•

Simultaneous Decomposition and Prediction (SDaP) Philosophy

–  Simple, fast, interpretable, scalable

•

Useful for many large-scale applications involving

multi-relational data

–  Predicting affinities

•

Soft versions: Generative models capture richer info/constraints

–  Still efficient but more opaque

•

A lot more to be done!

(26)

References

•  D. Agarwal, “Recommender Problems for Content Optimization”, Talk at MMDS, June 10.

•  Meghana Deodhar and Joydeep Ghosh “SCOAL: A Framework for Simultaneous Co-clustering and Learning from Complex Data” ACM Transactions on Knowledge Discovery from Data, Oct 2010

•  Meghana Deodhar and Joydeep Ghosh “Mining for the Most Certain Predictions from Dyadic Data”. KDD’09

(27)

(28)

ERIM Marketing Dataset

•

Household panel data collected by A.C. Nielsen

•

1714 customers, 121 products from 6 product categories

(ketchup, sugar, etc.)

•

Customer-product matrix cell values = # units purchased

–  Household Attributes – income, # residents, male head employed,

female head employed, total visits, total expense

–  Product Attributes – market share, price, # times product was

advertised

•

Properties

–  Fairly Sparse – 74.86% values are 0

(29)

Simultaneous Co-clustering and Classification

•

Elements of Z are

class labels

(2 class problem)

•

Logistic regression model relating attributes to class label

–  Log odds modeled as a linear combination of the attributes (= βTx

ij )

•

Find (

ρ

,

γ

) and (

β

’

s) that minimize the

total log loss

∑

+

−

v u uv T v u uv uv

z

w

, ) ( ) (

))

exp(

1

ln(

β

_ρ _γ

x

(30)

(31)

Related Approaches for Modeling Dyadic Data

•

Predictive discrete latent factor modeling

–  [Agarwal and Merugu ‘07]

–  Response variable modeled as a sum of

1.  Function of covariates (global structure) 2.  Co-cluster specific constant (local structure)

•

Matrix factorization with side information

–  Regression based latent factor modeling [Agarwal and Chen ‘09]

•  Cold-start problem

•  Regression based priors on latent user/item factors

–  Spatio-temporal Kalman filtering [Lu et al. ‘09]

•  MRF prior to capture spatial structure

(32)

PDLF Model

)

(

;

(

)

|

(

1 1 IJ T ij ij k I l J ij IJ ij ij

f

z

g

z

p

=

∑∑

π

_φ

µ

=

+

δ

= =

β

x

•

Constrained mixture model

─

k * l

components,

π

_IJ

: mixture prior of

IJ

th

component

•

Each component is a generalized linear model

─

f

_φ

:

exponential family,

g:

link function

•

Global trends

x

_ijT

β

shared across the components

(33)

PDLF vs. SCOAL

•

PDLF

–  Single global model + co-cluster constants –  Robust even when data is limited

•

SCOAL

–  k x l co-cluster models

–  Works well when large amount of data is available

•

Complementary approaches

n m j i ij T ij ij ij

x

f

z

i

j

z

p

(

|

,

ρ

,

γ

)

=

_φ

(

;

β

x

+

δ

_ρ₍ _),_γ₍ ₎

),

[

]

₁

,

[

]

₁ n m ij T ij ij ij

x

f

z

i

j

z

p

j i

),

[

]

1

,

[

]

1

;

(

)

,

|

(

) ( ), (

x

β

γ ρ φ

γ

ρ

=

(34)

Active Learning

•

Learner selects instances to be labeled such that the

generalization accuracy is improved the most

•

Example:

Large scale recommender systems

–  Require large number of customers to rate many products

–  Obtaining ratings is expensive

•

Solution: Select those customers and products to query

(35)

Hierarchical BlockRank

•

Learning multiple local models involves tuning a lot of

parameters

–  May not have enough data

–  Single global model may do better when training data is limited

•

Solution: Increase model complexity (# local models) as

more labeled data is acquired

–  Begin with a single co-cluster

–  Perform model selection step every N acquisition steps

•  Increase # co-clusters if validation set error reduces •  Use greedy, bisecting step to add one row/column cluster

(36)

Evaluation of the BlockRank Policy

(37)

Parallel SCOAL

[Joint work with Clinton Jones]

•

Expected to show maximum value on large datasets

–  Large heterogeneity

–  Sufficient data available for learning parameters

•

Distributed implementation using

Map-Reduce

framework

–  Developed on open source Hadoop platform

–  Experiments on Amazon EC2 clusters

•

SCOAL row/column cluster updates

–  Parallel distance computation for each matrix entry

(38)

Parallel SCOAL Pseudo Code

<id, tuple> <cc id, tuple> <cc id, <tuple>> <id, tuple> learn cc model Map Reduce

1. Learn co-cluster models

<id, tuple> <row id, tuple> <row id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce

2. Update row clusters

<id, tuple> <col id, tuple> <col id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce

3. Update col clusters

(39)

Results

(40)

Decoupled SCOAL

Grid Partitioning with SCOAL

Partitioning with D-SCOAL Global objective function requires

(41)

Future Directions

•

Exploiting additional structural information

–  Hierarchy/taxonomy on items

–  Social network on users

–  Spatial structure

•

Sparse co-cluster models and feature selection

–  Interaction between # models and model complexity

•

Robust error functions

–  Better at handling outliers

(42)

Row and Column Cluster Updates

•

Objective function is a sum of row/column errors

–  Assign each row to a row cluster that minimizes the row error

–  Row cluster assignment for row u

∑

=

−

=

n v uv uv uv g new

u

w

z

1 2

)

ˆ

(

min

arg

)

(

ρ

β

₁₁

β

₁₂ u

β

₂₁

β

₂₂

β

₃₁

β

₃₂

e

₁

(u)

e

₂

(u)

e

₃

(u)

(43)

Background

•

Difficult

classification/regression

problems may involve a

heterogeneous population

•

Divide and conquer approach

–  Partition the population and model each partition separately

–  E.g., electric load forecasting

•

Advantages

–  Learning models on more homogeneous data

–  Improves accuracy

(44)

Motivation

•

Traditionally partitioning is

a priori

–  Domain knowledge

–  Clustering algorithm

•

… but,

a priori

partitioning may be

suboptimal

•

Solution:

Interleaving

partitioning and construction of

prediction models

(45)

Approaches to Interactive Decomposition

•

Hard partitioning

of input space for regression

–  CART [Brieman et al. ‘84]

–  Model trees (M5') [Wang and Witten ‘97]

•

Soft partitioning

of input space for regression

–  Mixture of Experts [Jacobs and Jordan ‘91]

•  Hierarchical versions

–  Fuzzy clustering and modeling [Wedel and Steenkamp ‘89]

•

Output space partitioning

for modeling large number of

classes

(46)

cus

tom

ers

products

Incremental Model Selection (M-SCOAL)

•  Top down, bisecting, greedy algorithm Run SCOAL with k=1, l=1

Repeat

1. Split row/column cluster with highest error 2. Initialize SCOAL with current partitioning 3. Accept split if validation error is reduced until no change in k and l

•  Gives better local minimum

(47)

Modeling Ordered Data

[Deodhar and Ghosh, Data Mining for Design and Mkt. Workshop, ICDM ‘08]

•

Dyadic data with implicit

ordering along a mode

–  E.g., temporal marketing data

•

Segment

the time axis

–  Dynamic programming

(quadratic) or greedy (linear)

(48)

Results on Tensor ERIM Dataset

•

Dollars spent by households per week at 5 stores

–  1240 households X 50 weeks X 5 stores

Algorithm Train R-sq. Test MSE Test R-sq. Global Cluster Models SCOAL CoSeg 0.44 0.45 0.58 0.60 166.36 (1.27) 156.06 (1.36) 142.08 (0.99) 132.94 (0.88) 0.44 0.48 0.52 0.55

(49)

Mining for the Most Reliable Predictions

[Deodhar and Ghosh, KDD ‘09]

•

Problem

: Rank predictions by estimate of their accuracy

–  E.g., options market player strategy

•

SCOAL based ranking

–  Row-Col Ranking: Rank by estimated mean row error + col error

(50)

Actionable Mining of Multi-relational Data using Localized Predictive Models