Actionable Mining of
Multi-relational Data
using
Localized
Predictive Models
Joydeep Ghosh
Schlumberger Centennial Chaired Professor
The University of Texas at Austin
joint work with
A Single (Affinity) Relation
•
{Users} x {Movies}
à
ratings
DYADIC
data
•
Limited
side information
– Missing not at random– Time
•
Top solutions:
– Ensembles of matrix factorizations (global) + collaborative filtering (local)
(scalable? Interpretable?) Star Wars Heat Titani c Batm an Alice 5 2 5 4 Bob 5 3 Carol 2 4 2 Dave 5 1 5 ?
Multi-Relational
From Arindam Banerjee, Sugato Basu, Srujana Merugu,
Overview
•
Predicting Affinities between two (or more) sets of
entities
– Leveraging a variety of extra information
•
Dealing with predictive heterogeneity in large,
multi-relational data
– Multiple localized models
– Side benefits
Simultaneous Co-clustering and Learning (SCOAL)
[Deodhar and Ghosh, TKDD ‘10]
1 1 ? -1 1 1 ? -1 1 ? -1 1 -1 1 Predictive Model Customer attributes Product attributes
•
Simultaneous
partitioning and
prediction
•
exploits heterogeneity but has
“
chicken and egg
”
problem.
Regression Example
4 5 10 9 8 8 33 17 23.2 36 39 19 21 11 17.1 26 24 15 3.1 4.5 8.5 6.6 4.5 6.1 1 2.2 5 5.2 4 4 9.5 4 13.5 14 13 10 16 9.1 14.9 19 17.5 12 4 5 8.7 8 6.8 6.8 1 7 4 2 0 2 3 3 0 4 3 1 3 2 c pRegression Example
8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 2 0 3 1 2 7 4 3 2 1 4 0 3 c pRegression Example
8 6 4 6.6 3.1 4.5 6 4 2 5.2 1 4 9 7 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p c pRegression Example
8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p 1 + c + p c pReconstruction Errors
-2 -2.9 -3 -5.4 -2.9 -6 1.7 2.2 1.9 0.4 2.2 0.7 -5.3 -5.8 -6.1 -7.5 -5.6 -7.2 3.1 2.6 1.1 0.6 1.6 1.1 3 1 -3.5 2 3.5 2.5 -5.2 -7.9 -8.4 6.1 9.1 10.6 -0.5 -1.1 -3.6 6.9 7.8 6.4 0.9 -0.6 -2 3.5 6.4 3.5 -1.1 -0.8 0.2 -0.7 -0.7 -1.9 -1.6 0 1 2.3 2.5 2.2 -4.3 -3.5 -2.6 -8.6 -8.2 -8.7 1.3 2 1.7 1.5 0.9 1.6 2.5 1.8 -1.4 2 1.8 2 1.2 0 0.9 1 2 5 -0.5 0 -1 0 -2 -1 0.9 1 0.8 2 3 1.5Reconstructed with simultaneous co-clustering and regression
MSE = 7.9 Reconstructed with a single linear model
z = 1.2 + 3.6c + 1.5p
Linear Regression (Global)
•
Binary w
uvassociated with each matrix entry (0=missing)
•
Model parameters
β
T= [
β
0,
β
cT,
β
pT,
β
c,pT]
z
iji
j
e.g., products/ads e.g., customers/ queriesx
ij= {
c
i,
p
j, a
ij}
covariatese.g., ratings, CTR customer
features product features joint features T
Linear Regressions (Local)
•
ρ
: Mapping from m rows to
k row clusters
•
γ
: Mapping from n columns to
l
column clusters
– Total of k * l regression models.
•
Find co-clustering (
ρ
,
γ
) and models (
β
’
s) that minimize
the total (
regularized
) squared error
•
Clustering Based on the prediction cost
, not cluster homogeneity!
•
Classification? Change loss function; predictive model.
uv T v u uv v u uv uv uv
z
z
z
w
x
β
( ) ( ) , 2ˆ
)
ˆ
(
γ ρ=
−
∑
+ penaltySCOAL Meta-Algorithm
Input: Data: Z, Weights: W, Attributes: C, P Output: Co-clustering (ρ, γ), models {βs}β}
Initialize ρ, γ
Iterate until convergence
• Re-estimate model for each co-cluster
• Re-estimate the co-clusters
– Update row clusters: Assign each row to closest row cluster
– Update col clusters: Assign each col to closest col cluster
Return ρ, γ, {βs}β}
Preventing Overfits
•
Regularize Local Models
•
Smooth/share parameters across same row cluster
– Ditto for columns•
Shrinkage between global model and local models
•
Model Selection: find the right # of local models
– e.g. use greedy bisecting algorithm (M-SCOAL)Results (ERIM)
• Also handily beats MLP, M5’,…
Algorithm Test Error (MSE)
Global Model (k=1,l=1) 4.24 (0.06) CC (k=4,l=4) 4.002 (0.056) Co-Cluster then predict (k=4, l=4) 3.967 (0.034) Smoothened SCOAL (k=4,l=4) 3.893 (0.052)
Market Segmentation and Structure (ERIM Data)
Attribute Global Cust Seg 3, Prod Seg 3 Cust Seg 4, Prod Seg 4 Cust Seg 1, Prod Seg 2 intercept income # members male head emp. female head emp. # visits total spent price market share # times advertised 0.00 (1.00) -0.02 (0.00) 0.03 (0.00) 0.00 (0.42) 0.00 (0.62) 0.02 (0.00) 0.10 (0.00) -0.02 (0.00) 0.17 (0.00) 0.10 (0.00) -0.42 (0.00) -0.09 (0.31) 0.03 (0.74) -0.06 (0.42) -0.07 (0.31) -0.11 (0.05) 0.48 (0.00) -0.75 (0.00) 0.43 (0.00) 0.48 (0.00) -0.14 (0.00) -0.03 (0.00) 0.04 (0.00) 0.00 (0.87) 0.00 (0.45) 0.01 (0.06) 0.03 (0.00) -0.02 (0.00) 0.09 (0.00) 0.04 (0.00) 0.15 (0.00) -0.02 (0.42) -0.04 (0.06) 0.05 (0.04) 0.02 (0.46) 0.11 (0.00) 0.09 (0.00) 0.42 (0.00) 0.16 (0.00) 0.04 (0.06) Cheapest, most popular products
Coefficients of global model and sample co-cluster models
Low market share
High income, large # visits
Leveraging multiple prediction models
–
Interpretability
–
Mining for the most certain predictions (KDD)
–
Robust versions: dropping outlier rows/cols (ICML)
–
Active learning
Active Learning with SCOAL
[Deodhar, Ghosh and Saar-Tsechansky, Information Systems Research’11]
•
Idea:
Acquire labels in regions with poor model fits
•
Incrementally add models as more data available.
# Local models learnt by Global vs. local techniques
Issues with SCOAL
•
Hard partitioning double-edged
– Mixed memberships may be more natural
•
Cold-start problem (new user/product)
– Need to couple covariates with latent variables
•
Need uniform way of dealing with other
“
side
information
”
and missing data
– Social network– Free text
– Concept hierarchies
Latent Dirichlet Affinity Aware Bayesian Estimation
customers
products
models
Generative Model
Incorporating a Social Network
Affinities Not Missing at Random
W = binary variable (present/absent)
Rating, y à P (r at ed ; w= 1) à W
Key Properties of Inferencing
Direct EM formulation intractable
Variational
Methods available, providing mean-field
approximations
Iterative;
linear complexity per iteration
Dynamic versions
smoothen over time (Kalman Filtering etc).
Non-parametric Bayesian versions possible, but need
sampling.
•
Simultaneous Decomposition and Prediction (SDaP) Philosophy
– Simple, fast, interpretable, scalable•
Useful for many large-scale applications involving
multi-relational data
– Predicting affinities
•
Soft versions: Generative models capture richer info/constraints
– Still efficient but more opaque•
A lot more to be done!
References
• D. Agarwal, “Recommender Problems for Content Optimization”, Talk at MMDS, June 10.
• Meghana Deodhar and Joydeep Ghosh “SCOAL: A Framework for Simultaneous Co-clustering and Learning from Complex Data” ACM Transactions on Knowledge Discovery from Data, Oct 2010
• Meghana Deodhar and Joydeep Ghosh “Mining for the Most Certain Predictions from Dyadic Data”. KDD’09
ERIM Marketing Dataset
•
Household panel data collected by A.C. Nielsen
•
1714 customers, 121 products from 6 product categories
(ketchup, sugar, etc.)
•
Customer-product matrix cell values = # units purchased
– Household Attributes – income, # residents, male head employed,female head employed, total visits, total expense
– Product Attributes – market share, price, # times product was
advertised
•
Properties
– Fairly Sparse – 74.86% values are 0
Simultaneous Co-clustering and Classification
•
Elements of Z are
class labels
(2 class problem)
•
Logistic regression model relating attributes to class label
– Log odds modeled as a linear combination of the attributes (= βTxij )
•
Find (
ρ
,
γ
) and (
β
’
s) that minimize the
total log loss
∑
+
−
v u uv T v u uv uvz
w
, ) ( ) ())
exp(
1
ln(
β
ρ γx
Related Approaches for Modeling Dyadic Data
•
Predictive discrete latent factor modeling
– [Agarwal and Merugu ‘07]– Response variable modeled as a sum of
1. Function of covariates (global structure) 2. Co-cluster specific constant (local structure)
•
Matrix factorization with side information
– Regression based latent factor modeling [Agarwal and Chen ‘09]
• Cold-start problem
• Regression based priors on latent user/item factors
– Spatio-temporal Kalman filtering [Lu et al. ‘09]
• MRF prior to capture spatial structure
PDLF Model
)
)
(
;
(
)
|
(
1 1 IJ T ij ij k I l J ij IJ ij ijf
z
g
z
p
=
∑∑
π
φµ
=
+
δ
= =β
x
x
•
Constrained mixture model
─
k * l
components,
π
IJ: mixture prior of
IJ
thcomponent
•
Each component is a generalized linear model
─
f
φ:
exponential family,
g:
link function
•
Global trends
x
ijTβ
shared across the components
PDLF vs. SCOAL
•
PDLF
– Single global model + co-cluster constants – Robust even when data is limited
•
SCOAL
– k x l co-cluster models
– Works well when large amount of data is available
•
Complementary approaches
n m j i ij T ij ij ijx
f
z
i
j
z
p
(
|
,
ρ
,
γ
)
=
φ(
;
β
x
+
δ
ρ( ),γ( )),
[
]
1,
[
]
1 n m ij T ij ij ijx
f
z
i
j
z
p
j i),
[
]
1,
[
]
1;
(
)
,
,
|
(
) ( ), (x
β
γ ρ φγ
ρ
=
Active Learning
•
Learner selects instances to be labeled such that the
generalization accuracy is improved the most
•
Example:
Large scale recommender systems
– Require large number of customers to rate many products
– Obtaining ratings is expensive
•
Solution: Select those customers and products to query
Hierarchical BlockRank
•
Learning multiple local models involves tuning a lot of
parameters
– May not have enough data
– Single global model may do better when training data is limited
•
Solution: Increase model complexity (# local models) as
more labeled data is acquired
– Begin with a single co-cluster
– Perform model selection step every N acquisition steps
• Increase # co-clusters if validation set error reduces • Use greedy, bisecting step to add one row/column cluster
Evaluation of the BlockRank Policy
Parallel SCOAL
[Joint work with Clinton Jones]
•
Expected to show maximum value on large datasets
– Large heterogeneity– Sufficient data available for learning parameters
•
Distributed implementation using
Map-Reduce
framework
– Developed on open source Hadoop platform– Experiments on Amazon EC2 clusters
•
SCOAL row/column cluster updates
– Parallel distance computation for each matrix entry
Parallel SCOAL Pseudo Code
<id, tuple> <cc id, tuple> <cc id, <tuple>> <id, tuple> learn cc model Map Reduce1. Learn co-cluster models
<id, tuple> <row id, tuple> <row id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce
2. Update row clusters
<id, tuple> <col id, tuple> <col id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce
3. Update col clusters
Results
Decoupled SCOAL
Grid Partitioning with SCOAL
Partitioning with D-SCOAL Global objective function requires
Future Directions
•
Exploiting additional structural information
– Hierarchy/taxonomy on items– Social network on users
– Spatial structure
•
Sparse co-cluster models and feature selection
– Interaction between # models and model complexity•
Robust error functions
– Better at handling outliersRow and Column Cluster Updates
•
Objective function is a sum of row/column errors
– Assign each row to a row cluster that minimizes the row error
– Row cluster assignment for row u
∑
=−
=
n v uv uv uv g newu
w
z
z
1 2)
ˆ
(
min
arg
)
(
ρ
β
11β
12 uβ
21β
22β
31β
32e
1(u)
e
2(u)
e
3(u)
Background
•
Difficult
classification/regression
problems may involve a
heterogeneous population
•
Divide and conquer approach
– Partition the population and model each partition separately
– E.g., electric load forecasting
•
Advantages
– Learning models on more homogeneous data
– Improves accuracy
Motivation
•
Traditionally partitioning is
a priori
– Domain knowledge– Clustering algorithm
•
… but,
a priori
partitioning may be
suboptimal
•
Solution:
Interleaving
partitioning and construction of
prediction models
Approaches to Interactive Decomposition
•
Hard partitioning
of input space for regression
– CART [Brieman et al. ‘84]
– Model trees (M5') [Wang and Witten ‘97]
•
Soft partitioning
of input space for regression
– Mixture of Experts [Jacobs and Jordan ‘91]
• Hierarchical versions
– Fuzzy clustering and modeling [Wedel and Steenkamp ‘89]
•
Output space partitioning
for modeling large number of
classes
cus
tom
ers
products
Incremental Model Selection (M-SCOAL)
• Top down, bisecting, greedy algorithm Run SCOAL with k=1, l=1
Repeat
1. Split row/column cluster with highest error 2. Initialize SCOAL with current partitioning 3. Accept split if validation error is reduced until no change in k and l
• Gives better local minimum
Modeling Ordered Data
[Deodhar and Ghosh, Data Mining for Design and Mkt. Workshop, ICDM ‘08]
•
Dyadic data with implicit
ordering along a mode
– E.g., temporal marketing data
•
Segment
the time axis
– Dynamic programming
(quadratic) or greedy (linear)
Results on Tensor ERIM Dataset
•
Dollars spent by households per week at 5 stores
– 1240 households X 50 weeks X 5 storesAlgorithm Train R-sq. Test MSE Test R-sq. Global Cluster Models SCOAL CoSeg 0.44 0.45 0.58 0.60 166.36 (1.27) 156.06 (1.36) 142.08 (0.99) 132.94 (0.88) 0.44 0.48 0.52 0.55
Mining for the Most Reliable Predictions
[Deodhar and Ghosh, KDD ‘09]
•
Problem
: Rank predictions by estimate of their accuracy
– E.g., options market player strategy
•
SCOAL based ranking
– Row-Col Ranking: Rank by estimated mean row error + col error