• No results found

Actionable Mining of Multi-relational Data using Localized Predictive Models

N/A
N/A
Protected

Academic year: 2021

Share "Actionable Mining of Multi-relational Data using Localized Predictive Models"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Actionable Mining of

Multi-relational Data

using

Localized

Predictive Models

Joydeep Ghosh

Schlumberger Centennial Chaired Professor

The University of Texas at Austin

joint work with

(2)

A Single (Affinity) Relation

{Users} x {Movies}

à

ratings

DYADIC

data

Limited

side information

–  Missing not at random

–  Time

Top solutions:

–  Ensembles of matrix factorizations (global) + collaborative filtering (local)

(scalable? Interpretable?) Star Wars Heat Titani c Batm an Alice 5 2 5 4 Bob 5 3 Carol 2 4 2 Dave 5 1 5 ?

(3)

Multi-Relational

From Arindam Banerjee, Sugato Basu, Srujana Merugu,

(4)

Overview

Predicting Affinities between two (or more) sets of

entities

–  Leveraging a variety of extra information

Dealing with predictive heterogeneity in large,

multi-relational data

–  Multiple localized models

–  Side benefits

(5)

Simultaneous Co-clustering and Learning (SCOAL)

[Deodhar and Ghosh, TKDD ‘10]

1 1 ? -1 1 1 ? -1 1 ? -1 1 -1 1 Predictive Model Customer attributes Product attributes

Simultaneous

partitioning and

prediction

exploits heterogeneity but has

chicken and egg

problem.

(6)

Regression Example

4 5 10 9 8 8 33 17 23.2 36 39 19 21 11 17.1 26 24 15 3.1 4.5 8.5 6.6 4.5 6.1 1 2.2 5 5.2 4 4 9.5 4 13.5 14 13 10 16 9.1 14.9 19 17.5 12 4 5 8.7 8 6.8 6.8 1 7 4 2 0 2 3 3 0 4 3 1 3 2 c p
(7)

Regression Example

8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 2 0 3 1 2 7 4 3 2 1 4 0 3 c p
(8)

Regression Example

8 6 4 6.6 3.1 4.5 6 4 2 5.2 1 4 9 7 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p c p
(9)

Regression Example

8.5 6.1 4.5 6.6 3.1 4.5 5 4 2.2 5.2 1 4 8.7 6.8 5 8 4 6.8 10 8 5 9 4 8 13.5 10 4 14 9.5 13 23.2 19 17 36 33 39 17.1 15 11 26 21 24 14.9 12 9.1 19 16 17.5 2 0 3 1 2 7 4 3 3 2 1 4 0 3 c + 2p 1 + c + p c p
(10)

Reconstruction Errors

-2 -2.9 -3 -5.4 -2.9 -6 1.7 2.2 1.9 0.4 2.2 0.7 -5.3 -5.8 -6.1 -7.5 -5.6 -7.2 3.1 2.6 1.1 0.6 1.6 1.1 3 1 -3.5 2 3.5 2.5 -5.2 -7.9 -8.4 6.1 9.1 10.6 -0.5 -1.1 -3.6 6.9 7.8 6.4 0.9 -0.6 -2 3.5 6.4 3.5 -1.1 -0.8 0.2 -0.7 -0.7 -1.9 -1.6 0 1 2.3 2.5 2.2 -4.3 -3.5 -2.6 -8.6 -8.2 -8.7 1.3 2 1.7 1.5 0.9 1.6 2.5 1.8 -1.4 2 1.8 2 1.2 0 0.9 1 2 5 -0.5 0 -1 0 -2 -1 0.9 1 0.8 2 3 1.5

Reconstructed with simultaneous co-clustering and regression

MSE = 7.9 Reconstructed with a single linear model

z = 1.2 + 3.6c + 1.5p

(11)

Linear Regression (Global)

Binary w

uv

associated with each matrix entry (0=missing)

Model parameters

β

T

= [

β

0

,

β

cT

,

β

pT

,

β

c,pT

]

z

ij

i

j

e.g., products/ads e.g., customers/ queries

x

ij

= {

c

i

,

p

j

, a

ij

}

covariates

e.g., ratings, CTR customer

features product features joint features T

(12)

Linear Regressions (Local)

ρ

: Mapping from m rows to

k row clusters

γ

: Mapping from n columns to

l

column clusters

–  Total of k * l regression models.

Find co-clustering (

ρ

,

γ

) and models (

β

s) that minimize

the total (

regularized

) squared error

Clustering Based on the prediction cost

, not cluster homogeneity!

Classification? Change loss function; predictive model.

uv T v u uv v u uv uv uv

z

z

z

w

x

β

( ) ( ) , 2

ˆ

)

ˆ

(

γ ρ

=

+ penalty
(13)

SCOAL Meta-Algorithm

Input: Data: Z, Weights: W, Attributes: C, P Output: Co-clustering (ρ, γ), models {βs}β}

Initialize ρ, γ

Iterate until convergence

•  Re-estimate model for each co-cluster

•  Re-estimate the co-clusters

–  Update row clusters: Assign each row to closest row cluster

–  Update col clusters: Assign each col to closest col cluster

Return ρ, γ, {βs}β}

(14)

Preventing Overfits

Regularize Local Models

Smooth/share parameters across same row cluster

–  Ditto for columns

Shrinkage between global model and local models

Model Selection: find the right # of local models

–  e.g. use greedy bisecting algorithm (M-SCOAL)
(15)

Results (ERIM)

•  Also handily beats MLP, M5’,…

Algorithm Test Error (MSE)

Global Model (k=1,l=1) 4.24 (0.06) CC (k=4,l=4) 4.002 (0.056) Co-Cluster then predict (k=4, l=4) 3.967 (0.034) Smoothened SCOAL (k=4,l=4) 3.893 (0.052)

(16)

Market Segmentation and Structure (ERIM Data)

Attribute Global Cust Seg 3, Prod Seg 3 Cust Seg 4, Prod Seg 4 Cust Seg 1, Prod Seg 2 intercept income # members male head emp. female head emp. # visits total spent price market share # times advertised 0.00 (1.00) -0.02 (0.00) 0.03 (0.00) 0.00 (0.42) 0.00 (0.62) 0.02 (0.00) 0.10 (0.00) -0.02 (0.00) 0.17 (0.00) 0.10 (0.00) -0.42 (0.00) -0.09 (0.31) 0.03 (0.74) -0.06 (0.42) -0.07 (0.31) -0.11 (0.05) 0.48 (0.00) -0.75 (0.00) 0.43 (0.00) 0.48 (0.00) -0.14 (0.00) -0.03 (0.00) 0.04 (0.00) 0.00 (0.87) 0.00 (0.45) 0.01 (0.06) 0.03 (0.00) -0.02 (0.00) 0.09 (0.00) 0.04 (0.00) 0.15 (0.00) -0.02 (0.42) -0.04 (0.06) 0.05 (0.04) 0.02 (0.46) 0.11 (0.00) 0.09 (0.00) 0.42 (0.00) 0.16 (0.00) 0.04 (0.06) Cheapest, most popular products

Coefficients of global model and sample co-cluster models

Low market share

High income, large # visits

(17)

Leveraging multiple prediction models

Interpretability

Mining for the most certain predictions (KDD)

Robust versions: dropping outlier rows/cols (ICML)

Active learning

(18)

Active Learning with SCOAL

[Deodhar, Ghosh and Saar-Tsechansky, Information Systems Research’11]

Idea:

Acquire labels in regions with poor model fits

Incrementally add models as more data available.

# Local models learnt by Global vs. local techniques

(19)

Issues with SCOAL

Hard partitioning double-edged

–  Mixed memberships may be more natural

Cold-start problem (new user/product)

–  Need to couple covariates with latent variables

Need uniform way of dealing with other

side

information

and missing data

–  Social network

–  Free text

–  Concept hierarchies

(20)

Latent Dirichlet Affinity Aware Bayesian Estimation

customers

products

models

(21)

Generative Model

(22)

Incorporating a Social Network

(23)

Affinities Not Missing at Random

W = binary variable (present/absent)

Rating, y à P (r at ed ; w= 1) à W

(24)

Key Properties of Inferencing

Direct EM formulation intractable

Variational

Methods available, providing mean-field

approximations

Iterative;

linear complexity per iteration

Dynamic versions

smoothen over time (Kalman Filtering etc).

Non-parametric Bayesian versions possible, but need

sampling.

(25)

Simultaneous Decomposition and Prediction (SDaP) Philosophy

–  Simple, fast, interpretable, scalable

Useful for many large-scale applications involving

multi-relational data

–  Predicting affinities

Soft versions: Generative models capture richer info/constraints

–  Still efficient but more opaque

A lot more to be done!

(26)

References

•  D. Agarwal, “Recommender Problems for Content Optimization”, Talk at MMDS, June 10.

•  Meghana Deodhar and Joydeep Ghosh “SCOAL: A Framework for Simultaneous Co-clustering and Learning from Complex DataACM Transactions on Knowledge Discovery from Data, Oct 2010

•  Meghana Deodhar and Joydeep Ghosh “Mining for the Most Certain Predictions from Dyadic Data”. KDD’09

(27)
(28)

ERIM Marketing Dataset

Household panel data collected by A.C. Nielsen

1714 customers, 121 products from 6 product categories

(ketchup, sugar, etc.)

Customer-product matrix cell values = # units purchased

–  Household Attributes – income, # residents, male head employed,

female head employed, total visits, total expense

–  Product Attributes – market share, price, # times product was

advertised

Properties

–  Fairly Sparse – 74.86% values are 0

(29)

Simultaneous Co-clustering and Classification

Elements of Z are

class labels

(2 class problem)

Logistic regression model relating attributes to class label

–  Log odds modeled as a linear combination of the attributes (= βTx

ij )

Find (

ρ

,

γ

) and (

β

s) that minimize the

total log loss

+

v u uv T v u uv uv

z

w

, ) ( ) (

))

exp(

1

ln(

β

ρ γ

x

(30)
(31)

Related Approaches for Modeling Dyadic Data

Predictive discrete latent factor modeling

–  [Agarwal and Merugu ‘07]

–  Response variable modeled as a sum of

1.  Function of covariates (global structure) 2.  Co-cluster specific constant (local structure)

Matrix factorization with side information

–  Regression based latent factor modeling [Agarwal and Chen ‘09]

•  Cold-start problem

•  Regression based priors on latent user/item factors

–  Spatio-temporal Kalman filtering [Lu et al. ‘09]

•  MRF prior to capture spatial structure

(32)

PDLF Model

)

)

(

;

(

)

|

(

1 1 IJ T ij ij k I l J ij IJ ij ij

f

z

g

z

p

=

∑∑

π

φ

µ

=

+

δ

= =

β

x

x

Constrained mixture model

k * l

components,

π

IJ

: mixture prior of

IJ

th

component

Each component is a generalized linear model

f

φ

:

exponential family,

g:

link function

Global trends

x

ijT

β

shared across the components

(33)

PDLF vs. SCOAL

PDLF

–  Single global model + co-cluster constants –  Robust even when data is limited

SCOAL

–  k x l co-cluster models

–  Works well when large amount of data is available

Complementary approaches

n m j i ij T ij ij ij

x

f

z

i

j

z

p

(

|

,

ρ

,

γ

)

=

φ

(

;

β

x

+

δ

ρ( ),γ( )

),

[

]

1

,

[

]

1 n m ij T ij ij ij

x

f

z

i

j

z

p

j i

),

[

]

1

,

[

]

1

;

(

)

,

,

|

(

) ( ), (

x

β

γ ρ φ

γ

ρ

=

(34)

Active Learning

Learner selects instances to be labeled such that the

generalization accuracy is improved the most

Example:

Large scale recommender systems

–  Require large number of customers to rate many products

–  Obtaining ratings is expensive

Solution: Select those customers and products to query

(35)

Hierarchical BlockRank

Learning multiple local models involves tuning a lot of

parameters

–  May not have enough data

–  Single global model may do better when training data is limited

Solution: Increase model complexity (# local models) as

more labeled data is acquired

–  Begin with a single co-cluster

–  Perform model selection step every N acquisition steps

•  Increase # co-clusters if validation set error reduces •  Use greedy, bisecting step to add one row/column cluster

(36)

Evaluation of the BlockRank Policy

(37)

Parallel SCOAL

[Joint work with Clinton Jones]

Expected to show maximum value on large datasets

–  Large heterogeneity

–  Sufficient data available for learning parameters

Distributed implementation using

Map-Reduce

framework

–  Developed on open source Hadoop platform

–  Experiments on Amazon EC2 clusters

SCOAL row/column cluster updates

–  Parallel distance computation for each matrix entry

(38)

Parallel SCOAL Pseudo Code

<id, tuple> <cc id, tuple> <cc id, <tuple>> <id, tuple> learn cc model Map Reduce

1. Learn co-cluster models

<id, tuple> <row id, tuple> <row id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce

2. Update row clusters

<id, tuple> <col id, tuple> <col id, <tuple>> <id, tuple> agg. dist. and assign compute distance Map Reduce

3. Update col clusters

(39)

Results

(40)

Decoupled SCOAL

Grid Partitioning with SCOAL

Partitioning with D-SCOAL Global objective function requires

(41)

Future Directions

Exploiting additional structural information

–  Hierarchy/taxonomy on items

–  Social network on users

–  Spatial structure

Sparse co-cluster models and feature selection

–  Interaction between # models and model complexity

Robust error functions

–  Better at handling outliers
(42)

Row and Column Cluster Updates

Objective function is a sum of row/column errors

–  Assign each row to a row cluster that minimizes the row error

–  Row cluster assignment for row u

=

=

n v uv uv uv g new

u

w

z

z

1 2

)

ˆ

(

min

arg

)

(

ρ

β

11

β

12 u

β

21

β

22

β

31

β

32

e

1

(u)

e

2

(u)

e

3

(u)

(43)

Background

Difficult

classification/regression

problems may involve a

heterogeneous population

Divide and conquer approach

–  Partition the population and model each partition separately

–  E.g., electric load forecasting

Advantages

–  Learning models on more homogeneous data

–  Improves accuracy

(44)

Motivation

Traditionally partitioning is

a priori

–  Domain knowledge

–  Clustering algorithm

… but,

a priori

partitioning may be

suboptimal

Solution:

Interleaving

partitioning and construction of

prediction models

(45)

Approaches to Interactive Decomposition

Hard partitioning

of input space for regression

–  CART [Brieman et al. ‘84]

–  Model trees (M5') [Wang and Witten ‘97]

Soft partitioning

of input space for regression

–  Mixture of Experts [Jacobs and Jordan ‘91]

•  Hierarchical versions

–  Fuzzy clustering and modeling [Wedel and Steenkamp ‘89]

Output space partitioning

for modeling large number of

classes

(46)

cus

tom

ers

products

Incremental Model Selection (M-SCOAL)

•  Top down, bisecting, greedy algorithm Run SCOAL with k=1, l=1

Repeat

1. Split row/column cluster with highest error 2. Initialize SCOAL with current partitioning 3. Accept split if validation error is reduced until no change in k and l

•  Gives better local minimum

(47)

Modeling Ordered Data

[Deodhar and Ghosh, Data Mining for Design and Mkt. Workshop, ICDM ‘08]

Dyadic data with implicit

ordering along a mode

–  E.g., temporal marketing data

Segment

the time axis

–  Dynamic programming

(quadratic) or greedy (linear)

(48)

Results on Tensor ERIM Dataset

Dollars spent by households per week at 5 stores

–  1240 households X 50 weeks X 5 stores

Algorithm Train R-sq. Test MSE Test R-sq. Global Cluster Models SCOAL CoSeg 0.44 0.45 0.58 0.60 166.36 (1.27) 156.06 (1.36) 142.08 (0.99) 132.94 (0.88) 0.44 0.48 0.52 0.55

(49)

Mining for the Most Reliable Predictions

[Deodhar and Ghosh, KDD ‘09]

Problem

: Rank predictions by estimate of their accuracy

–  E.g., options market player strategy

SCOAL based ranking

–  Row-Col Ranking: Rank by estimated mean row error + col error

(50)

References

Related documents

Manfaat yang diharapkan dari pengimplementasian metode ForkJoinPool menggunakan teknologi Web Workers pada JavaScript sebagai library transformasi data adalah dapat

Inilah yang menjelaskan semua FENOMENA mengapa Wanita yang SUPER CANTIK bisa TER TARIK dengan PRIA yang tidak tinggi, ganteng atau kaya LOGIKA mereka mempunyai PR EFERENSI seorang

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’

We show that the existence of fraud does not affect the real output decision of the firm nor the tax policy of the government.. Audit resources can be used to detect firms engaged

In conclusion, the method described offers a high post- process and post-thaw yield of hematopoetic stem cells, in combination with a small storage volume, does not require specific

The main purpose of this paper is to analyze the impacts of current security stability actions like Operation Zarb-e-Azb (army operation) and National Action

· What are the prospects that the new investment strategy for the CPP fund will produce a higher real rate of return over the long run, as is assumed in the estimate of

protocols (in detention, arrest, and enforcement), prosecutorial decisions (in charging and plea agreements), laws enacted, case law determined, and procedural