Learning Generative Models

(1)

Learning Generative Models

Adrian Barbu

(2)

(3)

Markov Random Fields

• Example problem

– Given noisy image – Find clean version

• One approach:

– Observations= pixels Y=Y₁,…,Y_n – Hidden variables = clean image

X=X₁,…,X_m

• How to describe a “clean image”?

(4)

Markov Random Field

• Undirected Graph:

– A node is conditionally independent of every other node given its direct

neighbors

• Probability Distribution

– A set of cliques C

– A set of potential functions on the cliques

(5)

Cliques

• Clique:

– A complete subgraph G’=(V’,E’) of G=(V,E) – Complete means fully connected

• Maximal clique

– A clique such that there is no other clique that includes it

(6)

Markov Random Field Probability

Given:

• An undirected graph G

• Define a set C of cliques of G

• Define a set of potential functions

– Encourage certain configurations

MRF probability

:

(7)

Exponential Form

• Since all cliques are positive, can use exponential

form

• is also called potential

• Obtain exponential form of the probability

(8)

Markov Random Fields

• Usual Applications:

– Observations Y=Y₁,…,Y_n

– Hidden variables X=X₁,…,X_m

• Generative Model

– Must enumerate all possible observation sequences

– Must assume for practical purposes:

• conditional independence of the observations • Each hidden variable X_i depends only on the

corresponding Y_i

(9)

Conditional Random Fields

• Cliques only over the hidden

variables

• Clique potentials depend on Y

• Discriminative model:

– X_i can depend on many Y’s – Easier to learn

– For sequential data, fast inference

(10)

Conditional Random Field

• Definition

– An undirected GM globally conditioned on Y

• Discriminative model:

• Potential functions over a clique

set C

– E.g maximal cliques

– Dependent on the observations Y

– Learned from training data

(11)

Learning

• Training data:

• Log-likelihood

• Derivative w.r.t. 

_j

• is the empirical distribution of the training

data

(12)

Training the CRF

• Gradient Ascent

• Compute

• Approximate

(13)

Discriminative Random Fields

• Overview

– Conditional Random Field for image analysis – 2D graph structure – Pairwise cliques – Binary labels 

• CRF Probability

– A_i = association potentials – I_ij=interaction potentials

(14)

Potential Functions

• Association Potential

• P is a GLM (Generalized Linear Model)

– Logistic Regression with nonlinear transformation of the input y

• Can use other learning algorithms:

(15)

Potential Functions

• Interaction Potential

– Ising potential x_ix_j

– Learned potential based on logistic regression

– 0_{·K·1 is a fudge factor}

(16)

Learning

• Maximum Likelihood Estimation

– Needs the evaluation of Z = NP hard

– Approximation by sampling (e.g. MCMC)

– Approximation by Mean Field or pseudo-likelihood

• Pseudo-likelihood approximation

(17)

Learning

• Pseudo-likelihood

with the constraint that 0_·K·1.

• Gradient ascent

– Initialize A by learning

• assume x_i are independent

(18)

Inference

• Given new test image y, find optimal x

Approaches:

• MAP (Maximum A Posteriori) estimation

– Maximize the posterior probability

– For binary labels, can use Min Cut/Max Flow algorithm – Results are not very good for large 

• Maximum Posterior Marginal (MPM)

– Maximize the marginals

(19)

Inference

Approaches

• Iterated Conditional Modes

– Greedy

1. Start with initial labels x

2. Given current labels, maximize the conditionals at all sites

(20)

Man-made Structure Detection

• Detect man-made structures in images

• Training set:

– 108 training, 129 testing images – Size 256×384

– Images divided in 16×16 blocks – 3004 structured blocks

– 36,269 unstructured blocks

• Features

– Compute gradient orientation histograms at 3 different scales

– Intra-scale features= moments of the histograms at each scale

– Inter-scale features= angle between corresponding peaks of the

histograms at two scales

(21)

Man-made Structure Detection

• Learning

– f_i are quadratic functions – Equivalent to polynomial

kernel of degree 2

– i = 4 nearest neighbors

(22)

Man-made Structure Detection

• Result using linear classifiers (linear decision

boundary)

• Comparison with MRF and Logistic

Regression

(23)

Man-made Structure Detection

Input image Logistic Regression

(24)

Man-made Structure Detection

(25)

Binary Image Denoising

• 4 images 64_×64 • 2 noise models:

– Gaussian

– Mixture of two Gaussians

(26)

Binary Image Denoising

Logistic Regression

MRF

(27)

Binary Image Denoising

• Quantitative Evaluation

• DRF just barely better than MRF for image

denoising

• Visually better

(28)

Discriminative Learning of MRFs

(29)

The MAP Estimation Problem

• Estimation problem:

Given input data y, solve

• Example: Image denoising

– Given noisy image y, find denoised image

x

• Issues

– Modeling: How to approximate ?

– Computing: How to find x fast?

Noisy image y

(30)

MAP Estimation Issues

• Popular approach:

– Find a very accurate model

– Find best optimum x of that model

• Problems with this approach

– Hard to obtain good

– Desired solution needs to be at global maximum

– For many models , the global maximum cannot be obtained in any reasonable time.

– Using suboptimal algorithms to find the maximum leads to suboptimal solutions

(31)

Markov Random Fields

• Bayesian Models:

– Markov Random Field (MRF) prior

• E.g. Image Denoising model

– Gaussian Likelihood

– Fields of Experts MRF prior

– Differential Lorentzian

– Image filters J_i

Image Filters J_i

(32)

MAP Estimation (Inference) in MRF

• Exact inference is too hard

– For the Potts model, one of the simplest MRFs

it is already NP hard (Boykov et al, 2001)

• Approximate inference is suboptimal

– Gradient descent

– Iterated Conditional Modes (Besag 1986) – Belief Propagation (Yedidia et al, 2001) – Graph Cuts (Boykov et al, 2001)

(33)

Gradient Descent for Fields of Experts

• Energy function:

• Analytic gradient (Roth & Black, ‘05)

• Gradient descent iterations

– 3000 iterations with small 

– Takes more than 30 min per image on a modern PC

(34)

Training the MRF

• Gradient update in model parameters

– Minimize KL divergence between learned prior and true probability

– Gradient ascent in log-likelihood

– Need to know Normalization Constant Z – E_X from training data

– Z and E_p obtained by MCMC – Slow to train

• Training the FOE prior

– Contrastive divergence (Hinton)

• An approximate ML technique

• Initialize at data points and run a fixed number of iterations

– Takes about two days

(35)

Going to Real-Time Performance

• Wainwright (2006)

– In computation-limited settings, MAP estimation is not the best choice

– Some biased models could compensate for the fast inference algorithm

• How much can we gain from biased models?

• Fast denoising approach:

– 1-4 gradient descent iterations (not 3000) – Takes less than a second per image

(36)

Active Random Field

• Active Random Field = A pair (M,A) of

– a MRF model M, with parameters _M

– a fast and suboptimal inference algorithm A with parameters _A

• They cannot be separated since they are trained

together

• E.g. Active FOE for image denoising

– Fields of Experts model

– Algorithm: 1-4 iterations of gradient descent

(37)

Training the Active Random Field

• Discriminative training

• Training examples = pairs

– inputs y_i+ desired outputs t_i

• Training=optimization

• Loss function L

– Aka benchmark measure

– Evaluates accuracy on training set – End-to-end training:

• covers entire process from input image to final result

(38)

Training Active Fields of Experts

• Training set

– 40 images from the Berkeley dataset (Martin 2001)

– Same as Roth and Black 2005

• Separate training for each noise level 

• Loss function L = PSNR

– Same measure used for reporting results

is the standard deviation of

(39)

Training 1-Iteration ARF, =25

Grow the filters

1. Start with one filter, size 3x3

– Train until no improvement – We found the particle in this

subspace

2. Add another filter initialized with zeros

– Retrain to find the new mode

3. Repeat step 2 until there are 5 filters

4. Increase filters to 5x5

– Retrain to find new mode

5. Repeat step 2 until there are 13 filters

PSNR training (blue), testing (red)

while training the 1-iteration ARF, =25 filters

5 Filters 3x3 5 Filters 5x5 6 Filters 5x5 13 Filters 5x5 2 Filters 3x3 1 Filter 3x3 0 1 2 3 4 5 x 104 21 22 23 24 25 26 27 28 Steps x10000 PSNR Training Data

(40)

Results

(41)

Standard Test Images

Lena Barbara Boats

(42)

Evaluation, Standard Test Images

_noise=25 29.67 29.01 31.03 28.78 28.65 30.89

Overcomplete DCT (Elad et al, 2006)

29.92 29.84 31.82 29.17 27.57 31.20

Globally Trained Dictionary (Elad et al, 2006) 30.42 29.73 32.15 29.28 29.60 31.32 KSVD (Elad et al, 2006) 28.99 28.90 30.14 28.66 27.10 30.15

Active FOE, 1 iteration

29.66 29.51 31.18 29.14 27.59 30.86

Active FOE, 4 iterations

30.16 29.21 31.40 29.37 29.13 31.69

Wavelet Denoising (Portilla et al, 2003)

31.15 30.16 32.86 29.91 30.72 32.08 BM3D (Dabov et al, 2007) 29.45 29.31 30.80 28.99 27.49 30.66

29.58 29.45 31.04 29.08 27.57 30.76

29.38 29.20 31.11 28.72 27.04 30.82

FOE (Roth & Black, 2005)

(43)

Evaluation, Berkeley Dataset

68 images from the Berkeley dataset

• Not used for training, not overfitted by other

methods.

• Roth & Black ‘05 also evaluated on them.

(44)

Evaluation, Berkeley Dataset

Average PSNR on 68 images from the Berkeley dataset, not used for training.

1: Wiener Filter

2: Nonlinear diffusion

3: Non-local means (Buades et al, 2005) 4: FOE model, 3000 iterations,

5,6,7,8: Our algorithm with 1,2,3 and 4 iterations 9: Wavelet based denoising (Portilla et al, 2003) 10: Overcomplete DCT (Elad et al, 2006)

(45)

Speed-Performance Comparison



_noise

=25

0 1 2 3 4 5 6 7 8 9 10 26 26.5 27 27.5 28 28.5 29

Frames per second

(46)

The FRAME model

(47)

Maximum Entropy Principle

• Entropy:

• Properties:

– Measure of randomness

– Maximum for the uniform distribution

– Given mean and variance, is maximum for the normal distribution

• Maximum Entropy Principle

(48)

Maximum Entropy

• Maximum entropy model

– Given some observed constraints F_j

– The most general model given the constraints

• Maximum entropy model given training data:

(49)

Maximum Entropy

• Observe features on a training set (x

_i

,y

_i

)

• The Maximum Entropy Model is a Gibbs Distribution

• If we only need to model x

(50)

Texture Modeling

• Want a probabilistic model for texture

• Given an image I and a filter F



obtain

• Obtain the histogram of I



• If f(I) is the true distribution of the images I,

define the marginal at v

_{∈ D}

(51)

Texture Modeling

Assumptions:

• Texture is homogeneous

• Texture can be captured by local filters

Obtain Max Entropy Formulation

– Uncountable number of constraints

(52)

Maximum Entropy for Texture

• Obtain

• p(I,

_K

,S

_K

) is a function of

– The potential functions _K=(₁,…,_k) – The filters S_K=(F1_,…,Fk₎

(53)

Filters for Texture Modeling

• Five types of filters:

– Intensity filter (), captures DC component

– Isotropic center-surround filters (Laplacian of Gaussian)

– Gabor filters

T=2,4,6,8,10,12, =0,30,60,90,120,150 – Spectrum analyzers

• Filter selection to choose a small number of filters

(54)

Synthesizing Textures

1. Initialize I

syn

_{as uniform white noise texture}

2. Repeat w

_{×M×N times}

1. Randomly pick location v_∈D

2. For all possible discrete val, compute p(I(v)=val |I(-v)) using p(I,_K,S_K)

3. Sample I(v) from p(val |I(-v))

• w is usually at least 4

(55)

Animal Fur Example

Original image Synthesized with 0 filters Synthesized with 1 filter

(56)

Filtered Images

• The images obtained using the six selected filters:

– Laplacian of Gaussian – 4 Gabor Cosine

(57)

Histograms

(58)

Potentials

(59)

Other Examples

(60)

Other Examples

(61)

Other Examples

• Observed texture (fabric) and synthesized using 3 filters:

– Two Spectrum analyzers – One Laplacian of Gaussian

(62)

Sparse FRAME

• FRAME model where the features are localized

– Response of a basis function at a certain location,orientation and scale

(63)

Active Basis Model

Ying Nian Wu, UCLA

Joint work with Z. Si, S.C. Zhu

(64)

Lasso









X

_p

X

_p

Y

₁

...

|Y -



_j j1 p



X

_j

|

2





|



_j

|

i1 p



Tibshirani: lasso

Efron, Hastie, Johnston, Tibshirani: LARS

(65)

Non-convex regularization









X

_p

X

_p

Y

₁

...

Fan, Li: SCAD Zhang: MCP

|Y -



_j j1 p



X

_j

|

2



s

_

(



_j

)

j1 p



(66)









X

_p

X

_p

Y

₁

...

Variable selection in linear regression

• Matching pursuit: forward selection

– Mallat, Zhang

• Bayesian variable selection

(67)

m p p m m m

X

Y





_,₁ ₁



...





_,





(Y

_m

, m

1,..., M ) :Signals

(X

₁

,..., X

_p

) : Dictionary, basis (overcomplete p

 n)

{|Y

_m

-



_{m, j} j1 p



X

_j

|

2



s

_

(



_{m, j}

)

j1 p



m1 M



}

)

(

and

)

(

both

over

on

minimizati

joint



_m_,_j

X

_j

Learning the sparse code

Redundant dictionary  sparse representation

(68)

m p p m m m

X

Y





_,₁ ₁



...





_,





codebook or Dictionary : ) ,..., ( Signals : ) ,..., 1 , ( 1 p m X X M m Y 

}

)

(

|

-{|

1 1 , 2 1 ,



  



M m p j j m j p j j m m

X

s

Y



_



Learning algorithm

Signal encoding: Given (X

_i

), for each Y

_m

, inferring (



_{m, j}

)

Dictionary re - learning: Given (



_{m, j}

), update (X

_j

)

(69)

m p p m m m

X

Y





_,₁ ₁



...





_,





(Y_m, m 1,..., M ) :12 12 natural image patches or 144 dimensional vectors (X₁,..., X_p) : Dictionary or codebook of regressors, p = 288

Olshausen, Field

Dictionary

Localized, elongated, oriented Wavelets

Basis functions

Elementary signals Atoms

(70)

• Allow each basis to move locally

• Change of notation

m p p m m m

X

Y





_,₁ ₁



...





_,





I

_m

 c

_m,1

B

₁

...  c

_{m, p}

B

_p

U

_m n‐stroke template n = 40 to 60, box= 100x100  , ,s x

B

I

_m



c

_m,i

B

_x ixm,i,s,im,i i1 n



U

_m

Active Basis Model



, ,s

x

B

(71)

n‐stroke template n = 40 to 60, box= 100x100

I

_m



c

_m,i

B

_x ixm,i,s,im,i i1 n



U

_m

(72)

(73)

I

_m



c

_m,i

B

_x ixm,i,s,im,i i1 n



U

_m

(x

_i

,



_i

)

 argmax

_x,_

max

_x,_ m1 M



| U

_m

, B

_x_x,s,___

|

2

(x

_m,i

,





_m,i

)

 argmax

_x,_

| U

_m

, B

_x

ix,s,i

|

2

c

_m,i

 U

_m

, B

_x ixm,i,s,im,i

U

_m

U

_m

 c

_m,i

B

_x ixm,i,s,im,i Constrained sparsity Collaborative/simultaneous

Active Basis Model

(74)

Active Basis Model

(75)

• Active basis templates

– Arg-max inference and explaining away, no reweighting,

– Residual images neutralize existing elements, same set of training examples

• Adaboost templates

– No arg-max inference or explaining away inhibition

– Lack of basis deformation results in many near-duplicate bases

# of negatives: 10556 7510 4552 1493 12217 double # elements

same # elements

(76)

(77)

Learning Active Basis Models from Non-Aligned Images

(78)

(79)

(80)

Compositional Sparse Coding

Ying Nian Wu, UCLA Based on joint work with

Yi Hong, Zhangzhang Si, Wenze Hu, Song-Chun Zhu

(81)

Sparse coding model

Rewrite active basis model in packed form

Represent image by a dictionary of active basis models

(B

(t )

, t

1,...,T)

I

_m



C

_m,k k1 K



B

_X m,k,Sm,k, Am,k (t_m,k)

U

m

I

_m



c

_m,i i1 N



B

_i

U

_m

I

_m



c

_m,i i1 N



B

_x

m,i,sm,i,m,i

U

m

I

_m



c

_m,i

i1 n



B

_X

mxixm,i,s,im,i

U

m

 C

m

B

Xm

U

m

B

_X

m

 (B

Xmxixm,i,s,im,i

, i

1,..., n)

(82)

strokes  characters

(83)

letters  words (alphabet  dictionary)

Laplace: CONSTANTINOPLE (S. Geman)

Atomic decomposition in harmonic analysis

atoms  molecules

Compositionality: S. Geman, Potter, Chi

Zhu, Mumford: And-Or graph

Why compositional sparse code?

Sparser and more symbolic representation Signal recovery

Classification and recognition

(84)

Learned dictionary of composition patterns from a training image

Generalize to testing images

One compositional pattern  many groups by spatial translation, rotation, scaling, deformation highly overcomplete/redundant dictionary

(85)

Y

_m





_m,1

X

₁

... 



_{m, p}

X

_p





_m



Y

_m

 C

_m,1

X

_G

1

...  C

m,K

X

GK





m

Compositional Sparse Code

• How to learn dictionary of candidate groups (patterns)?

• An ill-advised strategy

(1) Perform variable selection for each response vector (2) Discover groups (compositional patterns)

Step (1) early decision or commitment , sparse but may not patterned

• A better strategy

(86)

Learning algorithm: specify number and size of

templates

The first 7 iterations

Learning in the 10th _iteration

Image encoding : Given (B(t ), t 1,...,T ), encode each I_m by I_m  C_m,k k1 K



B_X m,k,Sm,k, Am,k (tm,k) U m

Dictionary re - learning : Given (X_m,k, S_m,k, A_m,k, t_m,k), update each B(t ) by shared variable selection

I

_m



C

_m,k k1 K



B

_X m,k,Sm,k, Am,k (t_m,k)

U

m

(B

(t )

, t

1,...,T)

(87)

(88)

(89)

(90)

(91)

(92)

(93)

(94)

(95)

(96)

(97)

(98)

(99)

(100)

(101)

15 training images: 61.63 +/_ 2.2 % 30 training images: 68.49 +/_ 0.9%

(102)

Medical Imaging Application

• HIV virus in 3D electron microscopy images

• Find shape of glycoproteins on its boundary

(103)

Applications in Medical Imaging

• Generative models are powerful

– Need few training examples – Great interpretation power

• Few applications in medical imaging

• Grand Challenge:

Find a visual dictionary for

medical images

(e.g. CT scans)

– Level 1: Gabor-like elements

– Level 2: Object parts like in the compositional sparse code

• Difficulties:

– 3D data would need thousands of filter directions

(104)

References

• J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In

ICML, 2001.

• Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. University of Pennsylvania, 2004. • S. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006

• SC Zhu, Y Wu, D Mumford. Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. IJCV, 1998

• Y.N. Wu, Z.Z. Si, H.F. Gong, S.C. Zhu. Learning Active Basis Model for Object Detection and Recognition. IJCV, 2010

• Y. Hong, Z.Z. Si, W.Z Hu, S.C. Zhu and Y.N. Wu. Unsupervised Learning of Compositional Sparse Code for Natural Image