Learning Generative Models
Adrian Barbu
Markov Random Fields
• Example problem
– Given noisy image – Find clean version
• One approach:
– Observations= pixels Y=Y1,…,Yn – Hidden variables = clean image
X=X1,…,Xm
• How to describe a “clean image”?
Markov Random Field
• Undirected Graph:
– A node is conditionally independent of every other node given its direct
neighbors
• Probability Distribution
– A set of cliques C
– A set of potential functions on the cliques
Cliques
• Clique:
– A complete subgraph G’=(V’,E’) of G=(V,E) – Complete means fully connected
• Maximal clique
– A clique such that there is no other clique that includes it
Markov Random Field Probability
Given:
• An undirected graph G
• Define a set C of cliques of G
• Define a set of potential functions
– Encourage certain configurations
MRF probability
:
Exponential Form
• Since all cliques are positive, can use exponential
form
•
is also called potential
• Obtain exponential form of the probability
Markov Random Fields
• Usual Applications:
– Observations Y=Y1,…,Yn
– Hidden variables X=X1,…,Xm
• Generative Model
– Must enumerate all possible observation sequences
– Must assume for practical purposes:
• conditional independence of the observations • Each hidden variable Xi depends only on the
corresponding Yi
Conditional Random Fields
• Cliques only over the hidden
variables
• Clique potentials depend on Y
• Discriminative model:
– Xi can depend on many Y’s – Easier to learn
– For sequential data, fast inference
Conditional Random Field
• Definition
– An undirected GM globally conditioned on Y
• Discriminative model:
• Potential functions over a clique
set C
– E.g maximal cliques
– Dependent on the observations Y
– Learned from training data
Learning
• Training data:
• Log-likelihood
• Derivative w.r.t.
j•
is the empirical distribution of the training
data
Training the CRF
•
Gradient Ascent
•
Compute
•
Approximate
Discriminative Random Fields
• Overview
– Conditional Random Field for image analysis – 2D graph structure – Pairwise cliques – Binary labels
• CRF Probability
– Ai = association potentials – Iij=interaction potentialsPotential Functions
• Association Potential
• P is a GLM (Generalized Linear Model)
– Logistic Regression with nonlinear transformation of the input y
• Can use other learning algorithms:
Potential Functions
• Interaction Potential
– Ising potential xixj
– Learned potential based on logistic regression
– 0·K·1 is a fudge factor
Learning
• Maximum Likelihood Estimation
– Needs the evaluation of Z = NP hard
– Approximation by sampling (e.g. MCMC)
– Approximation by Mean Field or pseudo-likelihood
• Pseudo-likelihood approximation
Learning
• Pseudo-likelihood
with the constraint that 0·K·1.
• Gradient ascent
– Initialize A by learning
• assume xi are independent
Inference
• Given new test image y, find optimal x
Approaches:
• MAP (Maximum A Posteriori) estimation
– Maximize the posterior probability
– For binary labels, can use Min Cut/Max Flow algorithm – Results are not very good for large
• Maximum Posterior Marginal (MPM)
– Maximize the marginals
Inference
Approaches
•
Iterated Conditional Modes
– Greedy
1. Start with initial labels x
2. Given current labels, maximize the conditionals at all sites
Man-made Structure Detection
• Detect man-made structures in images
• Training set:
– 108 training, 129 testing images – Size 256×384
– Images divided in 16×16 blocks – 3004 structured blocks
– 36,269 unstructured blocks
• Features
– Compute gradient orientation histograms at 3 different scales
– Intra-scale features= moments of the histograms at each scale
– Inter-scale features= angle between corresponding peaks of the
histograms at two scales
Man-made Structure Detection
• Learning
– fi are quadratic functions – Equivalent to polynomial
kernel of degree 2
– i = 4 nearest neighbors
Man-made Structure Detection
• Result using linear classifiers (linear decision
boundary)
• Comparison with MRF and Logistic
Regression
Man-made Structure Detection
Input image Logistic Regression
Man-made Structure Detection
Binary Image Denoising
• 4 images 64×64 • 2 noise models:
– Gaussian
– Mixture of two Gaussians
Binary Image Denoising
Logistic Regression
MRF
Binary Image Denoising
• Quantitative Evaluation
• DRF just barely better than MRF for image
denoising
• Visually better
Discriminative Learning of MRFs
The MAP Estimation Problem
• Estimation problem:
Given input data y, solve
• Example: Image denoising
– Given noisy image y, find denoised image
x
• Issues
– Modeling: How to approximate ?
– Computing: How to find x fast?
Noisy image y
MAP Estimation Issues
• Popular approach:
– Find a very accurate model
– Find best optimum x of that model
• Problems with this approach
– Hard to obtain good
– Desired solution needs to be at global maximum
– For many models , the global maximum cannot be obtained in any reasonable time.
– Using suboptimal algorithms to find the maximum leads to suboptimal solutions
Markov Random Fields
• Bayesian Models:
– Markov Random Field (MRF) prior
• E.g. Image Denoising model
– Gaussian Likelihood
– Fields of Experts MRF prior
– Differential Lorentzian
– Image filters Ji
Image Filters Ji
MAP Estimation (Inference) in MRF
• Exact inference is too hard
– For the Potts model, one of the simplest MRFs
it is already NP hard (Boykov et al, 2001)
• Approximate inference is suboptimal
– Gradient descent
– Iterated Conditional Modes (Besag 1986) – Belief Propagation (Yedidia et al, 2001) – Graph Cuts (Boykov et al, 2001)
Gradient Descent for Fields of Experts
• Energy function:
• Analytic gradient (Roth & Black, ‘05)
• Gradient descent iterations
– 3000 iterations with small
– Takes more than 30 min per image on a modern PC
Training the MRF
• Gradient update in model parameters
– Minimize KL divergence between learned prior and true probability
– Gradient ascent in log-likelihood
– Need to know Normalization Constant Z – EX from training data
– Z and Ep obtained by MCMC – Slow to train
• Training the FOE prior
– Contrastive divergence (Hinton)
• An approximate ML technique
• Initialize at data points and run a fixed number of iterations
– Takes about two days
Going to Real-Time Performance
• Wainwright (2006)
– In computation-limited settings, MAP estimation is not the best choice
– Some biased models could compensate for the fast inference algorithm
• How much can we gain from biased models?
• Fast denoising approach:
– 1-4 gradient descent iterations (not 3000) – Takes less than a second per image
Active Random Field
• Active Random Field = A pair (M,A) of
– a MRF model M, with parameters M
– a fast and suboptimal inference algorithm A with parameters A
• They cannot be separated since they are trained
together
• E.g. Active FOE for image denoising
– Fields of Experts model
– Algorithm: 1-4 iterations of gradient descent
Training the Active Random Field
• Discriminative training
• Training examples = pairs
– inputs yi+ desired outputs ti
• Training=optimization
• Loss function L
– Aka benchmark measure
– Evaluates accuracy on training set – End-to-end training:
• covers entire process from input image to final result
Training Active Fields of Experts
• Training set
– 40 images from the Berkeley dataset (Martin 2001)
– Same as Roth and Black 2005
• Separate training for each noise level
• Loss function L = PSNR
– Same measure used for reporting results
is the standard deviation of
Training 1-Iteration ARF, =25
Grow the filters
1. Start with one filter, size 3x3
– Train until no improvement – We found the particle in this
subspace
2. Add another filter initialized with zeros
– Retrain to find the new mode
3. Repeat step 2 until there are 5 filters
4. Increase filters to 5x5
– Retrain to find new mode
5. Repeat step 2 until there are 13 filters
PSNR training (blue), testing (red)
while training the 1-iteration ARF, =25 filters
5 Filters 3x3 5 Filters 5x5 6 Filters 5x5 13 Filters 5x5 2 Filters 3x3 1 Filter 3x3 0 1 2 3 4 5 x 104 21 22 23 24 25 26 27 28 Steps x10000 PSNR Training Data
Results
Standard Test Images
Lena Barbara Boats
Evaluation, Standard Test Images
noise=25 29.67 29.01 31.03 28.78 28.65 30.89Overcomplete DCT (Elad et al, 2006)
29.92 29.84 31.82 29.17 27.57 31.20
Globally Trained Dictionary (Elad et al, 2006) 30.42 29.73 32.15 29.28 29.60 31.32 KSVD (Elad et al, 2006) 28.99 28.90 30.14 28.66 27.10 30.15
Active FOE, 1 iteration
29.66 29.51 31.18 29.14 27.59 30.86
Active FOE, 4 iterations
30.16 29.21 31.40 29.37 29.13 31.69
Wavelet Denoising (Portilla et al, 2003)
31.15 30.16 32.86 29.91 30.72 32.08 BM3D (Dabov et al, 2007) 29.45 29.31 30.80 28.99 27.49 30.66
Active FOE, 2 iterations
29.58 29.45 31.04 29.08 27.57 30.76
Active FOE, 3 iterations
29.38 29.20 31.11 28.72 27.04 30.82
FOE (Roth & Black, 2005)
Evaluation, Berkeley Dataset
68 images from the Berkeley dataset
• Not used for training, not overfitted by other
methods.
• Roth & Black ‘05 also evaluated on them.
Evaluation, Berkeley Dataset
Average PSNR on 68 images from the Berkeley dataset, not used for training.
1: Wiener Filter
2: Nonlinear diffusion
3: Non-local means (Buades et al, 2005) 4: FOE model, 3000 iterations,
5,6,7,8: Our algorithm with 1,2,3 and 4 iterations 9: Wavelet based denoising (Portilla et al, 2003) 10: Overcomplete DCT (Elad et al, 2006)
Speed-Performance Comparison
noise=25
0 1 2 3 4 5 6 7 8 9 10 26 26.5 27 27.5 28 28.5 29Frames per second
The FRAME model
Maximum Entropy Principle
• Entropy:
• Properties:
– Measure of randomness
– Maximum for the uniform distribution
– Given mean and variance, is maximum for the normal distribution
• Maximum Entropy Principle
Maximum Entropy
• Maximum entropy model
– Given some observed constraints Fj
– The most general model given the constraints
• Maximum entropy model given training data:
Maximum Entropy
• Observe features on a training set (x
i,y
i)
• The Maximum Entropy Model is a Gibbs Distribution
• If we only need to model x
Texture Modeling
• Want a probabilistic model for texture
• Given an image I and a filter F
obtain
• Obtain the histogram of I
• If f(I) is the true distribution of the images I,
define the marginal at v
∈ D
Texture Modeling
Assumptions:
•
Texture is homogeneous
•
Texture can be captured by local filters
Obtain Max Entropy Formulation
– Uncountable number of constraints
Maximum Entropy for Texture
• Obtain
• p(I,
K,S
K) is a function of
– The potential functions K=(1,…,k) – The filters SK=(F1,…,Fk)
Filters for Texture Modeling
• Five types of filters:
– Intensity filter (), captures DC component
– Isotropic center-surround filters (Laplacian of Gaussian)
– Gabor filters
T=2,4,6,8,10,12, =0,30,60,90,120,150 – Spectrum analyzers
• Filter selection to choose a small number of filters
Synthesizing Textures
1. Initialize I
synas uniform white noise texture
2. Repeat w
×M×N times
1. Randomly pick location v∈D
2. For all possible discrete val, compute p(I(v)=val |I(-v)) using p(I,K,SK)
3. Sample I(v) from p(val |I(-v))
•
w is usually at least 4
Animal Fur Example
Original image Synthesized with 0 filters Synthesized with 1 filter
Filtered Images
• The images obtained using the six selected filters:
– Laplacian of Gaussian – 4 Gabor Cosine
Histograms
Potentials
Other Examples
Other Examples
Other Examples
• Observed texture (fabric) and synthesized using 3 filters:
– Two Spectrum analyzers – One Laplacian of Gaussian
Sparse FRAME
• FRAME model where the features are localized
– Response of a basis function at a certain location,orientation and scale
Active Basis Model
Ying Nian Wu, UCLA
Joint work with Z. Si, S.C. Zhu
Lasso
X
p
X
p
Y
1
1
...
|Y -
j j1 p
X
j|
2
|
j|
i1 p
Tibshirani: lassoEfron, Hastie, Johnston, Tibshirani: LARS
Non-convex regularization
X
p
X
p
Y
1
1
...
Fan, Li: SCAD Zhang: MCP
|Y -
j j1 p
X
j|
2
s
(
j)
j1 p
X
p
X
p
Y
1
1
...
Variable selection in linear regression
• Matching pursuit: forward selection
– Mallat, Zhang
• Bayesian variable selection
m p p m m m
X
X
Y
,1 1
...
,
(Y
m, m
1,..., M ) :Signals
(X
1,..., X
p) : Dictionary, basis (overcomplete p
n)
{|Y
m-
m, j j1 p
X
j|
2
s
(
m, j)
j1 p
m1 M
}
)
(
and
)
(
both
over
on
minimizati
joint
m,jX
jLearning the sparse code
Redundant dictionary sparse representation
m p p m m m
X
X
Y
,1 1
...
,
codebook or Dictionary : ) ,..., ( Signals : ) ,..., 1 , ( 1 p m X X M m Y }
)
(
|
-{|
1 1 , 2 1 ,
M m p j j m j p j j m mX
s
Y
Learning algorithm
Signal encoding: Given (X
i), for each Y
m, inferring (
m, j)
Dictionary re - learning: Given (
m, j), update (X
j)
m p p m m m
X
X
Y
,1 1
...
,
(Ym, m 1,..., M ) :12 12 natural image patches or 144 dimensional vectors (X1,..., Xp) : Dictionary or codebook of regressors, p = 288
Olshausen, Field
Dictionary
Localized, elongated, oriented Wavelets
Basis functions
Elementary signals Atoms
• Allow each basis to move locally
• Change of notation
m p p m m mX
X
Y
,1 1
...
,
I
m c
m,1B
1... c
m, pB
pU
m n‐stroke template n = 40 to 60, box= 100x100 , ,s xB
I
m
c
m,iB
x ixm,i,s,im,i i1 n
U
mActive Basis Model
, ,s
x
B
n‐stroke template n = 40 to 60, box= 100x100
I
m
c
m,iB
x ixm,i,s,im,i i1 n
U
mI
m
c
m,iB
x ixm,i,s,im,i i1 n
U
m(x
i,
i)
argmax
x,max
x, m1 M
| U
m, B
xx,s,|
2(x
m,i,
m,i)
argmax
x,| U
m, B
xix,s,i
|
2c
m,i U
m, B
x ixm,i,s,im,iU
mU
m c
m,iB
x ixm,i,s,im,i Constrained sparsity Collaborative/simultaneousActive Basis Model
Active Basis Model
• Active basis templates
– Arg-max inference and explaining away, no reweighting,
– Residual images neutralize existing elements, same set of training examples
• Adaboost templates
– No arg-max inference or explaining away inhibition
– Lack of basis deformation results in many near-duplicate bases
# of negatives: 10556 7510 4552 1493 12217 double # elements
same # elements
Learning Active Basis Models from Non-Aligned Images
Compositional Sparse Coding
Ying Nian Wu, UCLA Based on joint work with
Yi Hong, Zhangzhang Si, Wenze Hu, Song-Chun Zhu
Sparse coding model
Rewrite active basis model in packed form
Represent image by a dictionary of active basis models
(B
(t ), t
1,...,T)
I
m
C
m,k k1 K
B
X m,k,Sm,k, Am,k (tm,k)U
mI
m
c
m,i i1 N
B
iU
mI
m
c
m,i i1 N
B
xm,i,sm,i,m,i
U
mI
m
c
m,ii1 n
B
Xmxixm,i,s,im,i
U
m C
mB
XmU
mB
Xm
(B
Xmxixm,i,s,im,i, i
1,..., n)
strokes characters
letters words (alphabet dictionary)
Laplace: CONSTANTINOPLE (S. Geman)
Atomic decomposition in harmonic analysis
atoms molecules
Compositionality: S. Geman, Potter, Chi
Zhu, Mumford: And-Or graph
Why compositional sparse code?
Sparser and more symbolic representation Signal recovery
Classification and recognition
Learned dictionary of composition patterns from a training image
Generalize to testing images
One compositional pattern many groups by spatial translation, rotation, scaling, deformation highly overcomplete/redundant dictionary
Y
m
m,1X
1...
m, pX
p
m
Y
m C
m,1X
G1
... C
m,KX
GK
mCompositional Sparse Code
• How to learn dictionary of candidate groups (patterns)?
• An ill-advised strategy
(1) Perform variable selection for each response vector (2) Discover groups (compositional patterns)
Step (1) early decision or commitment , sparse but may not patterned
• A better strategy
Learning algorithm: specify number and size of
templates
The first 7 iterations
Learning in the 10th iteration
Image encoding : Given (B(t ), t 1,...,T ), encode each Im by Im Cm,k k1 K
BX m,k,Sm,k, Am,k (tm,k) U mDictionary re - learning : Given (Xm,k, Sm,k, Am,k, tm,k), update each B(t ) by shared variable selection
I
m
C
m,k k1 K
B
X m,k,Sm,k, Am,k (tm,k)U
m(B
(t ), t
1,...,T)
15 training images: 61.63 +/_ 2.2 % 30 training images: 68.49 +/_ 0.9%
Medical Imaging Application
• HIV virus in 3D electron microscopy images
• Find shape of glycoproteins on its boundary
Applications in Medical Imaging
• Generative models are powerful
– Need few training examples – Great interpretation power
• Few applications in medical imaging
• Grand Challenge:
Find a visual dictionary for
medical images
(e.g. CT scans)
– Level 1: Gabor-like elements
– Level 2: Object parts like in the compositional sparse code
• Difficulties:
– 3D data would need thousands of filter directions
References
• J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In
ICML, 2001.
• Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. University of Pennsylvania, 2004. • S. Kumar and M. Hebert. Discriminative Random Fields. IJCV, 2006
• SC Zhu, Y Wu, D Mumford. Filters, random fields and maximum entropy (FRAME): Towards a unified theory for texture modeling. IJCV, 1998
• Y.N. Wu, Z.Z. Si, H.F. Gong, S.C. Zhu. Learning Active Basis Model for Object Detection and Recognition. IJCV, 2010
• Y. Hong, Z.Z. Si, W.Z Hu, S.C. Zhu and Y.N. Wu. Unsupervised Learning of Compositional Sparse Code for Natural Image