Nijmegen
The following full text is a publisher's version.
For additional information about this publication click this link.
http://hdl.handle.net/2066/90034
Please be advised that this information was generated on 2017-12-06 and may be subject to
change.
Neural D ecoding w ith Hierarchical Generative M odels
M arcel A. J. v a n G erv enRadboud University Nijmegen, Institute fo r Computing and Information Sciences, 6525 A J Nijmegen, the Netherlands, and Radboud University Nijmegen, Institute for Brain, Cognition and Behaviour, 6525 E N Nijmegen, the Netherlands
F loris P. de Lange
Radboud University Nijmegen, Institute fo r Brain, Cognition and Behaviour, 6525 E N Nijmegen, the Netherlands
Tom H eskes
Radboud University Nijmegen, Institute fo r Computing and Information Sciences, 6525 A J Nijmegen, the Netherlands, and Radboud University Nijmegen, Institute for Brain, Cognition and Behaviour, 6525 E N Nijmegen, the Netherlands
R ecent research h a s sh o w n th a t re co n stru c tio n of perceiv ed im ag es b a se d o n h e m o d y n am ic re sp o n se as m e a su re d w ith fu n c tio n a l m ag n etic reso n an ce im a g in g (fM RI) is sta rtin g to becom e feasible. In th is letter, w e ex p lo re re c o n stru c tio n b a se d o n a le a rn e d h ierarch y of featu res b y em p lo y in g a hie rarch ica l gen e rativ e m o d e l th a t co n sists of c o n d itio n a l restricted B o ltzm an n m achines. In a n u n su p e rv ise d p h a se , w e le a rn a h ierarch y of featu res from data, a n d in a su p e rv ise d p h ase, w e le a rn h o w b r a in ac tiv ity p red icts th e states of th o se features. R eco n stru ctio n is a ch iev ed b y sa m p lin g from th e m o d e l, c o n d itio n e d o n b ra in activity. We sh o w th a t b y u sin g th e h ierarch ical g en era tiv e m o d el, w e can o b ta in g o o d -q u ality reco n stru ctio n s of v isu a l im ages of h a n d w ritte n d ig its p re se n te d d u rin g a n fM R I sca n n in g session.
1 I n tr o d u c tio n _____________________________________________________ Recent developm ents in cognitive neuroscience have sh o w n th at it is possi ble to infer m ental state from n eu ro im ag in g d ata (Haxby et al., 2001; Thirion et al., 2006; Kay, N aselaris, Prenger, & G allant, 2008; M itchell et al., 2008; M iyaw aki et al., 2008; N aselaris, Prenger, Kay, Oliver, & G allant, 2009). These break th ro u g h s in neu ral decoding, p o p u larized as b rain reading, hold m uch prom ise as a n ew ap p ro ach for stu d y in g h u m an b rain function (see H assabis et al., 2009; Stokes, T hom pson, C usack, & D uncan , 2009, for
a n u m b er of exam ples). D ecoding can be distin g u ish ed into classification, identification, an d reconstruction (Kay & G allant, 2009). The aim of classifi cation is to infer to w hich of a sm all n u m b er of stim ulus classes a particular brain state belongs. The goal of identification is to determ ine w hich stim ulu s from a candidate set of possible stim uli explains the observed brain state. This is typically achieved by tem plate m atching, w here the observed brain state is com pared w ith the predicted b rain state of stim uli in the can d idate set. R econstruction, finally, uses the observed b rain state in order to reconstruct the actual stim ulus rath er th an choosing from a class or set of potential stim uli.
A n early decoding exam ple is the study by H axby et al. (2001), w hich show ed th at different stim ulus categories can be classified reliably using fMRI. M ore recently, Kay et al. (2008) and M itchell et al. (2008) show ed th at the identification of previously unseen stim uli by com paring predicted hem odynam ic response w ith m easured hem odynam ic response is feasible. T hirion et al. (2006) show ed th at perceived or im agined contrast patterns can be reconstructed from retinotopic inform ation. Finally, M iyaw aki et al. (2008) an d N aselaris et al. (2009) d em o n strated th a t the reconstruction of perceived im ages from m easured hem odynam ic response is possible by decoding m ultivariate activation patterns. N ote th a t reconstruction is m uch h ard er th an either classification or identification since it requires inference ab o u t w hich stim ulus (from a sheer infinite set of possible stim uli) caused the observed b rain activity. In contrast, classification and identification boils d o w n to the selection of one stim ulus from a restricted set of possible stim uli.
In this study, our goal is to b u ild a reconstruction m odel th at aim s to m im ic neu ral encoding an d to in v ert this m odel in order to reconstruct perceived stim uli from m easured b ra in activity—in our case, hem odynam ic responses in visual cortex. A n influential hypothesis ab o u t neural encoding is the predictive coding hypothesis, w hich states th at the b rain tries to infer the causes of its sensations (H elm holtz, 1867; Barlow, 1961; Rao & Ballard, 1998). This principle, together w ith hierarchical organization as a key organizational principle of the b rain (Zeki & Ship p , 1988; Fellem an & Van Essen, 1991), suggests th at the h u m a n b rain m ay em body a hierarchical generative m odel w here to p -d o w n drive along the hierarchy encodes our prior beliefs ab o u t the presence or absence of abstract causes an d w here b o tto m -u p d rive along the hierarchy encodes sensory inform ation (Lee & M um ford, 2003; Friston, 2005).
O u r reconstruction m odel is k n o w n as a deep belief n etw o rk —a hier archical generative m odel w hose b u ild in g blocks are kn o w n as restricted B oltzm ann m achines (Sm olensky, 1986; H inton, O sindero, & Teh, 2006) an d w hose latent causes (i.e., stim ulus features) are learned from data. This app ro ach is different from m ost existing reconstruction studies w here re construction is based on predefined features (e.g., m anually constructed im age patches or fixed G abor filters). This has as an advan tag e th at the m odel can be ad a p te d to d ata sets w ith different stim ulus characteristics.
latent causes
o o o o o
O
w, W,To o o o o
O
A T w2 : w2t Co o o o o
o
W t ; Jw tOOOOO
o
O'-i
i 3 a 0 1 •< sensory inputFigure 1: The hierarchical generative model implements how latent causes h explain sensory input v conditional on observed brain activity z. Reconstruction proceeds by sampling from the model while clamping z. Dashed arcs represent interactions that are used during training but are not part of the generative model.
z
h
v
The w o rk presented in Fujiw ara, M iyaw aki, an d K am itani (2009) is another recent exam ple w here features are lea rn ed from d ata using canonical corre lation analysis, albeit w ith o u t m aking use of a hierarchical architecture as em ployed in this letter. The hierarchical natu re of ou r m odel allow s low- level reconstructions to be influenced by com plex im age features. Such deep m odels do m ore justice to the hierarchical organization of the neocortex and sh o u ld lead to im proved reconstruction perform ance since w e m ay benefit from the relation betw een com plex im age features an d response properties of neurons in higher cortical areas (H edge & Van Essen, 2000).
We show th at our hierarchical generative m odel is able to reconstruct in d iv id u al h an d w ritte n digits w ith lo w reconstruction error w here recon struction quality, as determ in ed by a behavioral experim ent, im proves w h en the hierarchy consists of m ultiple layers.
2 H ierarchical G en erativ e M o d e l ____________________________________ The aim of the hierarchical generative m odel, sh o w n in Figure 1, is to pro v ide a m odel of the interactions betw een the stim ulus v, latent causes h, and b ra in activity z. We start from the a ssu m p tio n th at the laten t causes can be m odeled in term s of a hierarchy of processing units th a t detect increasingly com plex features (statistical invariances), analogous to the hierarchical or ganization of visual cortex (Fellem an & Van Essen, 1991). The key idea of our stu d y is to use an u n su p erv ise d learn in g phase to learn the laten t
causes th at best explain observed d ata a n d to use a sup erv ised learning phase in order to learn h o w observed hem odynam ic response is linked to these latent causes. The resulting hierarchical generative m odel captures how b rain activity arises from the invariances in our environm ent. Recon struction is achieved by sam pling from the m odel conditional on observed b rain activity.
2.1 U n su p e rv ise d L earning Phase. In the un su p erv ise d learning phase, w e disregard the conditional p a rt of the m odel an d focus only on learning a hierarchy of features. O ne w ay to represent such a feature hierarchy is in term s of a deep belief n etw o rk (H inton et al., 2006), w hich consists of sm aller m odules, kno w n as restricted B oltzm ann m achines. We first describe the theory beh in d (restricted) B oltzm ann m achines an d subsequently describe how they can be u sed to com pose a deep belief netw ork.
A B oltzm ann m achine (H inton & Sejnowski, 1983) is a n etw o rk of sym m etrically coupled units th a t associates a scalar energy to each state x of the variables of interest:
exp(- E (x))
p(x) = ---Z--- ’ ( )
w here Z = ^ x e x p ( -E(x)) is the p artitio n function and E(x) = — 1 xTWx — b Tx is the energy of a state w ith w e ig h t m atrix W an d bias term s b. A Boltz m an n m achine norm ally consists of stochastic binary (conditional Bernoulli) units w here the probability of a u n it xi being active given the states of the rem aining units is given by the follow ing Gibbs sam pling u p d a te rule:
p(xi = 1 | x_i) = a
|^;
+Y2
w ijx jj
, (2.2) w ith sigm oid function a (x) = (1 + e x p ( -x )) -1.Param eter learning becom es interesting w h e n som e of the units are vis ible and the rem aining units are hidden: x = (v, h). The h id d e n units h th en act as latent variables th at m odel distributions over the visible state vectors v th at cannot be m odeled by direct pairw ise interactions betw een visible units. The average g rad ien t of the log likelihood over a train in g set
D = (v"}”^ 1 w ith respect to one of the m odel p aram eters 0 is th en given by
E p ( > J * m = E, ( > J M ) — E p i i m ) . (2.3)
w here p is the m odel distribution, p is the em pirical distribution, a n d F (v) = — log ^ 2h exp(— E(v, h)) is the free energy. Learning in B oltzm ann m achines
is hard since it takes a long tim e to reach the equilibrium d istrib u tio n and the learning signal is noisy, as it is the difference of tw o sam p led expecta tions. Fortunately, learning becom es easier w h en one m akes use of restricted Boltzm ann m achines (RBMs) (Sm olensky, 1986), w here interactions are re stricted to taking place only betw een visible a n d h id d en units such th at they can be described in term s of a bip artite graph.
L earning in restricted B oltzm ann m achines can be carried out m ore ef ficiently u sing the n otion of contrastive divergence (H inton, 2002; H inton et al., 2006). The energy function of a restricted B oltzm ann m achine is bilinear,
E(v, h) = - h T W v - cTv - b Th, (2.4)
such th a t the free energy of the in p u t can be com puted efficiently u sing the distributive law of probability theory:
F (v) = — l o g £ exp(hr W v + cT v + b T h)
= - c Tv - log £ r [ e xP h A b i + ^ WijVj hi,...,h„ i y y j
= ~ cT v -
J2
logJ2
expI
h iI
bi +J2
' (2.5)F urtherm ore, for a restricted B oltzm ann m achine, w e readily obtain the follow ing conditionals:
p (h I v ) =
n
p (hi I v ) =n
°
I
(2hi -im
bi+m
'p (v | h) = ["] p (vj | h) = Y [ a |^(2vj — 1^ c; + Y 1 w ijh i J J ■
The factorization enjoyed by RBMs brings ab o u t tw o benefits. First, Ep ( - - ) can be com puted analytically. Second, the set of variables in (v, h) can be sam p led in tw o substeps in each step of a Gibbs sam pling chain. We sam ple h given v an d th en a new v given h, starting from a training exam ple vi (by sam pling from the em pirical d istrib u tio n p), such th a t after k steps, we obtain vk+i ~ p(v | hk). The id ea of k-step contrastive divergence (CD-k) involves an approxim ation th a t introduces som e bias in the gradient. We ru n the chain for only k steps to obtain the follow ing p aram eter u p d a te
after seeing an exam ple v1,
A 9 a dF (vk+1^ ^ (2.6)
39 39
w here we replaced the averages over all in p u ts in equation 2.3 by sin gle sam ples vk+i an d vi. Instead of sam pling, it is also possible to use a m ean field ap p ro ach w here expected activations instead of sam p led states are p ro p ag ated th ro u g h the m odel (Welling & H in to n , 2002). We use this strategy w hen sam pling h id d en layer activations. The average g rad ien t is obtained by averaging A 9 over m an y exam ples. D espite the bias introduced by these approxim ations, contrastive divergence w orks well in practice.
The feature hierarchy can be constructed by stacking restricted B oltzm ann m achines on top of each other, w here the o u tp u t of the previous RBM acts as in p u t to the next RBM. The RBMs are trained using contrastive divergence an d the stacked m odel, kn o w n as a deep belief n etw o rk (DBN), is fine-tuned u sing the w ake-sleep algorithm (H inton, D ayan, Frey, & N eal, 1995; H in to n et al., 2006). The obtained m odel can be used to detect fea tures by forw ard p ro p ag atio n of activations, b u t it can also be in terpreted as a generative m odel th a t consists of a top-level associative m em ory that can be used to generate sam ples y (H inton, 2007). In other w o rd s, given a state of the associative m em ory defined by the u p p e r tw o layers, we can reconstruct the input. In our experim ents, we will use gray-scale im ages as in p u t to our m odel. O ne w ay to achieve this u sing stochastic b in ary units is by m a p p in g real values to b inary activation probabilities (H inton et al., 2006). F urtherm ore, we will penalize large p aram eter values by a d d in g a sm all w eight decay term to the p aram eter u p d ates to avoid overfitting.
2.2 S u p e rv ise d L earning Phase. In order to use a DBN for neural de coding, w e need to condition the m odel on observed b rain activity z d u rin g reconstruction. O ne w ay to achieve this is to m ake use of conditional re stricted B oltzm ann m achines (Salakhutdinov, M nih, & H in to n , 2007; Taylor, H inton, & Roweis, 2006) such th a t m odel param eters becom e a function of z (Bengio, 2009). H ere, w e assum e th at the bias term s b and c becom e linearly d ep e n d en t on z sim ply by w riting the energy function as
It is assu m ed th a t z also includes a constant to m odel arb itrary offsets as in a stan d ard RBM. It follow s th at the conditional free energy becom es
E (v, h | z) = - h TW v — z T C v — zT Bh ■ (2.7)
k
(2.8) w hose derivatives can readily be c om puted (Taylor et al., 2006).
Figure 2: The experimental procedure consists of three stages. During the un supervised learning phase, m any stimuli are presented in order to learn their latent causes. During the supervised learning phase, a small num ber of stimuli are presented, together w ith brain activity, which was measured w hen subjects observed the stimuli. Finally, a stim ulus that has n o t been previously seen by the model is reconstructed by sampling from the hierarchical generative model conditional on measured brain activity.
In the sup erv ised learning phase, w e replace the restricted B oltzm ann m achines th a t have previously been train ed in the u n su p erv ise d phase by conditional restricted B oltzm ann m achines. L earning proceeds as before, b u t n o w w e keep the interactions fixed and u p d a te only the bias term s based on the observed brain activity. E ssentially this allow s our observa tions to m o d u late the probability th at certain feature detectors are active or inactive. In the follow ing, w e will u p d a te the bias term s only for the top-level associative m em ory—the top tw o layers in the hierarchy.
2.3 R econstruction. In order to reconstruct perceived stim uli, w e use the follow ing p rocedure (see Figure 2). First, w e learn the feature hierarchy in an u n su p e rv ised m an n er u sing th o u san d s of stim uli. Second, w e learn in a sup erv ised m anner ho w the biases are influenced by observed b rain activ ity th a t is acquired w hile presenting a sm all n u m b er of stim uli. Finally, w e generate a reconstruction using the hierarchical generative m odel by per form ing conditional sam pling. T hat is, w e perform one Gibbs sam pling step in the top-level associative m em ory conditional on brain activity and a sec ond unconditional G ibbs sam pling step in the top-level associative m em ory, an d th en w e p ropagate expectations back to the in p u t layer. R econstruction error betw een a stim ulus v an d its reconstruction r consisting of N pixels is
quantified in term s of city-block distance: e(v, r) = N ^ N=1 IV — ri I • In or der to obtain a m ore subjective assessm ent of reconstruction perform ance, w e asked five subjects to rate h o w w ell reconstructions m atch the stim uli.
3 E x p e r im e n t_______________________________________________________ Stim uli consisted of 2106 h an d w ritten gray-scale digits at a 28 x 28 pixel resolution taken from the train in g set of the M NIST database (h ttp ://y a n n .le c u n .c o m /e x d b /m n is t). We selected 1000 h an d w ritte n 6s an d 1000 h an d w ritte n 9s in order to train a hierarchical generative m odel (m odel stim uli). This subset of the available d a ta w as fo u n d to be sufficiently large for m odel estim ation. A dditionally, for the im aging experim ent (see below ), w e selected 53 h an d w ritte n 6s a n d 53 h an d w ritten 9s (presentation stim uli). The choice for these tw o digits w as based on the consideration th at the variatio n w ith in an d betw een these tw o digit classes w as quite large. The form er ensures th a t reconstruction w ill be nontrivial, w hile the latter allow s us to assess m ore easily if digit specific features are learned. The m odel stim uli w ere do w n sam p led from 28 x 28 pixel to 16 x 16 pixel im ages. The p resentation stim uli w ere scaled to fit the full visual field.
We collected 106 trials in one participant. In each trial, a h an d w ritten 6 or 9 w as presented to the subject. The character rem ained visible for 12.5 seconds and flickered at a rate of 6 H z on a black background. In order to ensure sustained attention d u rin g the entire scanning session, the subject's task w as to m aintain fixation to a fixation dot an d to detect a brief (33 ms) change in color from red to green and back occurring once and random ly w ith in a trial. D etection w as indicated by pressing a b u tto n w ith the right- h an d th u m b as fast as possible. Trials w ere sep arated by a 12.5 second intertrial interval. The 106 trials w ere p artitio n ed ran d o m ly into four runs interspersed w ith 30 second rest periods.
Blood-oxygenation-level d ep en d en t (BOLD) sensitive functional im ages w ere obtained by m eans of a Siem ens 3T MRI system u sing a 32-channel coil for signal reception. We used a single-shot g rad ien t EPI sequence w ith a repetition tim e (TR) of 2500 ms, echo tim e (TE) of 30 m s, an d isotropic voxel size of 2 x 2 x 2 m m . F unctional im ages w ere acquired in 42 axial slices in ascending order. A high-resolution anatom ical im age w as acquired using an MP-RAGE sequence (TE/TR = 3^39/2250 ms; 176 sagittal slices, w ith isotropic voxel size of 1 x 1 x 1 mm).
Functional data w ere preprocessed an d analyzed w ith in the fram ew ork of SPM5 (Statistical P aram etric M apping, w w w .fil.ion.ucl.ac.uk/spm ). F unctional b rain volum es w ere m otion-corrected a n d coregistered w ith the anatom ical scan. F unctional d ata w ere d etren d ed an d high-pass-filtered. The volum es acquired 10 to 15 seconds after trial onset w ere averaged in order to obtain an estim ate of the steady-state response in in d iv id u al voxels. A s in p u t to our reconstruction m odel, w e u sed the 1000 voxels th a t show ed
(a)
total num ber of hidden units
(b)
Figure 3: Reconstruction error as a function of the num ber of included voxels (a) and as a function of the num ber of hidden units for models consisting of one, two, or three hidden layers (b).
the largest difference betw een task a n d rest conditions according to a stan d ard general linear m odel (GLM) analysis. As expected, the selected voxels w ere alm ost exclusively located in the occipital lobe, w hich contains the m ain cortical visual areas.
W hen training the m odels in the u n su p erv ised phase, w e ran the con trastive divergence algorithm (CD-1) follow ed by the w ake-sleep algorithm for bo th 100 iterations using a learning rate of 0.1 an d a w eig h t p en alty of 0.001 on all 2000 m odel stim uli. O ne iteration consisted of a single pass over the training data, w hich w as p artitio n ed into m inibatches of 100 trials. All reconstructions w ere com puted u sing a leave-one-out cross-validation schem e. For each of the 106 presen tatio n stim uli, all 105 rem aining stim uli w ere used to train a m odel in the sup erv ised phase by ru n n in g CD-1 for 100 iterations u sing a sm all learning rate of 0.0001 and a w eig h t penalty of 0.001. Subsequently, a reconstruction w as p ro d u ced for the rem aining stim ulus using Gibbs sam pling as described above.
4 R e s u l t s ___________________________________________________________ We start by com puting reconstruction error w h e n predicting the gray value of in d iv id u al pixels directly from BOLD response. This can be interpreted as u sing just the visible layer of an RBM, w here gray values are taken to be the posterior probabilities of the h id d e n u n it activations an d w hich is equivalent to solving a set of in d e p en d en t logistic regression problem s. Figure 3a show s the decrease in reconstruction error as the n u m b er of included voxels increases. N ote the slight increase in error w h e n all voxels are included, possibly d ue to overfitting. R econstruction error averaged over all im ages based on 1000 in clu d ed voxels w as 0.063.
stimulus pixel level one layer two layers three layers
Figure 4: Image reconstructions for the first 10 images for all different models.
We no w m ove on to the hierarchical generative m odel. O ur p rim a ry goal is to determ ine w h eth er a d d in g a layer of h id d en units im proves the recon struction. Figure 3b depicts the average reconstruction error as a function of the nu m b er of h id d e n u n its for one, tw o, or three h id d en layers. A lthough there is quite som e variability in the reconstruction error, it is clear th a t the reconstruction error decreases as a function of the n u m b er of h id d e n units. Furtherm ore, m odels w ith tw o or three h id d en layers seem to outperform m odels consisting of one h id d en layer. M inim um reconstruction error for one-, tw o-, an d three-layer m odels w as 0.085, 0.081, an d 0.082, respectively.
N ote th a t the reconstruction error for the hierarchical generative m o d els is larger th a n th a t of the pixel-level reconstructions. H ow ever, it is also w ell k n o w n th a t error m easures such as city-block distance do n o t accu rately capture the quality of the reconstructions as perceived by hum ans an d alternative m etrics have been defined th a t try to incorporate p ro p er ties of the h u m an visual system (Wang, Bovik, Sheikh, & Sim oncelli, 2004). This is also ap p are n t in the reconstructions sh o w n in Figure 4. We q u an tified the subjective experience of reconstruction perform ance by asking five naive subjects to ran k each of the reconstructions according to how w ell they m atch the stim uli. The h istogram in Figure 5 show s th at on av erage, the pixel-level m odel an d the one-layer m odel give less preferred reconstructions com pared to the tw o-layer an d three-layer reconstructions. Differences in preference am ong all m odels w ere significant as com puted by a W ilcoxon ran k su m test (p = 0.01).
In order to gain an u n d e rsta n d in g of w h a t invariances are represented by our m odel, w e can visualize the features th at have been learned by the h id d en units sim ply by representing the elem ents of the w eig h t m atrix be longing to a h id d e n un it as a 16 x 16 im age. The features learned in the first layer of the hierarchical m odel resem ble G abor filters; th at is, the features are optim ally responsive for som e location, orientation, an d spatial frequency (one of these features is sh o w n in the top left of Figure 6). Features learned by layers higher u p the hierarchy are m ore of a d istrib u ted nature. The hi erarchical generative m odel also allow s us to probe w hich voxels code for
B Q B B B B B B B B
B Q B B D Q D n B O
□ □ □ □ □ □ □ □ □ □
□ □ □ □ □ □ □ □ □ □
B B B B B B B B B B
250 200 150 100 50 0
best good worse
preferred order worst I pixel level I one layer I two layers three layers
Figure 5: Preferred order of the models determ ined from their reconstructions by five naive subjects.
w hich feature. For exam ple, Figure 6 show s positive and negative voxel con tributions to one of the feature nodes in a one-layer m odel. Both early visual an d lateral occipital cortex contribute to this feature. A s w o u ld be expected, activity in early visual cortex decreases the likelihood of the feature being ac tive, in line w ith the feature coding (am ong others) for the absence of visual stim ulation in the visual field. Involvem ent of the lateral occipital com plex is w ell in line w ith previous findings th at this region is one of the sites involved in the encoding of letters and letter strings (Vinckier et al., 2007). 5 D is c u s s io n _______________________________________________________ We have d em o n strated th a t hierarchical generative m odels can be u sed to obtain good-quality reconstructions of im ages based on m easured hem o d y nam ic response. We have also sh o w n th at reconstructions obtained b y deep (m ultilayered) m odels lead to generally preferred solutions. F urtherm ore, hierarchical generative m odels can be used to probe h o w in d iv id u al voxels influence the activation an d deactivation of features an d , as a consequence, to learn ab o u t the neural encoding of certain stim ulus characteristics. In spection of anatom ical localization of feature voxels show s th at bo th early visual an d lateral occipital cortex contribute to the reconstruction of digits.
O u r m otivation for u sing hierarchical generative m odels w as tw ofold. First, w e w an te d to learn stim ulus features from d a ta instead of using a fixed basis set. This allow s the m odel to a d a p t to the statistics of the data at h and. This said, it m ay be possible to use a large data set of n atu ral im age patches in order to learn a generic basis set th at can be ap p lied to any d ata set (Lee, Grosse, R anganath, & N g , 2009). Second, w e w an ted
Figure 6: Anatomical localization of voxels that contribute to one of the features in a one-layer model. The corresponding feature is shown in the top left panel. The gray blob near the calcarine sulcus represents negative param eter values, and the gray blob in lateral occipital cortex, indicated by the arrow, represents positive param eter values.
to use a hierarchical architecture such th a t com plex features could also influence reconstruction perform ance. A lth o u g h deep m odels gave better reconstructions th a n shallow m odels, there w as no clear relation betw een layers in the visual hierarchy and layers in the m odel. Voxels in early visual cortex seem ed to be m ost predictive for each layer in the m odel.
In order to assess reconstruction perform ance, w e have u se d b o th city block distance, an d a subjective m easure of reconstruction perform ance. In term s of city-block distance, pixel-level reconstructions gave the best per form ance, w hereas in term s of the subjective m easure, a tw o-layer m odel gave the best perform ance. The reason for this ap p aren t contradiction is due to the fact th at city-block distance, a lth o u g h useful for testing convergence,
is n o t an optim al m easure of reconstruction perform ance. R econstructions u sing deep m odels are obtained as com positions of the in d iv id u al feature activations. Since these features have som e fixed spatial layout, they m ay activate pixels th a t w ere n o t active in the original m odel. F urtherm ore, al th o u g h deep m odels generally lead to high-quality sm ooth reconstructions, they can in som e cases generate a reconstruction th at belongs to the w rong stim ulus class. For exam ple, stim uli 2 and 5 in Figure 4 are erroneously reconstructed as a 9 instead of a 6, w hereas the pixel-level reconstructions still show som e correspondence w ith the original im age. This occasional m isclassification w ill induce a large contribution to the city-block distance. The subjective m easure, in contrast, w ill be tolerant of sm all differences (e.g., rotations or translations) betw een an original im age an d its recon struction w hile penalizing the n o nsm ooth appearance of the pixel-level reconstructions.
In ou r experim ent, a sim ple block design w as used w here stim uli w ere presented as flashing stim uli for prolonged periods. This yielded high- quality reconstructions because w e could m ake use of the h ig h signal-to- noise ratio th at is afforded by the steady-state response. A t the sam e tim e, this prohibited the presen tatio n of a large n u m b er of stim uli an d therefore necessitated the use of a restricted stim ulus set consisting of h an d w ritte n 6s an d 9s only. This also sim plified the reconstruction problem since the stim ulu s set could be m odeled by a relatively sm all set of features. O ther studies have u sed m ore ra p id designs w here flashing stim uli w ere presented for only brief periods (Kay et al., 2008), th u s allow ing the p resentation of larger an d m ore variable stim ulus sets. It rem ains an open question w h eth er ac curate reconstruction can be obtained u sing our m odel in such settings.
We have also m ade use of a GLM analysis in order to identify voxels th at show significant differences betw een task an d rest conditions. This allow ed us to reduce the n u m b er of voxels th a t w ere used as in p u t to the m odel an d reduce the negative effect of overfitting. O ur focus on a sm all subset of voxels m ade the in terp retatio n of the relation betw een those voxels and the learned features difficult. In terpretation m ay be facilitated by including m ore voxels an d u sing regularizers th at induce sm oothness and sparseness (van G erven, Cseke, de Lange, & H eskes, 2010). Together w ith the use of retinotopic m ap p in g , this allow s one to probe m ore accurately w hich cortical areas code for w hich im age features.
In conclusion, w e have dem o n strated th a t hierarchical generative m odels can be used for neu ral decoding an d offer a new w in d o w into the brain. If perform ance could be generalized to im agined stim uli, w e w ill come closer to a system th a t is able to "read th o u g h ts" (Kay & G allant, 2009). There is reason to believe th at such a generalization is feasible d ue to the fact th a t perceived an d im agined stim uli can lead to sim ilar neu ral activation p attern s (Roland & G ulyas, 1995; M ellet, Petit, M azoyer, Denis, & T zourio, 1998; Koch, 2004; Thirion, D uchesnay, H u b b a rd , D ubois, Poline, Lebihan, et al., 2006; Stokes et al., 2009).
A c k n o w led g m en ts __________________________________________________ We gratefully acknow ledge the su p p o rt of the D utch technology fo undation STW (project n u m b er 07050), the N eth erlan d s O rganization for Scientific Research N W O (Veni g ra n t 451.09.001 an d Vici g ran t 639.023.604), an d the BrainGain Sm art Mix Program m e of the N eth erlan d s M inistry of Economic A ffairs and the N eth erlan d s M inistry of Education, C ulture an d Science.
R e fe re n c e s _________________________________________________________ Barlow, H. (1961). Possible principles underlying the transformation of sensory mes
sages. In W. Rosenblith (Ed.), Sensory Communication. Cambridge, MA: MIT Press. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in
M achine.Learning, 2(1), 1-127.
Felleman, D. J. & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cereb.Cortex, 1,1-47.
Friston, K. (2005). A theory of cortical responses. Philos. Trans. R. Soc. Lond. B Biol. Sci, 360(1456), 815-836.
Fujiwara, Y., Miyawaki, Y., & Kamitani, Y. (2009). Estimating image bases for visual image reconstruction from human brain activity. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems, 22 (pp. 576-584). Cambridge, MA: MIT Press.
Hassabis, D., Chu, C., Rees, G., Weiskopf, N., Molyneux, P. D., & Maguire, E. A. (2009). Decoding neuronal ensembles in the human hippocampus. Current Biology, 19, 546-554.
Haxby, J., Gobbini, M., Furey, M., Ishai, A., Schouten, J., & Pietrini, P. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293, 2425-2430.
Hedge, J. & Van Essen, D. C. (2000). Selectivity for complex shapes in primate visual area V2. J. Neurosci., 20(RC61), 1-6.
Helmholtz, H. (1867). Handbuch der Physiologischen Optik. New York: Dover. Hinton, G. E. (2002). Training products of experts by minimizing contrastive
divergence. NeuraLCompujrtion, 14(8), 1711-1800.
Hinton, G. E. (2007). To recognize shapes, first learn to generate images. In P. Cisek, T. Drew, & J. Kalaska (Eds.), Computational neuroscience: Theoretical insights into brain function. Burlington, MA: Elsevier.
Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158-1161.
Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. N syrsLC o m M a tS n 1 8 ,1527-1554.
Hinton, G. E. & Sejnowski, T. J. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE.
Kay, K. N. & Gallant, J. L. (2009). I can see what you see. Nature Neuroscience, 12(3), 245-246.
Kay, K., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). Identifying natural images from human brain activity. N ature 452, 352-355.
Koch, C. (2004). The quest for consciousness: A neurobiological approach. Greenwood Village, CO: Roberts & Company.
Lee, H., Grosse, R., Ranganath, R., & Ng, A. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning
(pp. 609-616). New York: ACM.
Lee, T. S. & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J .O p .S o e .A m .A , 20(7), 1434-1448.
Mellet, E., Petit, L., Mazoyer, B., Denis, M., & Tzourio, N. (1998). Reopening the men tal imagery debate: Lessons from functional anatomy. N yro lm a se 8(2), 129-139. Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.-M., Malave, V. L., Mason,
R. A., et al. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320(5880), 1191-1195.
Miyawaki, Y., Uchida, H., Yamashita, O., Sato, M., Morito, Y., Tanabe, H. C., et al. (2008). Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5), 915-929.
Naselaris, T., Prenger, R. J., Kay, K. N., Oliver, M., & Gallant, J. L. (2009). Bayesian re construction of natural images from human brain activity. Neuron, 63(6), 902-915. Rao, R. P., & Ballard, D. H. (1998). Predictive coding in the visual cortex: A functional
interpretation of some extra-classical receptive field effects. Nat. Neurosci., 2,
79-87.
Roland, P. E., & Gulyas, B. (1995). Visual memory, visual imagery, and visual recognition of large field patterns by the human brain: Functional anatomy by positron emission tomography. Cerebral Cortex, 5, 79-93.
Salakhutdinov, R., Mnih, A., & Hinton, G. (2007). Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning. San Francisco: Morgan Kaufmann.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Vol. 1: Foundations (pp. 194-281). Cambridge, MA: MIT Press. Stokes, M., Thompson, R., Cusack, R., & Duncan, J. (2009). Top-down activation
of shape-specific population codes in visual cortex during mental imagery. J. Neurosci., 29(5), 1565-1572.
Taylor, G. W., Hinton, G. E., & Roweis, S. (2006). Modeling human motion using binary latent variables. In J. C. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.),
Advances in neural information processing systems, 20. Cambridge, MA: MIT Press. Thirion, B., Duchesnay, E., Hubbard, E., Dubois, J., Poline, J., Lebihan, D., et al.
(2006). Inverse retinotopy: inferring the visual content of images from brain activation patterns. NeurgImO£e 33(4), 1104-1116.
van Gerven, M. A. J., Cseke, B., de Lange, F. P., & Heskes, T. (2010). Efficient Bayesian multivariate fMRI analysis using a sparsifying spatio-temporal prior.
Ne^ rolm^ r e, 50(1), 150-161.
Vinckier, F., Dehaene, S., Jobert, A., Dubus, J. P., Sigman, M., & Cohen, L. (2007). Hierarchical coding of letter strings in the ventral stream: Dissecting the inner organization of the visual word-form system. N euron 5 5 ,143-156.
Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on ImageProeessing, 13(4), 600-612.
Welling, M., & Hinton, G. E. (2002). A new learning algorithm for mean field Boltzmann machines. In ICANN '02: Proceedings of the International Conference on Artificial Neural Networks (pp. 351-372). Berlin: Springer-Verlag.
Zeki, S., & Shipp, S. (1988). The functional logic of cortical connections. Nature, 335,
311-331.