Contrast Sensitivity Functions in Autoencoders

(1)

Contrast Sensitivity Functions in Autoencoders

Qiang Li

Image Processing Lab, Parc Cientific

_{v )}

Universitat de Val `encia, Spain

Alex Gomez-Villa

Computer Vision Center

_{v )}

Universitat Aut `onoma de Barcelona, Spain

Marcelo Bertalm´ıo

Grupo de Im ´agenes y Visi ´on

_{v )}

Instituto de ´Optica CSIC, Madrid, Spain

Jes ´

us Malo

Image Processing Lab, Parc Cientific

_{v )}

Universitat de Val `encia, Spain

Artificial neural networks (ANNs) trained to solve low-level visual tasks have been shown to develop biological-like features. Examples include human-like receptive fields and being subject to human-like visual illusions. In this work we report that these artificial systems may also develop human-like Contrast Sensitivity Functions (CSFs).

Following the interest in accurate comparison between humans and ANNs, we use psychophysical ideas to extend a recently proposed eigenanalysis to define the CSFs of such artificial systems. Results show that low-pass sensitivity to chromatic gratings and wider band-pass sensitivity to achromatic gratings may appear in convolutional autoencoders trained to enhance the retinal signals. Similarly, 3D versions of these networks may develop a spatio-temporal CSF with the basic diamond shape and bandwidth of the human window of visibility. Moreover, masking-like saturated responses to the contrast of spatio-chromatic and spatio-temporal gratings may emerge in these autoencoders.

However, the consideration of a range of ANN models trained for the same statistical goal shows that deeper architec-tures or better accuracy in the goal do not necessarily lead to more human CSFs or more human masking-like saturations. We argue that the nontrivial interplay between goal and architecture in the optimization prevents against premature con-clusions about the organization principles of the visual system if sensible architectures are not adopted. This suggests that goal-driven derivations of psychophysical phenomena should include more realistic units than those considered in conventional CNNs.

Keywords: Spatio-chromatic Contrast Sensitivity, Spatio-temporal Contrast Sensitivity, Convolutional Autoencoders, Deblurring and Denoising, Natural Images, Statistical goals, Architectures

1. Introduction

Automatic differentiation is an extremely powerful tool that is at the core of the current explosion of deep-learning (Goodfellow, Bengio, & Courville, 2016). It allows the optimization of complex models without the burden of the explicit computation of the derivatives (Martinez, Cyriac, Batard, Bertalm´ıo, & Malo,2018). The flexibility of the resulting models explains their success in visual neuroscience (Kriegeskorte,2015;Yamins & DiCarlo,2016;Cadena et al.,2019). However, more important than the accuracy of the

doi: Received: March 15, 2021 © CC BY-NC-ND

(2)

specific models, the advent of machine learning descriptions of neural function stress the need to clarify one central point (Richards & al.,

2019): the strong interaction between goal and architecture. Marr’s computational and algorithmic levels of analysis of vision (Marr & Poggio,1976;Marr,1982) are not that independent: a biologically sensible goal (at the computational level) may lead to non-biological behaviors if the wrong architecture (at the algorithmic level) is chosen (Martinez, Bertalm´ıo, & Malo,2019). As a result, it is not surprising that the behavior of current deep artificial neural networks (ANNs) departs from human behavior (Wichmann et al.,2017;

Geirhos et al.,2019) despite they achieve human-like accuracy in the goal (Geirhos, Meding, & Wichmann,2020). For instance, relying on the universal approximation property of deep architectures in a specific goal may lead to the proposition of too deep (unrealistic) models that display non-human behavior in other tasks. This could be avoided by using more realistic layers/neurons that may require shallower architectures (Bertalm´ıo et al.,2020). This interaction between goal and architecture may complicate the discussion about the organization principles of the visual system. In this context, the analysis of the similarities and differences between biological and deep-learning systems is interesting because it may reveal fundamental limitations of the conventional ANN analogy to study vision.

This work follows (Gomez-Villa, Martin, Vazquez, & Bertalmio,2019;Gomez-Villa, Martin, Vazquez, Bertalm´ıo, & Malo,2020), where we did the first systematic study of color illusions in ANNs by subjecting the networks to classical psychophysical experiments. In order to explain the human-like nature of some of the illusions found in artificial systems, in (Gomez-Villa et al.,2020) we proposed a novel eigenanalysis of the networks. This analysis revealed the emergence of human-like chromatic channels and filters. That work on visual illusions of ANNs was focused on Convolutional Neural Networks (CNNs) and specifically on CNN-Autoencoders. The goal of autoencoders is minimizing the error of the decoded signal in the representation domain of the input (Goodfellow et al.,

2016). Such simple low-level goal in the spatial domain may be reasonable for early vision stages like the Lateral Geniculate Nucleous (LGN) (Martinez-Otero, Molano, Wang, Sommer, & Hirsch,2014), and it is known that spatio-chromatic contrast sensitivity functions (CSFs) can be derived from the responses of the LGN cells (Ingling & Martinez-Uriegas,1983;Martinez-Uriegas,1994;Shapley & Hawken,2011).

Our aims here are (1) checking how the spatio-chromatic CSFs of CNN autoencoders may depend on the architecture even if the organization principle is fixed, and (2) the eventual emergence of similar human-like CSFs in 3D-CNNs trained with spatio-temporal sig-nals. In order to do so we connect the eigenanalysis of the networks proposed in (Gomez-Villa et al.,2020) with classical psychophysical definitions of the CSF, and we assess the nonlinear response of CNNs for gratings of different contrasts.

The exploration of contrast response of CNNs trained on natural scenes updates classical statistical accounts of the CSFs (Atick & Redlich,1992; Atick, Li, & Redlich,1992). These classical explanations were based on clever observations about the 2nd order properties of natural images, but relied on linear filtering models. Therefore, in principle, deep-learning models have two advantages: (i) the use of actual image data instead of (idealized) models of images, and (ii) the use of flexible networks that overcome the linear constraint and hence may develop masking saturations.

The eventual advantages of current deep-learning models seem obvious, but one may ask: do more general (nonlinear) architectures necessarily lead to more human CSFs?, or does better accuracy in the goal imply more human CSFs and masking behavior?. These are the kind of questions addressed by our experiments.

The structure of the paper is as follows. Section 2 extends the theory proposed in (Gomez-Villa et al.,2020) to obtain the CSFs of artificial systems with a low-contrast analysis based on classical psychophysics. Section 3 shows the main empirical findings of the work: the emergence of human-like CSFs and contrast saturation in CNN-autoencoders with shallow architectures under the considered error minimization goal. Interestingly, resemblance with human behavior reduces for deeper architectures. Finally, Section 4 discusses the implications of the results: on the one hand, statements about the goal or organization principles at an abstract level are problematic because the final behavior very much depends on the algorithmic level (or selected architecture). On the other hand, special care has to be taken in using deep models in vision science: their ability for function approximation without the appropriate architecture constraints does not guarantee human behavior. This suggests a word of caution against premature interpretations of the behavior obtained through deep architectures in biological problems.

(3)

2. Methods: Estimating Contrast Sensitivity in Autoencoders

Here we consider different linear characterizations of the autoencoders including the eigenanalysis proposed in (Gomez-Villa et al.,2020). In this way we propose a procedure to estimate the autoencoder CSF that is more connected to the conventional definition of the CSF in human observers.

Autoencoders

Autoencoders are ANNs that transform the signal into an inner representation through an encoder and a decoder transforms this inner representation back in the input domain.

x

S(x, θ)

##

y (1)

In Eq.1 we do not made explicit the encoding and decoding operations, i.e. x and y are in the image space. In this work we will not make any assumption on the nature of the inner representation of the autoencoder. This is because the basic goal function in autoencoders (reconstruction error) is defined in the image domain, shared by input and response. Moreover, with the appropriate stimuli, the CSF characterization can be defined in this image domain.

Following (Gomez-Villa et al.,2019,2020) we focus on Convolutional Autoencoders trained to enhance blurred and noisy signals coming from natural scenes. In this context, given a clean image, xc, the input to the neural system would be a distorted version of the

clean signal: x = H ·xc+ n, where H is blurring operator related to the optics of the eye and n is the noise associated to the response of

the photodetectors. Both H and n are unknown to the neural system. The goal of the network at this early stage is inferring xcfrom x.

Accurate models of LGN cells show that this may be one of the goals of the biological processing after retinal detection (Martinez-Otero et al.,2014).

In the supervised learning setting of ANNs, the parameters θ of the network are selected so that the reconstruction error ε = |xc− S(x, θ)|2is minimized over a set of training images (Goodfellow et al.,2016). Of course, supervised learning and parameter

updates using backpropagation may not be biologically plausible (Lillicrap, Santoro, Marris, Akerman, & Hinton,2020). However, our initial aim here is looking for statistical explanations of human frequency sensitivity and hence ANNs can be seen as convenient tools to optimize the selected goal. With this focus on the goal, the specific learning algorithm is not as important as ensuring that the final network actually fulfills the goal. We will see that the situation may not be that simple because networks optimizing the same goal with equivalent performance may display human or non-human CSFs depending on their architecture.

Contrast Sensitivity Function and Linearized Autoencoders

The CSF describes the linear response of human viewers for low-contrast sinusoids (Campbell & Robson,1968; Mullen,1985;

Kelly,1979). In that linear setting, the CSF describes an input-output mapping where an input sinusoid of frequency f , the basis function bf, leads to an output, yf_{, with attenuated contrast (or standard deviation, σ) given by, σ(y}f_{) = CSF(f ) σ(b}f

). In the case of humans the slope is obtained from contrast thresholds because there is no access to the output. However, for autoencoders the computation of the output is straightforward, yf = S(bf, θ). Therefore, given an input sinusoid, one could check the attenuation factor by comparing the standard deviation of output and input:

CSF(f ) = σ(S(b

f

, θ))

σ(bf) (2)

This ratio should be computed for low-contrast sinusoids to keep the (eventually) nonlinear autoencoder in the low-energy range. For chromatic sinusoids the deviations have to be computed separately over the achromatic, red-green, and blue-yellow color chan-nels (Mullen, 1985). In this work we use the Jameson and Hurvich opponent color space (Hurvich & Jameson, 1957) to generate achromatic and purely chromatic gratings and to decompose the corresponding responses.

(4)

Of course, plain attenuation for sinusoids in Eq.2may not provide a full description of the action of nonlinear systems. In principle it is not obvious why we should perform the analysis in a specific basis. Therefore one should check to what extent sinusoids are indeed eigenfunctions of the system.

A way to test this point is linearizing the response of the autoencoders in the low-contrast regime and check that it is shift invariant. Using a Taylor expansion, the response for low-contrast images can be approximated by the Jacobian around the origin (the zero-contrast image, 0):

y = S(x, θ)

y_low = S(0 + xlow, θ) ≈ S(0, θ) + ∇xS(0, θ) · xlow (3)

y_low ≈ ∇xS(0, θ) · xlow

where we assumed that the response for zero-contrast images is zero. If the behavior of the system at this low-energy regime is shift invariant, the Jacobian matrix can be diagonalized as ∇xS(0, θ) = B · λ · B−1, with extended oscillatory basis functions in the columns

of B (and rows of B−1). Fourier basis and cosine basis are examples of extended (non-local) oscillatory functions that diagonalize shift invariant systems. The reason for this result is equivalent to the emergence of cosine basis when computing the principal components of stationary signals (shift invariant autocorrelation) (Clarke,1981). As a result, the slope of the response for low-contrast sinusoids (the CSF) will be related to the eigendecomposition of the Jacobian of the system at 0. Let’s compute the response for a sinusoid in this Taylor/Fourier setting to see the relation. A basis function bf with specific frequency f is orthogonal to all rows (sinusoids) in B−1 except that of the same frequency, i.e. B−1· bf = δf0f. And this delta selects the corresponding column (of frequency f ) among all the columns in B:

yf = S(bf, θ) ≈ B · λ · B−1· bf ≈ B · λ · δf0f

≈ λfbf (4)

So the slope of the response for basis functions of frequency f is λf(the corresponding eigenvalue of the Jacobian of the autoencoder).

As a result, for systems with shift invariance in the low-contrast regime, the eigenvalues of the linear approximation of the system (eigenvalues of the Jacobian) are conceptually similar to the CSF. Direct comparison of the eigenvalue spectrum with the CSF may not be simple because the eigenfunctions may differ from Fourier sinusoids and they are given in a specific order depending on the diagonalization algorithm.

Nevertheless, if the linearized version of the system (the Jacobian at 0) is shift invariant, which can be seen from a convolutional structure in the Jacobian matrix, oscillatory waves are eigenfunctions of the system, and hence Eq.2may provide a good description of the behavior of the system.

Alternative linear characterizations of the autoencoders

A 2D cartoon of the degradation and restoration process can illustrate alternative characterizations of the neural networks opti-mized to enhance the retinal signal (see Fig.1). In this diagram, two-pixel natural scenes (the left panel) follow a PDF obtained from independent t-student sources mixed by a matrix that introduces strong correlation between the luminance of the pixels. This kind of two-pixel representations is common to describe the statistics of natural images (Simoncelli & Olshausen,2001), and mixtures of sparse components is a widely accepted model for natural scenes (Hyv¨arinen, Hurri, & Hoyer,2009;Oord & Schrauwen,2014;Malo,2020), and appropriate enough for this illustration. In this diagram the low-frequency direction corresponds to the main diagonal (where the two pixels have the same luminance) and the high-frequency direction is orthogonal (for images where one of the pixels is brighter than the other). The zero-contrast image is at the crossing point of the frequency axes.

(5)

Figure 1: Degradation of visual data and linear and nonlinear strategies for enhancement in the spatial domain. The axes of the plots represent the luminance in each of the photodetectors of two-pixel images(Olshausen01). The left panel represents the PDF of natural scenes: marginal heavy tailed distributions of oscillatory functions mixed to have strong correlation in the pixel domain. Optical blur (second panel) implies contraction of the high frequencies. Additive retinal noise implies the convolution by the PDF of the noise leading to the PDF in the third panel. The restoration problem may include (i) generic nonlinear solutions such as the one represented by the PDF at the right-top, and (ii) linear solutions as the one at the right-bottom.

Optical blur implies the attenuation of high-frequency versus low-frequency components and hence the contraction of the dataset as shown in the second panel. Assuming linear and noisy photoreceptors, the PDF of the retinal response results from the convolution of the PDF of the blurred images with the PDF of the noise (the function with circular support). The result (third panel) is the input to the autoencoder, whose goal is recovering the distribution at the first panel. Linear solutions are limited to global scaling of the domain (for instance by inverting the contraction introduced by the blur), while nonlinear solutions may twist the domain in arbitrary ways.

In this setting, the computation of the CSF according to Eq.2means putting low-contrast sinusoids (e.g. the samples highlighted in green in the third panel) through the system, and checking the amplitude of the output (green dots at the panels at the right) over the directions of the input. This nonlinear example illustrates the fact that the behavior can be contrast dependent (see the different twist in the concentric contours). This graphical view illustrates the difference between three possible linear characterizations, with y = M · x:

• The optimal linear solution: the matrix M that better relates the input x with the desired output xc. This is the M that

minimizes the expected value E{|xc− M · x|2}. Assuming a representative set of N clean/distorted pairs stacked in the matrices

Xc= [x (1) c x (2) c . . . x (N )

c ] and X = [x(1)x(2). . . x(N )], the optimal RMSE solution is given by the pseudoinverse:

M = Xc· X† (5)

• Globally linearized network: the matrix M that better describes the nonlinear behavior of the network over the whole set of natural images. This is the M that minimizes E{|S(x, θ) − M · x|2}. Assuming a representative set of N input/output pairs

stacked in the matrices X = [x(1)x(2). . . x(N )], and Y = [y(1)y(2). . . y(N )], the solution is given by the pseudoinverse:

M = Y · X† (6)

• Locally linearized network at 0: the matrix M that better describes the nonlinear behavior of the network for low-contrast images. This is the M that minimizes E{|∇xS(0, θ) · xlow− M · xlow|2}. Of course, this could be empirically approximated by

M = Ylow· X†low, but in this case the obvious exact solution is:

(6)

While the optimal linear solution (or the optimal linear network) is a convenient reference to describe the problem, the other two options are different characterizations of the autoencoder. Eq.6summarizes the behavior of the network in a single matrix, and Eq.7is a description only valid around 0, and hence more closely connected to the low-contrast regime of the CSF. The eigenanalysis cited for ∇xS in Eq.4can be applied for the three matrix characterizations, but it is important to note the differences between them.

The Jacobian of cascades of linear+nonlinear layers (as in autoencoders based on Convolutional Neural Networks) can be obtained either analytically (Martinez et al.,2018), or it can be obtained via automatic differentiation or alternative methods based on system identification (Berardino, Ball´e, Laparra, & Simoncelli,2017). However, the above procedures are tedious, so in (Gomez-Villa et al.,

2020) we took the more straightforward approach represented by Eq.6.

The different linear characterizations considered in this section and the example in Fig.1illustrate that the behavior of a nonlinear autoencoder over high contrast images may be substantially different from the threshold behavior. Therefore, the eigenanalysis used in (Gomez-Villa et al.,2020) and the corresponding attenuation of sinusoids through the matrix given by Eq.6, will also be compared with the result of Eq.2.

3. Results

Here we analyze the behavior of a range of autoencoders trained for the same statistical goal (enhancement of blurred/noisy retinal signals) using different architectures. First we show that they indeed fulfill the goal. Then, we analyze their global linear approximation according to Eq.6by showing their shift-invariant nature, their spatio-chromatic and spatio-temporal receptive fields, and their extended oscillatory eigenfunctions. Finally, we apply Eq.2both to the nonlinear systems and to their linearized versions to get the corresponding CSFs.

3.1 Architectures and performance

In this work we consider 2D CNNs that act on spatio-chromatic signals and 3D CNNs that act on spatio-temporal signals (color images and achromatic videos).

The spatio-chromatic models follow the networks studied in (Gomez-Villa et al.,2019,2020): autoencoders with 2D convolutional layers made of 8 feature maps with kernels of spatial size 5 × 5 and sigmoids or Rectified Linear Units (ReLU) as activation functions. In every case, the last (reconstruction) layer has 3 features (the RGB color channels) so that the input and output domains are the same. We train a range of progressively deeper architectures: 2 layers, 4 layers, 6 layers, and 8 layers. We also consider a really deep architecture made of pre-trained State-Of-The-Art (SOTA) solutions to the considered problems: first we apply the denoising network (Zhang, Zuo, Chen, Meng, & Zhang,2017), and then the deblurring network (Tao, Gao, Shen, Wang, & Jia,2018). This cascade represents a sophisticated network with 17+21 = 38 layers. Finally, we consider the best linear solution, Eq.5, as a convenient reference.

The spatio-temporal models follow the same spirit, in this case also including convolution in the temporal dimension: autoencoders with 3D convolutional layers made of 8 feature maps with kernels of size 5 × 5 × 5 and sigmoid or ReLU activation functions. In every case, the last (reconstruction) layer has 1 feature map (the luminance). We consider architectures of different depth (2 layers, 4 layers, 6 layers, and 8 layers), and also the optimal linear network as useful reference.

Training of the spatio-chromatic models is done over the same dataset used in (Gomez-Villa et al.,2019,2020): the Large Scale Visual Recognition Challenge 2014 CLS-LOC validation dataset (which contains 50 k images), leaving 10 k images for validation purposes. This dataset is a subset of the whole ImageNet dataset (Russakovsky et al.,2015). We take 128 × 128 images and assume a sampling frequency f s = 70cpd, i.e. we assume images subtend 3.6 deg. The cut off frequency of the Gaussian blur kernel is f s/4 = 17cpd, and the variance of the zero-mean Gaussian noise is σ = 15 in the [0,255] scale of RGB channels.

(7)

function in all cases and all the models are implemented using Tensorflow (Abadi & Zheng,2016). We train our models using ADAM stochastic gradient descent (Kingma & Ba,2017) with a batch size of 64 examples, momentum of 0.9, weight decay of 0.001. The maximum number of epochs is set to 100 and we stop the optimization early if there is no improvement in the validation set after 10 consecutive evaluations.

The spatio-temporal models are trained over patches of size 32 × 32 × 25 from the movie The Stranger (Welles, 1946). We use 10000 patches for training and 1000 patches for validation. We assume spatial sampling of 30 cpd and temporal sampling of 25 Hz. Degradation used is the same as in the spatio-chromatic case, here applied frame-wise to the luminance channel. Loss function, implementation and learning method are the same as in the spatio-chromatic case. Training data and code are available at

http://isp.uv.es/code/visioncolor/autoencoderCSF.html

Table 1 and Figures2and3 show that the considered networks indeed learn to solve the spatio-chromatic and spatio-temporal problems. While the table shows the values of the goal function over the whole validation set, the figures show representative visual results of the transmitted signals together with the goal values for specific stimuli of the validation set. Results include the original stimuli, the input (the degraded retinal responses), the responses of the considered nonlinear networks, and the optimal linear solutions. In both cases (spatio-chromatic and spatio-temporal) the optimal linear network computed from the pseudoinverse, Eq.5, clearly improves the retinal signal in RMSE and the high frequency noise is (visually) less apparent. The nonlinear networks (regardless of the activation functions) further improve the linear results both in RMSE and visually (edges are sharper and noise has been reduced). The table shows that the performance of the linearized versions of the networks, Eq.6, is better than the optimal linear network but worse than the nonlinear networks. This means that the response of the nonlinear networks introduces a sort of regularization that makes the resulting matrices more appropriate to solve the problem. Linearized responses are not shown in the figures because all reconstructions are visually similar and do not convey much qualitative information.

Increasing the depth of the CNN architectures improves the performance with regard to the shallower configuration: note the improvement from 2 to 4 layers both in RMSE and visually for every activation function. This is true for both the spatio-chromatic and the spatio-temporal cases. However, the improvement saturates and adding extra layers (with no modification of the training strategy) does not necessarily reduce the reconstruction error. This is noticeable in the RMSE scores, but not that relevant in the visual results. In some specific cases (e.g. 6 and 8 layers architectures with sigmoid activation) the networks introduce a shift in the chromatic content as seen in Fig.2(top row). This specific artifact will induce a clear distortion in one of the chromatic CSFs as shown below.

The saturation of the performance with the depth of the architecture is a known effect in deep learning (Glorot & Bengio,2010;

Priyanka & Wang,2019). It is the reason for the development of architectural/training tricks to keep the performance high in validation when increasing the number of free parameters (Komatsu & Gonsalves,2020;He, Zhang, Ren, & Sun,2016;Bengio, Lamblin, Popovici, & Larochelle,2006). We didn’t explore these specific tricks because of two reasons: (1) the aim of the paper is not pure goal optimization but studying the eventual emergence of human behavior in the networks, and (2) the considered cases already show improvement of the performance from the shallower configurations to deeper configurations and the posterior decay in performance is small (and negligible in visual terms in most of the cases).

As shown below, these cases are enough to show how human-like properties may appear in the networks, and how sensitive this result may be with regard to modifications in the architecture.

Finally, the considered State-Of-The-Art (SOTA) system with a total of 38 layers does get the better RMSE and visual results. These SOTA architectures were carefully handcrafted to this specific end. Note that in Table 1 the global linearization of the SOTA network is not that different from the linearized versions of the other networks. The key difference (in terms of the goal) is not in the linear approximation of the networks but in the different nonlinear behavior they have.

We didn’t use the spatio-chromatic SOTA system for the enhancement of spatio-temporal signals because a frame-wise application of the 2D network would not exploit the 3D structure of natural movies. Naive frame-wise application of 2D networks could introduce

(8)

Figure 2: Representative responses of CNNs for two spatio-chromatic stimuli. Numbers indicate the values of the goal function (RMSE distance in RGB with regard to the original stimulus).

artifacts when studying the response to spatio-temporal sinusoids so we preferred to exclude 2D models from the spatio-temporal study focused on 3D CNNs. Again, the considered 3D architectures reveal the fulfillment of the goal and display a range of performance values which are enough to discuss the emergence of human-like spatio-temporal bandwidths.

3.2 Linearization and Eigenanalysis

Figure4shows representative results for the linearization of the considered networks. The top row recalls the results reported in (Gomez-Villa et al., 2020) for the 2D spatio-chromatic networks with two convolutional layers and sigmoid activation, and the bottom row shows the results for the 3D spatio-temporal networks with two convolutional layers and sigmoid activation. The equivalent results for the other networks are qualitatively similar.

(9)

Figure 3: Representative responses of CNNs to a spatio-temporal stimulus. Numbers indicate the values of the goal function (RMSE distance in luminance with regard to the original sequence). Data from The Stranger(Welles, 1946), including this image, can be freely used since the movie is in the Public Domain, and online available athttps://archive.org/details/TheStranger720p.

Following the discussion in Section 2, the matrices M that result from Eq.6(panels A and E) have a clear convolutional structure: rows with the same shape up to a shift in the spatial or the spatio-temporal domain. Note that in this matrix notation the input and output have been vectorized, as shown in panel A. This structure in M reveals shift invariance in the receptive fields. Representative receptive fields are shown in panels B and F, where the row highlighted in green in the matrix has been reshaped to be represented back into the spatial or the spatio-temporal domain. All the linearized networks present these kind of center-surround receptive fields (a red-green example in panel B, and an on-center/off-surround with distinct temporal responses in different spatial locations in panel F). These receptive fields are replicated with the corresponding shift along the rows of the matrices M . As anticipated in Section 2, following (Clarke,1981), convolutional-like matrices lead to extended oscillatory eigenfunctions as illustrated in panels C and G. The spatio-chromatic functions in panel C are similar to Cosine basis and interestingly, as noted in (Gomez-Villa et al.,2020), the basis

(10)

Spatio-Chrom. Spatio-temp.

RMSE in RGB RMSE in Lumin.

Degradation 30.5 ± 0.2 16.12 ± 0.05

Linear Net 22.7 ± 0.3 11.1 ± 0.1

CNNs Sigmoid ReLU Sigmoid ReLU

Nonlinear Linearized Nonlinear Linearized Nonlinear Linearized Nonlinear Linearized 2-Layers 21.7 ± 0.4 21.0 ± 0.3 20.6 ± 0.5 20.4 ± 0.2 8.3 ± 0.1 10.2 ± 0.1 7.7 ± 0.1 10.0 ± 0.1 4-Layers 19.2 ± 0.5 20.2 ± 0.3 18.8 ± 0.5 20.3 ± 0.3 7.6 ± 0.1 10.0 ± 0.1 7.8 ± 0.1 10.2 ± 0.1 6-Layers 20.9 ± 0.4 21.9 ± 0.3 19.1 ± 0.5 20.5 ± 0.3 7.9 ± 0.1 10.0 ± 0.1 7.2 ± 0.2 10.2 ± 0.1 8-Layers 20.0 ± 0.4 21.0 ± 0.3 19.1 ± 0.5 20.4 ± 0.4 7.8 ± 0.1 10.1 ± 0.1 8.9 ± 0.1 10.5 ± 0.1

SOTA Nonlinear Linearized

17.9 ± 0.4 20.2 ± 0.3

Table 1: Performance of the networks for the spatio-chromatic problem (left) and the spatio-temporal problem (right). Initial value of the goal function (RMSE), and result obtained by different architectures and activation functions. All scores refer to the average over 100 realizations over randomly selected 80% of the validation set, with the corresponding standard deviation.

Figure 4: Linearization of spatio-chromatic networks (top) and spatio-temporal networks (bottom). The matrices M (in panels A and E), the receptive fields (panels B and F), the eigenfunctions (in panels C and G), and the eigenvalues (panels D and H).

(11)

Spatio-Chrom. Spatio-temp. Linear Fract. Linear Fract.

(in %) (in %)

CNNs Sigmoid ReLU Sigmoid ReLU

2-Layers 90.6 92.9 92.2 92.1

4-Layers 94.1 94.8 92.1 91.7

6-Layers 94.3 94.6 92.7 91.6

8-Layers 94.4 94.8 92.4 92.6

SOTA 93.9

Table 2: Linear fraction of the networks for the spatio-chromatic problem (left) and the spatio-temporal problem (right). Results are computed for the validation set.

display oscillations in human-like red-green and blue-yellow opponent directions. See the original report for further analysis of the chromatic directions and color matching functions. The spatio-temporal functions in panel G include extended textures (both in space and time), with flicker and motion, but they cannot be identified with Fourier or Cosine waves. However, it is clear that they display characteristic spatio-temporal frequencies (medium to low spatial frequencies at the top versus high spatio-temporal frequencies at the bottom). The functions in panels C and G are ordered according to the eigenvalues (in panels D and H). This means that the systems favour medium and low spatio-temporal frequencies.

As stressed in (Gomez-Villa et al., 2020), this eigenanalysis makes no other assumption than the linear approximation of the networks. The structure of the matrices found and the order in the oscillating eigenfunctions reveal the suitability of Fourier-like basis and the higher relevance of medium and low frequencies. Fourier basis are not imposed: they just become evident after the eigenanalysis of the linearized versions of the networks. This justifies further study of the nets (both the original nonlinear networks and the corresponding linearized versions) through the conventional ratio of output/input signals assumed in psychophysics, Eq.2.

Following (Gomez-Villa et al.,2020) we computed the fraction of the nonlinear response of the CNNs that can be explained by the linear approximation, i.e. 1 −E{|S(x)−M ·x|}_E{|S(x)|} . This linear fraction, which is 100% for a linear system and decreases with the intrinsic nonlinearity of the network, is given in Table 2. Results show that the considered problems (and the corresponding networks) are mostly linear. Moreover, increasing the depth (increasing the number of parameters and potential flexibility) does not mean increasing the effective nonlinearity of the networks. This suggests that the networks are not overfitting the data.

3.3 Stimulation with gratings

In the experiments below we stimulate the considered architectures with purely achromatic, and isoluminant red-green and blue-yellow gratings according to the cardinal directions of a psychophysically meaningful color opponent space (Hurvich & Jameson,1957). See Fig.5for a representative subset of the gratings used to feed the networks for the estimation of the spatio-chromatic CSFs. The full set included gratings of 36 spatial frequencies linearly spaced in the range [0.5, 20.5] cpd and 9 contrasts linearly spaced in the range [0.07,0.6]. The average color was the white of the system with 30 cd/m2_.

The stimuli for the computation of the spatio-temporal CSFs were moving sinusoids with 16 spatial frequencies in the range [0,15] cpd, 10 temporal frequencies in the range [0,10] Hz, and 9 contrasts in the range [0.1,1]. The average luminance was set to 45 cd/m2_.

(12)

Figure 5: Representative spatio-chromatic stimuli to feed the networks.

3.4 Spatio-chromatic CSFs and contrast responses

Figure6shows the attenuation factors suffered by the different gratings (achromatic and chromatic) depending on their frequency and contrast. Results are shown for all the nonlinear architectures with either sigmoid or ReLU activation functions.

While some of the frequency responses (those for shallower networks) share features with human CSFs, the result strongly depends on the architecture.

The result of the 2-layer networks is particularly interesting: the opponent chromatic channels display a narrow low-pass behavior, while the achromatic channel has substantially wider bandwidth and attenuates a bit the low frequencies (particularly for low contrasts). Another human-like feature is the fact that the attenuation increases with contrast. This means that the contrast response curves reduce their slope (saturate) with contrast, as is the case in human observers for achromatic (Legge & Foley,1980;Legge,1981) and chro-matic (Martinez-Uriegas,1997) gratings. Note that the slope (attenuation factor) decreases with contrast in both 2-layer architectures. In the sigmoid case the effect is strong in all channels but in the ReLU case it is also apparent in the achromatic channel. Of course, in these two-layer examples the agreement with human behavior is not complete: the maximum of the achromatic CSF is shifted towards high frequencies, and the decay at low frequencies is smaller than in humans.

Nevertheless, the resemblance of the behavior in the shallower architectures is remarkable in a learning scenario where no biological constraint has been introduced. As a result, considering only the 2-layer networks one could think that the selected goal (enhancement of retinal signals) is actually guiding the biological organization. However, the human-like features present in the 2-layers result disappear for other architectures that have similar or better performance in the goal.

For instance, in deeper architectures with ReLU the qualitative differences between the achromatic and chromatic channels go away: all sensitivities turn low-pass and a clearly non-human peak appears at high frequencies for high contrast sinusoids. This artifact in the achromatic channel is stronger in deeper architectures with sigmoid activation. Moreover, for 6- and 8-layers nets with sigmoids the sensitivity of the red-green channel is so small that does not appear in the plots. This explains the chromatic shift for these networks

(13)

Figure 6: Spatio-chromatic CSFs for gratings of different contrast. Different CNN autoencoders and the State-Of-The-Art (SOTA) system are considered. No normalization to the output of Eq.2has been applied.

in the colors of the images in Fig.2.

More interesting is the non-human nature of the CSFs of the SOTA system despite it is the network that excels in the goal. This is a compelling example of the fact that a better performance in terms of a sensible statistical goal does not necessarily imply the emergence of human-like behavior.

The presence of different sensitivity curves for different contrasts in Fig.6implies that the contrast response for a sinusoid may be nonlinear. Note for instance the points highlighted with circles (that correspond to sinusoids with f = 2 cpd). Variation of the attenuation factor with the contrast for the different shades of the circles implies that the slope changes. This is more clearly illustrated in Fig.7, that explicitly reproduces representative examples of these curves for achromatic sinusoids (in black), red-green sinusoids (in red), and blue-yellow sinusoids (in blue).

The 2-layer architectures (e.g. left column in Fig.7) display a saturating behavior similarly to the compressive contrast response in humans (Legge & Foley,1980;Legge,1981;Martinez-Uriegas,1997). However, deeper architectures display non-human (expansive) nonlinearities (as in the chromatic cases of the 8-layers ReLU) or roughly linear behavior (as in the 6-layer sigmoid). Note that in this case, the small gain of the red-green channel explains the chromatic distortion shown in Fig.2. Note that the contrast response of the SOTA system also departs from the linear behavior happening at the low-energy regime. The pattern found in the nonlinearities is not correlated with the achievement of higher accuracies in the goal.

The last spatio-chromatic result shows the CSFs obtained from the stimulation of the linearized models with gratings. This means applying the output/input ratios, similarly to Eq.2, to the matrices obtained from the nonlinear models via Eq.6. The result correspond-ing to the optimal linear network (obtained from Eq.5) is included as a convenient reference. Figure8shows these results together with

(14)

Figure 7: Representative examples of nonlinear responses for achromatic (top row) and chromatic (middle and bottom rows) sinusoids with f = 2 cpd. The linear behavior at the low-contrast regime has been plotted with dashed line as useful reference to check if the response is expansive or compressive with regard to the threshold behavior.

the achromatic CSF of the OSA Standard Spatial Observer (Watson & Malo,2002), and classic results of the RG and BY chromatic CSFs (Mullen,1985).

The linear network, the 2-layer networks and the SOTA system are the cases where the achromatic CSF displays a more clear band-pass result. The reduction of sensitivity for achromatic low frequencies is not as strong as in humans but there is a rough resemblance of the results in these cases. In the other CNN cases the achromatic channel becomes low-pass. Again, the 6- and 8-layers architectures with sigmoid activation display the artifact in red-green channel. In summary the linearized models corresponding to deeper CNNs display qualitatively less-human CSFs.

3.5 Spatio-temporal results

Figure9shows the attenuation factors found for moving sinusoids of low and high contrast in the plane (fx, ft). Similarly to the

spatio-chromatic case we find a contrast-dependent behavior (which is very strong in deeper networks), so we also provide the average attenuation factors over contrast. The CSF result of the optimal linear network and the human CSF in (Kelly,1979) have also been included as reference.

(15)

Figure 8: Spatio-chromatic CSFs for linearized networks. The optimal linear network (top-left) and the human CSFs from (Watson02,Mullen85) at the bottom-left are also included as reference. The artificial results (from the output/input ratios) were in the same range as in Fig.6. However, in each case, the three CSFs were normalized by the maximum of the achromatic channel for a better comparison with the human data, also normalized in that way.

The main feature of the spatio-temporal human window of visibility is its diamond shape, with smaller spatial bandwidth for higher temporal frequencies (or speeds) (Kelly,1979;Watson,2013). The linear solution and the 2-layer architectures have an average result with human-like diamond shape of human-like bandwidth. Deeper architectures, particularly with ReLU activations have less-human CSFs (non-diamond shape or reduced bandwidth).

The contrast dependence of the attenuation of sinusoids may reflect the saturation of human motion sensors tuned to specific (fx, ft) (Simoncelli & Heeger,1998;Morgan, Chubb, & Solomon,2006). In order to check this point, Fig.10shows the responses of

the networks for two spatio-temporal frequencies compared to the linear responses derived from the CSF.

In every case, the response for the lower spatial frequency is bigger, as in humans, and the nonlinear responses are mainly saturating. However, for deeper ReLU architectures (those departing from the diamond shape in the CSF) there is a strong deviation from the low-energy behavior.

4. Discussion

Summary of results

In the experiments we trained CNN autoencoders to minimize the reconstruction error of blurred and noisy retinal signals over a set of natural scenes. We considered the reconstruction of spatio-chromatic signals and spatio-temporal achromatic sequences with architectures made of 2D and 3D layers with conventional point-wise activation functions. We analyzed linearized versions of the networks and we found shift-invariant center-surround receptive fields and extended waves as eigenfunctions. These eigenfunctions show chromatic oscillations in cardinal directions of human-like color opponent spaces. The corresponding eigenvalues show that the networks are focused on medium and low spatio-temporal frequencies. This suggests a human-like frequency sensitivity in the CNNs

(16)

(17)

Figure 10: Representative examples of nonlinear responses for the different networks. In each architecture, the curves have been normalized to the maximum of the nonlinear response (i.e. the response for fx= 3.75cpd and ft= 3Hz).

and confirm that it is possible to represent their response through the ratio of the output/input amplitudes of spatio-chromatic and spatio-temporal gratings.

The stimulation of the networks with achromatic, RG and BY isoluminant sinusoids and moving sinusoids shows that: (i) 2D shal-low autoencoders display narrow shal-low-pass behavior in the chromatic channels and wider band-pass behavior in the achromatic channel. (ii) 3D shallow autoencoders display a diamond-shaped bandwidth in the spatio-temporal Fourier domain. (iii) Shallow autoencoders (both 2D and 3D) display saturating contrast responses for spatio-chromatic and spatio-temporal sinusoids. These human-like features of 2-layer architectures disappear for more complicated architectures despite the fact that these deeper architectures may have better performance in terms of the statistical goal. Particularly interesting is the case of a deep (38 layers) State-Of-The-Art system (Zhang et al.,2017;Tao et al.,2018) that excels in the goal but gets really noisy and non-human CSFs. Only after global linearization it shows human-like CSFs at the cost of a significant drop in the performance.

Relation to previous work

Our results revisit classical work on the statistical grounds of the CSFs (Atick & Redlich,1992;Atick et al.,1992) in light of the new possibilities provided by automatic differentiation.

From the technical point of view, a number of assumptions that had to be done in the 90’s, either have been confirmed with the use of large data sets, or are not necessary with the use of more flexible models. In particular, regarding the signal, Atick et al. assumed trans-lation invariance, independence between color and space-time, and 2nd order retrans-lations (autocorretrans-lation with 1/|f |2_{decay). Moreover,}

regarding the model, they restricted themselves to linear solutions similar to Wiener filters. More recent studies with colorimetrically calibrated scenes (Gutmann, Laparra, Hyv¨arinen, & Malo,2014) have confirmed the correctness of the color/space independence as-sumption. However, the focus on the power spectrum and the linear solutions has proven to be too limited for denoising (Guti´errez, Ferri, & Malo,2006). Adaptive (nonlinear) models that take into account additional features of the signal are required. Nevertheless, the nonlinear networks considered here turn out to be roughly translation invariant (see Fig.4) as expected from their convolutional architectures.

(18)

Other technical difference is in the formulation of the statistical goal: Atick et al. maximized the mutual information between the clean signal and the response I(xc, y); while here we minimized the RMSE between the clean signal and the response, |Xc− Y |2.

These goals are exactly equivalent when the difference between clean signal and the response is Gaussian, which is not the case in general. However, note that these goals are always related because the limit |Xc− Y |2→ 0 implies I(xc, y) → ∞.

Beyond the spatio-chromatic case, in our work we check the emergence of the CSF with spatio-temporal signals, which was mentioned but not addressed by Atick et al. Finally, the consideration of nonlinear models allows us to explore the consequences of the error-minimization goal in the saturation of the contrast responses which, of course, was not possible in the linear framework of Atick et al.

Simultaneously with our results, other group has been working independently on the emergence of CSFs in ANNs, which they will present at the next VSS meeting (Akbarinia, Morgenstern, & Gegenfurtner,2021). Their results are restricted to the spatial CSF (no chromatic nor spatio-temporal cases) and are based on networks trained for higher-level goals such as classification. Therefore, their results are complementary to ours, focused on lower-level signal enhancement at earlier visual stages such as the LGN. Bandwidth reduction at early stages may be a necessary preprocessing that simplifies downstream decision making, and hence it appears when imposing higher level goals.

Goal and architecture are not independent

More important than the technical generalizations over (Atick & Redlich,1992; Atick et al.,1992) is that the current freedom to explore the different linear and nonlinear architectures stresses the relevance of the architectural constraints. The conventional interpretation of of the efficient coding hypothesis (Barlow,1961) is the following: obtaining human-like results from certain statistical goal seems to suggest that the human visual system has been shaped by this goal. However, it is important to realize that the results have been obtained via the optimization of certain model. In the case of Atick et al. it was a single model (the linear filter), but in our case here we tried a range of models (architectures). Since the results for the different architectures is markedly different, the conclusion can not be about the goal, but about specific combinations of goal-and-architecture. Our results illustrate the fact that the computational and the algorithmic levels of analysis of visual processing systems (Marr & Poggio,1976;Marr,1982) are not independent. This prevents about premature conclusions about the organization principles at the abstract computational level if sensible architectures are not adopted.

Beyond accuracy

Human-like CSFs are obtained for shallow autoencoders (2-layers), or even linear networks, despite deeper architectures achieve similar or better performance in the goal. Previous literature has warned about the limitations of a single accuracy/performance measure to identify human-like behavior. Achieving similar performance on a task does not guarantee that two models actually use the same strategy (or algorithm) (Firestone,2020). For instance, different strategies may become evident if performance degrades in different ways when changing the experimental setting (Wichmann et al.,2017;Geirhos et al.,2019). Therefore, additional checks different from the optimization goal have to be done in order to confirm the human-like behavior of a model. Examples include verifying additional psychophysics not included in the goal (Martinez et al.,2019), or disaggregating the results checking the consistency between model and humans in individual trials, not on averages over the data set (Geirhos et al.,2020).

Individual Non-Euclidean distances from the optimization of average Euclidean distance

Another interesting consequence of our results is that a Euclidean measure (the RMSE) aggregated over the set of natural images leads to systems that measure individual differences in non-Euclidean ways. Note that (in the systems that we trained) given two input signals x and x + ∆x, the difference between the corresponding responses is ∆y ≈ M · ∆x. As a network should assess the difference between the two signals from ∆y, the perceived difference for the system will depend on λ (the eigenspectrum of M ) or on the CSF. In this way, when measuring differences between individual images, the perceptual metric of the network will be non-Euclidean. For

(19)

instance, high-frequency distortions will be less relevant for the network than medium or low-frequency distortions. Even though here we did not check the correlation between the image distortions perceived by humans and networks, the observed CSFs in the networks are consistent with the fact that RMSE is not a good representation of human distortion metrics (Wang & Bovik,2009;Laparra, Mu˜noz, & Malo,2010;Hepburn, Laparra, Malo, & Santos,2020).

The emergence of a non-Euclidean distance from the minimization of the average Euclidean distance is a counter intuitive conse-quence of the highly nonuniform distribution of natural scenes: departures in less populated regions of the image space (e.g. in high frequency directions) have to be underrated to favour the average match to the data in highly populated regions.

Final remarks

In visual neuroscience, deep models are emerging as the new standard to reproduce the activity of visual areas under natu-ral scene stimulation. On the one hand, conventional deep models driven by object recognition goals reproduce the response from V1 (Kriegeskorte,2015;Güçlü & Gerven,2015) to IT (Cadieu et al.,2014;Yamins et al.,2014). On the other hand, deep networks are powerful enough to fit the mappings between stimuli and measured responses (Prenger, Wu, David, & Gallant,2004;Antol´ık, Hofer, Bednar, & Mrsic-Flogel,2016; Batty et al., 2017). These two approaches (goal-driven and measurement-driven deep models) have been thoroughly compared in V1 and were found to be superior to linear filter-banks and simple linear-nonlinear models (Cadena et al.,2019). However, more recently, the same team has shown that linear-nonlinear models with general Divisive Normalization make a significant step towards the performance of state-of-the-art CNN with interpretable parameters (Burg et al.,2021).

In our goal-driven case for visual psychophysics, the emergence of human-like spatio-chromatic and spatio-temporal CSFs for certain CNN autoencoders generalizes in different ways previous statistical explanations of the CSF based on linear models (Atick & Redlich,1992;Atick et al.,1992). However, we find a strong dependence of the CSFs on the architecture for similar performance mea-sures. Our results suggest that when checking a statistical goal, one should be careful about the architecture of the considered model. Similarly to the move from conventional CNNs in (Cadena et al.,2019) to more realistic divisive normalization models in (Burg et al.,

2021), we think that future goal-driven derivations of visual psychophysics should include more realistic architectures too. Examples include divisive normalization with interpretable interaction between parametrized features (Martinez et al.,2018,2019) and general-izations of Wilson-Cowan interactions (Bertalm´ıo et al.,2020). The deep learning framework is excellent to consider additional factors in the goal that shapes the organization of the system. For instance, what would be the impact of information constraints on contrast responses and CSFs?. Learning frameworks with rate-distortion constraints are already available (Ball´e, Laparra, & Simoncelli,2017), and we advocate for the study of their artificial psychophysics using realistic and interpretable architectures as opposed to conventional CNNs.

References

Abadi, M., & Zheng, X. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems.

Akbarinia, A., Morgenstern, Y., & Gegenfurtner, K. (2021). Contrast sensitivity is formed by visual experience and task demands. In J. Vision: Proc. Vis. Sci. Soc. VSS-2021.

Antol´ık, J., Hofer, S. B., Bednar, J. A., & Mrsic-Flogel, T. D. (2016). Model constrained by visual hierarchy improves prediction of neural responses to natural scenes. PLOS Computational Biology, 12(6), doi:10.1371/journal.pcbi.1004927.

Atick, J., Li, Z., & Redlich, A. (1992). Understanding retinal color coding from first principles. Neural Comp., 4(4), 559-572. Atick, J., & Redlich, A. (1992). What does the retina know about natural scenes? Neural Comp., 4(2), 196-210.

Ball´e, J., Laparra, V., & Simoncelli, E. P. (2017). End-to-end optimized image compression. In Int. conf. learn. repres. (ICLR). Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory

(20)

Batty, E., Merel, J., Brackbill, N., Heitman, A., Sher, A., Litke, A. M., et al. (2017). Multilayer recurrent network models of primate retinal ganglion cell responses. In 5th int. conf. learn. repres., ICLR. OpenReview.net.

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy layer-wise training of deep networks. In Proc. 19th NIPS (p. 153–160). Cambridge, MA, USA: MIT Press.

Berardino, A., Ball´e, J., Laparra, V., & Simoncelli, E. (2017). Eigen-distortions of hierarchical representations. In Proc. NIPS 17 (p. 3533–3542).

Bertalm´ıo, M., Gomez-Villa, A., Mart´ın, A., Vazquez, J., Kane, D., & Malo, J. (2020). Evidence for the intrinsically nonlinear nature of receptive fields in vision. Scientific Reports, 10, 16277.

Burg, M. F., Cadena, S. A., Denfield, G. H., Walker, E. Y., Tolias, A. S., Bethge, M., et al. (2021). Learning divisive normalization in primary visual cortex. bioRxiv. Available fromhttps://www.biorxiv.org/content/early/2021/03/08/767285

, doi:10.1101/767285

Cadena, S., Denfield, G., Walker, E., Gatys, L., Tolias, A., Bethge, M., et al. (2019). Deep convolutional models improve predictions of macaque v1 responses to natural images. PLoS Comput. Biol., 15(4), e1006897.

Cadieu, C., Hong, H., Yamins, D., Pinto, N., Ardila, D., Solomon, E., et al. (2014). Deep neural networks rival the representation of primate it cortex for core visual object recognition. PLoS Comput. Biol., 10(12), e1003963.

Campbell, F., & Robson, J. (1968). Application of Fourier analysis to the visibility of gratings. Journal of Physiology, 197, 551–566. Clarke, R. (1981). Relation between the karhunen lo`eve and cosine transforms. IEE Proceedings F (Comm. Radar Sig. Proc.), 128,

359-361.

Firestone, C. (2020). Performance vs. competence in human–machine comparisons. Proc. Nat. Acad. Sci., 117(43), 26562–26571. Geirhos, R., Meding, K., & Wichmann, F. A. (2020). Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by

measuring error consistency. arXiv: 2006.16736.

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2019). Imagenet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Int. Conf. Learn. Repr. https://arxiv.org/abs/1811.12231. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh & M.

Titter-ington (Eds.), Proc. 13th Int. Conf. Artif. Intell. Stats. (Vol. 9, pp. 249–256). JMLR Workshop and Conference Proceedings. Gomez-Villa, A., Martin, A., Vazquez, J., & Bertalmio, M. (2019). Convolutional neural networks can be deceived by visual illusions.

In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (pp. 12309–12317).

Gomez-Villa, A., Martin, A., Vazquez, J., Bertalm´ıo, M., & Malo, J. (2020, Nov). Color illusions also deceive CNNs for low-level vision tasks: Analysis and implications. Vision Research, 176, 156-174.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. (http://www.deeplearningbook.org) Güçlü, U., & Gerven, M. A. J. van. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across

the ventral stream. Journal of Neuroscience, 35(27), 10005–10014.

Guti´errez, J., Ferri, F. J., & Malo, J. (2006). Regularization operators for natural images based on nonlinear perception models. IEEE Trans. Im. Proc., 15(1), 189–200.

Gutmann, M. U., Laparra, V., Hyv¨arinen, A., & Malo, J. (2014). Spatio-chromatic adaptation via higher-order canonical correlation analysis of natural images. PloS ONE, 9(2), e86481.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE conf. comp. vis. patt. recogn. (cvpr)(p. 770-778).

Hepburn, A., Laparra, V., Malo, J., & Santos, R. (2020). Perceptnet: A human visual system inspired neural network for estimating perceptual distance. In 2020 IEEE Int. Conf. Im. Proc. (ICIP) (p. 121-125).

(21)

Hyv¨arinen, A., Hurri, J., & Hoyer, P. (2009). Natural image statistics: A probabilistic approach to early computational vision. Heidelberg, Germany: Springer-Verlag.

Ingling, C. R., & Martinez-Uriegas, E. (1983). The relationship between spectral sensitivity and spatial sensitivity for the primate r-g X-channel. Vision Research, 23(12), 1495-1500.

Kelly, D. H. (1979). Motion and vision. ii. stabilized spatio-temporal threshold surface. J. Opt. Soc. Am., 69(10), 1340–1349. Kingma, D. P., & Ba, J. (2017). Adam: A method for stochastic optimization. arXiv:1412.6980.

Komatsu, R., & Gonsalves, T. (2020). Comparing U-Net based models for denoising color images. AI, 1(4), 465–486. , doi:10.3390/ai1040029

Kriegeskorte, N. (2015). Deep neural networks: A new framework for modeling biological vision and brain information processing. Ann. Rev. Vis. Sci., 1(1), 417-446.

Laparra, V., Mu˜noz, J., & Malo, J. (2010). Divisive normalization image quality metric revisited. JOSA A, 27(4), 852–864. Legge, G. E. (1981). A power law for contrast discrimination. Vision Research, 21(4), 457-467.

Legge, G. E., & Foley, J. M. (1980). Contrast masking in human vision. J. Opt. Soc. Am., 70(12), 1458–1471.

Lillicrap, T., Santoro, A., Marris, L., Akerman, C., & Hinton, G. (2020). Backpropagation and the brain. Nat. Rev. Neurosci., 21(6), 335-346.

Malo, J. (2020). Spatio-chromatic information available from different neural layers via gaussianization. J. Math. Neurosci., 10(18), 10.1186/s13408-020-00095-8.

Marr, D. (1982). Vision: A computational approach. San Francisco, CA, USA: Freeman & Co.

Marr, D., & Poggio, T. (1976). From understanding computation to understanding neural circuitry (AI Memo No. AIM-357). MIT. Martinez, M., Bertalm´ıo, M., & Malo, J. (2019). In praise of artifice reloaded: Caution with natural image databases in modeling vision.

Front. Neurosci. doi: 10.3389/fnins.2019.00008.

Martinez, M., Cyriac, P., Batard, T., Bertalm´ıo, M., & Malo, J. (2018). Derivatives and inverse of cascaded linear+nonlinear neural models. PLoS ONE, 13(10), doi:10.1371/journal.pone.0201326.

Martinez-Otero, L., Molano, M., Wang, X., Sommer, F., & Hirsch, J. (2014). Statistical wiring of thalamic receptive fields optimizes spatial sampling of the retinal image. Neuron, 81(4), 943-956.

Martinez-Uriegas, E. (1994). Visual science and engineering: Models and applications. In K. D. H (Ed.), (p. 117-187). CRC Press. Martinez-Uriegas, E. (1997). Color detection and color contrast discrimination thresholds. Proc. OSA Meeting, 81.

Morgan, M., Chubb, C., & Solomon, J. (2006). Predicting the motion after-effect from sensitivity loss. Vision Research, 46(15), 2412-2420.

Mullen, K. T. (1985). The CSF of human colour vision to red-green and yellow-blue chromatic gratings. J. Physiol., 359, 381–400. Oord, A. van den, & Schrauwen, B. (2014). The student-t mixture as a natural image patch prior with application to image compression.

Journal of Machine Learning Research, 15(60), 2061-2086.

Prenger, R., Wu, M. C.-K., David, S. V., & Gallant, J. L. (2004). Nonlinear v1 responses to natural scenes revealed by neural network analysis. Neural Networks, 17(5), 663-679. , doi:https://doi.org/10.1016/j.neunet.2004.03.008

Priyanka, S., & Wang, Y. (2019). Fully symmetric convolutional network for effective image denoising. Applied Sciences, 9(4). Richards, B., & al. et. (2019, Oct). A deep learning framework for neuroscience. Nature Neuroscience, 22, 1761–1770.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. Int. J. Comput. Vis., 115, 211–252.

Shapley, R., & Hawken, M. J. (2011). Color in the cortex: single- and double-opponent cells. Vision Research, 51(7), 701-717. (Vision Research 50th Anniversary Issue: Part 1)

Simoncelli, E., & Heeger, D. (1998). A model of neuronal reponses in visual area MT. Vis. Res., 38(5), 743-761.

(22)

Tao, X., Gao, H., Shen, X., Wang, J., & Jia, J. (2018). Scale-recurrent network for deep image deblurring. In Proc. ieee conf. comp. vis. patt. recogn.(pp. 8174–8182).

Wang, Z., & Bovik, A. C. (2009). Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine, 26(1), 98–117.

Watson, A. B. (2013). High frame rates and human vision: A view through the window of visibility. SMPTE Motion Imaging Journal, 122(2), 18-32.

Watson, A. B., & Malo, J. (2002). Video quality measures based on the standard spatial observer. In IEEE proc. int. conf. im. proc. (Vol. 3, p. III-III).

Welles, O. (1946). The stranger. RKO Radio Picures, CC Public Domain Mark 1.0, https://archive.org/details/TheStranger720p. Wichmann, F. A., Janssen, D. H. J., Geirhos, R., Aguilar, G., Sch¨utt, H. H., Maertens, M., et al. (2017). Methods and measurements to

compare men against machines. Electronic Imaging, 36-45(10).

Yamins, D., & DiCarlo, J. (2016). Using goal-driven deep learning models to understand sensory cortex. Nat. Neurosci., 19, 356–365. Yamins, D., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models

predict neural responses in higher visual cortex. Proc. Nat. Acad. Sci., 111(23), 8619–8624. , doi:10.1073/pnas.1403112111 Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a gaussian denoiser: Residual learning of deep CNN for image