Beyond Implicit Regularization: Avoiding Overfitting via Regularizer Mirror Descent

(1)

Avoiding Overfitting via Regularizer Mirror Descent

Navid Azizan^{1 2} Sahin Lale³ Babak Hassibi³

Abstract

It is widely recognized that, despite perfectly interpolating the training data, deep neural networks (DNNs) can still generalize well due in part to the

“implicit regularization” induced by the learning algorithm. Nonetheless, “explicit regularization”

(or weight decay) is often used to avoid overfitting, especially when the data is known to be corrupted.

There are several challenges with using explicit regularization, most notably unclear convergence properties. In this paper, we propose a novel vari- ant of the stochastic mirror descent (SMD) algorithm, called regularizer mirror descent (RMD), for training DNNs. The starting point for RMD is a cost which is the sum of the training loss and any convex regularizer of the network weights. For highly-overparameterized models, RMD provably converges to a point “close” to the optimal solution of this cost. The algorithm imposes virtually no additional computational burden compared to stochastic gradient descent (SGD) or weight decay, and is parallelizable in the same manner that they are. Our experimental results on training sets that contain some errors suggest that, in terms of generalization performance, RMD outperforms both SGD, which implicitly regularizes for the

`2norm of the weights, and weight decay, which explicitly does so. This makes RMD a viable option for training with regularization in DNNs.

In addition, RMD can be used for regularizing the weights to be close to a desired weight vector, which is particularly relevant for continual learning.

1Massachusetts Institute of Technology, Cambridge, Mas- sachusetts, USA²Stanford University, Stanford, California, USA

3California Institute of Technology, Pasadena, California, USA.

Correspondence to: Navid Azizan <[email protected]>.

Presented at the ICML 2021 Workshop on Overparameterization:

1. Introduction

Today’s deep neural networks are typically highly- overparameterized and therefore have such a large capacity that they can easily overfit training data to zero training error (Zhang et al.,2016). Furthermore, it is now widely recognized that such networks can still generalize well despite (over)fitting the training data (Bartlett et al.,2020;Belkin et al.,2018;2019), which is, in part, attributed to the “implicit regularization” (Gunasekar et al.,2018a;Azizan &

Hassibi,2019;Neyshabur et al.,2015) effect of the optimization algorithms used for training, such as stochastic gradient descent (SGD). However, in many cases, especially when the training data is known to include corrupted samples, it is still highly desirable to avoid overfitting the training data through some form of regularization (Goodfellow et al., 2016;Kukaˇcka et al.,2017). This can be done through early stopping, explicit regularization of the network parameters, or weight decay. However, the main challenge with these ap- proaches is that their convergence properties are not known and they do not come with any performance guarantees.

In this paper, we propose a novel optimization algorithm for training deep neural networks with any desired convex regularizer, which comes with provable convergence guarantees.

The algorithm, called regularizer mirror descent (RMD), is designed based on the implicit regularization property of stochastic mirror descent (SMD). It is computationally efficient, as it imposes no additional overhead compared to the standard SGD, and can run in mini-batches and/or be distributed in the same manner as in SGD. We evaluate the performance of RMD in achieving the desired regularization effect and generalization error, using a ResNet-18 neural network architecture on a CIFAR-10 dataset with some frac- tion of the labels corrupted. The results show that, in terms of generalization performance, RMD outperforms the SGD, which regularizes the `2norm of the weights, as well as the weight decay algorithm, which explicitly does so.

2. Background

We review some background about stochastic gradient methods and different forms of regularization.

(2)

2.1. Stochastic Gradient Descent

Let Li(w) denote the loss on the data point i for a weight vector w ∈ R^p. For a training set consisting of n data points, the total loss isPn

i=1Li(w), which is typically attempted to be minimized by stochastic gradient descent (Robbins

& Monro,1951) or one of its variants (such as mini-batch, distributed, adaptive, with momentum, etc.). Denoting the model parameters at the t-th time step by wt∈ R^pand the index of the chosen data point by i, the update rule of SGD can be simply written as

wt= wt−1− η∇Li(wt−1), t ≥ 1, (1) where η is the so-called learning rate, w0is the initialization, and ∇Li(·) is the gradient of the loss. When trained with SGD, typical deep neural networks (which have many more parameters than the number of data points) achieve zero training error (Zhang et al.,2016), or, in other words,

“interpolate” the training data (Ma et al.,2018).

2.2. Explicit Regularization

As mentioned earlier, it is often desirable to avoid (over)fitting the training data to zero error, e.g., when the data has some corrupted labels. In such scenarios, it is beneficial to augment the loss function with a (convex and differentiable) regularizer ψ : R^p→ R, and consider

min

w λ

n

X

i=1

L_i(w) + ψ(w), (2)

where λ ≥ 0 is a hyper-parameter that controls the strength of regularization relative to the loss function. A simple and common choice of regularizer is ψ(w) = ¹₂kwk², which is commonly referred to as weight decay.

Note that the bigger λ is, the more effort in the optimization is spent on minimizingPn

i=1L_i(w). Since Li(w) ≥ 0, the lowest these terms can get is zero, and thus, for λ → ∞, the problem would be equivalent to the following:

minw ψ(w)

s.t. L_i(w) = 0, i = 1, . . . , n.

(3)

2.3. Implicit Regularization

In recent years, several papers have noted that, even without imposing any explicit regularization in the objective, i.e., by optimizing only the loss functionPn

i=1L_i(w), there is still an implicit regularization induced by the optimization algorithm used for training (Gunasekar et al.,2018a;b;Azizan

& Hassibi,2019). In particular, it has been shown that SGD tends to converge to interpolating solutions with minimum

`₂norm (Gunasekar et al.,2018a;Poggio et al.,2020), i.e., minw kwk2

s.t. Li(w) = 0, i = 1, . . . , n.

More generally, it has been shown (Azizan & Hassibi,2019;

Azizan et al.,2021) that stochastic mirror descent (SMD), whose update rule is defined for a differentiable strictly- convex “potential function” ψ(·) as

∇ψ(wt) = ∇ψ(wt−1) − η∇Li(wt−1), (4) with proper initialization (w0≈ 0)¹and sufficiently small learning rate converges to the solution of²

minw ψ(w)

s.t. Li(w) = 0, i = 1, . . . , n.

(5)

Note that this is equivalent to the case of explicit regularization with λ → ∞, i.e., problem (3).

3. Proposed Method: Regularizer Mirror Descent (RMD)

When it is undesirable to reach zero training error, e.g., due to the presence of corrupted samples in the data, one cannot rely on the implicit regularization to avoid overfitting. On the other hand, the explicit regularization methods do not come with any convergence guarantees. In this section, we propose a new algorithm, called Regularizer Mirror Descent (RMD), which provably regularizes the weights for any desired differentiable strictly-convex regularizer.

We are interested in solving the explicitly-regularized optimization problem (2). Let us define an auxiliary variable z ∈ Rⁿ with elements z[1], . . . , z[n]. The optimization problem (2) can be transformed into the following form:

minw,z λ

n

X

i=1

z²[i]

2 + ψ(w) s.t. z[i] =p

2L_i(w), i = 1, . . . , n.

(6)

The objective of this optimization problem is a strictly- convex function ˆψw

z

= ψ(w) + ^λ₂kzk², and there are n equality constraints. Note that this new form is similar to the implicitly-regularized optimization problem (5), which is solved by a stochastic mirror descent algorithm.

Thus, we can define an SMD algorithm for solving this

1For practical reasons, one might not be able to set the initial weight vector exactly to zero. However, it can be initialized ran- domly around zero, which is common practice, and that would be of no major consequence.

2See Section5for a more precise statement.

(3)

Algorithm 1 Regularizer Mirror Descent (RMD) Require: λ, η, w0

1: Initialization: w ← w0, z ← 0 2: repeat

3: Take a data point i 4: c ← η ˆ`⁰

z[i] −p2Li(w) 5: w ← ∇ψ⁻¹

∇ψ(w) +√ ^c

2L_i(w)∇Li(w)

6: z[i] ← z[i] −_λ^c

7: until convergence 8: return w

new problem. To do so, in order to enforce the constraints z[i] =p2Li(w), we define a “constraint-enforcing” loss

`ˆ

z[i] −p2Li(w)

, where ˆ`(·) is a differentiable and convex function with a unique root at 0 (an example is the square loss ˆ`(·) = ^(·)₂²).

At time t, when the i-th training sample is chosen for updat- ing the model, the update rule of RMD is given as:

∇ψ(wt) = ∇ψ(wt−1) + ct,i

p2L_i(w_t−1)∇Li(wt−1), z_t[i] = z_t−1[i] − c_t,i

λ ,

zt[j] = zt−1[j], ∀j 6= i, (7)

where ˆ`⁰(·) is the derivative of the constraint-enforcing loss function, ct,i= η ˆ`⁰

zt−1[i] −p2Li(wt−1)

and the vari- ables are initialized with w0= 0 and z0= 0. Note that because of strict convexity of the regularizer ψ(·), its gradient

∇ψ(·) is an invertible function, and the above update rule is well-defined. Algorithm1summarizes the procedure.³This iterative algorithm, as will be shown in Section5, provably solves the optimization problem (2).

4. Experimental Results

In this section, we evaluate the regularization performance of RMD in comparison with both implicit regularization induced by the standard SGD and explicit regularization performed through weight decay. The results indicate that RMD outperforms both of the alternatives by a significant margin, in terms of the generalization capability, thus mak- ing it a viable option for regularization.

4.1. Setup

Dataset. To test the ability of different regularization methods in avoiding overfitting, we need a training set that

3Code for Regularizer Mirror Descent (RMD) is available at https://github.com/SahinLale/

RegularizerMirrorDescent

does not consist entirely of clean data. Thus, we took the popular CIFAR-10 dataset (Krizhevsky & Hinton,2009), which has 10 classes and n = 50, 000 training data points, and corrupted 25% of the data points, by assigning them a random label. Since, for each of those images, there is a 0.9 chance of being assigned a wrong label, roughly 0.9 × 25% = 22.5% of the training data has incorrect labels.

Network Architecture. We use a standard ResNet-18 (He et al.,2016) architecture, which is commonly used for the CIFAR-10 dataset. The network has 18 layers, and around 11 million parameters. Thus, it qualifies as highly overparameterized. We do not make any changes to the network.

Algorithms. We use three different algorithms for optimization/regularization.

1) Standard SGD (implicit regularization): First, we train the network with the standard (mini-batch) SGD. While there is no explicit regularization, this is still known to induce an implicit regularization, as discussed in Section2.3.

2) Explicit regularization via weight decay: This time, we train the network with an explicit `2-norm regularization, through weight decay.

3) Explicit regularization via RMD: Finally, we train the network with RMD, which is provably regularizing with an `2norm. Note that we choose the same regularization parameter λ for both weight decay and RMD.

In all three cases, we train in mini batches—mini-batch RMD is summarized in Algorithm2.

4.2. Results

The training and test accuracies for all the three methods are given in Table1. As expected, because the network is highly overparameterized, SGD interpolates the training data and achieves almost 100% training accuracy. The test accuracy for SGD is pretty low, just shy of 80%, perhaps unsurpris- ingly, given that it has fit the corrupted samples, too. We varied the regularizer parameter for weight decay over a large range but were only able to achieve marginally better generalization performance than SGD. RMD, on the other hand, significantly outperforms SGD and weight decay, and achieves a test accuracy of almost 87%. This performance was not too sensitive to λ in the ranges we considered.

5. Convergence Guarantees

Here, we state the main theoretical result. See AppendixD for the details. Let us define a new learning problem over parameters ˆw = w

z

∈ R^p+n with the model fˆ_i( ˆw) =p2L_i(w) − z[i], ˆy_i = 0, and the loss ˆL_i( ˆw) =

(4)

Table 1. Comparison between RMD and the two baselines, (1) implicit regularization induced by SGD and (2) explicit regularization through weight decay

Algorithm Lambda Training Accuracy

Test Accuracy

SGD N/A 99.70% 79.42%

RMD 0.9 70.20% 85.43%

1.0 71.81% 85.71%

1.1 72.90% 86.12%

1.2 74.38% 86.92%

1.3 75.04% 86.35%

1.4 75.16% 86.62%

1.5 75.84% 85.74%

1.6 77.56% 85.72%

Weight Decay 0.005 69.95% 75.59%

0.008 78.69% 62.11%

0.01 83.28% 60.92%

0.05 97.20% 72.43%

0.8 99.38% 80.56%

0.9 99.46% 79.80%

1 99.65% 79.69%

1.2 99.65% 80.62%

`ˆ ˆ

yi− ˆfi( ˆw)

= ˆ`

z[i] −p2Li(w)

for i = 0, . . . , n.

Note that in this new problem, we now have p + n parameters and n constraints/data points, and since p n, we have p + n n, and we are still in the highly-overparameterized regime (even more so). Therefore, we can also define the set of interpolating solutions for the new problem as

W =ˆ n

w ∈ Rˆ ^p+n| ˆfi(w) = ˆyi, i = 1, . . . , no . (8)

Let us define a potential function ˆψ ( ˆw) = ψ(w) +^λ₂kzk² and a corresponding SMD

∇ ˆψwt

zt

= ∇ ˆψwt−1

zt−1

− η∇ ˆLi

wt−1

zt−1

,

initialized at ˆw₀=w0

~0

. It is straightforward to verify that this update rule is indeed equivalent to the update rule of RMD given in (7).

Let us also define ˆ

w^∗= arg min

w∈Rˆ ^p+n

Dψˆ( ˆw, ˆw0)

s.t. fˆ_i( ˆw) = ˆy_i, i = 1, . . . , n.

(9)

Plugging Dψˆ( ˆw, ˆw0) = Dψ(w, w0) +^λ₂kzk²and ˆfi( ˆw) = p2L_i(w) − z[i] into (9), it is easy to see that it is equivalent to (6) for w₀= 0.

Assumption 1 Denote the initial point by wˆ₀ =

w0

~0

. There exists wˆ ∈ Wˆ and a region B =ˆ w⁰

z⁰

∈ R^p+n| Dψˆ

ˆ w,w⁰

z⁰

≤

containing ˆ

w₀, such thatD_L_ˆ

i

ˆ w,w⁰

z⁰

≥ 0, i = 1, . . . , n, for all

w⁰ z⁰

∈ ˆB.

Assumption 2 Consider the region ˆB in Assumption 1.

The ˆfi(·) have bounded gradient and Hessian on the convex hull of ˆB, i.e.,

∇ ˆfi

w⁰ z⁰

≤ γ, and α ≤ λ_min

Hfˆ_i

w⁰ z⁰

≤ λ_max

Hfˆ_i

w⁰ z⁰

≤ β, i = 1, . . . , n, for allw⁰

z⁰

∈ conv ˆB.

Theorem 1 Consider the set of interpolating solutions ˆW defined in(8), the closest such solution ˆw^∗defined in(9), and the RMD iterates given in(7) initialized atw0

~0

, where every data point is revisited after some steps. Under As- sumptions1and2, for sufficiently small step size, i.e., for anyη > 0 for which ˆψ(·) − η ˆLi(·) is strictly convex on ˆB for all i, the following holds.

1. The iterates converge tow_∞ z∞

∈ ˆW.

2. D_ψ_ˆ

ˆ w^∗,w_∞

z_∞

= o().

6. Conclusion and Outlook

We presented Regularizer Mirror Descent (RMD), a novel efficient algorithm for training deep neural networks. The starting point for RMD is a cost which is the sum of the training loss and any differentiable strictly-convex regularizer of the network weights. For highly-overparameterized models, RMD provably converges to a point “close” to the optimal solution of this cost. The algorithm can be readily applied to any deep neural network and enjoys the same parallelization properties as SGD. We demonstrated that RMD outperforms both the implicit regularization induced by SGD, and the explicit regularization performed via weight decay, by a significant margin.

Given that RMD enables training any network efficiently with a desired regularizer, it opens up several avenues for fu- ture research. In particular, an extensive experimental study of the effect of different regularizers on different datasets and different architectures would be instrumental to uncov- ering the role of regularization in modern learning problems.

(5)

References

Azizan, N. and Hassibi, B. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations (ICLR), 2019.

Azizan, N., Lale, S., and Hassibi, B. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, arXiv:1906.03830, 2021.

Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A.

Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.

Belkin, M., Hsu, D. J., and Mitra, P. Overfitting or per- fect fitting? risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems, volume 31, 2018.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias–

variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.

Farajtabar, M., Azizan, N., Mott, A., and Li, A. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp.

3762–3773. PMLR, 2020.

Goodfellow, I., Bengio, Y., and Courville, A. Regularization for deep learning. Deep learning, pp. 216–261, 2016.

Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Charac- terizing implicit bias in terms of optimization geometry.

In International Conference on Machine Learning, pp.

1827–1836, 2018a.

Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Im- plicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, volume 31, 2018b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des- jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Kukaˇcka, J., Golkov, V., and Cremers, D. Regulariza- tion for deep learning: A taxonomy. arXiv preprint arXiv:1710.10686, 2017.

Lopez-Paz, D. and Ranzato, M. A. Gradient episodic mem- ory for continual learning. In Advances in Neural Infor- mation Processing Systems, volume 30, 2017.

Ma, S., Bassily, R., and Belkin, M. The power of interpola- tion: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th In- ternational Conference on Machine Learning, volume 80, pp. 3325–3334. PMLR, 2018.

Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR (Workshop), 2015. URLhttp:

//arxiv.org/abs/1412.6614.

Poggio, T., Banburski, A., and Liao, Q. Theoretical issues in deep networks. Proceedings of the National Academy of Sciences, 117(48):30039–30045, 2020.

Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400–

407, 1951.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.

Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

(6)

A. Special Cases of RMD

A.1. Special Case: q-norm Potential

An important special case of RMD is when the potential function ψ(·) is chosen to be the `_q norm, i.e., ψ(w) = ¹_qkwk^q_q =

1 q

Pp

k=1|w[k]|^q, for a positive integer q. Let the current gradient be denoted by g := ∇Li(wt−1). In this case, the update rule can be written as:









 wt[k] =

|wt−1[k]|^q−1sign(wt−1[k]) + η ˆ`⁰

zt−1[i] −p2Li(wt−1) p2Li(wt−1) g[k]

1 q−1×

sign

|wt−1[k]|^q−1sign(wt−1[k]) + η ˆ`⁰

zt−1[i] −p2Li(wt−1) p2L_i(w_t−1) g[k]

, ∀k z_t[i] = z_t−1[i] −η

λ

`ˆ⁰

z_t−1[i] −p

2L_i(w_t−1) ,

zt[j] = zt−1[j], ∀j 6= i, (10)

where w_t[k] denotes the k-th element of wt(the weight vector at time t), and g[k] is the k-th element of the current gradient g. Note that for this choice of potential function, the update rule is separable, in the sense that the update for the k-th element of the weight vector requires only the k-th element of the weight and gradient vectors. This allows for efficient parallel implementation of the algorithm, which is crucial for large-scale tasks. One can further choose ˆ`(·) = ^(·)₂², which implies ˆ`⁰(·) = (·) and simplifies the updates to the following form:











w_t[k] =

|wt−1[k]|^q−1sign(w_t−1[k]) + η

1 q−1×

sign

|wt−1[k]|^q−1sign(wt−1[k]) + η

z_t−1[i] −p2L_i(w_t−1) p2L_i(w_t−1) g[k]

, ∀k zt[i] = zt−1[i] − η

λ

zt−1[i] −p

2Li(wt−1) ,

zt[j] = zt−1[j], ∀j 6= i, (11)

Even among the family of q-norm RMD algorithms, there can be a wide range of effects on the weights for different q. We discuss some important examples in more detail.

`1 norm regularization promotes sparsity in the weights. Sparsity is often desirable for reducing the storage and/or computational load, given the massive size of today’s deep neural networks. However, since `1-norm is neither differentiable nor strictly convex, we propose to use ψ(w) = ₁₊¹ kwk¹⁺₁₊for some small > 0. While most sparsification and pruning methods for neural networks are post hoc, an advantage of this method is that it optimizes for sparsity while training the network.

`∞norm regularization promotes bounded and small range of weights. With this choice of potential, the weights tend to concentrate around a small interval. This is often desirable in various implementations of neural networks since it provides a small dynamic range for quantization of weights, which reduces the production cost and computational complexity. However, since `∞is not differentiable, one can choose a large value for q, e.g., q = 10 and implement ψ(w) = ₁₀¹kwk¹⁰₁₀to achieve the desirable regularization effect of `∞.

`₂norm still promotes small weights, similar to `1norm, but to a lesser extent. The update rule is:

(7)











wt[k] = wt−1[k] + η ˆ`⁰

z_t−1[i] −p2L_i(w_t−1)

p2Li(wt−1) g[k], ∀k

z_t[i] = z_t−1[i] − η ˆ`⁰

λ ,

z_t[j] = z_t−1[j], ∀j 6= i. (12)

A.2. Special Case: Negative Entropy Potential

One can choose the potential function ψ(·) to be the negative entropy, i.e., ψ(w) =Pp

k=1w[k] log(w[k]). For this particular choice, the Bregman divergence reduces to the Kullback–Leibler divergence. Let the current gradient be denoted by g := ∇Li(wt−1). The update rule can be written as:











wt[k] = wt−1[k] exp



 η ˆ`⁰



 ∀k

zt[i] = zt−1[i] − η ˆ`⁰

λ ,

z_t[j] = z_t−1[j], ∀j 6= i, (13)

This update rule requires the weights to be positive but one can use the magnitude of the weights.

A.3. Reduction of RMD to SMD for λ → ∞

Here, we show that, for λ → ∞, RMD reduces to the standard SMD, which jives with the fact that the optimization problem it solves, i.e., (2), for λ → ∞ reduces to the optimization problem that SMD solves, i.e., (5).

Note that when λ → ∞, the update rule for z_tin (7) vanishes, and we have z_t= 0 for all t. Therefore, the update becomes

∇ψ(wt) = ∇ψ(w_t−1) + η`⁰

−p2Li(wt−1)

p2Li(wt−1) ∇Li(w_t−1). (14)

For `(·) = ^(·)₂², we have `⁰(·) = (·), and the update rule further reduces to

∇ψ(wt) = ∇ψ(wt−1) − η∇Li(wt−1), (15)

which is precisely the update rule for SMD.

B. Regularization for Continual Learning

It is often desirable to regularize the weights to remain close to a particular weight vector. This is particularly useful for continual learning, where one seeks to learn a new task while trying not to “forget” the previous task as much as possible (Lopez-Paz & Ranzato,2017;Kirkpatrick et al.,2017;Farajtabar et al.,2020). In this section, we show that our algorithm can be readily used for such settings by initializing w0to be the desired weight vector and suitably choosing a notion of closeness.

Augmenting the loss function with a regularization term that promotes closeness to some desired weight vector w^reg, one can pose the optimization problem as

minw λ

n

X

i=1

Li(w) + kw − w^regk². (16)

(8)

More generally, using a Bregman divergence D_ψ(·, ·) corresponding to a differentiable strictly-convex potential function ψ : R^p→ R, one can pose the problem as

minw λ

n

X

i=1

Li(w) + Dψ(w, w^reg). (17)

Note that Bregman divergence is defined as Dψ(w, w^reg) := ψ(w) − ψ(w^reg) − ∇ψ(w^reg)^T(w − w^reg), is non-negative, and convex in its first argument. Due to strict convexity of ψ, we also have Dψ(w, w^reg) = 0 iff w = w^reg. For the choice of ψ(w) = ¹₂kwk², for example, the Bregman divergence reduces to the usual Euclidean distance Dψ(w, w0) = ¹₂kw − w^regk². Same as in Section3, we can define an auxiliary variable z ∈ Rⁿ, and rewrite the problem as

minw,z λ

n

X

i=1

z²[i]

2 + Dψ(w, w^reg) s.t. z[i] =p

2Li(w), i = 1, . . . , n.

(18)

It can be easily shown that the objective of this optimization problem is a Bregman divergence, i.e., Dψˆ

w z

,w^reg

0

, corresponding to a potential function ˆψw

z

= ψ(w) +^λ₂kzk². As will be discussed in Section5, this is exactly in a form that an SMD algorithm with the choice of potential function ˆψ, initialization w0= w^regand z0= 0, and a sufficiently small learning rate will solve. In other words, Algorithm1with initialization w0= w^regprovably solves the regularized problem (17).

C. Additional Details on the Experiments

In this appendix, we provide further details on the experiments for the purpose of reproducing the results, and also provide additional results.

C.1. Setup

Dataset. To test the ability of different regularization methods in avoiding overfitting, we need a training set that does not consist entirely of clean data. Thus, we took the popular CIFAR-10 dataset (Krizhevsky & Hinton,2009), which has 10 classes and n = 50, 000 training data points, and corrupted 25% of the data points, by assigning them a random label. Since, for each of those images, there is a 9/10 chance of being assigned a wrong label, roughly 9/10 × 25% = 22.5% of the training data has incorrect labels.

Network Architecture. We use a standard ResNet-18 (He et al.,2016) deep neural network, which is commonly used for the CIFAR-10 dataset. The network has 18 layers, and around 11 million parameters. Thus, it qualifies as highly overparameterized. We do not make any changes to the network.

Algorithms. We use three different algorithms for optimization/regularization.

1. Standard SGD (implicit regularization): First, we train the network with the standard (mini-batch) SGD. While there is no explicit regularization, this is still known to induce an implicit regularization, as discussed in Section2.3.

2. Explicit regularization via weight decay: This time, we train the network with an explicit `2-norm regularization, through weight decay.

3. Explicit regularization via RMD: Finally, we train the network with RMD, which is provably regularizing with an `2

norm. Note that we choose the same regularization parameter λ for both weight decay and RMD.

Mini Batch. For all the three algorithms, we train in mini batches. The size of mini batches that we use is 128, which is a common choice for CIFAR-10.

(9)

Algorithm 2 Mini-batch Regularizer Mirror Descent (RMD) Require: λ, η, w0

1: repeat

2: Take a mini batch B 3: L(w) ←¯ _|B|¹ P

i∈BLi(w), ∇ ¯L(w) ← _|B|¹ P

i∈B∇Li(w), and ¯z ← _|B|¹ P

i∈Bz[i]

4: c ← η ˆ`⁰

¯ z −p

2 ¯L(w) 5: w ← ∇ψ⁻¹

∇ψ(w) +√ ^c

2 ¯L(w)∇ ¯L(w)

6: for i ∈ B do

7: z[i] ← z[i] − _λ^c 8: end for

9: until convergence 10: return w

Learning Rate. We used three different learning rates for each of the algorithms: 0.001, 0.01 and 0.1. Among all, 0.1 provided the best convergence behavior for all three algorithms. The reported results are thus for 0.1 learning rate.

Stopping Criterion. In order to determine the stopping point for each of the algorithms, we use the following stopping criteria.

1. For SGD, we train until the training data is interpolated, i.e., 100% training accuracy, similar as in (Azizan et al.,2021).

2. Note that it is not feasible to determine the stopping criterion based on the training accuracies for the explicit regularization via weight decay or RMD. Therefore, for the explicit regularization via weight decay, we consider the change in the total loss over the training set. We stop the training if the change in the total loss is less than 0.1% over 200 consecutive epochs.

3. For RMD, we know that the algorithm eventually interpolates the new manifold ˆW, i.e., fits the constraints in (6). Thus, we can use the total change in the the constraints, i.e.,

n

X

i=1

z[i] −p 2L_i(w)

.

and we stop the training if this summation improves less than 0.1% over 200 consecutive epochs.

C.2. Results

The training and test accuracies for the three algorithms with various values of λ are summarized in Table2. As mentioned before, because the network is highly overparameterized, SGD expectedly interpolates the training data and achieves almost 100% training accuracy. The test accuracy for SGD is just shy of 80%. Varying the regularization parameter λ for weight decay, it can only achieve marginally better generalization performance than SGD, while often performing worse. It is also worth noting that different runs of weight decay with the same hyper parameters often result in (sometimes drastically) different test accuracies (e.g. see λ = 0.009 in the table). RMD, on the other hand, significantly outperforms SGD and all the different runs of weight decay, achieving a top test accuracy of almost 87%. This performance is not too sensitive to λ in the ranges we considered.

D. Convergence Guarantees

In this section, we provide convergence guarantees for RMD under certain assumptions, motivated by the implicit regularization property of stochastic mirror descent, recently established in (Azizan & Hassibi,2019;Azizan et al.,2021).

Let us denote the training dataset by {(xi, y_i) : i = 1, . . . , n}, where xi ∈ R^dare the inputs, and yi ∈ R are the labels.

The output of the model on data point i is denoted by a function fi(w) := f (x_i, w) of the parameter w ∈ R^p. The loss on data point i can then be expressed as L_i(w) = `(y_i− f_i(w)) with `(·) being convex and having a global minimum at zero

(10)

Table 2. Comparison between RMD and the two baselines, (1) implicit regularization induced by SGD and (2) explicit regularization through weight decay

Algorithm Lambda Training Accuracy Test Accuracy

SGD N/A 99.70% 79.42%

RMD 0.9 70.20% 85.43%

1.0 71.81% 85.71%

1.1 72.90% 86.12%

1.2 74.38% 86.92%

1.3 75.04% 86.35%

1.4 75.16% 86.62%

1.5 75.84% 85.74%

1.6 77.56% 85.72%

1.6 78.36% 85.69%

Weight Decay 0.005 69.95% 75.59%

0.005 70.07% 64.26%

0.006 71.29% 81.32%

0.006 71.11% 79.64%

0.006 71.25% 80.62%

0.007 74.50% 65.94%

0.008 78.69% 62.11%

0.0085 79.77% 65.86%

0.009 79.54% 60.98%

0.009 79.26% 73.80%

0.0095 80.25% 66.58%

0.01 83.28% 60.92%

0.02 92.47% 69.93%

0.03 96.47% 70.79%

0.04 96.71% 69.80%

0.05 97.20% 72.43%

0.08 97.97% 74.06%

0.7 99.64% 79.96%

0.8 99.38% 80.56%

0.9 99.46% 79.80%

1 99.65% 79.69%

1.2 99.65% 80.62%

1.3 99.75% 80.61%

(11)

(examples include square loss, Huber loss, etc.). Since we are mainly concerned with highly overparameterized models (the so-called interpolating regime), where p n, there are (infinitely) many parameter vectors w that can perfectly fit the training data points, and we can define

W = {w ∈ R^p| f_i(w) = y_i, i = 1, . . . , n} = {w ∈ R^p| L_i(w) = 0, i = 1, . . . , n}. (19) Let w^∗∈ W denote the interpolating solution that is closest to the initialization w0in Bregman divergence, i.e.,

w^∗= arg min

w

Dψ(w, w0)

s.t. fi(w) = yi, i = 1, . . . , n.

(20)

It has been shown that, for a linear model f (xi, w) = x^T_iw, and for a sufficiently small learning rate η > 0, the iterates of SMD (4) with potential function ψ(·), initialized at w0, converge to w^∗(Azizan & Hassibi,2019).

When initialized at w0= arg min_wψ(w) (which is the origin for all norms, for example), the convergence point becomes the minimum-norm interpolating solution, i.e.,

w^∗= arg min

w

ψ(w)

s.t. fi(w) = yi, i = 1, . . . , n.

(21)

While for nonlinear models, the iterates of SMD do not necessarily converge to w^∗, it has been shown that for highly- overparameterized model, under certain conditions, this still holds in an approximate sense (Azizan et al.,2021). In other words, the iterates converge to an interpolating solution w∞∈ W which is “close” to w^∗. More formally, the result from (Azizan et al.,2021) along with its assumptions can be stated as follows.

Let us define DLi(w, w⁰) := L_i(w) − L_i(w⁰) − ∇L_i(w⁰)^T(w − w⁰), which is defined in a similar way to a Bregman divergence for the loss function. The difference though is that, unlike the potential function of the Bregman divergence, due to the nonlinearity of f_i(·), the loss function Li(·) = `(y_i− f_i(·)) need not be convex (even when `(·) is).

Assumption 3 ((Azizan et al.,2021)) Denote the initial point by w₀. There existsw ∈ W and a region B = {w⁰ ∈ R^p| Dψ(w, w⁰) ≤ } containing w₀, such thatD_L_i(w, w⁰) ≥ 0, i = 1, . . . , n, for all w⁰∈ B.

Assumption 4 ((Azizan et al.,2021)) Consider the region B in Assumption 3. Thefi(·) have bounded gradient and Hessian on the convex hull ofB, i.e., k∇fi(w⁰)k ≤ γ, and α ≤ λ_min(H_f_i(w⁰)) ≤ λ_max(H_f_i(w⁰)) ≤ β, i = 1, . . . , n, for allw⁰∈ conv B.

Theorem 2 ((Azizan et al.,2021)) Consider the set of interpolating solutions W = {w ∈ R^p | f (xi, w) = yi, i = 1, . . . , n}, the closest such solution w^∗ = arg min_w∈WD_ψ(w, w₀), and the SMD iterates given in (4) initialized at w0, where every data point is revisited after some steps. Under Assumptions3and4, for sufficiently small step size, i.e., for any η > 0 for which ψ(·) − ηL_i(·) is strictly convex on B for all i, the following holds.

1. The iterates converge tow_∞∈ W.

2. D_ψ(w^∗, w_∞) = o().

Motivated by the above result, we now return to RMD and its corresponding optimization problem.

Let us define a learning problem over parameters w z

∈ R^p+n with ˆfi

w z

= p2Li(w) − z[i], ˆyi = 0, and Lˆi

w z

= ˆ`

ˆ yi− ˆfi

w z

= ˆ`

z[i] −p2Li(w)

for i = 0, . . . , n. Note that in this new problem, we now have p + n parameters and n constraints/data points, and since p n, we have p + n n, and we are still in the highly-overparameterized regime (even more so). Therefore, we can also define the set of interpolating solutions for the new problem as

W =ˆ w z

∈ R^p+n| ˆf_i(w) = ˆy_i, i = 1, . . . , n

. (22)

(12)

Let us define a potential function ˆψw z

= ψ(w) +^λ₂kzk²and a corresponding SMD

∇ ˆψw_t zt

= ∇ ˆψw_t−1 zt−1

− η∇ ˆLi

w_t−1 zt−1

, (23)

initialized atw0

0

. It is straightforward to verify that this update rule is indeed equivalent to the update rule of RMD given in (7).

On the other hand, from (20), we have ˆ

w^∗= arg min

w,z

Dψˆ

w z

,w0

0

s.t. fˆi(w z

) = ˆyi, i = 1, . . . , n.

(24)

Plugging Dψˆ

w z

,w₀

0

= D_ψ(w, w₀) +^λ₂kzk²and ˆf_iw z

=p2L_i(w) − z[i] into (24), it is easy to see that it is equivalent to (18) for w₀= w^reg, and equivalent to (6) for w₀= 0.

The formal statement of the theorem follows from a direct application of Theorem2.

Assumption 5 Denote the initial point by w0

0

. There exists w z

∈ Wˆ and a region Bˆ =

w⁰ z⁰

∈ R^p+n| Dψˆ

w z

,w⁰

z⁰

≤

containing w₀ 0

, such that D_L_ˆ

i

w z

,w⁰

z⁰

≥ 0, i = 1, . . . , n, for all

w⁰ z⁰

∈ ˆB.

Assumption 6 Consider the region ˆB in Assumption1. The ˆf_i(·) have bounded gradient and Hessian on the convex hull of ˆB, i.e.,

∇ ˆfi

w⁰ z⁰

≤ γ, and α ≤ λmin

Hfˆ_i

w⁰ z⁰

≤ λmax

Hfˆ_i

w⁰ z⁰

≤ β, i = 1, . . . , n, for all

w⁰ z⁰

∈ conv ˆB.

Theorem 3 Consider the set of interpolating solutions ˆW defined in (8), the closest such solution ˆw^∗defined in(9), and the RMD iterates given in(7) initialized atw0

0

, where every data point is revisited after some steps. Under Assumptions1and 2, for sufficiently small step size, i.e., for anyη > 0 for which ˆψ(·) − η ˆLi(·) is strictly convex on ˆB for all i, the following holds.

1. The iterates converge tow_∞ z_∞

∈ ˆW.

2. D_ψ_ˆ

ˆ w^∗,w_∞

z_∞

= o().