On Agnostic PAC Learning using L 2 -polynomial Regression and Fourier-based Algorithms

(1)

On Agnostic PAC Learning using

L

₂

-polynomial

Regression and Fourier-based Algorithms

Mohsen Heidari

Purdue University [email protected]

Wojciech Szpankowski

Purdue University [email protected]

Abstract—We develop a framework using Hilbert spaces as a proxy to analyze PAC learning problems with structural properties. We consider a joint Hilbert space incorporating the relation between the true label and the predictor under a joint distributionD. We demonstrate that agnostic PAC learning with 0-1 loss is equivalent to an optimization in the Hilbert space domain. With our model, we revisit the PAC learning problem using methods based on least-squares such as L2 polynomial regression and [1]’s low-degree algorithm. We study learning with respect to several hypothesis classes such as half-spaces and polynomial-approximated classes (i.e., functions approximated by a fixed-degree polynomial). We prove that (under some distributional assumptions) such methods obtain generalization error up to2PoptwithPoptbeing the optimal error of the class.

Hence, we show the tightest bound on generalization error when Popt¤0.2.

I. INTRODUCTION

We study binary classification using polynomial regression from the agnostic PAC learning perspective [2], [3]. In this problem, multiple training instances are generated IID accord-ing to an underlyaccord-ing distribution D on the feature-label sets

X t1,1u. In addition, we are given a hypothesis class with respect to which the learning process takes place. If Popt is the minimum error attained using the given class,

then the objective of the learning algorithm is to output, with high probability, a classifier whose generalization error is not greater than Popt .

To gain computational efficiency or analytical tractability, many conventional learning methods such as support-vector machine (SVM) rely on intermediate loss functions other than the natural 0 1 loss. Square loss is an example that is a basis for L2-polynomial regression or another variant of SVM known as LS-SVM [4]. The well-known “low-degree” algorithm [1] is also known to be in this category of algorithms [5]. Such methods have been analyzed for many PAC learning problems. Under therealizabilityassumption wherePopt0,

theL2-polynomial regression and the low-degree algorithm are PAC learners for a variety of hypothesis classes [6]–[8]. Under the agnostic setting where Popt ¡ 0, the current results are

not that promising. The best known results forL2-polynomial regression (and the low-degree algorithm under the uniform distribution) are8Poptand1₄ Poptp1Poptqfor classes such

as half-spaces or polynomial-approximated classes [3], [5]. In this paper, we develop a framework using Hilbert spaces as a proxy to analyze such problems. We consider a joint

Hilbert space incorporating the relation between the true label and the predictor under the joint distributionD. This is unlike conventional analysis using Hilbert spaces that focus only on the predictors with marginal Dx on the features. As a

byproduct, we improve the above mentioned bounds and show that the generalization error ofL2-polynomial regression and the low-degree algorithm is less than 2Popt. This bound the

improves upon the previous bounds when Popt ¤ 0.2. We

show that methods based on square loss are suitable for learning classes with appropriate geometrical properties.

A. Our approach

We develop our framework by constructing two Hilbert spaces one with respect to the true underlying distribution

D and the other with respect to the empirical one. The first one is L2pDq, that is all real-valued functions f on X Y

such that ErfpX, Yq2s 8. The second one isL2pDˆqwith

ˆ

D being the empirical distribution of the training set. With this formulation, the true label Y and the training labels are understood as a member of these spaces. With this formulation, the generalization error of any classifiercequals 1₄kYck2

2,D.

Similarly, when the distance is calculated in the second Hilbert space, we obtain a characterization of the empirical error. Hence, minimizing the generalization (or empirical) error is equivalent to minimizing the distance between Y and the classifiercin the first (or second) Hilbert space. We argue that the mentioned hypothesis classes have appropriate structures using that allows us to drive lower bounds on its minimum errorPopt. For instance, givenk, the polynomial-approximated

class is characterized by the subspace ofL2pDq spanned by polynomials of degree up to k. With this structure, finding Poptis equivalent to finding the minimum distance ofY to the

subspace spanned by polynomials of degree up tok. As for the learning algorithms, we argue the low-degree algorithm and

L2-polynomial regression have suitable structures using which we drive our upper bounds on their generalization errors. For instance, in the case of L2 polynomial regression, the error of any classifier of the form signrppxq θs, with θ chosen appropriately, is bounded from above by 1₂kY pk2

2. Hence, minimizing the squares-loss as inL2-regression yields an error less than 2Popt.

(2)

B. Summary of the Results

In this work, we first present a more general version of the low-degree algorithm incorporating non-uniform but product probability distributions. We refer to this generalization as Fourier algorithm. With our framework, we study learning with respect to three well-known hypothesis classes. The first class is half-spaces consisting of all the Boolean-valued functions of the form cpxq signr°d_j₁wjxj θs. The second class

is called polynomial-approximated functions. Given a positive integer k and ¡0, it consists of Boolean-valued functions that are approximated by a degree k polynomial with square error up to 2_{. The thirst class is a generalization of the} second. We use our framework to analyze learning these hypothesis classes using L2-polynomial regression and the Fourier algorithm. Below, is the summary of our results:

1) The L2 polynomial regression with degree k outputs a hypothesis ˆg whose generalization error has the following properties:

For learning polynomial-approximated classes, it is less than2Popt 3 (Theorem1).

For learning half-spaces, when the marginal Dx is

uni-form over the unit ball in Rd, it is less than2Popt 3

(Theorem3).

For learninggeneralized concentrated classes, under any distribution, it is less than2Popt (Theorem4).

2)If the marginalDxis a product probability distribution on t1,1ud_{, then with probability}_p₁_δ_q_{, the Fourier algorithm}

outputs a hypothesis such that its generalization error is less than2Popt 2for learning polynomial-approximated classes. C. Related Works

The low-degree algorithm is introduced by [1] with PAC learning guarantees under the uniform distribution over t1, 1ud_{. This algorithm which is based on the Fourier}

expan-sion on the Boolean cube has been used for in various problems [6], [8], [9]. The L2 polynomial regression along with its L1 counterpart is introduced by [5] for learning with respect to polynomial-approximated classes,k-juntas, and half-spaces. Learning with respect to such classes has been studied extensively in the literature [5], [10]–[12]. Among such classes, learning with respect to half-spaces is the most challenging. In the case of proper agnostic PAC learning, where the algorithm’s predictor must be a half-space, it is an NP-hard problem [13], [14]. Even without theproper restric-tion, the problem is NP-hard. That said, under distributional assumptions, polynomial time algorithms are introduced [5], [15], [16]. Among them are theimproperlearning algorithms based on regression methods such as L1 or L2 polynomial regression [1], [5]. In particular, [5] proved thatL1polynomial regression learns a range of hypothesis classes such as half-spaces (under distributional assumptions) and polynomial-approximated classes

II. PRELIMINARIES

Notation:The input set is denoted byX which is a subset of Rd for some positive integerd. The output set is denoted by

Y which is a subset ofR. In binary classification Y t1,

1u. For shorthand, the random vectors in Rd are denoted by X pX1, X2, ..., Xdq. Further, for any ordered subset J

tj1, j2, , jmu, byXJ denote the random vectorpXj1, Xj2,

, Xjmq. Similarly, by x

J _{denote the vector} _p_x

j1, xj2, , xjmq. For a pair of functions f, g onX, the notation f g

means thatfpxq gpxqfor all xPX. Lastly, for any natural number`, the sett1,2, , `uis denoted byr`s.

A. A Hilbert Space Representation

We first develop a Hilbert Space formulation for the binary classification problem. LetDbe a joint probability distribution on the input-output setXY. In this paper, it is assumed that the marginalDxof any joint distributionDonXYhas finite

moments. Consider a Hilbert space of all real-valued functions

f :XYÞÑRwhich areL2pDq, that isEDrfpX, Yq2s 8.

The inner product between two membersf, gis defined as xf, gy ∆_EDrfpX, YqgpX, Yqs.

Given any integer p¡0 and distributionD, the p-norm of a functionf is defined as

kfkp,D ∆ EDrfpX, Yqps

1{p

.

Given any training sample S tpxi, yiq : i 1,2, ...,

nu, letDˆ denote its empirical distribution, that is a uniform distribution on S and zero outside of it. Associated with this distribution, we consider the Hilbert space L2pDˆq with the inner product and norms defined based on the empirical distribution Dˆ. We use this formulation to study the binary classification problem where Y t1,1u. Therefore, the generalization error of any predictor c : X ÞÑ t1,1u can be written in terms of the inner products as

PD ! Y cpXq ) 1 2 1 2xY, cyD 1 4kY ck 2 2,D, (1)

where, with slight abuse of notation, Y is understood as the mapping px, yq ÞÑ y and c is understood as a mapping on

X Y which depends only on X. Similarly, the empirical error ofc is equal to ˆ PDˆ ! Y cpXq ) 1 2 1 2xY, cyDˆ 1 4kY ck 2 2,Dˆ.

The goal now is to derive bounds on the minimum general-ization error when learning with respect to various hypothesis classes. In Section III we describeL2-polynomial regression and the Fourier algorithm, in SectionIVwe study polynomial-approximated classes, and finally in Section V we discuss half-spaces, and more general hypothesis classes that have structural properties.

III. PAC LEARNING WITHL2-POLYNOMIALREGRESSION We employ a PAC learning algorithm usingL2-polynomial regression. Given a training set, the objective of the poly-nomial regression is to minimize the empirical square loss over all polynomials of degree up to k. This process can be implemented by stochastic gradient descent or by solving a linear system of equations. We describe how this polynomial

(3)

regression can be used for PAC learning. Letpˆbe the output of the polynomial regression. The idea is to shift the polynomialpˆ

by a thresholdθand take its sign. This process is demonstrated as Algorithm 1.

Algorithm 1 PAC Learning withL2-Polynomial Regression

Input:Degree parameterk, and training samplestpxpiq, ypiqq, iP rnsu.

1: Find a polynomialpˆof degree up to kthat minimizes

1 n

¸

i

ypiq ppxpiqq2.

2: Find θ P r1,1s such that the empirical error of signrpˆpxq θs is minimized.

3: returngˆsignrpˆθs.

A. Fourier-Based Learning Algorithm

We present another variant of L2 polynomial regression, known as the low-degree (Fourier) algorithm [1]. Although this algorithm is more efficient than the polynomial regression, it requires binary input set X t1,1ud_{. The low-degree}

algorithm was originally designed for uniform distribution on the Boolean cube. In this paper, we present a more general version of it for incorporating non-uniform but product probability distributions on t1,1ud _[₁₇_{]. In this approach,}

the objective is to find an estimate of the p polynomial that minimizes the square loss kY pk2,D under the true

distribution. This method is based on the Fourier expansion on the Boolean cube [18] and is summarized in the following. Under product probability distribution on t1,1ud_{, any}

bounded real-valued functions can be written as

fpxq ¸ Srds

fS ψSpxq,

wherefS’s are the Fourier coefficients and calculated asfS ∆

xf, ψSy for every subset S rds. Further, the parityψS is a monomial defined as ψSpxq ¹ j_PS xjµj σj ,

with µj and σj being the mean and standard deviation of

the Xj, respectively. As the distribution is unknown, these

quantities are estimated in the algorithm.

As a result, we can write the Fourier decomposition of the optimal polynomial p. For that, we have the following statement:

Fact 1. LetDbe a probability distribution with the marginal

Dx that is a product probability distribution on t1,1ud. Then, the optimal polynomialp admits the following Fourier decomposition

p ¸

Srd_s:_|S|¤k

xY, ψSyψS.

Algorithm 2 Fourier-Based Learning

Input:Training samplestpxpiq, ypiqq, iP rnsu.

1: Compute the empirical mean µˆj and standard deviation

ˆ

σj of each feature.

2: For every S rds with|S| ¤k, construct the empirical parity asψpSpxq ± jPS xjµˆj ˆ σj .

3: Compute the empirical Fourier coefficients aS, for every S with at mostkelements, as

aS 1 n n ¸ i1 ypiq pψSpxpiqq.

4: Construct and return the function ΠˆY as

ˆ ΠYpxq ∆

¸

S:|S|¤k

aSψpSpxq.

With that decomposition, the idea behind the Fourier algo-rithm is to compute an empirical estimate ofxY, ψSy. This is demonstrated as Algorithm2.

In the following lemma which is proved in the supple-mentary part, we derive bounds for estimating the optimal polynomialp.

Lemma 1. Let D be a probability distribution with the marginalDxthat is a product probability distribution ont1,

1ud. Given δ P p0,1q, with probability at least p1δq, the following inequality holds

kpΠˆYk2¤O d _dk_c k pk1q!nlog 4dk pk1q!δ , (2)

where ck ∆ maxSrds,|S|¤kkψSk2₈ and n is the number of

samples.

IV. POLYNOMIALLYAPPROXIMATEDCLASS In this section, we study agnostic PAC learning with respect to concept classes whose members are approximated by fixed-degree polynomials. We adopt the Hilbert space representation in SectionII-Ato analyze PAC learning using Algorithm1and

2. We start with the following formulation:

Definition 1. Given P r0,1s, k P _N and any probability distributionDX onX, a concept classCof functionsc:X ÞÑ t1,1uisp, kq-approximated if sup c_PC inf p_PPk E cpXq ppXq2¤2,

wherePk is the set of all polynomials of degree up tok.

We consider agnostic PAC learning with respect to C and under the 01 loss function. The minimum generalization error and empirical error of C are, respectively, defined as

Popt∆min c_PC PD Y cpXq ( , p Popt∆min cPC ˆ P_Dˆ Y cpXq ( .

(4)

We use the Hilbert space representation in SectionII-Aand provide a lower bound on Popt.

Lemma 2. The minimum generalization error attainable by anyp, kqconcept classC is bounded from below as

Popt¥ 1 2 1 2kp _k 1,D,

where parg minp_PPkED

Y ppXq2.

We show in Section III-Athat the lower-bound in Lemma

2 helps to prove our results for the low-degree algorithm.

A. PAC Learning Bounds

Next, we analyze Algorithm1and2for this class and prove the first main result of the paper.

Theorem 1. Given ¡ 0 and k P N, the degree k L2

polynomial regression agnostically PAC learns any p, kq -approximated concept class with expected error up to

2Popt 3 c 2dk 1 n log en dk 1,

wheredis the number of input variables andnis the sample size.

Proof. To derive an upper bound on the empirical error of

ˆ

g, we first consider a weaker version of the algorithm. The idea is to selectθ randomly instead of optimizing it as in the algorithm. For that, we establish the following lemma. Lemma 3. Suppose θ is a random variable with the proba-bility density function fθptq 1 |t|, fort P r1,1s. Then, the following bound holds for any polynomial p

Eθ ˆ P ! Y signrppXq θs ) ¤ 1 2kY pk 2 2,Dˆ. Proof. Note thaty signpppxq θq, if θ is between y and

ppxq. Hence, the expected empirical error of signrppXq θs

with respect to the random θ equals to Eθ ˆ P ! Y signrppXq θs ) 1 n ¸ i Eθ 1 yisignpppxiq θq ( 1 n ¸ i P ! θP rppxiq, yis ryi, ppxiqs ) looooooooooooooooooomooooooooooooooooooon Pi . (3)

Next, we show that Pi ¤ 1₂pyi ppxiqq2 for all pxi, yiq’s.

Suppose yi 1. If ppxiq ¡ 1, then Pi 0 as θ ¤ 1. If ppxiq P r0,1s, then PiP ! θP rppxiq,1s ) »1 p_pxiq p1tqdt 1 2 1ppxiq 2 1 2 yippxiq 2 . Ifppxiq P r1,0s, then PiP ! θP rppxiq,1s ) »1 p_pxiq 1 |t|dt 1 2 »0 p_pxiq p1 tqdt 1 2 ppxiq 1 2pppxiqq 2 ¤1 2p1 |ppxiq|q 2 1 2pyippxiqq 2_.

Lastly, if ppxiq 1, then Pi 1 because θ¥ 1. In this

case alsoPi¤ 1₂pyippxiqq2. The case for yi 1follows

by symmetricity. Hence, we obtain the following inequality Eθ ˆ P ! Y gˆpXq ) ¤ 1 n ¸ i 1 2 yippxiq 2 .

The proof is complete by noting that the right-hand side equals to 1₂kY pk2

2,Dˆ.

Consequently, from the lemma and due the fact that θ in the algorithm is selected to minimize the empirical error, we obtain that ˆ P ! Y ˆgpXq ) ¤ 1 2kY pkˆ 2 2,Dˆ, (4)

where pˆis the output of L2-polynomial regression and gˆ

signrpˆ θs, as in Algorithm 1. Let c be the predictor with minimum generalization error in thep, kq-approximated concept class. Let p be a degree k polynomial such that

kcpk2¤. Sincepˆminimizes the empirical2-norm, then the right-hand side of (4) satisfies

1 2kY pkˆ 2 2,Dˆ ¤ 1 2kY p _k2 2,Dˆ. (5)

We proceed by taking the expected error of the empirical error with respect to the random training samples. From (4) and (5) we obtain the following inequalities

E ˆ P ! Y gˆpXq ) ¤ 1 2E kY pk2 2,Dˆ 1 2kY p _k2 2,D paq ¤ 1 2 kY ck2,D kpck2,D 2 ¤ 1 2 kY ck2,D 2 pbq ¤ 1 2 kY ck2 2,D 4 2 pcq ¤2Popt 5 2, (6)

where (a) holds from Minkowski’s inequality for2-norm, (b) holds askYck2,D ¤2, and (c) holds because of the second

equality in (1) and thatPoptPtY cpXqu.

Next, we connect the empirical error ofgˆto its generaliza-tion error. Note that the Vapnik–Chervonenkis (VC) dimension of all functions of the form signrps for some polynomial of degree upto k does not exceed dk 1_{. Therefore, from VC}

(5)

theory ( See Corollary 3.19 in [19]) for anyδ, with probability at least p1δq, the following inequality holds

P ! Y gˆpXq ) ¤_Pˆ!_Y _ˆ_g_p_X_q) c2 dk 1 n log en dk 1 d log1_δ 2n . (7)

Set δ expt1₂n2_u_{. Therefore, the proof is complete by} taking the expectation and combining it with the last bound in (6).

PAC bounds for the Fourier algorithm:Next, we employ a low-degree (Fourier) algorithm (Algorithm2) for PAC learning with respect to the polynomially approximated hypothesis class.

Theorem 2. Let D be a joint probability distribution with marginalDXthat is a product probability distribution ont1,

1ud_{. Then, for any}_δ_{P r}₀_,₁_s_{, with probability at least} ₁_δ_, the Fourier-based algorithm agnostically PAC learns any p, kq-approximated concept class with generalization error up to 2Popt 2 O d _dk_c k pk1q!nlog 4dk pk1q!δ , (8) where ck ∆maxSrds,|S|¤kkψSk2₈.

Corollary 1. If the expected value of eachXj satisfies|µj| ¤

11_k, then the generalization error of the Fourier algorithm is upper bounded by 2Popt 2 O d?_k_p_ed_qk n klog ed k log 2?k δ .

V. LEARNINGOTHERHYPOTHESISCLASSES In this section, we extend our results to two other type of concept classes. The first one is called half-spaces and the other one is a generalized version of the concentrated hypothesis classes.

A. Half-spaces

In this section, we consider learning another class of functions called half-spaces. More precisely, a half-space a Boolean-valued function of the form

cpxq signra0

d

¸

j1

ajxjqs, @xPRd

where aj P R. We start with a lower-bound on the optimal classification error of the class.

Lemma 4. LetDbe any joint probability distribution onRd t1,1uwith marginalDx that is the uniform distribution on

Sd1 or jointly Gaussian on Rd. Then, for any ¡ 0, the

minimum generalization error of learning with respect to half-spaces satisfy the following lower bound

Popt¥ 1 2 1 2kp k1,DX,

where p is a polynomial of degree up to Op14q minimizing kY pk2,D among all such polynomials.

The proof of the lemma follows from Lemma 2 and [5]’s result (Theorem 6) on the sign function. This result is stated as

Lemma 5 ( [5]). Let X be a random variable with uniform distribution onSd1or jointly Gaussian onRd. Then, for any

¡0, there exists a polynomial pof degreeOp14qsuch that E

ppXq signpXq2

¤2_.

This lemma makes a connection between half-spaces and the polynomial-approximated class. That said, in the following theorems we show our results for PAC learning using Algo-rithm 1.

Theorem 3. Let D be any joint probability distribution on

Rd t1,1u, with marginalDX that is uniform on the unit sphere or jointly Gaussian. Then, L2-polynomial regression

PAC learns half-spaces with expected generalization error up to 2Popt 3 d dOp14q n log n dOp14q .

B. Generalized approximated class

Lastly, we finish this paper by extending our results to a more general hypothesis class. Fix a set of functions e1pxq, e2pxq, ..., empxq and let H be a Hilbert space spanned by a

these functions. LetC be a class of functions each of which approximated by elements ofHwith square error up to, that is,

inf

hPHkchk2,D¤,

for any c P C. As a special case, suppose ei’s are all the

functions of the form epxq ±_j_Pr_d_sxαj

j where αj’s are

non-negative integers adding up to k. Then C is a pk, q -approximated class as in SectionIV.

Theorem 4. SupposeAis any algorithm that givenntraining instances finds a function ˆh P H so that the empirical loss

kY hk₂_,_Dˆ is minimized. Then, the predictor signrˆhs learns

C with expected generalization error up to

2Popt 3 O

d_VC_p_C_q

n log n

VCpCq ,

whereVCpCq is the VC dimension ofC.

ACKNOWLEDGEMENT

This work was supported in part by NSF Center on Science of Information Grants 0939370 and NSF Grants CCF-1524312, CCF-2006440, CCF-2007238, and Google Research Award.

(6)

REFERENCES

[1] N. Linial, Y. Mansour, and N. Nisan, “Constant depth circuits, Fourier transform, and learnability,”J. ACM, vol. 40, no. 3, pp. 607–620, 1993. [2] L. G. Valiant, “A theory of the learnable,”Communications of the ACM,

vol. 27, no. 11, pp. 1134–1142, nov 1984.

[3] M. J. Kearns, R. E. Schapire, and L. M. Sellie, “Toward efficient agnostic learning,”Machine Learning, vol. 17, no. 2-3, pp. 115–141, 1994. [4] J. Suykens and J. Vandewalle,Neural Processing Letters, vol. 9, no. 3,

pp. 293–300, 1999.

[5] A. T. Kalai, A. R. Klivans, Yishay Mansour, and R. A. Servedio, “Agnostically learning halfspaces,” inProc. 46th Annual IEEE Symp. Foundations of Computer Science (FOCS’05), Oct. 2005, pp. 11–20. [6] E. Mossel, R. O’Donnell, and R. A. Servedio, “Learning functions of

krelevant variables,”J. Comput. Syst. Sci, vol. 69, no. 3, pp. 421–434, 2004.

[7] E. Mossel, R. O’Donnell, and R. P. Servedio, “Learning juntas,” inProc. ACM Symp. on Theory of Computing, 2003, pp. 206–212.

[8] E. Blais, R. O’Donnell, and K. Wimmer, “Polynomial regression under arbitrary product distributions,”Machine learning, vol. 80, no. 2-3, pp. 273–294, 2010.

[9] M. Heidari, G. I. Shamir, and W. Szpankowski, “Fourier-based universal learning,”Journal of Machine Learning Research (JMLR), 2020. [10] A. R. Klivans, P. M. Long, and R. A. Servedio, “Learning halfspaces

with malicious noise.”Journal of Machine Learning Research, vol. 10, no. 12, 2009.

[11] A. Birnbaum and S. S. Shwartz, “Learning halfspaces with the zero-one loss: time-accuracy tradeoffs,” inAdvances in Neural Information Processing Systems, 2012, pp. 926–934.

[12] I. Diakonikolas, T. Gouleakis, and C. Tzamos, “Distribution-independent pac learning of halfspaces with massart noise,” inAdvances in Neural Information Processing Systems, 2019, pp. 4749–4760.

[13] A. Klivans and P. Kothari, “Embedding hard learning problems into gaussian space,” inApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014.

[14] V. Guruswami and P. Raghavendra, “Hardness of learning halfspaces with noise,”SIAM Journal on Computing, vol. 39, no. 2, pp. 742–765, 2009.

[15] P. Awasthi, M. F. Balcan, and P. M. Long, “The power of localization for efficiently learning linear separators with noise,”Journal of the ACM, vol. 63, no. 6, pp. 1–27, feb 2017.

[16] A. Daniely, “A ptas for agnostically learning halfspaces,” inConference on Learning Theory, 2015, pp. 484–502.

[17] M. L. Furst, J. C. Jackson, and S. W. Smith, “Improved learning of

AC0 functions,” inCOLT, vol. 91, 1991, pp. 317–325.

[18] R. O’Donnell,Analysis of boolean functions. Cambridge University Press, 2014.

[19] M. N. Y. U. Mohri, A. (Google, I. Rostamizadeh, A. U. of California, and B. Talwalkar,Foundations of Machine Learning. MIT Press Ltd, 2018.