Detecting adversarial manipulation using inductive Venn-ABERS predictors

(1)

ContentslistsavailableatScienceDirect

Neurocomputing

journalhomepage: www.elsevier.com/locate/neucom

Detecting

adversarial

manipulation

using

inductive

Venn-ABERS

predictors

Jonathan

Peck

a ,b ,∗

_,

_Bart

_Goossens

c

_,

_Yvan

_Saeys

a ,b

a Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent 90 0 0, Belgium b Data Mining and Modeling for Biomedicine, VIB Inﬂammation Research Center, Ghent 9052, Belgium c Department of Telecommunications and Information Processing, Ghent University, Ghent 90 0 0, Belgium

a

r

t

i

c

l

e

i

n

f

o

Article history: Received 9 July 2019 Revised 29 October 2019 Accepted 4 November 2019 Available online xxx Keywords: Adversarial robustness Conformal prediction Supervised learning Deep learning

a

b

s

t

r

a

c

t

InductiveVenn-ABERSpredictors(IVAPs)areatypeofprobabilisticpredictorswiththetheoretical guar-anteethattheirpredictionsareperfectlycalibrated.Inthispaper,weproposetoexploitthiscalibration propertyforthedetectionofadversarialexamplesinbinaryclassiﬁcationtasks.Byrejectingpredictions iftheuncertaintyoftheIVAPistoohigh,weobtainanalgorithmthatisbothaccurateontheoriginal testsetandresistanttoadversarialexamples.Thisrobustnessisobservedonadversarialsforthe under-lyingmodelas wellasadversarials thatweregeneratedbytaking theIVAPintoaccount.The method appearstooffercompetitiverobustnesscomparedtothestate-of-the-artinadversarialdefenseyetitis computationallymuchmoretractable.

1. Introduction

Modern machine learningmethods are able to achieve excep-tional empirical performance in many classification tasks [1,2] . However, they usually give only point predictions: a typical ma-chinelearningalgorithmforclassificationof,say,imagesfromthe ImageNet data set [3] will output only a single label given any image without reportinghowaccurate thispredictionis.In com-plex domains suchasmedicine,it ishighlydesirableto havenot just a prediction buta measure of confidence in thisprediction aswell. Many state-of-the-artmachine learningmethods provide some semblanceof such a measure by transforming their inputs intovectorsofprobabilitiesoverthedifferentpossibleoutputsand then selecting as the point predictionthe output with the high-est probability. Usually, these probabilities are obtained by first training a scoring classifier which assigns to each input a vector of scores, one score per class. These scores are then turned into probabilitiesviasomemethodofcalibration.Themostpopular cal-ibrationmethodisPlatt’sscaling [4] ,whichfitsa logisticsigmoid tothescoresoftheclassifiers.

Contrarytopopularbelief,however,theprobabilityvectorsthat resultfrommethodssuchasPlatt’sscalingcannotbereliably

inter-∗ _{Corresponding author at: Department of Applied Mathematics, Computer Sci-}

ence and Statistics, Ghent University, Ghent 90 0 0, Belgium.

E-mail addresses: [email protected] (J. Peck), [email protected] (B. Goossens), [email protected] (Y. Saeys).

pretedasmeasures of confidence [5] .The phenomenon of adver-sarialmanipulationillustratesthisproblem:forvirtuallyallcurrent machine learningclassifiers,itis possibleto constructadversarial examples [6] . These are inputs which clearly belong to a certain classaccordingto human observersandwhich arehighlysimilar to inputs on which the model performs very well. Despite this, themodelmisclassifiestheadversarialexampleandassignsavery highprobability to the resulting (incorrect) label. In fact, the in-putsneednotevenbesimilartoanynaturalinput;machine learn-ing models willalso sometimes classify unrecognizable inputsas belongingto certain classeswith veryhigh confidence [7] . Fig. 1 showsexamplesbothofimperceptibleadversarialperturbationsas wellasunrecognizable imageswhichareclassifiedasfamiliar ob-jects withhighconfidence.These observations clearlyundermine the idea that one can use such calibrated scores as measures of modelconfidence.Itbegsthequestion:

Do thereexist machine learningalgorithms which yield prov-ably valid measures of conﬁdence in their predictions? If so, couldthesemodels beusedtodetectadversarialmanipulation ofinputdata?

Theanswertothefirstquestionisknowntobeaffirmative:the algorithms inquestionare calledconformal predictors [8–10] .Our contributionhereistoshow thatsuch validconfidencemeasures can in fact be used to detect adversarial examples. In particular, there exist methods which can turn any scoring classifier intoa probabilisticpredictorwithvalidityguarantees;thisconstructionis https://doi.org/10.1016/j.neucom.2019.11.113

(2)

2 J. Peck, B. Goossens and Y. Saeys / Neurocomputing xxx (xxxx) xxx

Fig. 1. Illustrations of adversarial and fooling images. These examples highlight the fact that the conﬁdence scores output by deep neural network classiﬁers are unreliable.

knownasaninductiveVenn-ABERSpredictororIVAP [11] .By mak-ing use of the conﬁdence estimates output by the IVAPs, many state-of-the-artmachine learning models can be made robust to adversarialmanipulation.

1.1.Relatedwork

The reliability of machine learning techniques in adversarial settings has been the subject of much research for a number of years already [12–15] . Early work in this field studied how a linear classifier for spam could be tricked by carefully crafted changesinthecontentsofspame-mails,withoutsignificantly al-teringthereadabilityofthemessages.Morerecently,Szegedyetal. [14] showedthatdeepneuralnetworksalsosufferfromthis prob-lem.Sincethiswork,research interestinthe phenomenonof

ad-versarial examples has increased substantially and many attacks and defenses have been proposed [6,16,17] . The defenses can be broadlycategorizedasfollows:

• Detectormethods [18–21] .Thesedefensesconstructa detector

which augments the underlying classification model with the capability to detect whetheran input is adversarial ornot. If thedetectorsignalsapossibleadversarial,themodelprediction isconsideredunreliableonthat instanceandtheclassification resultisflagged.

• Denoisingmethods [22–25] .Here,thegoalistorestorethe ad-versarial examples to their original, uncorrupted versionsand thenperformclassiﬁcationonthesecleanedsamples.

• Other methods [6,22,26–28] . Another class of defenses per-formsneitherexplicitdetectionnorﬁltering.Rather,theaimis tomakethemodelinherentlymorerobusttomanipulationvia Pleasecitethisarticleas:J.Peck,B.GoossensandY. Saeys,DetectingadversarialmanipulationusinginductiveVenn-ABERSpredictors,

(3)

dataaugmentation,regularization,modiﬁedoptimization algo-rithmsorspecialarchitectures.

Despitethelargenumberofdefensesthathavebeenproposed so far,atthetime ofthiswritingonlyone techniqueis generally accepted ashaving any noticable effect [29] : adversarial training [27,28] .However,eventhismethodcurrentlyhastoolimited suc-cess. TheMadrydefense [27] ,forinstance,achieveslessthan50% adversarial accuracy on the CIFAR-10 data set [30] even though state-of-the-art clean accuracy is over 95% [31] . Moreover, many recentlyproposed defensessufferfromaphenomenon called gra-dient masking [32] . Here, the defense protects the model against adversarials by obfuscatingits gradient information.This is com-monlydonebyintroducingnon-differentiablecomponentssuchas randomization or JPEG compression. However, such tactics only renderthemodelrobustagainstgradient-basedattackswhich cru-ciallyrelyonthisinformationtosucceed.Attacksthatdonotneed gradientsorthatcancopewithimpreciseapproximations(suchas BPDA [32] orblack-boxattacks [33,34] )willnotbedeterredby gra-dientmaskingdefenses.

We are not the first to consider the hypothesis that adver-sarial examples are due to faulty confidence measures. Most ex-isting research in this area has focused on utilizing the uncer-tainty estimates provided by Bayesian deep learning [35] , espe-cially Monte Carlo dropout [36–38] and other stochastic regular-izationtechniques [5] .On theother hand,ourapproachis decid-edly frequentist and hence differs significantly from this related work. We are not awareof much other work inthisarea that is alsofrequentist(e.g.Grosseetal. [20] ),asthefieldofdeep learn-ing (andmodeluncertaintyin particular)appears tobe primarily Bayesian.Althoughsignificantprogressisbeingmade [39] ,scaling theBayesianmethodstostate-of-the-artdeepneuralnetworks re-mainsanopenproblem.TheBayesianapproachofintegratingover weighted likelihoods also suffers from certain pathologies which maydiminishitsusefulnessintheadversarialsetting [40] .Itisour hopethatthemethoddetailedhere,whichdrawsuponthe power-fulframeworkofconformalprediction [9] ,can serveasascalable andeffectivealternativetoBayesiandeeplearning.

Thedefenseweproposeinthisworkfallsunderthecategoryof detectormethods.Wearewell-awareofthebadtrackrecordthat thesemethodshave:forexample,CarliniandWagner [41] evaluate manyrecentlyproposedadversarialdetectormethodsandﬁndthat they can beeasily bypassed.However, by followingthe adviceof CarliniandWagner [41] andAthalyeetal. [32] intheevaluationof ourdetector,we hope toshow convincinglythat ourmethodhas thepotentialtobeastrong,scalableandrelativelysimpledefense againstadversarialmanipulation.Inparticular,wetestourdetector onadversarialexamplesgeneratedusingexistingattacksthattake thedetectorintoaccountaswellasanovelwhite-boxattackthat isspeciﬁcallytailoredtoourdefense.Weconcludethatthedefense weproposeinthisworkremainsrobustevenwhentheattacksare adapted tothedefenseanditappears tobecompetitive withthe defenseproposedbyMadryetal. [27] .

The presentworkis an extension ofourESANN 2019 submis-sion [42] .Themainadditionsareasfollows:

• IncludetheZeroesvsonesdataset.

• PerformanablationstudytotestthesensitivityoftheIVAPto itshyperparameters.

• Evaluatethedefenseinthe2normaswellinsteadofonly∞. • Thediscussionsoftheresultshasbeensigniﬁcantlyextended. • WehavereleasedareferenceimplementationonGitHub1_. • Include qualitative examples of adversarials produced by our

white-boxattack.

1_{https://github.com/saeyslab/binary-ivap}

1.2.Organization

The rest of the paper is organised as follows. Section 2 pro-vides the necessary background about supervised learning, con-formal prediction and IVAPs which will be used in the sequel. Section 3 describesthedefensemechanismwhichweproposefor thedetectionofadversarialinputs. Thisdefenseisexperimentally evaluatedonseveraltasksin Section 4 .Experimentalcomparisons totheMadrydefense arecarriedout in Section 5 . Section 6 con-tainstheconclusionandavenuesforfuturework.Thecodeforour referenceimplementationcanbefoundat

https://github.com/saeyslab/binary-ivap

2. Background

We consider the typical supervised learning setup for classi-ﬁcation. There is a measurable objectspace X and a measurable

label space Y. We let Z=X× Y. There is an unknown probabil-itymeasure P on _{Z which} we aim to estimate. In particular,we haveaclassofX→Y functionsH andadatasetS=

{

(

xi,yi

)

|

i= 1,...,m

}

ofi.i.d.samplesfromP.Ourgoalistoﬁndafunctionin

H whichﬁtsthedatabest: f=argminf∈HRS

(

f

)

. Here,RS istheempiricalrisk

RS

(

f

)

= 1 m m i=1

(

xi,yi,f

)

,

where is the loss function. In principle, this can be any non-negativeX× Y× H→Rfunction.Its purpose isto measure how “close” thefunctionfistotheground-truthyilocallyatthepoint

xi.Ourobjectiveistoﬁnda minimizerf inH for theaverageof theloss over thedata set S. Fork-ary classiﬁcation,a commonly usedlossfunctionisthelogloss

log

(

x,y,f

)

=− logpy

(

x; f

)

.

Here,py(x;f)istheprobabilitythatfassignstoclassyfortheinput

x.Ideally, fora discriminative model thisshould be equal to the conditional probability Pr[y

|

x] underthe measure P. In thecase ofascoringclassiﬁer,theseprobabilitiesarecomputedbyﬁttinga scoringfunctiongtothedatafollowedby acalibrationprocedure

s.Theﬁnalclassiﬁcationisthencomputedas

f

(

x

)

=argmaxi=1,...,ksi

(

g

(

x

))

. (1) InthecaseofPlatt’sscaling [4] ,whichisinfacta logistic regres-sionontheclassiﬁerscores,s_iisthesoftmaxfunction

si

(

z

)

=

exp

(

w_iz+bi

)

jexp

(

wjz+bj

)

.

Here,wiandbiarelearnedparameters.

2.1. Conformalprediction

Itisknown [5,36] thatthecalibratedscoresproducedby scor-ingclassiﬁerssuchas (1) cannotactually beusedasreliable esti-matesoftheconditionalprobabilityPr[y

|

x]:itispossibleto gen-erate inputsfor any given classiﬁer such that the model consis-tentlyassignshighprobabilitytothewrongclasses.Itmakessense, then,tolookfora typeofmachine learningalgorithm whichhas provableguaranteesontheconﬁdenceofitspredictions.Thisleads usnaturallytothestudyofconformalpredictors [8] ,whichhold ex-actlythispromise.

Ingeneral, aconformal predictorisafunction

which,when given a conﬁdence level

ε

∈ [0, 1], a bag2 _of _instances _B₌ 2 A bag or multiset is a collection of objects where the order is unimportant (like

(4)

4 J. Peck, B. Goossens and Y. Saeys / Neurocomputing xxx (xxxx) xxx z1,...,znandanewobjectx∈X,outputsaset

ε

(

B,x

)

⊆ Y.

In-tuitively,thissetcontainsthoselabelswhichthepredictorbelieves withatleast1−

ε

conﬁdencecouldbethetruelabeloftheinput samplebasedonthebagofexamplesgiventoit.Thesepredictors mustsatisfytwoproperties [8,10] :

1.Nestedness.If

ε

1≥

ε

2then

ε1

(

B,x

)

⊆

ε2

(

B,x

)

.

2. Exactvalidity.Ifthetruelabelofxisy,theny_∈

ε(B,x)with probabilityatleast1−

ε

.

The exact validity property is too much to ask from a deter-ministic predictor [8] . Instead, the algorithms we consider here are only conservatively valid. Speciﬁcally, let

ω

₌z1,z2,... be an inﬁnitesequence ofsamples froman exchangeabledistribution P. A distribution is said to be exchangeable if for every sequence

z1,...,zn and permutation

φ

of the integers

{

1,...,n

}

it holds that

Pr[z1,...,zn]=Pr[z_φ(1),...,zφ(n)].

That is, each permutation of a sequence of samples from P is equallylikely.Inparticular,i.i.d.samplesarealwaysexchangeable. Deﬁnetheerrorafterseeingnsamplesatsigniﬁcancelevel

ε

as

errε_n

(

,

ω

)

=

1 i f yn∈/

ε

(

z1,...,zn−1,xn

)

,

0 otherwise.

The quantities errε_n

(

,

ω

)

are random variables depending on

ω

~ P∞.Exact validitymeans thesevariables are allindependent andBernoulli distributed withparameter

ε

. Conservativevalidity meanstheyaredominatedindistributionbyindependentBernoulli random variables. That is, there exist two sequences of random variables

ξ

1,

ξ

2,... and

η

1,

η

2,... suchthat

1.each

ξ

nisindependentandBernoullidistributedwith parame-ter

ε

;

2. errεn

(

,

ω

)

≤

η

nalmostsurelyforalln;

3. the jointdistributionof

η

1,...,

ηn

equals that of

ξ

1,...,

ξn

for alln.

This implies the following asymptotic conservative validity property,bythelawoflargenumbers:

lim n→∞ 1 n n i=1

errε_i

(

,

ω

)

≤

ε

almostsurely.

Algorithm 1 is the general conformal prediction algorithm. It

Algorithm1:Theconformalpredictionalgorithm.

Input:Non-conformitymeasure

,signiﬁcancelevel

ε

,bag ofexamplesz1,...,zn⊆ Z,objectx∈X

Output:Predictionregion

ε

(

z1,. . .,zn,x

)

1foreachy∈Y do 2 z_n₊₁←

(

x,y

)

3 fori=1,...,n+1do 4

αi

←

(

z1,...,zn

\

zi,zi

)

5 end 6 py←_n₊₁1 #

{

i=1,...,n+1

|

αi

≥

αn

₊₁

}

7end 8return

{

y∈Y

|

p_y>

ε}

takes as a parameter a non-conformity measure

. In general, a non-conformity measure is any measurable real-valued func-tion whichtakes a sequence of samples z1,...,zn along withan additional sample z and maps them to a non-conformity score

(

z1,...,zn,z

)

.Thisscoreisintendedtomeasurehowmuchthe samplezdiffersfromthegivenbagofsamples.Theconformal pre-diction algorithm is always conservatively valid regardless ofthe

choice of non-conformity measure3_. _However, _the _predictive

eﬃ-ciencyofthealgorithm— thatis,thesizeofthepredictionregion

ε₍_B,_x₎ _{— can} _vary_considerably _with_different _choices_for

_._If

thenon-conformitymeasureischosensuﬃcientlypoorly,the pre-diction regions mayevenbe equalto theentiretyofY.Although thisisclearlyvalid,itisuselessfromapracticalpointofview.

Algorithm 1 determinesapredictionregionforanewinputx_∈ Xbasedonabagofoldsamplesbyiteratingovereverylabely∈Y

andcomputinganassociatedp-valuepy.Thisvalueistheempirical fractionofsamplesinthebag(includingthenew“virtualsample” (x,y))withanon-conformityscorethatisatleastaslarge asthe non-conformityscore of(x,y). Bythresholdingthesep-valueswe obtain aset ofcandidatelabels y1,...,yt such thateach possible combination

(

x,y1

)

,...,

(

x,yt

)

is“suﬃcientlyconformal” totheold samplesatthegivenlevelofconﬁdence.

2.2. InductiveVenn-ABERSpredictors

Of particular interest to us here will be the inductive Venn-ABERSpredictorsorIVAPs [11] .Thesearerelatedtoconformal pre-dictorsbuttheytakeadvantageofthepredictiveeﬃciencyofsome otherinductivelearningrule(suchasaneuralnetworkorsupport vectormachine).TheIVAPalgorithmisshownin Algorithm 2 for

Algorithm2:TheinductiveVenn-ABERSpredictionalgorithm forbinaryclassiﬁcation.

Input:bagofexamplesz1,...,zn⊆ Z,objectx∈X, learningalgorithmA

Output:Pairofprobabilities

(

p0,p1

)

1 Dividethebagoftrainingexamplesz₁,...,z_nintoaproper trainingsetz1,...,zmandacalibrationsetzm+1,...,zn. 2 RunthelearningalgorithmAonthepropertrainingsetto

obtainascoringruleF.

3 foreachexamplez_i=

(

x_i,y_i

)

inthecalibrationsetdo 4 s_i←F

(

x_i

)

5 end 6 s←F

(

x

)

7 Fitisotonicregressionto

{

(

s_m₊₁,y_m₊₁

)

,...,

(

sn,yn

)

,

(

s,0

)

}

obtainingafunction f0.

8 Fitisotonicregressionto

{

(

s_m₊₁,y_m₊₁

)

,...,

(

s_n,y_n

)

,

(

s,1

)

}

obtainingafunction f1.

9

(

p₀,p₁

)

←

(

f₀

(

s

)

,f₁

(

s

))

10 return

(

p₀,p₁

)

the case of binary classiﬁcation. The output of the IVAP algo-rithm isa pair(p0, p1) where 0≤ p0 ≤ p1 ≤ 1. Thesequantities canbe interpreted aslower andupperboundsonthe probability Pr[y₌1

|

x]_, that is, p0≤ Pr[y=1

|

x]≤ p1. The widthofthis in-terval, p1− p0,canbeusedasareliablemeasureofconﬁdencein theprediction.Although Algorithm 2 can onlybeused forbinary classiﬁcation, itis possible toextend itto the multi-classsetting [44] .Thisislefttofuturework.

IVAPsareavariantoftheconformalpredictionalgorithmwhere thenon-conformitymeasureisbased onanisotonicregressionof the scores which the underlying scoring classifier assigns to the calibration data pointsaswell asthe newinput to be classified. Isotonic (or monotonic) regression aims to fit a non-decreasing free-formlinetoasequenceofobservationssuchthatthelinelies asclosetotheseobservationsaspossible. Fig. 2 showsanexample ofisotonicregressionappliedtoa2Dtoydataset.

3 One might even use Bayesian methods in the computation of _{. In particular,}

might be computed using integration over weighted likelihoods determined by a parametric model together with some prior. This combination of Bayesian inference with frequentist methods has been called Frasian inference [43] .

(5)

Fig. 2. An example of isotonic regression on a 2D data set. The isotonic ﬁt is shown as a red line. This image was produced with the isoreg function from the R lan- guage [45] .

Inthecaseof Algorithm 2 ,theisotonicregressionisperformed asfollows. Lets1,...,sk be the scores assignedto thecalibration points. First,these pointsare sortedin increasing orderand du-plicates are removed, obtaining a sequences₁≤ · · · ≤ st.We then deﬁnethemultiplicityofs_jas

wj=#

{

i

|

si=sj

}

.

The“averagelabel” correspondingtosomescores_jis y_j= 1

wj

i:si=sj yi.

The cumulative sum diagram (CSD) is computed as the set of points Pi=

i j=1 wj, i j=1 y_jwj

=

(

Wi,Yi

)

fori=1,...,t.Forthesepoints,thegreatestconvexminorant(GCM) is computed. Fig. 3 showsan example ofa GCM computedfora givenset ofpointsinthe plane.Formally, theGCMofa function

f:U_→_Ris themaximal convex functiong:I_→_Rdeﬁned ona closed interval IcontainingU such that g(u) ≤ f(u) forall u ∈U

[46] .Itcan bethought ofasthe“lowestpart” of theconvexhull ofthegraphoff.

Thevalueats_ioftheisotonicregressionisnowdeﬁnedasthe slopeoftheGCMbetweenWi−1andWi.Thatis,iffistheisotonic regressionandgistheGCM,then

f

(

s_i

)

=g

(

Wi

)

− g

(

Wi−1

)

Wi− Wi−1 =

g

(

Wi

)

− g

(

Wi−1

)

wi

.

We leavethevaluesoffatother pointsbesides thes_iundeﬁned, aswewillneverneedthemhere.

3. DetectingadversarialmanipulationusingIVAPs

IVAPs enjoy a type ofvalidity property which we detail here. A random variable P is said tobe perfectly calibrated foranother randomvariableYifthefollowingequalityholdsalmostsurely:

E[Y

|

P]=P. (2)

Fig. 3. Example of the greatest convex minorant of a set of points, shown as a red line. This image was produced with the fdrtool package [47] .

LetP0,P1 berandomvariablesrepresentingtheoutputofanIVAP trainedonasetofi.i.d.samplesandevaluatedona newrandom sampleXwithlabelY_∈{0,1}.Thenthereexistsarandomvariable

S∈{0,1}suchthatPS isperfectlycalibratedforY.Hence,foreach predictionatleastoneoftheprobabilitesP0,P1 outputbyanIVAP isalmost surelyequal to theconditional expectation ofthe label giventheprobability.Sincethelabelisbinary,therelation (2) can berewrittenas

Pr[Y=1

|

PS]=PS. (3)

Eq. (3) expressesthevalidity propertyforIVAPs. Ifthedifference

p1− p0 issuﬃcientlysmallforsomeinstancex,sothatp0,p1≈ p, then (3) allows usto deduce that x haslabel 1 withprobability approximatelyp.Thevalidityproperty (3) holdsforIVAPsaslong asthedataareexchangeable,whichisthecasewheneverthedata arei.i.d.

Our proposed method for detecting adversarial examples is basedontheseobservations:weusep1− p0asaconﬁdence mea-sureforthepredictionreturnedbytheunderlyingscoringclassiﬁer andthenreturnthe appropriate labelbasedonthesevalues.The pseudocodeisshownin Algorithm 3 .Thegeneralideaistousean IVAP to obtain the probabilities p0 and p1 foreach test instance

x.If these probabilities lie close enough to each other according to the precision parameter

β

, then we assume to have enough conﬁdenceinourprediction;otherwise,we returnaspeciallabel REJECTsignifyingthatwe donottrusttheoutputoftheclassiﬁer onthisparticular sample.The precision parameter

β

istuned by maximizingYouden’s index [48] onaheld-outvalidationset con-sistingofcleanandadversarialsamples.Incasewetrustour pre-diction,weuse p= p₁

1−p0+p1 asanestimateoftheprobabilitythat

thelabelis1(asinVovketal. [11] ).Ifp>0.5,wereturn1; oth-erwise,we return 0.This method ofquantifying uncertainty and rejecting unreliablepredictions was infact already mentioned in Vovketal. [11] ,butto ourknowledgenobody hasyet attempted touseitforprotectionagainstadversarialexamples.

Animportantconsiderationinthedeploymentof Algorithm 3 is thecomputational complexityof theapproach.Thisisclearly de-terminedbytheunderlyingmachinelearningmodel,asweare re-quired to run its learningalgorithm andperform inference with theresultingmodelforeach sampleinthecalibration setaswell as for each new test sample. The overhead caused by the IVAP

(6)

Algorithm3:DetectingadversarialmanipulationwithIVAPs.

Input:precision

β

∈[0,1],bagofexamplesz1,...,zn⊆ Z, objectx∈X,learningalgorithmA

Output:Anelementoftheset_Y_∪

{

REJECT

}

. 1Usealgorithm2tocompute p₀,p₁forx. 2if p₁− p₀≤

β

then 3 Set p←₁_−pp1 0+p1. 4 if p>0.5then 5 return1 6 else 7 return0 8 end 9else 10 returnREJECT 11end

construction consistsofthe isotonic regressions, which are dom-inatedbythecomplexityofsortingthesamplesinthecalibration set,aswell asother operationsthat take lineartime. The scores ofthecalibrationsamplescanbeprecomputedandstoredforfast accesslater,reducingthecomplexityoftheinferencestep.To sum-marize,letTA(n)andTF(n)denotethetimecomplexityofrunning the learning algorithm and the scoring rule on a set of n sam-ples,respectively.Then we candetermine thetime complexityof Algorithm 3 asfollowsforabagofnsamples:

• Calibration. Initially, when the IVAP is calibrated, we run the learning algorithm on the proper training set and compute scoresforeachofthecalibrationsamples.ThistakesO

(

TA

(

n

)

+

TF

(

n

))

time.

• Inference. To compute the probabilities p0, p1 for a new test sample, we run thescoringrule andperformtwo isotonic re-gressions.ThiscanbedoneintimeO

(

TF

(

1

)

+nlogn

)

.

The overhead incurred by the IVAP in the calibration phase (which only needs to be performed once) is proportional to the complexityof the learning algorithm and scoring rule when ex-ecutedon thegiven bagofsamples. Whenperforming inference, theoverheadcomparedtosimplyrunningtheunderlyingmodelis log-linearinthesizeofthecalibrationset.

Itisalsoimportanttoconsiderhowthecalibrationsetandthe underlyingmodelaffecttheperformance oftheIVAP.Tothebest ofourknowledge,theexistingliteratureonIVAPsdoesnotprovide quantitativeanswerstothesequestions.Vovketal. [11] notethat thevalidity property (3) always holds for the IVAP regardless of theperformance ofthe underlyingmodel,aslong asthe calibra-tionset consists of independent samplesfrom the data distribu-tion.Theyalsopointoutthat,for“large” calibrationsets,the differ-encep1− p0willbesmall.Itwouldbeinterestingtohaveuniform convergenceboundsthat quantifyhow quickly p1− p0 converges tozeroas thecalibration setgrows larger, butwe are not aware ofanysuch results.Most likely,such boundswill depend on the accuracyoftheunderlyingmodel,wheretheconvergenceisfaster withamoreaccuratemodel.

3.1.White-boxattackforIVAPs

AspointedoutbyCarliniandWagner [41] ,itisnotsufficientto demonstratethatanewdefenseisrobustagainst existingattacks. Tomakeanyseriousclaimsofadversarialrobustness,wemustalso developandtesta white-boxattack thatwasdesignedto specifi-callytargetmodelsprotectedwithourdefense.Letxbeanyinput thatisnotrejectedandclassifiedasyby thedetector.Ourattack mustthenfindx˜∈X suchthat

1.

x− ˜x

pisassmallaspossible; 2. x˜isnotrejectedbythedetector; 3. x˜isclassiﬁedas1− y.

Here,

u

p is the general p norm of the vector u. Common choicesforpin theliterature are p∈{1,2, ∞} [27] ; we will fo-cuson2and∞normsinthiswork.

Following Carlini and Wagner [41] , we design a differentiable function thatis minimizedwhenx˜liesclosetox,is not rejected bythedetectorandisclassiﬁedas1− y.Todothiseﬃciently,note thatwheneveranewsamplehasthesamescoreasanoldsample inthecalibrationset,itwillhavenoeffectontheisotonic regres-sionandtheresultofapplying Algorithm 3 toitwillbethesame asapplyingthealgorithmtotheoldsample.Hence,wesearchfor asample(si,yi)inthecalibrationsetsuchthat

1. y_i₌1_{− y}; 2. f1

(

si

)

− f0

(

si

)

≤

β

;

3. theconﬁdence p= f₁(s_i)

1− f0(si)+f1(si) satisﬁes p > 0.5if yi=1 and

p≤ 0.5otherwise.

From among all samples that satisfy these conditions, we choosetheonewhichminimizesthefollowingexpression:

x− xi

22+c

(

s

(

x

)

− si

)

2.

Here,s(x)isthescoreassignedtoxbytheclassiﬁerandcisa con-stantwhichtradesoff therelativeimportanceofreducingthesize oftheperturbationvsfoolingthedetector.Wechoosethese sam-plesinthiswaysoasto“warm-start” ourattack:weseekasample thatliesascloseaspossibletotheoriginal xinX-spacebutalso hasascoresiwhichliesascloseaspossibletos(x).Thisway,we hope to minimize the amount of effortrequired to optimizeour adversarialexample.

Havingﬁxedsuchasample(si,yi),wesolvethefollowing opti-mizationproblem:

min δ∈[−1,1]d

δ

2

2+c

(

s

(

x+

δ

)

− si

)

2 subjecttox+

δ

∈[0,1]d. (4) Theconstantccanbedeterminedviabinarysearch,asinthe Car-lini&Wagnerattack [16] .Inourcase,weareapplyingtheattackto astandardneural networkwithoutanynon-differentiable compo-nents,meaningthescorefunctionsisdifferentiable.Theresulting problem (4) can therefore be solved usingstandard gradient de-scentmethodssincethefunctiontobeminimizedisdifferentiable aswell.Inparticular,weusetheAdamoptimizer [49] tominimize ourobjective function.The constraintx₊

δ

_∈[0_,1]d _can _easily _be enforced by clippingthe valuesofx+

δ

back intothe [0,1]d hy-percubeaftereachiterationofgradientdescent.

Note that our custom white-box attack will not suffer from anygradientmaskingintroducedbyourdefense. Indeed,we only relyongradientscomputedfromtheunderlyingmachinelearning model.Aslongastheunprotectedclassiﬁer doesnot mask gradi-ents(whichitshouldnot,sincetheunprotectedclassiﬁerswillbe vanillaneuralnetworkstrainedinthestandardway),this informa-tionwillbeuseful.

Asdescribed,theabovemethodofselectinga calibration sam-pleandsolving (4) onlyconsidersadversarialsthatareminimalin

2 distance.However, ∞-boundedperturbationsare alsoof inter-est,soinourexperimentsweevaluateboththeoriginal2 formu-lationaswellasan∞variantwherewereplace

·

22 by

·

∞.

4. Experiments

Toverifytherobustnessof Algorithm 3 against adversarial ex-amples, we perform experiments with several machine learning classiﬁcation tasks. The following questions are of interest to us here:

(7)

Fig. 4. Illustration of the different data splits. The training data is split into proper training data on which the scoring classiﬁer is trained and calibration data on which the IVAP ﬁts the isotonic regression. The test data is split into proper test data, which is used to determine the overall accuracy of the resulting algorithm, and validation data. Both of these subsets of test data are used to generate adversarial proper test data and adversarial validation data. The adversarial proper test data is used for evaluating the robustness of the resulting algorithm, whereas the adversarial validation data is used together with the validation data to tune β. All splits are 80% for the left branch and 20% for the right branch.

1. HowdoestheIVAPhandleadversarialexamplestargetedatthe original model (Section 4.1 )? Ideally, the IVAP should be im-munetothese.

2. Canwegenerateadversarialexamplesspeciﬁcallytobypassthe IVAP(Sections 4.2 and 4.3 )?AspointedoutbyCarliniand Wag-ner [16] ,itis not enoughforan adversarial defenseto be ro-bustagainstan obliviousadversary;wemustalsoestablish ro-bustnessagainstan adversarythatisawareofourdefense.We thereforeadaptexistingadversarialattackstotargetour detec-torandwealsoevaluatetheIVAPagainstourcustomwhite-box attack.

3. Howmuchadversarialdataisactuallyneededinthevalidation set (Section 4.4 )?As it stands, we augmentthe validationset withadversarialexamples generatedby avarietyofattacksin ordertotunethe

β

parameter.However,wewantourdefense toberobustnotonlyagainstattackswhichithasbeenexposed tobutalsotoattackswhichhaveyettobeinvented.We there-forecompareourresultstoasettingwhereonlyoneadversarial attackisusedforthevalidationset.

ThegoaloftheseexperimentsistoshowthattheIVAPcan han-dlebotholdandnewadversarialexamplesandthatitsrobustness generalizesbeyondthelimitedsetofattackswhichithasseen.As such,we consideranumberofdifferentscenarios foreach classi-ﬁcationtask,detailedbelow.

4.1. Robustnessofthedetectortotheoriginaladversarialexamples

Werunadversarialattackalgorithmsonboththevalidationset and theproper test set, creatingan adversarial validation set and anadversarialpropertestset(see Fig. 4 fordetailsonourdifferent datasplits).Theattacksweemployedwereprojectedgradient de-scent withrandom restarts [27] ,DeepFool [50] ,local search [51] , the single pixel attack [52] , NewtonFool [53] , fast gradient sign [6] and the momentum iterative method[54] .This means for each test setwegeneratesevennewdatasetsconsistingofallthe ad-versarial examples the attacks were able to ﬁnd on the test set fortheoriginalmodel.Wethenrun Algorithm 3 onthe adversar-ial validationsetaswell asthe regularvalidationset.Ourgoalis to maximize Youden’s index(deﬁned below) on both the regular andadversarialvalidationsets.We choose

β

such that thisindex ismaximized.Wecanthenusethepropertestsetandthe adver-sarialpropertestsettojudgethecleanandadversarialaccuracyof theresultingalgorithmwiththetunedvalueof

β

.

4.2. Robustnessofthedetectortonewadversarialexamples

We attempt to generatenew adversarials which take the de-tector intoaccountandevaluate theperformance similarlytothe

Table 1

Summary of performance indicators for the CNNs used in our experiments.

Task Clean accuracy Adversarial accuracy

Zeroes vs ones 100% 2.31%

Cats vs dogs 89.87% 1.8%

T-shirts vs trousers 99.75% 3.43% Airplanes vs automobiles 96.69% 3.75%

ﬁrst scenario.The resulting samplesare called adapted adversari-als. Theyare generated by applying all of theadversarial attacks mentioned above to Algorithm 3 . Forthe gradient-based attacks, we use the naturalevolution strategies approach [34,55] to esti-matethegradient,therebyavoidinganypossiblegradientmasking issues [32] .

4.3.Fullywhite-boxattack

We run our custom adversarial attack (4) forthe detector on theentirepropertestsetandevaluateitsperformance.Thisattack isspeciﬁcallydesignedtotargettheIVAPandshouldrepresentan absoluteworst-casescenarioforthedefense.

4.4.Ablationstudy

We estimate

β

based on adversarials generated only by the DeepFoolmethodonthevalidationsetfortheoriginalmodeland thenevaluatetheresultingdetectorasbefore.Thepurposeofthis scenarioistoestimatehow“future-proof” ourmethodis,by eval-uating its robustness against attacks that played no role in the tuning of

β

. Our choice for DeepFool was motivatedby the ob-servations that it is efficientand rathereffective, butit is by no means the strongest attack in our collection. It is less efficient than fast gradient sign, but this methodwas never meant to be aserious attack;it wasonly meantto demonstrateexcessive lin-earity of DNNs [6] . It is weaker than NewtonFool, for example, sincethismethodutilizes second-orderinformationofthe model whereasDeepFoolislimitedtofirst-orderapproximations.Itisalso weakerthansomefirst-orderattackssuchasthemomentum itera-tivemethod,whichwasatthetopoftheNIPS2017adversarial at-tackcompetition [54,56] .WethereforefeelthatDeepFoolisagood choiceforevaluatinghowrobustourdefenseistofutureattacks.

4.5.Metrics

For all scenarios, we report the accuracy, true prositive rate (TPR), truenegativerate(TNR),falsepositive rate(FPR)andfalse negativerate(FNR)ofthedetectors.Thesearedeﬁnedasfollows:

Accuracy=TA_m+TR, TPR=_TATA +FR, TNR= TR TR+FA, FPR= FA FA+TR, FNR= FR FR+TA.

Here,misthetotal numberofsamples,TAisthenumberof cor-rectpredictionsthedetectoraccepted,TR isthenumberof incor-rectpredictionsthedetectorrejected,FAisthenumberofincorrect predictionsthedetectoracceptedandFRisthenumberofcorrect predictionsthedetectorrejected.Youden’sindexisdeﬁnedas J=TPR+TNR− 1=TPR− FPR.

It is deﬁnedfor every point on a ROC curve. Graphically, it cor-respondstothedistancebetweentheROCcurve andthe random chanceline [48] .

Thesemetricsarecomputedfordifferentscenarios:

• Clean.Thisreferstothepropertestdataset,withoutany mod-iﬁcations.

(8)

8 J. Peck, B. Goossens and Y. Saeys / Neurocomputing xxx (xxxx) xxx Table 2

Summary of performance indicators for the detectors on the different tasks.

Task Data Size Accuracy TPR TNR FPR FNR

Zeroes vs ones Clean 1692 98.52% 98.51% 100.0% 0.0% 1.49% Adversarial 6918 46.07% 0.0% 100.0% 0.0% 100.0% Adapted 5858 97.41% 0.0% 99.06% 0.94% 100.0% Custom 2 1692 0.0% 0.0% 0.0% 100.0% 100.0%

Custom ∞ 1692 0.0% 0.0% 0.0% 100.0% 100.0%

Cats vs dogs Clean 3988 72.94% 75.9% 48.85% 51.15% 24.1% Adversarial 22,599 52.56% 71.25% 38.86% 61.14% 28.75% Adapted 17,187 59.27% 100.0% 59.18% 40.82% 0.0% Custom 2 3952 1.64% 100.0% 0.0% 100.0% 0.0%

Custom ∞ 3145 1.69% 100.0% 0.0% 100.0% 0.0%

T-shirts vs trousers Clean 1600 96.88% 96.86% 97.44% 2.56% 3.14% Adversarial 7076 50.99% 0.0% 99.97% 0.03% 100.0% Adapted 5923 77.53% 0.0% 77.99% 22.01% 100.0% Custom 2 1600 0.0% 0.0% 0.0% 100.0% 100.0%

Custom ∞ 1600 0.0% 0.0% 0.0% 100.0% 100.0%

Airplanes vs automobiles Clean 1600 74.38% 73.21% 96.3% 3.7% 26.79% Adversarial 8256 58.14% 0.0% 100.0% 0.0% 100.0% Adapted 6586 93.99% 0.0% 94.82% 5.18% 100.0% Custom 2 1600 0.0% 0.0% 0.0% 100.0% 100.0%

Custom ∞ 1600 0.0% 0.0% 0.0% 100.0% 100.0% • Adversarial. Here,themetrics arecomputedon theadversarial

test set. This data set is constructed by running existing ad-versarial attacksagainsttheunderlyingneural networkonthe propertestset.TheydonottaketheIVAPintoaccount. • Adapted. Thisissimilarto theAdversarialscenario butthe

at-tacksaremodiﬁedtotaketheIVAPintoaccount.

• Custom p. Here, we compute the metrics for adversarial ex-amplesgeneratedusingourcustomp white-boxattackonthe propertestset.Wereportresultsforp=2andp=∞.

4.6.Implementationdetails

The implementations of the adversarialattacks, including the gradientestimatorusingthenaturalevolutionstrategies,were pro-videdbytheFoolboxlibrary [57] .Theimplementationofthe Venn-ABERSpredictorwasprovidedbyToccaceli [58] .Thedifferentdata splitswe perform fortheseexperimentsare illustrated schemati-cally in Figure 4 .Almost all neuralnetworkswere trained for50 epochsusingtheAdamoptimizer [49] withdefaultparameters in theKerasframework [59] usingtheTensorFlow backend [60] .The exceptionistheCNNforthecatsvsdogstask,whichwastrained for100 epochs. No regularization ordata augmentation schemes wereused;the onlypreprocessingweperformis anormalization oftheinputpixelstothe[0,1]rangeaswellasresizingtheimages inone ofthetasks.Descriptions ofall theCNNarchitecturescan befoundin Appendix A .

4.7.Datasets

Here,wedetailthedifferentdatasetsusedforourexperiments. Note that these are all binary classiﬁcation problems, since the IVAPconstructionasformulatedbyVovketal. [11] onlyworksfor binarytasks.ExtendingtheIVAPto multiclassproblemshasbeen discussed,forexample,inManokhin [44] ;usingamulticlassIVAP inadversarialdefensewillbeexploredinfuturework.

Zeroesvs ones.Ourfirst taskisabinaryclassification problem based on the MNIST data set [61] . We take the original MNIST data set and filter out only the images displaying either a zero or a one. The pixel values are normalized to lie in the interval [0,1]. We then run our experiments as described above, using a convolutional neural network (CNN) as the scoring classifier for Algorithm 3 .

Catsvsdogs.Thesecondtaskweconsideristheclassiﬁcationof catsanddogsintheAsirradataset [62] .Thisdatasetconsistsof 25,000JPEGcolorimageswherehalfcontaindogsandhalfcontain cats.AgainwetrainaCNNasthescoringclassiﬁerfor Algorithm 3 . Intheoriginalcollection,not allimageswereofthesamesize.In ourexperiments,weresizedallimagesto64× 64pixelsand nor-malizedthepixelvaluesto[0,1]tofacilitateprocessingbyour ma-chinelearningpipeline.

T-shirts vs trousers. The third task is classifying whether a 28× 28grayscale imagecontainsa picture ofa T-shirtor apair of trousers. Similarly to the MNISTzeroes vsones task,we take the Fashion-MNIST data set [63] andﬁlterout thepictures of T-shirtsandtrousers.Again,thepixelvaluesarenormalizedto[0,1] andaCNNisusedforthisclassiﬁcationproblem.

Airplanesvsautomobiles.Ourfourthtaskisbasedonthe CIFAR-10 data set [30] , which consists of 60,000 RGB images 32 × 32 pixels insize. Weﬁlter out theimagesof airplanes and automo-bilesandtrainaCNNtodistinguishthesetwoclasses.

4.8. Resultsanddiscussion

The baseline accuracies of the unprotectedCNNs are givenin Table 1 forthedifferenttasks.Wecanseethat theyperform rea-sonablyaccuratelyoncleantestdata,butasexpectedthe adversar-ial accuracyisvery low. Table 2 givesthe cleanaccuraciesofthe IVAPs foreach of the tasks. From these results,we can see that theclean accuracyoftheprotectedmodelinvariablysuffersfrom the addition of the IVAP. However, in almost all tasks, the false positive andfalse negative ratesare relatively low. Ourdetectors arethereforecapableofacceptingcorrectpredictionsandrejecting mistakenonesfromtheunderlying modelonclean data.The ex-ceptionisthecatsvsdogstask,wherethefalsepositiverateis un-usuallyhighat51.15%.However,webelievetheseresultsmightbe improvedby usingabetter underlyingCNN.Inourinitial experi-ments,weusedthesameCNNforthistaskaswedidforairplanes vs automobiles and trained it for only 50 epochs. Using a more complexCNNandtrainingitlongersigniﬁcantlyimprovedthe re-sults,soweareoptimisticthattheresultsweobtainedcanbe im-provedevenfurtherby adaptingtheunderlyingCNNandtraining regime.

Theresultspresentedhereareobservationsfromasinglerunof ourexperiments.Multiplerunsoftheexperimentsdidnotproduce signiﬁcantlydifferentresults.

(9)

Fig. 5. The ROC curves for the different detectors. The orange dashed line is the random chance line. The red solid line is the ROC curve for our detector. The dotted black line is the cut-off corresponding to the optimal value of β, which is deﬁned here as the value that maximizes Youden’s index.

Fig. 6. Selections of adversarial examples which can fool our detectors, generated by the custom 2 white-box attack.

Robustness to existing attacks. The ﬁrst question we would like to answer when evaluating any novel adversarial defense is whether it can defend against existing attacks that were not

Fig. 7. Selections of adversarial examples which can fool our detectors, generated by the custom ∞ white-box attack.

adaptedspeciﬁcallytofoolit.Thisisabareminimumofstrength onedesires:ifexistingattackscanalreadybreakthedefense,then itisuseless.In Table 2 ,weevaluate ourIVAPdefenseagainst the

(10)

Fig. 8. Selections of adversarial examples which can fool our ablated detectors, gen- erated by the custom 2 white-box attack.

Fig. 9. Selections of adversarial examples which can fool our ablated detectors, gen- erated by the custom ∞ white-box attack.

suiteofattacksdescribedin Section 4.1 .Theresultsareshownfor theAdversarialandAdaptedscenarios.Whenwetransfer adversar-ialsgeneratedfortheunderlying modeltotheIVAP,theaccuracy alwaysdegradessigniﬁcantly.However,itisimportanttonotethat all unprotected CNNs have very low accuracy on these samples. Assuch, inalmost all cases,theIVAP hasa veryhightrue nega-tiverateasmostpredictionsareinerror.Theexceptionisthecats vsdogs task, where the true negativerate is rather low on the transferredadversarials.Theaccuracyisstillmuchhigherthanthe unprotectedmodel,though (52.56%vs 1.8%),so theIVAP did sig-niﬁcantlyimprovetherobustnessagainstthistransfer attack.The adversarialsgeneratedusingexisting attacksadaptedto the pres-enceoftheIVAPalsodegradetheaccuracy,butnotasseverelyas thetransferattack.Truenegativeratesremainhighacrossalltasks andthe level of accuracyis, at the very least,much better than withoutIVAPprotection.

Robustness to novel attacks. To more thoroughly evaluate the strength ofour defense, it is necessary to also test it against an adaptiveadversaryutilizinganattack thatis specificallydesigned to circumvent our defense. For this reason, Table 2 also shows tworowsof testresults foreachtask,wherewe run ourcustom white-boxattack for both the 2 and ∞ norms. At first glance, itappearsasthoughourattack iscompletelysuccessfulandfully breaks our defense, yielding 0% accuracy. To investigate this fur-ther, Figs. 6 and 7 show randomselectionsoffiveadversarial ex-amplesthatmanagedtofoolthedifferentdetectors,generatedby ourwhite-boxattack.Empiricalcumulativedistributionsofthe dis-tortion levels introduced by our white-box attack are shown in Figs. 10 and 11 .Visuallyinspectingthegeneratedadversarialsand looking atthe distributions of the distortion levels reveals cause foroptimism: although some perturbations are still visually im-perceptible,thereisa significantincreaseintheoveralldistortion requiredto fool the detectors. This is especially thecase for the

∞ adversarials,wherealmost all perturbationsare obvious.Even the2adversarialsaregenerallyveryunsubtleexceptfortheones generatedonthecatsvsdogstask.We believethisisdueto the fact that the Asirra images are larger than the CIFAR-10 images (64 × 64vs32 × 32), sothat the 2 perturbations can be more “spreadout” acrossthepixels.Regardless,ifonecomparesthe ad-versarialimagestotheoriginals,theperturbationsarestillobvious becausetheadversarialsareunusuallygrainyandblurred.

Sensitivityof hyperparameters. Forthe experiments carriedout in Table 2 ,theIVAPwasonlyexposedtoadversarialattackswhich ithadalreadyseen:therejectionthreshold

β

wastunedon sam-ples perturbed using thesame adversarialattacks asit wasthen tested against. Fig. 5 shows theROC curves forthe different de-tectorsalong withtheoptimalthresholds. Animportantquestion that arisesnow is how well the IVAP fares if we test it against attacks that differ fromthe onesused to tune its hyperparame-ters, since in practice it is doubtful that we will be able to an-ticipatewhat attacksanadversarymayuse.Moreover,evenifwe cananticipatethis, theset ofpossibleattacksmaybe solarge as to make it computationally infeasible to use them all. Therefore, wealsotesttheperformance oftheIVAPwhen

β

istunedon ad-versarialsgeneratedonlybytheDeepFoolattackinsteadofthefull suiteofattacks.Weobtainedidenticalvaluesof

β

fortheZeroesvs onesandCatsvsdogstasks.Theresultsfortheothertaskswhere

β

wasdifferent are shown in Table 3 . The conclusionsare simi-lar:theaccuracydegradesboth oncleanandadversarialdata,but cleanaccuracyisstillhighandtruenegativeratesarehighon ad-versarial data.The custom white-boxattack again appearshighly successful. Figs. 12 and 13 plottheempirical cumulative distribu-tions ofthe adversarialdistortionlevels forthe ablateddetectors and Figs. 8 and 9 showselectionsof customwhite-box adversar-ials which fooled the ablated detectors. These results show that robustnessto our2 white-boxattack is signiﬁcantly lower com-paredtothenon-ablateddetectors.The∞-boundedperturbations arestillverynoticable,however.Asanillustrativeexample, Fig. 14 shows histograms of the differences p1− p0 between the upper andlower probabilitiesfor cleanandDeepFool adversarial exam-plesonthezeroesvsonestask.Althoughthereisasmallamount ofoverlap,thevastmajorityofadversarialexampleshaveamuch higherdifference than cleansamples: cleandata hasa difference closeto 0%,whereas adversarialsare almostalways closeto 50%. Thetunedmodelthresholdisplacedsothatthemajorityofclean samplesarestill allowedthroughbutvirtuallyall adversarialsare rejected.

5. Comparisonwithothermethods

Inthissection,wecompareourIVAPmethodtotheMadry de-fense [27] which, to the best of our knowledge, is the strongest state-of-the-artdefenseatthetimeofthiswriting.Speciﬁcally,we apply the Madry defense to each of the CNNs used in the tasks from Section 4 andcomparetheaccuraciesofthedifferent meth-ods. We also perform transfer attacks where we try to fool our IVAPusingadversarialsgeneratedfortheMadrydefenseandvice versa.

The Madrydefense is aform ofadversarialtrainingwhere,at eachiterationofthelearningalgorithm,the_∞projectedgradient descentattack isusedtoperturballsamplesinthecurrent mini-batch4_._The _model_is_then _updated_based_on_this_perturbed _batch insteadoftheoriginal.Tofacilitatefaircomparison,wedonotuse thepre-trainedmodelsandweights publishedby Madry et al. for the MNISTandCIFAR-10data sets,since theseare meant forthe full 10-classproblemswhereas ourdefense onlyconsiders a sim-plerbinarysubproblem.Rather,wemodiﬁedthenetwork architec-tures for thesetasks so they correspond to our own CNNs from Section 4 andtrainedthemaccordingtotheprocedureoutlinedby Madryetal. [27] .

Table 4 showstheresultsweobtainedwiththeMadrydefense aswellastheparametersettingsweusedforthePGDattack.The

4_{Madry et al. stress that the}

∞ variant be used and not, e.g., an 2 one. We

follow their advice and perform adversarial training exclusively with the ∞ version.

The implementation we used was taken directly from https://github.com/MadryLab/ mnist _ challenge .

(11)

Fig. 10. Empirical cumulative distributions of the adversarial distortion produced by our 2 white-box attack. The distance metrics are 2 (left) and ∞ (right).

Fig. 11. Empirical cumulative distributions of the adversarial distortion produced by our ∞ white-box attack. The distance metrics are 2 (left) and ∞ (right). Table 3

Summary of performance indicators for the ablated detectors on the different tasks.

Task Data Size Accuracy TPR TNR FPR FNR

T-shirts vs trousers Clean 1,600 97.81% 97.89% 94.87% 5.13% 2.11% Adversarial 7,076 49.31% 3.81% 93.02% 6.98% 96.19% Adapted 5,901 56.8% 0.0% 56.93% 43.07% 100.0% Custom 2 1,600 0.0% 0.0% 0.0% 100.0% 100.0%

Custom ∞ 1,600 0.0% 0.0% 0.0% 100.0% 100.0%

Airplanes vs automobiles Clean 1,600 93.62% 95.39% 60.49% 39.51% 4.61% Adversarial 8,256 52.63% 0.0% 90.52% 9.48% 100.0% Adapted 6,580 69.18% 0.0% 69.73% 30.27% 100.0% Custom 2 1,600 0.0% 0.0% 0.0% 100.0% 100.0%

Custom ∞ 1,598 0.0% 0.0% 0.0% 100.0% 100.0%

Table 4

Summary of parameters and performance indicators for the Madry defense on the machine learning tasks we considered. Step sizes were always set to 0.01 and random start was used each time.

Task Clean accuracy Adversarial accuracy ε Steps

Zeroes vs ones 99.88% 97.1% 0.3 40

Cats vs dogs 54.94% 3.5% 0.3 10

T-shirts vs trousers 97.12% 85% 0.3 40 Airplanes vs automobiles 82.06% 29.19% 0.3 10 defense performs verywell onthezeroes vsonesandT-shirtsvs trouserstasks.However,onairplanesvsautomobilesthe adversar-ialaccuracyislowandonthecatsvsdogstaskboththecleanand

Table 5

Results of the Madry-to-IVAP transfer attack, where adversarial examples generated for the Madry defense were tested against our IVAPs.

Task Accuracy TPR TNR FPR FNR

Zeroes vs ones 36.05% 30.91% 100.0% 0.0% 69.09% Cats vs dogs 61.64% 65.08% 52.3% 47.7% 34.92% T-shirts vs trousers 72.75% 71.02% 84.47% 15.53% 28.98% Airplanes vs automobiles 45.81% 34.53% 95.92% 4.08% 65.47%

adversarialaccuraciesare low.Infact, forthe lattertask,the ad-versarialaccuracyisalmostthesameasifnodefensewasapplied atall.

(12)

Fig. 12. Empirical cumulative distributions of the adversarial distortion produced by our 2 white-box attack on the ablated detectors. The distance metrics are 2 (left) and ∞ (right).

Fig. 13. Empirical cumulative distributions of the adversarial distortion produced by our ∞ white-box attack on the ablated detectors. The distance metrics are 2 (left) and ∞ (right).

Fig. 14. Histograms of the differences p 1 − p 0 for clean and adversarial data along

with the tuned model threshold β. We used the DeepFool attack here to generate adversarials for the zeroes vs ones model.

In Table 6 we transfer the white-box adversarials forour de-fensetotheMadrydefense.The Madrydefense appearsquite ro-busttoourIVAPadversarialsforT-shirtsvstrousersandairplanes vsautomobiles,butmuchlesssoonzeroesvsones.Theaccuracy

Table 6

Results of the IVAP-to-Madry transfer attack. Here, adversarial examples generated for the IVAP defense by our custom white-box attack are transferred to the Madry defense. We test adversarials generated by both the 2 and ∞ variants of our attack.

Task Accuracy ( 2 ) Accuracy ( ∞ )

Zeroes vs ones 46.28% 66.78%

Cats vs dogs 55.26% 55.48%

T-shirts vs trousers 94.94% 93.88%

Airplanes vs automobiles 79.94% 78.38%

onthe adversarialsforthecatsvsdogstaskiscloseto theclean accuracy, so theseappear to have little effect on the Madry de-fense.

Table 5 showswhathappenswhenwetransferadversarialsfor theMadrydefensetoourIVAPdetectors.Weobservethat forthe zeroes vs ones andairplanes vsautomobiles, the accuracy drops below50%.However,false positiveratesontheseadversarialsare very low,sothe detectorssuccessfullyreject manyincorrect pre-dictions.Ontheother tasks,accuracyremainsrelativelyhigh. The falsepositiverateisratherhighforthecatsvsdogstask,butlow forT-shirtsvstrousers.

Fig. 15 plotsthe empiricaldistributions ofthe adversarial dis-tortion levels of the Madry adversarials, both for the 2 and ∞ Pleasecitethisarticleas:J.Peck,B.GoossensandY. Saeys,DetectingadversarialmanipulationusinginductiveVenn-ABERSpredictors,

(13)

Fig. 15. Empirical cumulative distribution curves for the adversarial distortion of the Madry adversarials.

Fig. 16. Selections of adversarial examples which can fool the Madry defense, gen- erated by the ∞ PGD attack.

distances.Comparingthisﬁgurewith Figs. 10 and 11 ,weseethat ourcustom white-boxadversarialsfortheIVAP defense generally requirehigher 2 and∞ levels ofdistortionthan theMadry ad-versarials. Fig. 16 plotsselectionsofadversarialsgeneratedforthe Madrydefensesbythe ∞ PGDattack.The visuallevelsof distor-tion appearcomparabletoourdefense (forthe∞ variantofour attack),exceptinthecaseofthecatsvsdogstaskwherethe per-turbationsfortheMadrydefenseseemmuchlarger.Itisimportant to note, however,that the performance of theMadry defense on this taskisclose to randomguessing, whereas the IVAP still ob-tainsover70%cleanaccuracy.

Weareleadtotheconclusionthatourdefenseinfactachieves higher robustnesson thesetasks than the Madrydefense. More-over,ourdefenseappearstoscalemuchbetter:adversarially train-ing a deep CNN with ∞ projected gradient descent quickly be-comes computationallyintractable asthedataincrease in dimen-sionalityandtheCNNincreasesincomplexity.Ontheother hand, ourIVAPdefensecantakeanexistingCNN(trainedusingeﬃcient, standard methods)and quicklyaugmentit withuncertainty esti-mates based on a calibration set and a data set of adversarials for theunderlying model. The resultingalgorithm appears to re-quirehigherlevels ofdistortion thanthe Madrydefense inorder to be fooledeffectivelyandit seemsto achieve higherclean and adversarialaccuracy onmore realistictasks. Finally,we note that our method introduces only a single extra scalar parameter, the

threshold

β

, whichcan be easily manually tuned ifdesired. It is also possible to tune

β

based on other objectives than Youden’s indexwhichwasusedhere.Ontheotherhand,theMadrydefense requiresapotentiallylengthyadversarialtrainingprocedure when-everthedefense’sparametersaremodiﬁed.

6. Conclusion

We have proposed using inductive Venn-ABERS predictors (IVAPs)toprotectmachinelearningmodelsagainstadversarial ma-nipulationofinputdata.Ourdefenseusesthewidthofthe uncer-tainty interval produced by theIVAP as a measureof conﬁdence in the prediction of the underlying model, where the prediction isrejectedincasethisintervalistoo wide.Theacceptable width isahyperparameterofthealgorithmwhich canbeestimated us-inga validationset.Theresultingalgorithmis nolonger vulnera-bletotheoriginaladversarialexamplesthatfooledtheunprotected modelandattackswhichtakethedetectorintoaccounthavevery limitedsuccess.Wedonotbelievethissuccesstobedueto gradi-entmasking, asthedefense remainsrobust evenwhen subjected toattacksthat do notsufferfromthisphenomenon. Ourmethod appearstobecompetitivetothedefenseproposedbyMadryetal. [27] ,whichisstate-of-the-artatthetimeofthiswriting.

Thealgorithm proposedhereislimitedto binaryclassiﬁcation problems.However,thereexistmethodstoextendtheIVAPtothe multiclasssetting [44] .Ourfuturework willfocusonusingthese techniques to generalize our detector to multiclass classiﬁcation problems.

Ofcourse, the mere fact that our own white-boxattack fails against theIVAP defense doesnot constitute hard proof that our methodissuccessful.Wethereforeinvitethecommunityto scruti-nizeourdefenseandtoattempttodevelopstrongerattacksagainst it.Ourcodeisavailableat https://github.com/saeyslab/binary-ivap . Theperformance oftheIVAPdefenseisstillnotidealatthisstage since clean accuracy is noticeably reduced. However, we believe these findings represent a significant step forward. As such, we suggest that the community further look into methods from the fieldofconformalpredictioninordertoachieveadversarial robust-ness at scale. To ourknowledge, we are the first to apply these methods tothis problem, although theidea has beenmentioned elsewherealready [64, Section 9.3, p133] .

DeclarationofCompetingInterest

Theauthorsdeclarethattheyhavenoknowncompeting ﬁnan-cialinterestsorpersonal relationships thatcould have appearedto inﬂuencetheworkreportedinthispaper.

(14)

Acknowledgements

WegratefullyacknowledgethesupportoftheNVIDIA corpora-tionwhoprovideduswithaTitanXpGPUtoconductour exper-imentsundertheGPUGrantProgram.JonathanPeckissponsored byaPh.D.fellowshipoftheResearchFoundation-Flanders(FWO). YvanSaeysisanISACMarylouIngramScholar.

AppendixA. DescriptionoftheNeuralNetworks

Inthissection,wegive detaileddescriptions oftheCNNsused in our experiments. These descriptions are Python code for the Kerasframework.Notethatweassumethefollowingimportshave beenloaded:

A1.Zeroesvsones

A2.Catsvsdogs

(15)

A3. T-shirtsvstrousers

A4. Airplanesvsautomobiles

References

[1] Y. LeCun , Y. Bengio , G. Hinton , Deep learning, Nature 521 (7553) (2015) 436 . [2] J. Schmidhuber , Deep learning in neural networks: an overview, Neural Netw.

61 (2015) 85–117 .

[3] J. Deng , W. Dong , R. Socher , L.-J. Li , K. Li , L. Fei-Fei , Imagenet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, Ieee, 2009, pp. 248–255 .

[4] J. Platt , Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Adv. Large Margin Classif. 10 (3) (1999) 61–74 . [5] Y. Gal , Uncertainty in Deep Learning, University of Cambridge (2016) . [6] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial ex-

amples, CoRR, abs/ 1412.6572 (2015).

[7] A. Nguyen , J. Yosinski , J. Clune , Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436 . [8] G. Shafer , A. Gammerman , V. Vovk , Algorithmic Learning in a Random World,

Springer, 2005 .

[9] A. Gammerman , V. Vovk , Hedging predictions in machine learning, Comput. J. 50 (2) (2007) 151–163 .

[10] G. Shafer , V. Vovk , A tutorial on conformal prediction, J. Mach. Learn. Res. 9 (Mar) (2008) 371–421 .

[11] V. Vovk , I. Petej , V. Fedorova , Large-scale probabilistic predictors with and without guarantees of validity, in: Advances in Neural Information Processing Systems, 2015, pp. 892–900 .

[12] N. Dalvi , P. Domingos , Mausam , S. Sanghai , D. Verma , Adversarial classiﬁcation, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, ACM, 2004, pp. 99–108 .

[13] M. Barreno , B. Nelson , R. Sears , A.D. Joseph , J.D. Tygar , Can machine learning be secure? in: Proceedings of the ACM Symposium on Information, Computer and Communications Security, ACM, 2006, pp. 16–25 .

[14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fer- gus, Intriguing properties of neural networks, arXiv: 1312.6199 (2013). [15] B. Biggio , F. Roli , Wild patterns: Ten years after the rise of adversarial machine

learning, Pattern Recognit. 84 (2018) 317–331 .

[16] N. Carlini , D. Wagner , Towards evaluating the robustness of neural networks, in: Proceedings of the IEEE Symposium on Security and Privacy (SP), IEEE, 2017, pp. 39–57 .

[17] N. Papernot , P. McDaniel , I. Goodfellow , S. Jha , Z.B. Celik , A. Swami , Practical black-box attacks against machine learning, in: Proceedings of the ACM Asia Conference on Computer and Communications Security, in: ASIA CCS ’17, ACM, 2017, pp. 506–519 .

[18] Z. Gong, W. Wang, W.-S. Ku, Adversarial and clean data are not twins, arXiv: 1704.04960 (2017).