A Study on Differential Private Online Learning

(1)

http://www.scirp.org/journal/jcc ISSN Online: 2327-5227

ISSN Print: 2327-5219

A Study on Differential Private

Online Learning

Weilin Nie, Cheng Wang

Huizhou University, Huizhou, China

Abstract

Online learning algorithms are very attractive, in which iterations are applied efficiently instead of solving some optimization problems. In this paper, on-line learning with protecting privacy is considered. A perturbation term is added into the classical online algorithms to obtain the differential privacy property. Firstly the distribution for the perturbation term is deduced, and then an error analysis for the new algorithms is performed, which shows the convergence and learning rate. From the error analysis, a choice for the para-meters for differential privacy can be found theoretically.

Keywords

Online Learning, Differential Privacy, Output Perturbation, Error Decomposition, Learning Rate

1. Introduction

Online learning is widely used recently in computer sciences, due to its effi- ciency in calculation and well theoretical results. Compared with the classical batch learning in learning theory, online algorithms update the output only according to the last sample point. So such algorithms are very effective to

handle the practical problems and have been studied in [1] [2] [3] [4] [5] and

etc. However, as the technologic development of data analysis, there are risks for applying such algorithms on a big data set. A commonly used notion for mea-

suring the risk is differential privacy [6]. Little references on this topic can be

found except for [7]. There the authors conducted an analysis for online convex

programming. Choice for the parameters of differential privacy and utilities analysis are presented for algorithms such as implicit gradient descent and genera- lized infinitesimal gradient descent. In this paper, the line of work begins with

[8] is considered which can be thought as a kernel online learning algorithm.

How to cite this paper: Nie, W.L. and Wang, C. (2017) A Study on Differential Private Online Learning. Journal of Com-puter and Communications, 5, 28-33.

https://doi.org/10.4236/jcc.2017.52004

Received: December 18, 2016 Accepted: February 18, 2017 Published: February 21, 2017

http://creativecommons.org/licenses/by/4.0/

(2)

2. Fundamental Principles

Our setting for online learning is introduced as follow. Let the input space X

be a compact metric space, and output Y∈ −

[

M M,

]

for some M >0 as a

regression problem. Denote Z:= ×X Y as the sample space. Assume there is a

probability measure ρ_on Z, which can be decomposed to marginal dis-

tribution ρX on X and conditional distribution ρ

( )

y x on Y at x∈X .

Then the regression function is defined by

( )

d

Y

fρ =

∫

y ρ y x (1)

which is indeed the conditional expectation of y given x. The regression fun-

ction minimizes the least square generalization error (see [9] for more details)

( )

(

( )

)

2

: d

Z

f =

∫

f x −y ρ

 (2)

So learning algorithms always aim to approximate the regression function

based on samples

{

zt =

(

x yt, t

)

}

_t₌_0,1,2,_, which are drawn independently from

distribution ρ. Let K X: ×X →R be a Mercer Kernel, and K is the in-

duced reproducing kernel Hilbert space (RKHS, [10]), i.e., the completion of

{

}

span Kx:x∈X where Kx

( )

x′ =K x x

(

, ′

)

for any x x, ′∈X with respect to

the inner product

,

(

,

)

K

x x K x x

K K

_′ = ′ . The corresponding norm in _K is

denoted as ⋅K. Now our online learning algorithm as

( )

(

)

1 , 0,1, 2,

t t t t t t x_t t t

f₊ = f −η _ f x −y K +λ f _ t=  (3)

with f0 =0. Here η >t 0 is the step size and λ >t 0 is the regularization

parameter.

When applying this online algorithm on private data set, it may leak some

sensitive information. To deal with this privacy problem, Dwork et al. introdu-

ced differential privacy in [11]. Which can be described as follow. For the sample

space Z introduced above, the Hamming distance between two sample sets

{

1, 2

}

m

z z ∈Z is

(

1 2

)

{

1, 2,

}

d z z, =# i=1,,m z: i≠z i (4)

Definition 1 A random algorithm : m Range

( )

A Z → A is -differential private if for every two data sets z z1, 2 satisfying d

(

z z1, 2

)

=1, and every set

( )

(

1

)

(

( )

2

)

Range A z Range A z

∈ ∩

 , there holds

( )

{

1

}

{

( )

2

}

Pr A z ∈ ≤ ⋅e Pr A z ∈ (5)

To endow our online algorithm the differential privacy property, a perturbation term is added into the output of (3), that is,

,

t t t

f _ = f +b (6)

where bt takes value in R with distribution to be determined in following

analysis.

Differential private online learning has already been studied in [7], there the

(3)

Here our algorithm is different, which is based on the Mercer kernels. Our pur-

pose in this paper is to firstly provide the explicit density function for b and

then conduct an error analysis for (6), which reveals the learning rate.

3. Differential Privacy Analysis

In this section, a detail analysis for the perturbation term bt in algorithm (6)

will be conducted. Firstly recall the useful definition of sensitivity and lemma

proposed in [11].

Definition 2 denote ∆ft as the maximum infinite norm of difference betw-

een the outputs when changing the last sample point in z. Let

{

(

)

}

0

, t

i i _i

z= x y ₌

and z=

{

(

x y1, 1

) (

, x y2, 2

)

,,

(

xt−1,yt−1

) (

, xt,yt

)

}

, ft and ft derived from (3)

accordingly, it is clear that

,

: sup

t t t

z z

f f f _∞

∆ = − ₍₇₎

Then a similar result to [11] is:

Lemma 1 Assume ∆ft is bounded by Ct >0, and bt has density function

proportion to exp

t b C   ₋        

, then algorithm (6) provides -differential privacy.

Proof. For all possible output function r, and z z, differ in last element, then

{

,

}

{

}

Pr Pr exp

t

t t t

b

t r f

f r b r f

C  −  = = = − ∝ −      (8) and

{

,

}

{

}

Pr Pr exp

t

t t t

b

t r f

f r b r f

C

 − 

= = = − ∝ − 

 



 ₍₉₎

So by triangle inequality,

{

,

}

{

,

}

{

,

}

Pr Pr Pr

f_t f_t Ct

t t t

f r f r e e f r

−

= ≤ = × ≤ =



   (10)

Then the lemma is proved by a union bound. 

It is obvious that if finding the upper bound for ∆ft, the distribution for bt

can be derived. Set t 1

(

t t0

)

θ

η = + and

(

)

1

0

1

t t t

θ

λ −

= + for some t0 >0 and

0< <θ 1. Moreover, denote κ=supx x, ′∈XK x x

(

, ′

)

(as K is Mercer Kernel

on compact metric space X). The next lemma is taken from [1] to bound f_{t K}.

Lemma 2 If 2

0 1

tθ ≥κ + , then for all t∈, there holds

t K t

M

f κ

λ

≤ ₍₁₁₎

Now the main result for differential privacy for algorithm (6) follows.

Theorem 1 When choosing t 1

(

t t0

)

θ

η = + and

(

)

1 0

1

t t t

θ

λ −

= + for some

1 1 2< <θ and

2 0 1

tθ ≥κ + , let the density function of bt is 1 exp t b C α            with

2Ct

(4)

(

)

(

)

2 2 1 0 2 1 1 t M C

t t θ

κ κ

−

+ =

− + (12)

then the algorithm (6) provides _{-differential privacy.}

Proof. From (3) there holds

( )

(

)

1 1 1 1 1 ₁ 1 1

t t t t t t x_t t t

f = f₋ −η₋ _ f₋ x₋ −y₋ K ₋ +λ₋ f₋ _ (13)

and

(

)

(

)

1 1 1 1 1 ₁ 1 1

t t t t t t x_t t t f f₋ η₋ f₋ x₋ y₋ K λ₋ f₋

−

 

= − _ − + _₍₁₄₎

Then

(

)

(

)

(

( )

)

1 1 1 1 ₁ 1 1 1 ₁

t t t t t t x_t t t t x_t

f −f =η₋ _ f₋ x₋ −y ₋ K ₋ − f₋ x₋ −y₋ K ₋ _ (15)

From the above lemma 1

1 t K t M f κ λ − −

≤ _{for all}t. By the reproducing proper-

ty that f x

( )

= f K, x K ≤ f K Kx K ≤κ f K (see [9]),

2 1

1

2

t t K t t

M

f f η κ M κ

λ

− −

 

− ≤ _ + _

  (16)

Therefore

(

)

(

)

2 2 2 1 , 0 2 1 sup 1

t t t

z z

M

f f f

t t θ

κ κ

− ∞

+ ∆ = − ≤

− + (17)

Set Bt to be the right hand side in lemma 1 then the theorem is proved. 

4. Error Analysis

In this section, fρ∈K is assumed for simplicity. It will be shown that ft

obtained from (6) still converge to regression function f_ρ by choosing appro-

priate parameter  under the choice of ηt and λt as in the theorem in the

last section. To this end, an error decomposition is needed. Denote operators

:

t K K

L  → as Lt

( )

f = f x K

( )

t x_t for t=0,1, 2,, and I as the identity

operator. It is easy to verify that 2

t

L ≤

κ

. Notice that f_ρ∈_K, the following

decomposition can be deduced:

(

)(

)

(

)

1

t t t t t t t t x_t t t

f₊ − f_ρ = I−η λI−ηL f − f_ρ +η _y K − λI+L f_ρ_ (18)

(

)

1

(

1

)

1

t t t t t t t t

A f f_ρ B A A ₋ f₋ f_ρ B₋ B

= − + = _ − + _+ =_₍₁₉₎

(

)

[

]

1 0 0 1 1 1 0

t t t t t t t

A A− A f fρ B A B− A A− A B

=  − + + + +  (20)

Here At = −I η λt tI−ηtLt and Bt =ηty Kt x_t −

(

λtI+Lt

)

fρ. In the follow-

ing the first term is called initial error and second one is sample error. The initial

error is easy to bound from the analysis above. Since t0 is such that

2 0 1

tθ ≥κ + , A_t is a positive operator with At ≤ −1 η λt t = + −

(

t t0 1

) (

t+t0

)

.

1 0 0 1 0 0

0 0 0 0 0 0 ( ) 1 .

t t _K t t _K

t

j _K

K

A A A f f A A A f f

j t f f j t t f t t ρ ρ ρ ρ − − = − ≤ ⋅ − + −

≤ Π −

+

= +

(5)

For the sample error, it is more difficult and the Pinelis-Bernstein inequality

[4] will be applied.

Lemma 3 Let ξi be a martingale difference sequence in a Hilbert space. Sup-

pose that almost surely ξ ≤i B and

2 2 1

1

t

i i t

i= − ξ ≤σ

∑

 for some constants

, t 0, 1, 2,

Bσ > t= . Then for any 0< <δ 1, with probability at least 1−δ ,

there holds 1 2 2 ln 3 t i t i B ξ σ δ =     ≤ _ + _{  }    

∑

(21)

Now the error bounds for sample error can be derived. Notice that

(

2

)

(

2 ₁

)

t K t t K t K

B ≤η _Mκ+ κ +λ f_ρ _≤η _Mκ+ κ + f_ρ _. Set

1 1, 1, 2, ,

i A At t A Bi i i t

ξ = ₋  ₋ =  , then

(

2

)

0

1 1 1

0

1

i K t t i i K i K

i t

A A A B M f

t t ρ

ξ _≤ ₋ ₋ _≤ + − η₋  κ₊ κ ₊ 

 

+

 (22)

(

2

)

(

2

)

0 1 0

1 1 1 1

1 1

K K

i t

M f M f

t t λ− κ κ ρ t t λ κ κ ρ

   

= _ + + _≤ _ + + _

+ + (23)

(

2 ₁

)

_.

t M fρ _K

η  κ κ 

= _ + + _ (24)

And 2 2

(

2

)

1

1 1

t

i i t

i= − X K ≤tη Mκ+ κ + fρ K

∑

 . So for any 0< <δ 1, with

probability at least 1−δ ,

(

)

(

)

2 1 2 1 2 8 2 1 ln 3 8 2 1 ln 3 t

i t _K

i K

K

t M f

M f

t

ρ

ρ θ

ξ η κ κ

δ κ κ δ = −     ≤ _ + + _{ }        ≤ _ + + _{ }   

∑

(25)

Note that

(

2 ₁

)

t K t K

B ≤η _Mκ+ κ + f_ρ _, hence

(

2

)

1 1 1 0 1 2

11 2

1 ln

3

t t t t t K K

B A B A A A B M f

tθ κ κ ρ δ

− − −

 

 

+ + + ≤ _ + + _{ }

  

  (26)

Combining the initial error, sample error bounds and applying Markov in-

equality for the fact that bt+1 =Ct+1 , the total error estimation is obtained.

Theorem 2 Choose η λt, t and bt as in the theorem in the last section, with

confidence 1−δ

(

0< <δ 1

)

, there holds

1, 1 2

1 4

t _K

f f C

t

ρ θ _δ

+ − ≤  − (27)

where constant

(

2

)

(

2

)

0

11

4 1 1

3

K K

C_ = κ + κM +t f_ρ + κM+ κ + f_ρ .

5. Conclusion

In this paper, analysis is performed for the differential privacy (Theorem 1) and generalization property (Theorem 2) for the online differential private learning algorithm (6). Under the choice of parameters in our theorems, the algorithm

(6) can provide -differential privacy and keep learning rate close to 1 2, for

any >0. However, this error bound is not satisfactory enough. It might be an

(6)

future work.

Fund

This work is supported by NSFC (Nos. 11326096, 11401247), NSF of Guangdong Province in China (No. 2015A030313674), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (No. 2013LYM_0089), Na-tional Social Science Fund in China (No. 15BTJ024), Planning Fund Project of Humanities and Social Science Research in Chinese Ministry of Education (No. 14YJAZH040) and Doctor Grants of Huizhou University (No. C511.0206).

References

[1] Ying, Y. and Zhou, D.X. (2006) Online Regularized Classification Algorithms. IEEE

Transactions on Information Theory, 11, 4775-4788.

https://doi.org/10.1109/TIT.2006.883632

[2] Smale, S. and Zhou, D.X. (2009) Online Learning with Markov Sampling. Analysis

and Applications, 7, 87-113. https://doi.org/10.1142/S0219530509001293

[3] Guo, Z.C. and Wang, C. (2011) Online Regression with Unbounded Sampling.

In-ternational Journal of Computer Mathematics, 88, 2936-2941.

https://doi.org/10.1080/00207160.2011.587510

[4] Tarrés, P. and Yao, Y. (2014) Online Learning as Stochastic Approximations of Re-gularization Paths. IEEE Transactions on Information Theory, 60, 5716-5735. [5] Ying, Y. and Zhou, D.X. (2015) Online Pairwise Learning Algorithms with Kernels.

Computer Science, 10, 441-449.

[6] Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006) Calibrating Noise to Sen-sitivity in Private Data Analysis. Theory of Cryptography, Springer, 265-284.

https://doi.org/10.1007/11681878_14

[7] Jain, P., Kothari, P. and Thakurta, A. (2012) Differential Private Online Learning.

JMLR: Workshop and Conference Proceedings, 2012, 1-34.

[8] Smale, S. and Yao, Y. (2006) Online Learning Algorithms. Foundations of

Compu-tational Mathematics, 6, 145-170. https://doi.org/10.1007/s10208-004-0160-z

[9] Cucker, F. and Smale, S. (2002) On the Mathematical Foundations of Learning.

Bulletin of the American Mathematical Society, 39, 1-49.

https://doi.org/10.1090/S0273-0979-01-00923-5

[10] Aronszajn, N. (1950) Theory of Reproducing Kernels. Transactions of the American

Mathematical Society, 68, 337-404.

https://doi.org/10.1090/S0002-9947-1950-0051437-7

[11] Dwork, C. (2006) Differential Privacy. ICALP, Springer, 1-12.

(7)

Submit or recommend next manuscript to SCIRP and we will provide best service for you:

Accepting pre-submission inquiries through Email, Facebook, LinkedIn, Twitter, etc. A wide selection of journals (inclusive of 9 subjects, more than 200 journals)

Providing 24-hour high-quality service User-friendly online submission system Fair and swift peer-review system

Efficient typesetting and proofreading procedure

Display of the result of downloads and visits, as well as the number of cited articles Maximum dissemination of your research work

Submit your manuscript at: http://papersubmission.scirp.org/