The Parzen Window method : in terms of two vectors and one matrix

(1)

Contents lists available atScienceDirect

Pattern Recognition Letters

journal homepage:www.elsevier.com/locate/patrec

The Parzen Window method: In terms of two vectors and one matrix

Hamse Y. Mussa

a,b,∗

_{, John B.O. Mitchell}

a

_{, Avid M. Afzal}

b

a_{EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews KY16 9ST, Scotland, UK} b_{Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensﬁeld Road, Cambridge CB2 1EW, England, UK}

a r t i c l e

i n f o

Article history:

Received 15 February 2015 Available online 18 June 2015

Keywords:

Probability density function Kernel functions Parzen Window

a b s t r a c t

Pattern classification methods assign an object to one of several predefined classes/categories based on fea-tures extracted from observed attributes of the object (pattern). WhenLdiscriminatory features for the pat-tern can be accurately determined, the patpat-tern classification problem presents no difficulty. However, precise identification of the relevant features for a classification algorithm (classifier) to be able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classification problem is often cast as devising a classifier that minimizes the misclassification rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to the class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown.

The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-speciﬁc probability) densities for a given pattern. These probability densities can then be utilized to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can be-come computationally impractical when the size of the training dataset is in the tens of thousands andL is also large (a few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved.

In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.

1. Introduction

In mathematical pattern recognition, the problem of pattern clas-sification entails assigning an object – based on a number of spe-cific features of the object – to one of a finite set of predefined classes/categories

ωj

, wherej=1, 2,…,J, withJbeing the number of classes/categories of interest. Typically the object (or simply the pat-tern) is represented by anL-dimensional vectorxwhose elements,

(

x1,x2, . . . ,xL),are values assumed to contain the appropriate infor-mation about the speciﬁc pattern features utilized to accurately clas-sify the pattern represented byx.

WhenLdiscriminatory features for a pattern can be determined accurately, the pattern classiﬁcation problem presents no diﬃculty:

∗ _{Corresponding author at: EaStCHEM School of Chemistry and Biomedical Sciences} Research Complex, University of St Andrews, North Haugh, St Andrews KY16 9ST, Scotland, UK. Tel.:+44 7951519416; fax: +44 1223763076.

E-mail address:[email protected](H.Y. Mussa).

it reduces to a simple look-up table scheme. However, identifying the relevant features to classify realistic patterns without classification errors is generally impossible. Thus, the pattern classification prob-lem is often cast as the task of finding a classifier that minimizes the misclassification rate[1]. One popular way of achieving this objective is to treat both the pattern vectorxand the class label

ωj

as ran-dom variables. In this case, the posterior class probabilitiesp(

ωj

|x) for a given patternxis computed; then patternxis assigned to the class, for which thep(

ωj

|x) value is maximum[1–6]. (In the last step it is being assumed that all misclassiﬁcation errors are equally bad

[1,3,4].)

However, in practice, the form of the functionp(

ωj

|x) is unknown; instead,Npatternsx_i and their corresponding correct class labels yi∈

{

ω

1,

ω

2, . . . ,

ωJ

}

– i.e.,D=

{

(

xi,yi)

}

Ni=1,assumed to constitute a representative dataset of the joint probability density functionp(

ωj

, x) for

ωj

andx– are usually available. It is from these prototype pat-ternsxiand their corresponding class labelsyithat one tries to esti-matep(

ωj

|x).

http://dx.doi.org/10.1016/j.patrec.2015.06.002

(2)

According to basic probability rules[1–7],

p

(ω

j

|

x

)

p

(

x

)

=p

(ω

j,x

)

=p

(

x

|

ω

j

)

p

(ω

j

)

(1)

These rules allow one to modularize the estimation problem and es-timatep(

ωj

|x) (and, of course,p(x)) in terms ofp(x|

ωj

) andp(

ωj

):

p

(ω

j

|

x

)

=

p

(

x

|

ω

j

)

p

(ω

j

)

p

(

x

)

, (2)

whereby we may have a better chance of being able to estimate p(x|

ωj

) andp(

ωj

) fromDthan estimatingp(

ωj

|x) directly fromD. In the denominator,p

(

x

)

=J

j=1p

(

x

|

ω

j)p

(ωj)

.

In the Bayesian statistics framework,p(

ωj

) is referred to as the class prior probability, which is the probability that a member of class

ωj

will occur. The functionp(x|

ωj

) is called the class-conditional probability density function, i.e. the probability density of observ-ing patternxgiven thatxis a member of class

ωj

. The denominator term on the right hand side ofEq. 2is often called the “evidence” or “marginal likelihood”. For the purpose of this paper we can afford to simply view this term as a normalization factor.

If there is evidence that the number of prototype patterns per class is an indication of the importance of that class, then a sensible ap-proximation ofp(

ωj

) can be

p

(ω

j

)

=

N

i=1

ν

ij

N , (3)

where

ν

i

j=1 if theith prototypexibelongs to class

ωj

, otherwise

ν

i

j=0; andNis as described before. Nonetheless,p(

ωj

) is typically assumed to be uniform, i.e.,p

(ωj)

=1

J,whereJis as deﬁned before. Estimatingp(x|

ωj

) fromDis not straightforward[1–5]. In the last half-century, a plethora of methods have been proposed for estimat-ingp(x|

ωj

) based onD,the so-called training set. There are ample excellent reviews and text books on this topic; for example, the two books – one by Hand[4]and the other by Murphy[8]– give adequate and accessible descriptions of the bulk of these approaches devised in recent (and not so recent) years.

In this paper we are concerned with one particular approach that is widely thought to be apropos to the task of estimatingp(x|

ωj

) from a representative training dataset: the so-called Parzen Window method[1,2,4,9,10], also known as Parzen estimator, Potential func-tion technique[10], to name but a few.

In the preceding discussion and in the rest of the paper, the terms “class”, “label”, “class label” and “category” are employed inter-changeably. For notational simplicity we usex,xiand

ωj

both as ran-dom variables and their realizations. Furthermore we follow (in line with the current trend in machine learning and statistics) the conve-nient – but not necessarily correct – practice of using the term “den-sity” for both a discrete random variable’s probability function and for the probability function of a continuous random variable. An im-plicit assumption being made throughout the paper is that all spaces, matrices, vectors, functions and variables (discrete or not),etc., are real.

2. Literature review

The Parzen Window approach can approximatep(x|

ωj

) by a sim-ple formula[1,2,10]:

ˆ

p

(

x

|

ω

j

)

= 1

N

i=1

ν

ij N

i=1

K

(

x,xi;

λ)ν

ij, (4)

wherexi∈Ddenotes prototype patterns and

ν

ijis as deﬁned before. K(x,xi;

λ

) – commonly known as the kernel function – is a two-variable function with speciﬁc properties, which are abundantly cov-ered in the statistical pattern recognition literature[1,2,9,10]. At any

rate, it might be helpful to think ofK(x,xi;

λ

) as a measure of simi-larity returning how similar patternsxandxiare,

λ

being a tunable (smoothing) parameter. In other words, the kernel function peaks at x₌x_iand decays away elsewhere; the

λ

parameter,inter alia, has an important role in determining the rate of the decay.

Eq. 4indicates that ˆp

(

x

|

ω

j)is formed from the superposition of kernel functionK(x_i,x;

λ

) values at the given prototype patternsx_i for class

ωj

. The Parzen Window method is powerful in the sense that, with enough representative data points (prototypes/references), its estimate of the class conditional probability density converges top(x|

ωj

) (see Ref.[1], Chapter 4). AlthoughEq. 4is conceptually simple and capable of providing a good estimate ofp(x|

ωj

), it can computationally suffer from the requirements that all the proto-types/referencesxifor class

ωj

must be retained in main-memory to compute ˆp

(

x

|

ωj)

,the estimate ofp(x|

ωj

). Furthermore, considerable CPU-time may be required each time this method is used to estimate p(x|

ωj

) to classify a novel pattern. The fact that the size of the refer-ence pattern vectorsxican be easily in the hundreds (or more) may exacerbate the main-memory and CPU-time requirements.

Over the years, various schemes have been developed to address the computational drawback of this otherwise elegant and power-ful method. For example, one of these schemes entails – see Ref.[1]

(Chapter 4), Ref.[4](Chapter 2) and Ref.[10](Chapters 6) for detailed technical and practical discussions – expressing the kernel function as a ﬁnite series expansion

K

(

x,xi;

λ)

= M

m=1

φ

m

(

x

)φ

m

(

xi

)

, (5)

which renders the right hand side ofEq. 4

1

N

i=1

ν

ij N

i=1

K

(

x,xi;

λ)ν

ij= 1

N

i=1

ν

ij N

i=1

ν

i j

M

m=1

φ

m

(

x

)φ

m

(

xi

),

(6)

with

{

φm

}

M

m=1 being appropriate basis functions (not necessarily polynomials) deﬁned in the feature space in which the pattern vec-torsxandxireside.

FromEqs. 4and6, we have

ˆ

p

(

x

|

ω

j

)

= M

m=1 cm_ω

j

φ

m

(

x

),

(7)

where

cm_ω

j =

1

N

i=1

ν

ij N

i=1

ν

i j

φ

m

(

xi

)

,

with

ν

i

j=1 if patternxibelongs to class

ωj

, otherwise

ν

ij=0. This scheme certainly removes the reference patterns’ storage problem. However, it can create a computational problem of its own, in particular when bothMandLare large, which is often the case in real world classiﬁcation problems. ComputingMbasis functions ofL variables to classify a new patternxis not a trivial computational task

[1–4,10–12].

Another approach – albeit a particular case of the scheme above – is that proposed by Specht[13]. It was based on a Taylor series ex-pansion of

ρ

(x,x_i;

λ

) (seeEq. 8), such that anrth order polynomial in Lvariables was required to estimate and store

(

L+_Lr

)

terms to approxi-mate ˆp

(

x

|

ω

j)[10,13,14]. For short but “insightful” descriptions of the relationship between an appropriate value ofrand the smoothing pa-rameter

λ

, see Ref.[1](Chapter 4) and Ref.[14](Chapter 4). In princi-ple, Specht’s scheme has a strong appeal of simplicity providing the number of terms required in the Taylor series can be held to a practi-cal limit. Unfortunately, bothrandLcan be large in current realistic classiﬁcation problems[1,4,10].

(3)

Thus, the motivation for this paper is to introduce yet another scheme that might be able to circumvent the aforementioned com-putational bottleneck problem, while retaining the estimation power and conceptual simplicity of the Parzen Window method. This work is conﬁned to Parzen Window based approaches, in which the kernel function is – or can be expressed – in the form

K

(

x,xi;

λ)

=A f

(

x;

λ)ρ(

x,xi;

λ)

f

(

xi;

λ),

(8)

whereA_>0;f(x;

λ

) andf(x_i;

λ

) are any real functions deﬁned in the feature space;

ρ

(x,xi;

λ

) is a polynomial inxandxi; and

λ

is as de-ﬁned before. The kernel functions that are or can be written in the form above are ubiquitous nowadays in data analysis[4,6,8]. They have most often been successfully applied to discrete data; for this reason we decided to conﬁne attention to the discrete case. For illus-trative purposes, we focus on binary data, i.e.,xl=0 or 1 denoting absence or presence of featurex_l in the pattern vector, respectively. That is to say, bothx(test pattern vector) andxi(reference/prototype pattern vector) reside in a binary feature spaceX=

{

0,1

}

L_.

The extension of the proposed scheme to continuous data – i.e., X=RL_{– is straightforward.}

One ﬁnal, but important remark is that Specht’s approach and our proposed scheme are arguably similar in spirit. However there are crucial differences: unlike Specht’s formulation, our scheme does not estimate

(

L+_Lr

)

terms, it does not retain

(

L+_Lr

)

terms in main-memory, nor does the variablerfeature in the ﬁnal form of our algorithm – instead, in our case, only twoL-dimensional vectors and oneL-by-L matrix are required to retain in main-memory. The two vectors and the matrix can notably be highly sparse. We will brieﬂy expound on this assertion shortly.

3. Proposed method and discrete Parzen Window approach

As a concrete example, we use the widely utilized kernel func-tion (albeit in cheminformatics[15,16], and references therein) intro-duced by Aitchison and Aitken (AA–kernel)[17,18],

K

(

x,xi;

λ)

=

λ

L

₁₋

_λ

λ

d(x,xi)

, (9)

where 0.5<

λ

<1 andd(x,xi) denotes the number of components in whichxandxidisagree. This dissimilarity measured(x,xi) can be conveniently expressed as[4]

d

(

x,xi

)

=xTx−2xTxi+xTixi (10)

In passing, the AA–kernel is basically a discrete analogue of an isotropic Gaussian kernel[17,18].

FromEqs. 9and10, and the fact thatelnw₌_w_,_{we have}

K

(

x,xi;

λ)

=

λ

Leln( 1−λ

λ )(xTx+xTixi)×e−2 ln(

1−λ λ )xTxi =

λ

L_e−α(xT_x₊_xT

ixi)×e2αxTxi

=

λ

L_e−α(xT_x₎ ×e2αxT_x

i×_e−α(xTixi), (11)

where

α

₌ln

(

₁λ

−λ

)

. The terme2αx

T_x

ican be written as

e2αxT_x i =

∞

r=0

(

2

α)

r

r!

(

x T_x

i

)

r= ∞

r=0

γ

r

(

xTxi

)

r, (12)

where

γ

r=(2α) r r! .

InsertingEq. 12intoEq. 11yields

K

(

x,xi;

λ)

=

λ

Le−α(x

T_x₎

_∞

r=0

γ

r

(

xTxi

)

r

e−α(xT

ixi) ₍₁₃₎

(cf.Eq. 8).

FromEqs. 13and4, we have

ˆ

p

(

x

|

ω

j

)

=

λ

L_e−α(xT_x₎

N

i=1

ν

ij N

i=1

ν

i je−α(x

T ixi)

∞

r=0

γ

r

(

xTxi

)

r (14)

Now we come to the main contribution of this paper: removing the requirement for retaining all the reference/prototype patterns for class

ωj

in main-memory to compute ˆp

(

x

|

ωj)

in order to es-timate ˆp

(ωj

|

x

)

to classify a new pattern x. However, ﬁrst we sim-plify the notation by deﬁning these variablesN_ω_j,a,zandz, which will be consistently used throughout, as follows:N_ω_j=N

i=1

ν

ij;

βi

= e−α(xTixi)_;_a₌Nωj

i=1

βi

;z= N_ω

j

i=1xi; andz=

N_ω j

i=1

βi

xi, where Nωj refers to the number of patterns in the training dataset that belong to class

ωj

;N,

α

and

ν

i

jare as described before. In our new notation, Eq. 14becomes

ˆ

p

(

x

|

ω

j

)

=B

γ

0

Nωj

i=1

β

i+

∞

r=1

γ

r Nωj

i=1

β

i

(

xTxi

)

r

=B

γ

0a+

∞

r=1

γ

r Nωj

i=1

(

xT

_(β

ixi

))(

xTxi

)

r−1

, (15)

whereB₌λLe−α(xTx)

N_ω j

.

The main contribution of the paper is formulating Eq. 15 in terms of

γ

r,a,z,zand anL-by-Lmatrix,Qwhich will be deﬁned shortly. The task of this formulation basically amounts to expressing N_ω

j

i=1

(

xT

(βi

xi))(xTxi)r−1 in terms ofz,z andQ. In doing this, we hope to ameliorate the computational drawback of the Parzen Win-dow method based on kernel functions in the form given inEq. 8.

4. Results

When r=1, the task is trivial: Nωj

i=1

(

xT

(βi

xi))(xTxi)r−1 reduces to

N_ω_j

i=1 xT

_(β

ixi

)

=xTz (16)

where, by deﬁnition (seeSection 3),z=Nωj i=1

(βi

xi).

However, whenr> 1, the task can be taxing. To this end, we make use of a simple – but useful – proposition (Proposition 1, whose proof is provided inAppendices AandB) to demonstrate that N_ω

j

i=1

(

xT

(βi

xi))(xTxi)r−1forr>1 can be written as (seeEq. 19in Appendix C)

Nωj

i=1

(

xT

(β

ixi

))(

xTxi

)

r−1=

(

xTQx

)(

xTz

)

r−2, (17)

whereQis just anL-by-Lmatrix, (seeEq. 19). InsertingEqs. 16and17intoEq. 15results in

ˆ

p

(

x

|

ω

j

)

=B

γ

0a+

γ

1

(

xTz

)

+

(

xTQx

)

∞

r=2

γ

r

(

xTz

)

r−2

=B

γ

0a+

γ

1

(

xTz

)

+

4

α

2

(

_eμ₋

μ

₋₁

)

μ

2

(

x

T_Qx

₎

,

(18)

where B=λL_e−α(xTx)

N_ω j

; ∞_r₌₂

γ

r

(

xTz

)

r−2=4α

2₍_eμ₋_μ₋₁₎

μ2 , with

μ

= 2

α(

xT_z

₎

_,

_γ

r=(2α) r

r! ,

α

=ln

(

1−λλ

)

and 0.5<

λ

<1

Evidently,Eq. 18illustrates that it is not necessary to retain all ref-erence/prototype patterns for a given class in main-memory to com-pute the value of ˆp

(

x

|

ω

j)for a test patternx; instead, all that is re-quired is anL-by-LmatrixQand twoLsize vectors (zandz), which are computed once and then retained in main-memory. This was the objective we set out to achieve in this paper.

(4)

value ofLis large. The fundamental reason for this sparsity is that in a high-dimensional reference pattern vectorxithere is the potential of many of its components being zero – i.e., many of the features are very likely to be absent inx_i.

In passing, if we are dealing with continuous data, wherebyxi∈ RL_,_{the vectors}_z,_z_{and matrix}_Q_{could be dense. Nonetheless, storing} Qcan still be computationally cheaper than retainingN_ω_jreference patternsxiper class in main-memory – providing thatNωj>L.

The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.

5. Conclusion

The Parzen Window method is a powerful tool for estimating class conditional probability density functions. However, it can suffer from a severe computational bottleneck when the training dataset is large. Over the years, attempts have been made to rectify this computa-tional drawback of the method. To the best of our knowledge the is-sue has remained unsolved. In this paper we have proposed a novel scheme, which we hope contributes to our endeavor of alleviating the computational bottleneck from which the Parzen Window algorithm suffers when the training dataset is large.

Acknowledgments

We thank the BBSRC for funding this research through Grant

BB/I00596X/1. JBOM thanks the Scottish Universities Life Sciences Al-liance (SULSA) for ﬁn ancial support.

Appendix A

Proposition 1. Let ai and bi be real variables ∈ [0, ∞). Then the product of

(

n_i₌₁bi)(n_i₌₁ai)q _{can be given as}n

i=1bi(ai)q+ q

j=1

(

n

i=1ai)j−1

i=1aini₌ibi

(

ai

)

q−j,wherei=1,2, . . . ,nwith

nandq∈Z+_.

Proof. Let us suppose, with no loss of generality, that n=4 and q=3. In this scenario, we have the product

(

b1+b2+b3+b4

)

×

(

a1+a2+a3+a4

)

×

(

a1+a2+a3+a4

)

×

(

a1+a2+a3+a4

)

. For the sake of clarity, we write the product as

(

d1+d2+d3+ d4

)

×

(

c1+c2+c3+c4

)

×

(

a1+a2+a3+a4

)

×

(

b1+b2+b3+b4

)

, whereai=ci=di,withibeing 1, 2, 3, 4. To labour the obvious, we express

(

d1+d2+d3+d4

)

×

(

c1+c2+c3+c4

)

×

(

a1+a2+a3+ a4

)

×

(

b1+b2+b3+b4

)

as follows

(

d1+d2+d3+d4

)

×

(

c1+c2+c3+c4

)

×

(

a1+a2+a3+a4

)

×

(

b1+b2+b3+b4

)

=b1a1c1d1+b2a2c2d2+b3a3c3d3+b4a4c4d4

+d1

(

b2a2c2+b3a3c3+b4a4c4

)

+d2

(

)

+d3

(

)

+d4

(

)

+d1

(

c1

(

b2a2+b3a3+b4a4

)

+c2

(

b1a1+b3a3+b4a4

)

+c3

(

b1a1+b2a2+b4a4

)

+c4

(

b1a1+b2a2+b3a3

))

+d2

(

c1

(

b2a2+b3a3+b4a4

)

+c2

(

b1a1+b3a3+b4a4

)

+c3

(

b1a1+b2a2+b4a4

)

+c4

(

b1a1+b2a2+b3a3

))

+d3

(

c1

(

b2a2+b3a3+b4a4

)

+c2

(

b1a1+b3a3+b4a4

)

+c3

(

b1a1+b2a2+b4a4

)

+c4

(

b1a1+b2a2+b3a3

))

+d4

(

c1

(

b2a2+b3a3+b4a4

)

+c2

(

b1a1+b3a3+b4a4

)

+c3

(

b1a1+b2a2+b4a4

)

+c4

(

b1a1+b2a2+b3a3

))

+d1[c1

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

+. . .

+c4

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

] +d2[c1

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

+. . . +c4

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

] +d3[c1

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

+. . . +c4

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

] +d4[c1

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

+. . . +c4

(

a1

(

b2+b3+b4

)

+a2

(

b1+b3+b4

)

+a3

(

b1+b2+b4

)

+a4

(

b1+b2+b3

))

], (1)

which can be written more compactly as:

4

i=1 bi

n

i=1 ai

3 =

4

i=1

bi

(

ai

)

3₊ 4

i=1 ai

4

i₌_i

bi

(

ai

)

2

+ 4

i=1 ai

4

i=1 ai

4

i₌i biai

+

4

i=1 ai

24

i=1 ai

4

i₌_i

bi (2)

wherei=1,2,3,4; note that inEq. 2we have made use of fact that, by deﬁnition,ai=ci=di. It readily follows that for the general case

n

i=1 bi

n

i=1 ai

q

=

n

i=1

bi

(

ai

)

q+ n

i=1 ai

n

i₌i

bi

(

a_i

)

q−1

+

n

i=1 ai

n

i=1 ai

n

i₌_i

bi

(

ai

)

q−2+. . .

. . .+

n

i=1 ai

q−1 _n

i=1 ai

n

i₌_i

bi, (3)

or in a more compact form

n

i=1 bi

n

i=1 ai

q

=

n

i=1

bi

(

ai

)

q+

q

j=1

n

i=1 ai

j−1

×

i=1 ai

n

i₌i

bi

(

a_i

)

q−j, (4)

which can be immediately proved by induction, seeAppendix B. This completes the proof of the proposition.

ClearlyEq. 3can be rearranged to give

n

i=1

bi

(

ai

)

q=

n

i=1 bi

n

i=1 ai

q

−

n

i=1 ai

n

i₌i

bi

(

a_i

)

q−1

−

n

i=1 ai

n

i=1 ai

n

i₌_i

bi

(

ai

)

q−2−. . .

. . .−

n

i=1 ai

q−1n

i=1 ai

n

i₌_i

(5)

Remark 1.FromEqs. 1–3, one does not fail to observe that, ifnis large (i.e. asymptotically),

n

i=1 ai

q−1 _n

i=1 ai

n

i₌_i

bi>

n

i=1 ai

q−2 _n

i=1 ai

n

i₌_i

biai

>

n

i=1 ai

q−3 _n

i=1 ai

n

i₌_i

bi

(

ai

)

2

>

n

i=1 ai

q−4n

i=1 ai

n

i₌i bi

(

a_i

)

3

> . . . >

n i=1 ai n

i₌_i

bi

(

ai

)

q−1>

n

i=1 bi

(

ai

)

q

(6)

Furthermore, a closer look at the terms above also reveals that one might – albeit asymptotically – view

n

i=1 ai

q−1n

i=1 ai

n

i₌i bi,

n

i=1 ai

q−2n

i=1 ai

n

i₌i bia_i,

. . . , n i=1 ai n

i₌_i

bi

(

ai

)

q−1,

n

i=1 bi

(

ai

)

q

as a geometric progression whose common ratio is in the interval (0,1).

Remark 2.Whennis large (that is, asymptotically), another impor-tant inference that one could glean fromEqs. 1–3is that

n

i₌₁

bi

(

ai

)

m=

n

i₌₂

bi

(

ai

)

m=

n

i₌₃

bi

(

ai

)

m=. . .=

n

i₌_n

bi

(

ai

)

m, (7)

and even more so in our pattern recognition context whose basic credo is that all reference/prototype patternsxifor a given class are similar.i=1,2, . . . ,nandm=q−l,wherel=1,2, . . . ,q−1

Proposition 1 , Remarks 1and2essentially form the theoretical basis for the algorithm proposed in this work.

Based onRemark 1, it is justiﬁable to assume thatEq. 5can be well approximated as

n

i=1

bi

(

ai

)

q≈

n

i=1 bi

n

i=1 ai

q

−

n

i=1 ai

q−1n

i=1 ai

n

i₌i bi

−

n

i=1 ai

q−2 _n

i=1 ai

n

i₌i

bia_i (8)

Due to space constraints, practical realization and validation ofEq. 8

will be given in application journals on pattern recognition and cheminformatics.

Appendix B

Proof ofEq. 4by induction.

Base case: whenq₌1_,fromEq. 4we have

n

i=1 bi

n

i=1 ai

1 =

n

i=1 bi

(

ai

)

1

+ 1 j=1

n i=1 ai

j−1 _n

i=1 ai

n

i₌_i

bi

(

ai

)

1−j (9)

Clearly the left hand side of the equation can be expressed as n

i=1biai+ni=1aini₌ibi and the right hand side of the equation isn_i₌₁biai+iaini₌ibi. ThusEq. 9holds forq=1.

General case:

Assume thatEq. 4is correct for some positive integerk∈Z+_{. From} Eq. 4we have

n

i=1 bi

n

i=1 ai

k

=

n

i=1 bi

(

ai

)

k

+

k

j=1

n

i=1 ai

j−1

i ai

n

i₌_i

bi

(

ai

)

k−j (10)

We now show that the equation above holds also fork+1. Starting with the left hand side, we have

n i=1 bi n i=1 ai

k

_n

i=1 ai =

n

i=1

bi

(

ai

)

k+ k j=1

n i=1 ai

j−1

× n i=1 ai n

i₌_i

bi

(

ai

)

k−j

n

i=1 ai ,

(11)

where we have made use of the assumption thatEq. 10is correct; hence

n i=1 bi n i=1 ai

k

_n

i=1 ai = n i=1 bi

(

ai

)

k

n i=1 ai + k j=1 ×

n

i=1 ai j i ai n

i₌i

bi

(

a_i

)

k−j,

(12)

wheren_i₌₁bi(ai)k

n i=1ai

can be expressed as

n

i=1 bi

(

ai

)

k

n

i=1 ai =

n

i=1

bi

(

ai

)

k+1+ n

i=1 ai

n

i₌i

bi

(

a_i

)

k (13)

InsertingEq. 13intoEq. 12results in

n

i=1 bi

n

i=1 ai

k

_n

i=1 ai =

n

i=1

bi

(

ai

)

k+1

+

n i=1 ai n

i₌_i

bi

(

ai

)

k+

k j=1

n i=1 ai j i ai n

i₌_i

bi

(

ai

)

k−j

(14)

[n_i₌₁aini₌ibi

(

ai

)

k+kj=1

(

n

i=1ai)j

iaini₌ibi

(

ai

)

k−j] is

ba-sicallyk_j₌₀

(

n_i₌₁ai)j

iaini=ibi

(

ai

)

k−j,which can be rewritten

as k_j+₌1₁

(

n_i₌₁ai)j−1

i=1aini=ibi

(

ai

)

(k+1)−j. Inserting this ex-pression inEq. 14and rewriting

(

n_i₌₁ai)k

₍

n

i=1ai)on the left hand side ofEq. 14as

(

n_i₌₁ai)k+1_yield

n

i=1 bi

n

i=1 ai

k+1 =

n

i=1

bi

(

ai

)

k+1

+

k+1

j=1

n

i=1 ai

j−1

i=1 ai

n

i₌_i

bi

(

ai

)

(k+1)−j

, (15)

which is what we set out to prove. Thus,Eq. 4(and for that matter

Eq. 3) inAppendix Ahold for all values ofq∈Z+_.

Appendix C

In this work:bi=xT

(βi

xi);bi=xT

(βi

xi

)

;ai=xTxi;ai=xTxi;

(6)

as described in the main text (see Section 3). Hence, from Eq. 8

we have Nωj

i=1

(

xT

_(β

ixi

))(

xTxi

)

r−1≈cr−1

xT

Nωj

i=1

(β

ixi

)

−cr−2

Nωj

i=1 xTxi

Nωj

i=i

xT

(β

ixi

)

−cr−3

Nωj

i=1 xT_x

i Nωj

i=i

(

xT

_(β

ix_i

))(

xTx_i

)

(16)

wherec=

(

xTNωj i=1xi).

Making use of the two deﬁnitions z₌Nωj

i=1

(βi

xi) and z=

N_ω j

i=1xigiven inSection 3in the main text and with some algebraic manipulation (such as the fact thatgT_h₌_hT_b_{for any vectors}_g_and h), the three lines on the right hand side ofEq. 16can be recast in a more simple and familiar form:

• Line 1:cr−1

₍

_xTNωj

i=1

(βi

xi))=

(

xTz

)

r−1xTz

• Line 2:cr−2₌

₍

_xT_z

₎

r−2_,_whereas

N_ω

j

i=1 xTxi

N_ω

j

i=i

xT

(β

ixi

)

=xTYx,

whereY=

(

Nωj i=1xi

N_ω_j

i=i

(βi

xi

)

T

)

is anL-by-Lmatrix.

• Line 3:cr−3₌

₍

_xT_z

₎

r−3_,_whereas

Nωj

i=1 xT_x

i Nωj

i=i

(

xT

_(β

ix_i

))(

xTx_i

)

=xT

Nωj

i=1 xixT

Nωj

i=i

(β

ixi

)

xT_i

x

=xT Nωj

i=1

xi

(

xTSix

)

, (17)

whereSi= N_ω

j

i=i

((βi

xi

)

x T

i

)

is anL-by-Lmatrix.

According toEq. 7, the matricesS1,S2, . . . ,SN_ω

jcan be considered equivalent/similar. Hence,Eq. 17modiﬁes to

Nωj

i=1 xT_x

i Nωj

i=i

(

xT

_(β

ix_i

))(

xTx_i

)

=

(

xTz

)(

xTSx

)

, (18)

whereS=Siand

(

xTz

)

=xT N_ω

j i=1xi

Expressed in terms ofY,S,zandz,Eq. 16becomes

Nωj

i=1

(

xT

(β

ixi

))(

xTxi

)

r−1

≈

(

xT_z

₎

r−1

₍

_xT_z

₎

₋

₍

_xT_z

₎

r−2

₍

_xT

₍

_Y₊_S

₎

_x

₎

=

(

xT_z

₎

r−2

₍₍

_xT_z

₎₍

_xT_z

₎

₋_xT

₍

_Y₊_S

₎

_x

₎

=

(

xT_z

₎

r−2

₍₍

_xT_z_zT_x

₎

₋_xT

₍

_Y₊_S

₎

_x

₎

=

(

xT_z

₎

r−2

₍

_xT_Qx

_),

₍₁₉₎

whereQ=zzT₋

₍

_Y₊_S

₎

_,_an_L-by-L_matrix.

Note that

(

xT_z

₎

r−2_≥₀_,

₍

_xT_z_zT_x

₎

_≥_xT

₍

_Y₊_S

₎

_x; _hence,

(

xT_z

₎

r−2

₍

_xT_Qx

₎

_≥_{0 as it should be – because, by deﬁnition,} N_ω

j

i=1

(

xT

(βi

xi))(xTxi)r−1≥0.

References

[1]R. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, Wiley, New York, NY, USA, 2000.

[2]T.Y. Young, T. Calvert, Classiﬁcation, Estimation, Pattern Recognition, American Elsevier, New York, NY, USA, 1974.

[3]B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996.

[4]D.J. Hand, Discriminant and Classiﬁcation, Wiley, New York, NY, USA, 1981.

[5]A.R. Webb, Statistical Pattern Recognition, Wiley, Chicester, UK, 2002.

[6]C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, USA, 2006.

[7]K. Koch, Introduction to Bayesian Statistics, Springer–Verlag, Berlin, Germany, 2007.

[8]K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, MA, USA, 2012.

[9]E. Parzen, On estimation of a probability density function and mode, Ann. Math. Stat. 33 (1962) 1065–1076.

[10]W.S. Meisel, Computer-oriented Approaches to Pattern Recognition, Academic Press, New York, NY, USA, 1972.

[11]S. Fiori, Probability density function learning by unsupervised neurons, Int. J. Neu-ral Syst. 11 (2001) 399–417.

[12]S. Fiori, Non-symmetric pdf estimation by artiﬁcial neurons: application to sta-tistical characterization of reinforced composites, IEEE Trans. Neural Netw. 14 (2003) 959–962.

[13]D.P. Specht, Generation of polynomial discriminant functions for pattern recogni-tion, IEEE Trans. Electron Comput. 16 (1967) 308–319.

[14]H.C. Andrews, Mathematical Techniques in Pattern Recognition, Wiley, New York, NY, USA, 1972.

[15]G. Harper, J. Bradshaw, J.C. Gittins, D.V.S. Green, A.R. Leach, Prediction of biolog-ical activity for high-throughput screening using binary kernel discrimination, J. Chem. Inf. Comp. Sci. 41 (2001) 1295–1300.

[16]R. Lowe, H.Y. Mussa, J.B.O. Mitchell, R.C. Glen, Classifying molecules using a sparse probabilistic kernel binary classiﬁer, J. Chem. Inf. Model. 51 (2011) 1539–1544.

[17]J. Aitchison, C.G.G. Aitken, Multivariate binary discrimination by the kernel method, Biometrika 63 (1976) 413–420.