Contents lists available atScienceDirect
Pattern Recognition Letters
journal homepage:www.elsevier.com/locate/patrec
The Parzen Window method: In terms of two vectors and one matrix
Hamse Y. Mussa
a,b,∗, John B.O. Mitchell
a, Avid M. Afzal
baEaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews KY16 9ST, Scotland, UK bCentre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, England, UK
a r t i c l e
i n f o
Article history:
Received 15 February 2015 Available online 18 June 2015
Keywords:
Probability density function Kernel functions Parzen Window
a b s t r a c t
Pattern classification methods assign an object to one of several predefined classes/categories based on fea-tures extracted from observed attributes of the object (pattern). WhenLdiscriminatory features for the pat-tern can be accurately determined, the patpat-tern classification problem presents no difficulty. However, precise identification of the relevant features for a classification algorithm (classifier) to be able to categorize real world patterns without errors is generally infeasible. In this case, the pattern classification problem is often cast as devising a classifier that minimizes the misclassification rate. One way of doing this is to consider both the pattern attributes and its class label as random variables, estimate the posterior class probabilities for a given pattern and then assign the pattern to the class/category for which the posterior class probability value estimated is maximum. More often than not, the form of the posterior class probabilities is unknown.
The so-called Parzen Window approach is widely employed to estimate class-conditional probability (class-specific probability) densities for a given pattern. These probability densities can then be utilized to estimate the appropriate posterior class probabilities for that pattern. However, the Parzen Window scheme can be-come computationally impractical when the size of the training dataset is in the tens of thousands andL is also large (a few hundred or more). Over the years, various schemes have been suggested to ameliorate the computational drawback of the Parzen Window approach, but the problem still remains outstanding and unresolved.
In this paper, we revisit the Parzen Window technique and introduce a novel approach that may circumvent the aforementioned computational bottleneck. The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
1. Introduction
In mathematical pattern recognition, the problem of pattern clas-sification entails assigning an object – based on a number of spe-cific features of the object – to one of a finite set of predefined classes/categories
ωj
, wherej=1, 2,…,J, withJbeing the number of classes/categories of interest. Typically the object (or simply the pat-tern) is represented by anL-dimensional vectorxwhose elements,(
x1,x2, . . . ,xL),are values assumed to contain the appropriate infor-mation about the specific pattern features utilized to accurately clas-sify the pattern represented byx.WhenLdiscriminatory features for a pattern can be determined accurately, the pattern classification problem presents no difficulty:
∗ Corresponding author at: EaStCHEM School of Chemistry and Biomedical Sciences Research Complex, University of St Andrews, North Haugh, St Andrews KY16 9ST, Scotland, UK. Tel.:+44 7951519416; fax: +44 1223763076.
E-mail address:[email protected](H.Y. Mussa).
it reduces to a simple look-up table scheme. However, identifying the relevant features to classify realistic patterns without classification errors is generally impossible. Thus, the pattern classification prob-lem is often cast as the task of finding a classifier that minimizes the misclassification rate[1]. One popular way of achieving this objective is to treat both the pattern vectorxand the class label
ωj
as ran-dom variables. In this case, the posterior class probabilitiesp(ωj
|x) for a given patternxis computed; then patternxis assigned to the class, for which thep(ωj
|x) value is maximum[1–6]. (In the last step it is being assumed that all misclassification errors are equally bad[1,3,4].)
However, in practice, the form of the functionp(
ωj
|x) is unknown; instead,Npatternsxi and their corresponding correct class labels yi∈{
ω
1,ω
2, . . . ,ωJ
}
– i.e.,D={
(
xi,yi)}
Ni=1,assumed to constitute a representative dataset of the joint probability density functionp(ωj
, x) forωj
andx– are usually available. It is from these prototype pat-ternsxiand their corresponding class labelsyithat one tries to esti-matep(ωj
|x).http://dx.doi.org/10.1016/j.patrec.2015.06.002
According to basic probability rules[1–7],
p
(ω
j|
x)
p(
x)
=p(ω
j,x)
=p(
x|
ω
j)
p(ω
j)
(1)These rules allow one to modularize the estimation problem and es-timatep(
ωj
|x) (and, of course,p(x)) in terms ofp(x|ωj
) andp(ωj
):p
(ω
j|
x)
=p
(
x|
ω
j)
p(ω
j)
p
(
x)
, (2)whereby we may have a better chance of being able to estimate p(x|
ωj
) andp(ωj
) fromDthan estimatingp(ωj
|x) directly fromD. In the denominator,p(
x)
=Jj=1p
(
x|
ω
j)p(ωj)
.In the Bayesian statistics framework,p(
ωj
) is referred to as the class prior probability, which is the probability that a member of classωj
will occur. The functionp(x|ωj
) is called the class-conditional probability density function, i.e. the probability density of observ-ing patternxgiven thatxis a member of classωj
. The denominator term on the right hand side ofEq. 2is often called the “evidence” or “marginal likelihood”. For the purpose of this paper we can afford to simply view this term as a normalization factor.If there is evidence that the number of prototype patterns per class is an indication of the importance of that class, then a sensible ap-proximation ofp(
ωj
) can bep
(ω
j)
=N
i=1
ν
ijN , (3)
where
ν
ij=1 if theith prototypexibelongs to class
ωj
, otherwiseν
ij=0; andNis as described before. Nonetheless,p(
ωj
) is typically assumed to be uniform, i.e.,p(ωj)
=1J,whereJis as defined before. Estimatingp(x|
ωj
) fromDis not straightforward[1–5]. In the last half-century, a plethora of methods have been proposed for estimat-ingp(x|ωj
) based onD,the so-called training set. There are ample excellent reviews and text books on this topic; for example, the two books – one by Hand[4]and the other by Murphy[8]– give adequate and accessible descriptions of the bulk of these approaches devised in recent (and not so recent) years.In this paper we are concerned with one particular approach that is widely thought to be apropos to the task of estimatingp(x|
ωj
) from a representative training dataset: the so-called Parzen Window method[1,2,4,9,10], also known as Parzen estimator, Potential func-tion technique[10], to name but a few.In the preceding discussion and in the rest of the paper, the terms “class”, “label”, “class label” and “category” are employed inter-changeably. For notational simplicity we usex,xiand
ωj
both as ran-dom variables and their realizations. Furthermore we follow (in line with the current trend in machine learning and statistics) the conve-nient – but not necessarily correct – practice of using the term “den-sity” for both a discrete random variable’s probability function and for the probability function of a continuous random variable. An im-plicit assumption being made throughout the paper is that all spaces, matrices, vectors, functions and variables (discrete or not),etc., are real.2. Literature review
The Parzen Window approach can approximatep(x|
ωj
) by a sim-ple formula[1,2,10]:ˆ
p
(
x|
ω
j)
= 1N
i=1
ν
ij N
i=1
K
(
x,xi;λ)ν
ij, (4)wherexi∈Ddenotes prototype patterns and
ν
ijis as defined before. K(x,xi;λ
) – commonly known as the kernel function – is a two-variable function with specific properties, which are abundantly cov-ered in the statistical pattern recognition literature[1,2,9,10]. At anyrate, it might be helpful to think ofK(x,xi;
λ
) as a measure of simi-larity returning how similar patternsxandxiare,λ
being a tunable (smoothing) parameter. In other words, the kernel function peaks at x=xiand decays away elsewhere; theλ
parameter,inter alia, has an important role in determining the rate of the decay.Eq. 4indicates that ˆp
(
x|
ω
j)is formed from the superposition of kernel functionK(xi,x;λ
) values at the given prototype patternsxi for classωj
. The Parzen Window method is powerful in the sense that, with enough representative data points (prototypes/references), its estimate of the class conditional probability density converges top(x|ωj
) (see Ref.[1], Chapter 4). AlthoughEq. 4is conceptually simple and capable of providing a good estimate ofp(x|ωj
), it can computationally suffer from the requirements that all the proto-types/referencesxifor classωj
must be retained in main-memory to compute ˆp(
x|
ωj)
,the estimate ofp(x|ωj
). Furthermore, considerable CPU-time may be required each time this method is used to estimate p(x|ωj
) to classify a novel pattern. The fact that the size of the refer-ence pattern vectorsxican be easily in the hundreds (or more) may exacerbate the main-memory and CPU-time requirements.Over the years, various schemes have been developed to address the computational drawback of this otherwise elegant and power-ful method. For example, one of these schemes entails – see Ref.[1]
(Chapter 4), Ref.[4](Chapter 2) and Ref.[10](Chapters 6) for detailed technical and practical discussions – expressing the kernel function as a finite series expansion
K
(
x,xi;λ)
= M
m=1
φ
m(
x)φ
m(
xi)
, (5)which renders the right hand side ofEq. 4
1
N
i=1
ν
ij N
i=1
K
(
x,xi;λ)ν
ij= 1N
i=1
ν
ij N
i=1
ν
i jM
m=1
φ
m(
x)φ
m(
xi),
(6)with
{
φm
}
Mm=1 being appropriate basis functions (not necessarily polynomials) defined in the feature space in which the pattern vec-torsxandxireside.
FromEqs. 4and6, we have
ˆ
p
(
x|
ω
j)
= M
m=1 cmω
j
φ
m(
x),
(7)where
cmω
j =
1
N
i=1
ν
ij N
i=1
ν
i jφ
m(
xi)
,with
ν
ij=1 if patternxibelongs to class
ωj
, otherwiseν
ij=0. This scheme certainly removes the reference patterns’ storage problem. However, it can create a computational problem of its own, in particular when bothMandLare large, which is often the case in real world classification problems. ComputingMbasis functions ofL variables to classify a new patternxis not a trivial computational task[1–4,10–12].
Another approach – albeit a particular case of the scheme above – is that proposed by Specht[13]. It was based on a Taylor series ex-pansion of
ρ
(x,xi;λ
) (seeEq. 8), such that anrth order polynomial in Lvariables was required to estimate and store(
L+Lr)
terms to approxi-mate ˆp(
x|
ω
j)[10,13,14]. For short but “insightful” descriptions of the relationship between an appropriate value ofrand the smoothing pa-rameterλ
, see Ref.[1](Chapter 4) and Ref.[14](Chapter 4). In princi-ple, Specht’s scheme has a strong appeal of simplicity providing the number of terms required in the Taylor series can be held to a practi-cal limit. Unfortunately, bothrandLcan be large in current realistic classification problems[1,4,10].Thus, the motivation for this paper is to introduce yet another scheme that might be able to circumvent the aforementioned com-putational bottleneck problem, while retaining the estimation power and conceptual simplicity of the Parzen Window method. This work is confined to Parzen Window based approaches, in which the kernel function is – or can be expressed – in the form
K
(
x,xi;λ)
=A f(
x;λ)ρ(
x,xi;λ)
f(
xi;λ),
(8)whereA>0;f(x;
λ
) andf(xi;λ
) are any real functions defined in the feature space;ρ
(x,xi;λ
) is a polynomial inxandxi; andλ
is as de-fined before. The kernel functions that are or can be written in the form above are ubiquitous nowadays in data analysis[4,6,8]. They have most often been successfully applied to discrete data; for this reason we decided to confine attention to the discrete case. For illus-trative purposes, we focus on binary data, i.e.,xl=0 or 1 denoting absence or presence of featurexl in the pattern vector, respectively. That is to say, bothx(test pattern vector) andxi(reference/prototype pattern vector) reside in a binary feature spaceX={
0,1}
L.The extension of the proposed scheme to continuous data – i.e., X=RL– is straightforward.
One final, but important remark is that Specht’s approach and our proposed scheme are arguably similar in spirit. However there are crucial differences: unlike Specht’s formulation, our scheme does not estimate
(
L+Lr)
terms, it does not retain(
L+Lr)
terms in main-memory, nor does the variablerfeature in the final form of our algorithm – instead, in our case, only twoL-dimensional vectors and oneL-by-L matrix are required to retain in main-memory. The two vectors and the matrix can notably be highly sparse. We will briefly expound on this assertion shortly.3. Proposed method and discrete Parzen Window approach
As a concrete example, we use the widely utilized kernel func-tion (albeit in cheminformatics[15,16], and references therein) intro-duced by Aitchison and Aitken (AA–kernel)[17,18],
K
(
x,xi;λ)
=λ
L 1−λ
λ
d(x,xi), (9)
where 0.5<
λ
<1 andd(x,xi) denotes the number of components in whichxandxidisagree. This dissimilarity measured(x,xi) can be conveniently expressed as[4]d
(
x,xi)
=xTx−2xTxi+xTixi (10)In passing, the AA–kernel is basically a discrete analogue of an isotropic Gaussian kernel[17,18].
FromEqs. 9and10, and the fact thatelnw=w,we have
K
(
x,xi;λ)
=λ
Leln( 1−λλ )(xTx+xTixi)×e−2 ln(
1−λ λ )xTxi =
λ
Le−α(xTx+xTixi)×e2αxTxi
=
λ
Le−α(xTx) ×e2αxTxi×e−α(xTixi), (11)
where
α
=ln(
1λ−λ
)
. The terme2αxTx
ican be written as
e2αxTx i =
∞
r=0
(
2α)
rr!
(
x Txi
)
r= ∞
r=0
γ
r(
xTxi)
r, (12)where
γ
r=(2α) r r! .InsertingEq. 12intoEq. 11yields
K
(
x,xi;λ)
=λ
Le−α(xTx)
∞r=0
γ
r(
xTxi)
re−α(xT
ixi) (13)
(cf.Eq. 8).
FromEqs. 13and4, we have
ˆ
p
(
x|
ω
j)
=λ
Le−α(xTx)N
i=1
ν
ij N
i=1
ν
i je−α(xT ixi)
∞
r=0
γ
r(
xTxi)
r (14)Now we come to the main contribution of this paper: removing the requirement for retaining all the reference/prototype patterns for class
ωj
in main-memory to compute ˆp(
x|
ωj)
in order to es-timate ˆp(ωj
|
x)
to classify a new pattern x. However, first we sim-plify the notation by defining these variablesNωj,a,zandz, which will be consistently used throughout, as follows:Nωj=Ni=1
ν
ij;βi
= e−α(xTixi);a=Nωji=1
βi
;z= Nωj
i=1xi; andz=
Nω j
i=1
βi
xi, where Nωj refers to the number of patterns in the training dataset that belong to classωj
;N,α
andν
ijare as described before. In our new notation, Eq. 14becomes
ˆ
p
(
x|
ω
j)
=Bγ
0Nωj
i=1
β
i+∞
r=1
γ
r Nωj
i=1
β
i(
xTxi)
r=B
γ
0a+∞
r=1
γ
r Nωj
i=1
(
xT(β
ixi
))(
xTxi)
r−1, (15)
whereB=λLe−α(xTx)
Nω j
.
The main contribution of the paper is formulating Eq. 15 in terms of
γ
r,a,z,zand anL-by-Lmatrix,Qwhich will be defined shortly. The task of this formulation basically amounts to expressing Nωj
i=1
(
xT(βi
xi))(xTxi)r−1 in terms ofz,z andQ. In doing this, we hope to ameliorate the computational drawback of the Parzen Win-dow method based on kernel functions in the form given inEq. 8.4. Results
When r=1, the task is trivial: Nωj
i=1
(
xT(βi
xi))(xTxi)r−1 reduces toNωj
i=1 xT
(β
ixi
)
=xTz (16)where, by definition (seeSection 3),z=Nωj i=1
(βi
xi).However, whenr> 1, the task can be taxing. To this end, we make use of a simple – but useful – proposition (Proposition 1, whose proof is provided inAppendices AandB) to demonstrate that Nω
j
i=1
(
xT(βi
xi))(xTxi)r−1forr>1 can be written as (seeEq. 19in Appendix C)Nωj
i=1
(
xT(β
ixi))(
xTxi)
r−1=(
xTQx)(
xTz)
r−2, (17)whereQis just anL-by-Lmatrix, (seeEq. 19). InsertingEqs. 16and17intoEq. 15results in
ˆ
p
(
x|
ω
j)
=Bγ
0a+γ
1(
xTz)
+(
xTQx)
∞
r=2
γ
r(
xTz)
r−2=B
γ
0a+γ
1(
xTz)
+4
α
2(
eμ−μ
−1)
μ
2(
xTQx
)
,(18)
where B=λLe−α(xTx)
Nω j
; ∞r=2
γ
r(
xTz)
r−2=4α2(eμ−μ−1)
μ2 , with
μ
= 2α(
xTz)
,γ
r=(2α) r
r! ,
α
=ln(
1−λλ)
and 0.5<λ
<1Evidently,Eq. 18illustrates that it is not necessary to retain all ref-erence/prototype patterns for a given class in main-memory to com-pute the value of ˆp
(
x|
ω
j)for a test patternx; instead, all that is re-quired is anL-by-LmatrixQand twoLsize vectors (zandz), which are computed once and then retained in main-memory. This was the objective we set out to achieve in this paper.value ofLis large. The fundamental reason for this sparsity is that in a high-dimensional reference pattern vectorxithere is the potential of many of its components being zero – i.e., many of the features are very likely to be absent inxi.
In passing, if we are dealing with continuous data, wherebyxi∈ RL,the vectorsz,zand matrixQcould be dense. Nonetheless, storing Qcan still be computationally cheaper than retainingNωjreference patternsxiper class in main-memory – providing thatNωj>L.
The current paper presents the mathematical aspect of our idea. Practical realizations of the proposed scheme will be given elsewhere.
5. Conclusion
The Parzen Window method is a powerful tool for estimating class conditional probability density functions. However, it can suffer from a severe computational bottleneck when the training dataset is large. Over the years, attempts have been made to rectify this computa-tional drawback of the method. To the best of our knowledge the is-sue has remained unsolved. In this paper we have proposed a novel scheme, which we hope contributes to our endeavor of alleviating the computational bottleneck from which the Parzen Window algorithm suffers when the training dataset is large.
Acknowledgments
We thank the BBSRC for funding this research through Grant
BB/I00596X/1. JBOM thanks the Scottish Universities Life Sciences Al-liance (SULSA) for fin ancial support.
Appendix A
Proposition 1. Let ai and bi be real variables ∈ [0, ∞). Then the product of
(
ni=1bi)(ni=1ai)q can be given asni=1bi(ai)q+ q
j=1
(
ni=1ai)j−1
i=1aini=ibi
(
ai)
q−j,wherei=1,2, . . . ,nwithnandq∈Z+.
Proof. Let us suppose, with no loss of generality, that n=4 and q=3. In this scenario, we have the product
(
b1+b2+b3+b4)
×(
a1+a2+a3+a4)
×(
a1+a2+a3+a4)
×(
a1+a2+a3+a4)
. For the sake of clarity, we write the product as(
d1+d2+d3+ d4)
×(
c1+c2+c3+c4)
×(
a1+a2+a3+a4)
×(
b1+b2+b3+b4)
, whereai=ci=di,withibeing 1, 2, 3, 4. To labour the obvious, we express(
d1+d2+d3+d4)
×(
c1+c2+c3+c4)
×(
a1+a2+a3+ a4)
×(
b1+b2+b3+b4)
as follows(
d1+d2+d3+d4)
×(
c1+c2+c3+c4)
×
(
a1+a2+a3+a4)
×(
b1+b2+b3+b4)
=b1a1c1d1+b2a2c2d2+b3a3c3d3+b4a4c4d4+d1
(
b2a2c2+b3a3c3+b4a4c4)
+d2(
b1a1c1+b3a3c3+b4a4c4)
+d3(
b1a1c1+b2a2c2+b4a4c4)
+d4(
b1a1c1+b2a2c2+b3a3c3)
+d1
(
c1(
b2a2+b3a3+b4a4)
+c2(
b1a1+b3a3+b4a4)
+c3(
b1a1+b2a2+b4a4)
+c4(
b1a1+b2a2+b3a3))
+d2(
c1(
b2a2+b3a3+b4a4)
+c2(
b1a1+b3a3+b4a4)
+c3(
b1a1+b2a2+b4a4)
+c4(
b1a1+b2a2+b3a3))
+d3(
c1(
b2a2+b3a3+b4a4)
+c2(
b1a1+b3a3+b4a4)
+c3(
b1a1+b2a2+b4a4)
+c4(
b1a1+b2a2+b3a3))
+d4(
c1(
b2a2+b3a3+b4a4)
+c2(
b1a1+b3a3+b4a4)
+c3(
b1a1+b2a2+b4a4)
+c4(
b1a1+b2a2+b3a3))
+d1[c1(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
+. . .+c4
(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
] +d2[c1(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
+. . . +c4(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
] +d3[c1(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
+. . . +c4(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
] +d4[c1(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3(
b1+b2+b4)
+a4(
b1+b2+b3))
+. . . +c4(
a1(
b2+b3+b4)
+a2(
b1+b3+b4)
+a3
(
b1+b2+b4)
+a4(
b1+b2+b3))
], (1)which can be written more compactly as:
4i=1 bi
n
i=1 ai
3 =
4
i=1
bi
(
ai)
3+ 4
i=1 ai
4
i=i
bi
(
ai)
2+ 4
i=1 ai
4
i=1 ai
4
i=i biai
+
4i=1 ai
24
i=1 ai
4
i=i
bi (2)
wherei=1,2,3,4; note that inEq. 2we have made use of fact that, by definition,ai=ci=di. It readily follows that for the general case
ni=1 bi
n
i=1 ai
q
=
n
i=1
bi
(
ai)
q+ n
i=1 ai
n
i=i
bi
(
ai)
q−1+
n
i=1 ai
n
i=1 ai
n
i=i
bi
(
ai)
q−2+. . .. . .+
ni=1 ai
q−1 n
i=1 ai
n
i=i
bi, (3)
or in a more compact form
ni=1 bi
n
i=1 ai
q
=
n
i=1
bi
(
ai)
q+q
j=1
ni=1 ai
j−1
×
i=1 ai
n
i=i
bi
(
ai)
q−j, (4)which can be immediately proved by induction, seeAppendix B. This completes the proof of the proposition.
ClearlyEq. 3can be rearranged to give
n
i=1
bi
(
ai)
q= ni=1 bi
n
i=1 ai
q
−
n
i=1 ai
n
i=i
bi
(
ai)
q−1−
n
i=1 ai
n
i=1 ai
n
i=i
bi
(
ai)
q−2−. . .. . .−
ni=1 ai
q−1n
i=1 ai
n
i=i
Remark 1.FromEqs. 1–3, one does not fail to observe that, ifnis large (i.e. asymptotically),
ni=1 ai
q−1 n
i=1 ai
n
i=i
bi>
ni=1 ai
q−2 n
i=1 ai
n
i=i
biai
>
ni=1 ai
q−3 n
i=1 ai
n
i=i
bi
(
ai)
2>
ni=1 ai
q−4n
i=1 ai
n
i=i bi
(
ai)
3> . . . >
n i=1 ai n
i=i
bi
(
ai)
q−1>n
i=1 bi
(
ai)
q(6)
Furthermore, a closer look at the terms above also reveals that one might – albeit asymptotically – view
ni=1 ai
q−1n
i=1 ai
n
i=i bi,
ni=1 ai
q−2n
i=1 ai
n
i=i biai,
. . . , n i=1 ai n
i=i
bi
(
ai)
q−1,n
i=1 bi
(
ai)
qas a geometric progression whose common ratio is in the interval (0,1).
Remark 2.Whennis large (that is, asymptotically), another impor-tant inference that one could glean fromEqs. 1–3is that
n
i=1
bi
(
ai)
m=n
i=2
bi
(
ai)
m=n
i=3
bi
(
ai)
m=. . .=n
i=n
bi
(
ai)
m, (7)and even more so in our pattern recognition context whose basic credo is that all reference/prototype patternsxifor a given class are similar.i=1,2, . . . ,nandm=q−l,wherel=1,2, . . . ,q−1
Proposition 1 , Remarks 1and2essentially form the theoretical basis for the algorithm proposed in this work.
Based onRemark 1, it is justifiable to assume thatEq. 5can be well approximated as
n
i=1
bi
(
ai)
q≈ ni=1 bi
n
i=1 ai
q
−
ni=1 ai
q−1n
i=1 ai
n
i=i bi
−
ni=1 ai
q−2 n
i=1 ai
n
i=i
biai (8)
Due to space constraints, practical realization and validation ofEq. 8
will be given in application journals on pattern recognition and cheminformatics.
Appendix B
Proof ofEq. 4by induction.
Base case: whenq=1,fromEq. 4we have
ni=1 bi
n
i=1 ai
1 =
n
i=1 bi
(
ai)
1+ 1 j=1
n i=1 aij−1 n
i=1 ai
n
i=i
bi
(
ai)
1−j (9)Clearly the left hand side of the equation can be expressed as n
i=1biai+ni=1aini=ibi and the right hand side of the equation isni=1biai+iaini=ibi. ThusEq. 9holds forq=1.
General case:
Assume thatEq. 4is correct for some positive integerk∈Z+. From Eq. 4we have
ni=1 bi
n
i=1 ai
k
=
n
i=1 bi
(
ai)
k+
k
j=1
ni=1 ai
j−1
i ai
n
i=i
bi
(
ai)
k−j (10)We now show that the equation above holds also fork+1. Starting with the left hand side, we have
n i=1 bi n i=1 aik
n
i=1 ai =
ni=1
bi
(
ai)
k+ k j=1 n i=1 aij−1
× n i=1 ai n
i=i
bi
(
ai)
k−jn
i=1 ai ,
(11)
where we have made use of the assumption thatEq. 10is correct; hence
n i=1 bi n i=1 aik
ni=1 ai = n i=1 bi
(
ai)
k n i=1 ai + k j=1 × ni=1 ai j i ai n
i=i
bi
(
ai)
k−j,(12)
whereni=1bi(ai)k
n i=1aican be expressed as
n
i=1 bi
(
ai)
k ni=1 ai =
n
i=1
bi
(
ai)
k+1+ n
i=1 ai
n
i=i
bi
(
ai)
k (13)InsertingEq. 13intoEq. 12results in
ni=1 bi
n
i=1 ai
k
n
i=1 ai =
n
i=1
bi
(
ai)
k+1+
n i=1 ai ni=i
bi
(
ai)
k+k j=1
n i=1 ai j i ai ni=i
bi
(
ai)
k−j(14)
[ni=1aini=ibi
(
ai)
k+kj=1(
ni=1ai)j
iaini=ibi
(
ai)
k−j] isba-sicallykj=0
(
ni=1ai)jiaini=ibi
(
ai)
k−j,which can be rewrittenas kj+=11
(
ni=1ai)j−1i=1aini=ibi
(
ai)
(k+1)−j. Inserting this ex-pression inEq. 14and rewriting(
ni=1ai)k(
ni=1ai)on the left hand side ofEq. 14as
(
ni=1ai)k+1yield ni=1 bi
n
i=1 ai
k+1 =
n
i=1
bi
(
ai)
k+1+
k+1j=1
ni=1 ai
j−1
i=1 ai
n
i=i
bi
(
ai)
(k+1)−j, (15)
which is what we set out to prove. Thus,Eq. 4(and for that matter
Eq. 3) inAppendix Ahold for all values ofq∈Z+.
Appendix C
In this work:bi=xT
(βi
xi);bi=xT(βi
xi)
;ai=xTxi;ai=xTxi;as described in the main text (see Section 3). Hence, from Eq. 8
we have Nωj
i=1
(
xT(β
ixi
))(
xTxi)
r−1≈cr−1 xTNωj
i=1
(β
ixi)
−cr−2
Nωji=1 xTxi
Nωj
i=i
xT
(β
ixi)
−cr−3
Nωji=1 xTx
i Nωj
i=i
(
xT(β
ixi
))(
xTxi)
(16)
wherec=
(
xTNωj i=1xi).Making use of the two definitions z=Nωj
i=1
(βi
xi) and z=Nω j
i=1xigiven inSection 3in the main text and with some algebraic manipulation (such as the fact thatgTh=hTbfor any vectorsgand h), the three lines on the right hand side ofEq. 16can be recast in a more simple and familiar form:
• Line 1:cr−1
(
xTNωji=1
(βi
xi))=(
xTz)
r−1xTz• Line 2:cr−2=
(
xTz)
r−2,whereas Nωj
i=1 xTxi
Nω
j
i=i
xT
(β
ixi)
=xTYx,whereY=
(
Nωj i=1xiNωj
i=i
(βi
xi)
T)
is anL-by-Lmatrix.• Line 3:cr−3=
(
xTz)
r−3,whereas Nωji=1 xTx
i Nωj
i=i
(
xT(β
ixi
))(
xTxi)
=xT
Nωj
i=1 xixT
Nωj
i=i
(β
ixi)
xTix
=xT Nωj
i=1
xi
(
xTSix)
, (17)whereSi= Nω
j
i=i
((βi
xi)
x Ti
)
is anL-by-Lmatrix.According toEq. 7, the matricesS1,S2, . . . ,SNω
jcan be considered equivalent/similar. Hence,Eq. 17modifies to
Nωji=1 xTx
i Nωj
i=i
(
xT(β
ixi
))(
xTxi)
=(
xTz)(
xTSx)
, (18)whereS=Siand
(
xTz)
=xT Nωj i=1xi
Expressed in terms ofY,S,zandz,Eq. 16becomes
Nωj
i=1
(
xT(β
ixi))(
xTxi)
r−1≈
(
xTz)
r−1(
xTz)
−(
xTz)
r−2(
xT(
Y+S)
x)
=
(
xTz)
r−2((
xTz)(
xTz)
−xT(
Y+S)
x)
=
(
xTz)
r−2((
xTzzTx)
−xT(
Y+S)
x)
=
(
xTz)
r−2(
xTQx),
(19)whereQ=zzT−
(
Y+S)
,anL-by-Lmatrix.Note that
(
xTz)
r−2≥0,(
xTzzTx)
≥xT(
Y+S)
x; hence,(
xTz)
r−2(
xTQx)
≥0 as it should be – because, by definition, Nωj
i=1
(
xT(βi
xi))(xTxi)r−1≥0.References
[1]R. Duda, P.E. Hart, D.G. Stork, Pattern Classification, Wiley, New York, NY, USA, 2000.
[2]T.Y. Young, T. Calvert, Classification, Estimation, Pattern Recognition, American Elsevier, New York, NY, USA, 1974.
[3]B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996.
[4]D.J. Hand, Discriminant and Classification, Wiley, New York, NY, USA, 1981.
[5]A.R. Webb, Statistical Pattern Recognition, Wiley, Chicester, UK, 2002.
[6]C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, NY, USA, 2006.
[7]K. Koch, Introduction to Bayesian Statistics, Springer–Verlag, Berlin, Germany, 2007.
[8]K.P. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, Cambridge, MA, USA, 2012.
[9]E. Parzen, On estimation of a probability density function and mode, Ann. Math. Stat. 33 (1962) 1065–1076.
[10]W.S. Meisel, Computer-oriented Approaches to Pattern Recognition, Academic Press, New York, NY, USA, 1972.
[11]S. Fiori, Probability density function learning by unsupervised neurons, Int. J. Neu-ral Syst. 11 (2001) 399–417.
[12]S. Fiori, Non-symmetric pdf estimation by artificial neurons: application to sta-tistical characterization of reinforced composites, IEEE Trans. Neural Netw. 14 (2003) 959–962.
[13]D.P. Specht, Generation of polynomial discriminant functions for pattern recogni-tion, IEEE Trans. Electron Comput. 16 (1967) 308–319.
[14]H.C. Andrews, Mathematical Techniques in Pattern Recognition, Wiley, New York, NY, USA, 1972.
[15]G. Harper, J. Bradshaw, J.C. Gittins, D.V.S. Green, A.R. Leach, Prediction of biolog-ical activity for high-throughput screening using binary kernel discrimination, J. Chem. Inf. Comp. Sci. 41 (2001) 1295–1300.
[16]R. Lowe, H.Y. Mussa, J.B.O. Mitchell, R.C. Glen, Classifying molecules using a sparse probabilistic kernel binary classifier, J. Chem. Inf. Model. 51 (2011) 1539–1544.
[17]J. Aitchison, C.G.G. Aitken, Multivariate binary discrimination by the kernel method, Biometrika 63 (1976) 413–420.