PROBABILISTIC POTENTIAL FUNCTION NEURAL NETWORK CLASSIFIER
Gursel Serpen , Lloyd G. Allred and Krzysztof J. Cios Electrical Engineering & Computer Science Department
The University of Toledo, Toledo, OH 43606 [email protected]
ABSTRACT
A novel probabilistic potential function neural network classifier algorithm to deal with classes which are multi-modally distributed and formed from sets of disjoint pattern clusters is proposed in the paper. The proposed classifier has a number of desirable properties which distinguish it from other neural network classifiers: 1) it is fast, 2) it can learn multi-mode probability density functions, 3) it does not require guessing the network topology, rather topologically adapts to the classification problem at hand in a dynamic way as the training progresses, 4) it discovers clustering properties of training data and adapts to a minimal network topology in terms of needed computational resources, 5) it implements incremental learning procedure and hence, does not disturb the previous state of the network but simply adds to it, 6) it can form potentially optimal decision boundaries. A complete description of the algorithm in terms of its architecture and the pseudocode is presented in the paper.
1. Introduction
Artificial Neural networks (ANN) which represent a subset of algorithms in statistical pattern recognition theory are paradigms designed with maximum computational efficiency [14]. There is a large set of neural network paradigms in the literature addressing the pattern recognition problems. Each of these neural pattern classifiers has a number of shortcomings which render them not applicable to pattern classification tasks which would require the neural network paradigm to possess the following properties:
1. train on-line (fast learning ) and classify in real-time (parallel computation),
2. form classification boundaries which optimally separate the classes which are likely to be formed from a set of disconnected subclasses in the pattern space; the joint probability density function (PDF) of a particular class is likely to have many modes,
3. do not require an initial guess for the network topology, rather topologically adapt to particular instance of the classification problem at hand in a dynamic way as the training progresses,
4. discover clustering properties of training data and adapt to a minimal network topology in terms of needed computational resources,
5. implement incremental learning procedure and hence, does not disturb the previous state of the network but simply add to it (learning accomplished simply by adding new nodes to the existing network to learn the new training pattern), and
6. form optimal decision boundaries which approximate those of a theoretical Bayesian classifier.
We will next review four important ANN paradigms, which have been successfully applied to pattern classification tasks, with respect to six criteria outlined above and briefly explain their shortcomings. These ANN paradigms include Multi-Layer Feedforward network (MLF) [15], Radial Basis Function (RBF) network [10, 11], Learning Vector Quantization (LVQ) network [6] and Probabilistic Neural Network (PNN) [12].
The MLF with backpropagation learning rule is the most widely known neural network algorithm for pattern classification. The lack of efficient techniques to determine the topology of the network for a given problem and the slow learning speed make this paradigm unsuitable for real-time implementations [15]. It is well documented in the literature that the number of hidden layer nodes play a very important role in the ability of the network to partition the pattern space and currently there are no well-defined analytical procedures to specify the number of hidden layer nodes for a particular problem except in a number of limited cases [1, 2, 3, 4, 5, 7]. RBF neural networks can be trained up to three orders of magnitude faster than MLF’s for the same type of problems [8, 11].
Initialization of the network requires clustering properties of the data to be analyzed and understood well, which is typically performed using an unsupervised learning algorithm like the k-means and is essential to determine the number of hidden layer nodes and their parameter settings. The training process requires matrix inversion operation to be performed to compute the weights from the hidden layer to the output layer. It is possible that a lack of understanding of the clustering properties of the data may cause an inappropriate network topology to be specified, which in turn will cause the network performance to suffer significantly. LVQ networks can be trained very efficiently as compared to other neural paradigms, MLF and RBF networks [6]. This paradigm suffers significant performance degradation if the codebook vectors cannot be initialized optimally, for which no well- defined procedure exists. The effect of initialization on the network performance gets worse if the class distributions are disjoint and maybe even intermingled. The PNN possesses a number of important useful characteristics like the ease of setting it up, simplicity of training and ability to perform in real-time for most applications once the network is trained [12]. The PNN paradigm requires a pattern layer node to be created and tuned to (weight vector of the newly created node is set to the training pattern) for each training pattern, which might result in very large node counts in the pattern layer for some realistic size problems.
This paper introduces a new neural network algorithm which has a potential to perform better than any of the four paradigms discussed above for a stochastic pattern classification problem and, at the same time, do not suffer from shortcomings associated with each paradigm. In brief, the proposed neural network algorithm, called the Probabilistic Potential Function Neural Network (PPFNN), theoretically possesses all six properties listed above.
2. Probabilistic Potential Function Neural Network
In a typical stochastic pattern classification problem, noise or other real-life imperfections will cause the classes to overlap in the pattern space and therefore making it impossible to assign a given pattern to a particular class with certainty. In this case, a probability value for class membership of a pattern can be computed to determine the class to which the pattern most likely belongs. The PPFNN implements an algorithm to compute the probability of class membership for a given pattern and assigns the pattern to the class associated with the highest probability value. A formal mathematical statement of the problem is presented in the Appendix.
2.1. PPFNN Architecture
The PPFNN employs four layers to implement the stochastic decision making rule. The first layer of processing consists of nodes in the pattern layer. Nodes in the pattern layer simply distribute the incoming signal values to hidden layer nodes without any weighting. Nodes in the hidden layer loosely represent the cluster centers in the data set and are connected to output layer nodes through modifiable weights, wij=γkij, where γijk is an element in the sequence of positive reals (e.g., harmonic sequence given by {1/k}, k = 1,2,...,) for training pattern k, hidden node i and output node j (see the reference by Tou & Gonzalez for the set of conditionsγkij must satisfy).
Output layer has as many nodes as there are classes. The fourth and final layer is basically a MAXNET [9]. The topology of the PPFNN is given in Figure 1.
Input layer nodes simply carry the pattern values to hidden layer nodes. Hidden layer nodes implement a function of the form exp
(
−α x−xk2)
where α is a spread parameter of the exponential function centered at xk. Nodes in the output layer are connected to hidden layer nodes through modifiable weights calculated during training. Output layer nodes sum the incoming weighted signals and pass the weighted sum through a non- linearity which outputs the actual value of the signal if the signal value belongs to the interval [0,1], outputs a 0 if the signal value is less than 0 and outputs a 1 if the signal value is greater than 1. The forth and final layer is a subnet, MAXNET, to choose the node with the highest input excitation value and to set its output to 1, while setting the outputs of the remaining nodes to zero. Each node in the MAXNET layer is connected to only a single node in the output layer without any weighting.2.2. Training Procedure
Network parameters to be set includeα , whose value needs to be determined heuristically based on empirical observations on the actual data set whenever possible, and number of nodes in all four layers of the network.
Number of nodes in the pattern layer is equal to the dimension of the feature vectors. The hidden layer, initially, will have one node centered around a randomly chosen training pattern. There will be M nodes in the output layer one for each of M classes. Nodes in the output layer are connected to the hidden layer nodes through modifiable weights, -γkij or +γijk, depending on the class membership of the training pattern. Number of nodes in the MAXNET layer is equal to the number of nodes in the output layer. Each node in the MAXNET layer will be connected to a single node in the output layer with a fixed weight value of +1. The supervised training procedure for the neural network consists of steps outlined in Figure 2.
Fig. 1: Topology of PPFNN.
2.3. Critique of the PPFNN Algorithm
The PPFNN algorithm theoretically holds the promise to satisfy all six design criteria listed in the Introduction.
A brief explanation of each criteria and why and how the PPFNN meets that requirement follows next. Training the network requires generating an output for the training pattern and simply adding a new hidden layer node if the network is not able to classify the training pattern correctly. Overall training step is straightforward and promises to be very fast. This fast training speed is likely to translate into real-time learning for most applications. The data flows from the pattern layer to the MAXNET layer unidirectionally and computations associated with a specific node in a given layer can be performed in parallel with computations associated with the rest of the nodes in that layer.
The initial configuration of the network has a single node in the hidden layer. As the training progresses, new hidden layer nodes are added to the network as needed. Note, that a compact cluster is likely to contribute one hidden layer node (or a couple more nodes depending on the spread of the potential function as defined by the parameterα) if the cluster center (or any pattern sufficiently close to it) happens to be in the training set. Once a potential function (with proper spread) is placed for a training pattern belonging to a particular cluster, any other training pattern coming from that cluster will be classified correctly by the algorithm and hence, there will not be
MAXNET Layer Hidden
Layer Pattern
Layer
Output Layer I
N P U T S
wij
O U T P U T S
a need to create a new hidden layer node. The implication is that, on the average, only a small number of hidden layer nodes will need to be created for compact pattern clusters no matter how many patterns belong to that cluster. Therefore, the learning algorithm does have the potential to induce a minimal network topology depending on how well the training patterns represent the actual class distributions.
Fig. 2: Pseudocode for the PPFNN Algorithm
The PPFNN algorithm implements incremental learning in the sense that learning does not disturb the existing network topology. Essentially, those parameters computed during earlier learning cycles do not need to be recomputed each time a new training pattern is presented since a new pattern (and the cluster it belongs to) is learned by simply adding a new hidden layer node and associated connections while preserving the existing network topology.
Identification of an optimal value for the parameter which determines the spread of the potential functions will require an understanding of clustering properties of the training data. This parameter will be correlated to the clustering properties of the training data and will take a large value if clusters are spread over a large region in the pattern space and accordingly, will have a small value if the clusters are tightly packed in the pattern space.
In the worst case, a new hidden layer node is likely to be created for each training pattern if the value of the parameterα is too small, which will result in a very narrow spread for the potential function, and the training data
0. Initialize the PPFNN. Assume a value for α. 1. Present a new feature vector (index is k) and
compute network output.
2. If the network classifies the vector correctly for each class,
no action needed.
Else
A. Add a new hidden layer node (index is i), B. Center the potential function represented by the new hidden layer node around this vector, and C. Repeat for each class (index is j)
If pattern belongs to the class and function, fki, is positive,
no action needed.
Else if pattern does not belong to the class and function, fki, is negative,
no action needed.
Else if pattern belongs to the class and function, fki, is negative,
connect output of hidden node i to the output node for class j through a weight of +
γ
kij.Else if pattern does not belong to the class and function, fki, is positive,
connect output of hidden node i to the output node for class j through a weight of −
γ
kij.3. Repeat the procedure until all training patterns are processed.
clusters are spread wide. Even then, the resulting network topology will be no larger than that of the PNN’s. If the parameter value is set too large when the actual training data clusters are compact, two or more clusters belonging to two different classes might reside in the subspace where a specific potential function is placed. This is not a desirable situation since the network will not be able to learn the distinction between those two clusters.
3. Conclusions
The Probabilistic Potential Function Neural Network is a stochastic decision making algorithm and was designed for classification tasks which have sufficiently large overlap between class distributions. The proposed algorithm promises the desired properties of fast training and learning, ability to form class partition boundaries for cases where classes are multi-modally distributed and formed from a set of disjoint pattern clusters, does not require an initial guess for the network topology, dynamically adapts to a minimal network topology for a particular problem, performs incremental learning, and implements Bayes optimal decision surfaces. The parameter,α, which determines the spread of the potential functions placed around a subset of the training patterns needs to be set initially. In most cases, value of this parameter will depend on the clustering properties of the training data.
A cursory analysis of the training data is likely to lead to a near-optimal value for the parameterα.
Acknowledgements
Authors gratefully acknowledge the support for Dr. Gursel Serpen by AFOSR under SFR Program.
References
[1] K. J. Cios, and N. Liu, “Machine Learning in Generation of a Neural Network Architecture: A Continuous ID3 Approach,” IEEE Transactions on Neural Networks, Vol. 3, No. 2, pp. 280-291, 1992.
[2] G. Cybenko, “Approximation by Superpositions of Sigmoidal Functions, Mathematics of Control, Signals and Systems,” No. 2, pp. 303-314, 1989.
[3] K. Funahashi, “On the Approximate Realization of Continuous Mappings by Neural Networks, Neural Networks,” No. 2, pp. 183-192, 1989.
[4] K. Hornik, M. Stinchcombe & H. White, “Multilayer Feedforward Networks are Universal Approximators,”
Neural Networks, Vol. 2, pp. 359-366, 1989.
[5] Y. Ito, “Approximation of Continuous Functions on Rd by Linear Combinations of Shifted Rotations of a Sigmoid Function with and without Scaling,” Neural Networks, Vol. 5, pp. 105-115, 1992.
[6] T. Kohonen, “Improved Versions of Learning Vector Quantization,” IJCNN Proceedings, Vol. 1, pp. 545- 550, 1991.
[7] M. Leshno, V. Ya-Lin, A. Pinkus, & S. Schocken, “Multilayer Feedforward Networks with Nonpolynomial Activation Function can Approximate any Function,” Neural Networks, Vol. 6, pp. 861-867, 1993.
[8] R. P. Lipmann, “Neural Networks, Bayesian a posteriori Probabilities, and Pattern Classification,” From Statistics to Neural Networks: Theory and Pattern Recognition Applications, NATO ASI Series, Vol. 136, pp.
83-104, 1994.
[9] Y-H. Pao, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley Publishing Company, Reading: MA, 1989.
[10] E. Parzen, “On Estimation of a Probability Density Function and Mode,” Annals of Mathematical Statistics, Vol. 33, pp. 1065-1076, 1962.
[11] F. Poggio, “Regularization Theory, Radial Basis Functions and Networks,” From Statistics to Neural Networks: Theory and Pattern Recognition Applications, NATO ASI Series, Vol. 136, pp. 83-104, 1994.
[12] D. F. Specht, “Probabilistic Neural Networks for Classification, Mapping, or Associative Memory,”
IJCNN, Vol. I, pp. 525-532, July 1987.
[13] J. T. Tou & R. C. Gonzalez, Pattern Recognition Principles, Addison-Wesley Publishing Company:
Reading, Massachusetts, 1981.
[14] P. J. Werbos, Links Between Artificial Neural Networks (ANN) and Statistical Pattern Recognition in Artificial Neural Networks and Statistical Pattern Recognition: Old and New Connections, I. K. Sethi and A. K.
Jain (Editors), pp. 11-31, 1991.
[15] P. J. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, John Wiley & Sons, Inc. New York, NY, 1994.
Appendix
Mathematical Model(See reference 13 for further details)-Let p
(
ω /i x)
the probability that xbelongs to the class ωi, where i is the index for classes. The stochastic decision making rule to identify the class the pattern belongs to is given byx∈ωi if p(ωi/ )x >p(ωj/ )x ∀ ≠j i. (A-1) The conditional probabilities employed in the decision rule are obtained by the following formula:
p x f x
f x
f x f x
f x
i k
i
k i
k i
k i
k i
( / ) ~
( )
( )
( ) ( )
( )
ω ≈ =
− ∞ < <
≤ ≤
< < ∞
0 0
1
1 1
if if 0
if
(A-2)
where k indicates the k-th training pattern processed and i is the index for classes. The function f xk
( )
can be computed with the iterative formula given by( ) ( ) ( , )
f xk = fk−1 x ±γkK x xk (A-3)
where the coefficients γk, k= 1 2, , . .. , can be obtained from the harmonic sequence
{
1/k}
, k=1 2, , . .. , (see reference 13 for conditions this coefficient must satisfy). The potential function, K x x( , k) , for any sample pattern point xk is defined by( ) ( ) ( )
K x xk i i x i xk
i
, =
=
∞
λ ϕ2 ϕ
1
(A-4)
where λi, i= 1 2, , . .. , are real numbers (not zero) chosen to make the potential function bounded for xk j
j M
∈
=
ω
1
andϕi
( )
x, i=1 2, , .. ., are orthonormal functions. A suitable choice for a potential function is of the formK x x( , k)=exp −1(x−xk)TC− (x−xk) 2
1 (A-5)
where xk is the mean vector (training pattern) and C is the covariance matrix of the pattern class. If any two elements of the vector, x, are statistically independent and elements have the same variances,σ, then, Equation A-5 can algebraically be manipulated to yield
[ ]
K x x( , k)=exp−α x−xk 2 , (A-6) where α = 1 2/ σ2.