MURDOCH RESEARCH REPOSITORY
http://dx.doi.org/10.1109/TENCON.1996.608775
Zaknich, A. and Attikiouzel, Y. (1996) Modified probabilistic
neural network hardware implementation schemes. In:
Proceedings of the 1996 IEEE Region 10 TENCON - Digital Signal
Processing Applications Conference, 26 - 29 November, Perth,
Western Australia, pp 167 - 172.
http://researchrepository.murdoch.edu.au/17957/
Copyright © 1996 IEEE
Personal use of this material is permitted. However, permission to reprint/republish
this material for advertising or promotional purposes or for creating new collective
1996 IEEE TENCON - Digital Signal Processing Applications
Modified Probabilistic Neural Network ardware Implementation Schemes
Anthony Zaknich, Member IEEE, Associate Member of AES Yianni Attikiouzel, Senior Member E E E , Fellow IEE, Fellow IEAust
Centre for Intelligent
Information Processing Systems (CIIPS),
Department
of
Electrical and Electronic Engineering,
The
University
of
Western Australia,
Nedlmds 6907,
Western Australia
Abs&act
The modified probabilistic neural network for nonlinear time series anajysis was developed and introduced in 1991. It effectively represents a simple family of clustering methods for reducing the size of Specht's general regression neural network and retaining all its benefits. Three hardware implementation schemes for the most basic form of the modified probabilistic neural network are described. The first is an optoelectronic implementation and the other two are Very Large Scale Integration designs: a virtual implementation and a fully parallel implementation.
The Modified Probabilistic Neural Network
(MPNN)
[ l,2,3] is similar to Specht's General Regression Neural Network ( G R " ) [4] and they are both closely related to the Probabilistic Neural Network (PNN) [5]. The MPNN and the G R " are general regression or function mapping algorithms which can
be implemented as three layer feed forward neural networks. They can be trained very easily and quickly and as such they can be very useful for general nonlinear signal processing applications.
If the yi are allowed to be individual real valued scalars equation (1) becomes exactly Specht's GRNN which incorporates each and every training vector pair {xi -> yi} into its architecture (xi is a single training vector in the input space and yi is the associated desired scalar output). If it can be assume that there is only one Centre in the input space per (output yi then a convenient general model to use for
(all forms of the MF"N and the GRNN is:
fi(
)centxi
0
yi
M
Zi
is a Parzen radial basis function (RBF).
is the centre or mean vector for class i in the input space (real valued or quantised). is the single learning or smoothing parameter chosen during network training. is the output related to centxi (real valued or quantised).
is the number of unique centres i in the MPNN structure.
is the number of input training vectors xj associated with centxi.
M C i = l
NS = Zi, is the total No of training vectors.
Equation (2) can be derived from the G R " equation (1) through the following approximation:
Zi
C
fi(
(x-
xj),o )j = l
Zijj(
(x-
centxi),o ) (3)This is a reasonable approximation if the x. are close J
together in a relatively small local space and can be adequately represented by a single centre vector centxi. The key to the practical application of the general MPNN equation (2) is related to the method of selection of the yi and the grouping (clustering) of the associated input vectors in each class i. The various methods for this have been developed by Zaknich et al [1,2] previously. Given that a satisfactory grouping has been determined for a particular nonlinear signal processing problem equation (2) can be implemented as a general MPNN nonlinear signal processing machine.
The most simple Parzen
RBF
which is most convenient to use for hardware implementations is defined by equation (4).when all IXk
-
centxikle
0, elsefi( (x
-
centxi),, ) = 0 (4)where:
Xk are the elements of p-dimensional vector T
x = (XI, x2
,...
Xk...
x ).
P
The main advantage that a suitable neural network hardware implementation can offer is the ability to directly implement the parallelism and concurrency of the neural network algorithm. According to Farhat
[6] there are two broad directions in which hardware implementation research seems to be going. The first is toward connection machines or general-purpose parallel computers in which a large number of digital central processing units are interconnected to perform the parallel computations in Very Large Scale Integration (VLSI) hardware. Some of the more standard Digital Signal Processing (DSP) processors and transputers or special processors designed with parallelisation in mind can be used for these machines [7]. The other is towards analog hardware where a large number of simple processing units or neurones are connected through modifiable weights such that their phase-space dynamic behaviour has useful signal processing functions associated with it. The later approach is dominated by analog optoelectronic hardware implementations because optics offers the massive inter connectivity and parallelism and the electronics side offers the flexibility, high gain and decision making. Ultimately, a full optical solution would be the most desirable and could end up to
be
the winning approach. However, according to other researchers [8] CMOS digital VLSI technology is also promising because of its outstanding success with integration scale and ultra-low power dissipation in computer logic and memory devices. This paper offers very basic MPNN designs which are based on optoelectronic, virtual and fully parallel VLSI electronics technologies.Optoelectronic Implementation
Optoelectronic implementations are typically analog in nature. Farhat [6,91 introduces a few basic optoelectronics devices which can be used to develop the "N. This can be done easily because the MPNN is much less demanding than other neural networks in its implementation requirements. The
basic building blocks include the light emitting array (LEA), the 2-D spatial light modulator (SLM), photo diode array (PDA) and the anamorphic lens system (cylindrical and spherical lenses in tandem). Amongst other things they can be used to perform dot products on 2-D arrays of weight vectors or in our case training vector centres.
Since light signals vary from zero to some upper positive intensity it is most convenient to use unsigned positive arithmetic for signal representation. The negative signal components are dealt with by introducing a fixed positive bias to
make all signal components positive for processing. The appropriate bias is then subtracted from the final result to restore the negative parts if necessary.
A basic optoelectronic implementation design is depicted in Figure 1. For simplicity of representation the vector xi is taken to be equivalent to centxi. The positive analog signal (original signal plus DC bias) passes through an analog delay line having (p-I-ext) taps. This is added to ext tap values from an output analog delay line forming a (p- 1)-dimensional positive valued input vector. The input vector needs to be normalised to have a unity magnitude for reasons specified later. However, to preserve the original signal energy scaling an energy feature is added to the input vector before normalisation. The extra energy feature can be computed as follows:
All vectors x and xi are normalised to unit length, ie. llxll= 1 and llxill = 1 by the following formulation:
x = x / d(x.x), if x.x > 0 and zero if x.x = 0.
The resulting unity magnitude x vector has a dimension of p. It is necessary to do this unity normalisation to be able to exploit the following rela tionship:
(x
-
xi) T (x-
xi> = (2-
2x.xi) = r2where r is the Euclidean distance from vector x to vector xi.
If a FU3F similar to equation (4) is to be used a maximum radius of (5 around the input vector x can
be defined by specifying an equivalent dot product threshold as follows:
x.xi = (2
-
02) / 2 = thresholdAll vectors xi producing dot products having values greater or equal to the threshold value can be said to be within a Euclidean distance o of x. 'This is a very useful result since it is very easy to build a dot product function with optical components.
The elements of the x vector are fed to the first LEA which produces light beam intensities in proport~on to the element magnitude
then directed with a le dimensional SLM encoded vectors xi. Each of the ligh
elements of
x
are directed at all the respective cells of the SLM representing the same elements in eachxi.
The light passing through the SLM is then focused
onto a PDA with another lens system such that the resulting outputs from the PDA represent all the dot products x.xi. The output current from each of the M photo diodes (PD's) is fed to its own comparator with a common threshold setting. If the threshold is exceeded the comparator turns on else it stays off. Each of the activated comparators turn m the respective cell of the following LEA which produces a light beam of fixed intensity while the other cells remain off. At this stage the M light beams from last LEA are simultaneously directed at two verti one-dimensional SLMs, each with M cells. them has the Zi yi information encoded in it and the other the Zi information. The light passing through each SLM is focused to single PDs producing the two summations C Zi yi A(x) and C Zi fi(x). Since the RBF is a flat top hat type the summations involve simple accumulations of those quantities which are within the required distance of x. Other RBFs can be accommodated by simply modulation the light intensities from the last LEA as a function of r or the dot product xxi. The summations are finally divided to produce the output
96)
.
As mentioned before the DC bias can be removed from the result at this stage if necessary.VLSI Implementation
VLSI technology is quite satisfactory for the implementation of two basic MPNN hardware designs using equation (2) and
equation (4). The first design is a virtual digital design and the second a fully parallel design. This RBF selects training centres which are within a specified city block of the input vector x and applies a straight linear weighted average to them to compute the output
96)
.
Although, this approach is the least accurate it is the most convenient to implement in hardware. If the training cluster centres are kept close the accuracy can be improved at the expense of increasing network size.ly the MPNN memory and are the most crucial ones. that the memory and systems control d the <T o p ~ i ~ i s e r functions are
performed by a standard serial type hose computer since these are the least time critical functions if the adaptive operation i s not so important. If all the training and cluster centre allocations are done by the hmt computer which then down loads the a m i n g to the hardware a minimal hardware size and complexity can be achieved.
A virtual Ml" machine design is best done with
I circuitry. It has a, single central (CPU) fed by a random access which holds the network parameters. For each input vector x the machine cycles through the whole mupied memory from i = 1 to i = M ~ e e d ~ ~ ~ g the relevant memory values to the processor. tes and accumulates the ith sub then produces an
The hardware system is designed to work with ms&nened binary integer words with values from 0 to (2b1ts -l), where bits equals the maximum number of word bits. Zero is represented by 2(bits-1), positive values range between 2(bits-1) and (2bits -1) and negative values range between (2(bits-1) -1) and 0. As indicated in the diagram above the
RAM
is addressable from i = 1 to i = M and at each address location i the training vector centre xi, the number of ith training vectors Zi, and the product of Zi times the desired output yi are stored. These parameters are computed by the host computer using any suitable metPld before hand and then just simply down loaded to their respectiveJRAM
locations. Centres xi are computed as the mean rounded integer values of the ith input training vectors and the outputs yi are the means of their associated desired outputs. The xi ally the same or a little evice. For example if the need not be larger than 8 bits. The Zi RAM size needs to be large enough the take the largest expected number of training vectors and the Zi yiRGM
big enough the take the highest Zi times the maximum yi value. Typically the output BAC size is the same as the input ADC size which is also the maximum yi size.bit device then the
number of word bits as the number of upper bits plus the number of lower bits, ie. bits = @ispper
+
bitslower). The upper bits of each the elements of the vectors xi and x are compared using a logic AND gate. If all the upper bits match then they can be said to be within the same city block where the city block size is defined by the lower bits which also defines the o. For example, if the 3 lower bits of the word are excluded from the matching then the city block size is Illbinary = 7de5eal and the matching vectors can be said to be wthm 7 or closer in any of their dimensions. This is a good method provided it is acceptable to allocate values for 5 as only powers of
2, ie, 0,2,4, 8, 16 etc. which correspond to 0, 1, 2, 3 , 4 etc. numbers of lower bits respectively. Since acceptable 0's tend to be fairly broad ranging this is not a serious limitation. On average this approach should produce acceptable results but some individual results can be biased if the particular x vector elements have lower bit values which are greater than half the city block distance as defined above. For example, if an element of x has the lower bits Illbinary it will only match with centres having lower bits less than or equal to 11 lb i n a and those centres above that but still close ( eg. 1 h i n a q )
will be ignored. This can be fixed to some extent if the following logic is included. If the upper bit of the lower bits of an element of
x
is a 0 then apply the method as normal. If it is a 1 then reduce the n u m k r of upper matching bits by 1 ( ie. increases the lower bits by 1 ). This ensures that sufficient matching vectors either side of x are included but the effectiveCT is now variable by 1 bit. Which ever approach is
used the value of o is set by preselecting the number of lower memory bits for each element of the x vectors which are ignored by the AND gate comparator. When the AND function determines a match the gate passing the Zi and Zi yi values eo the accumulators is activated. The optimal 5 can be
determined quite quickly by the host computer by running some real data through the system for test values of o = 0, (5 = 2, o = 4 etc. until a satisfactory
mean squared error (mse) is achieved during training.
The final output value
96)
is computed at the end of the cycle by dividing the final value of the Zi yi accumulator by the final value of the Zi accumulator. The accumulator sizes must be sufficient to accept the highest expected totals to avoid overflow else, overflow circuitry must be included. The highest expected Zi yi accumulator value is +/- (M maximum(Zi ) maximum(yi)) and the highest accumulator value is +/- (M maximum(Zi )). Finally, the divider must be able to handle these highest inputs to result in peak values of%)
close to +/-(maximum(yi)).
A fairly easy way to implement any virtual MF"N design is to use a standard DSP chip fiynily such as the Texas Instruments TMS320C30 or TMS320C40 for the
CPU
and the memory. These types of chips have standard multiply and accumulate functions needed by the MPNN and they can also perform other arithmetic functions such as division. If this type of implementation is not computationally fast enough with one DSP chip then an easy solution to this would to break the M P N N in parallel sections and implement them with virtual machines similar to that described above but without their final divider stages. The subtotals from these machines can be transferred to external accumulation and division circuitry to compute the final output. This strategy would introduce very little extra delay. Due to the likelihood of amassing very large final accumulator values it would probably be best to consider a floating point CPU design for very large network sizes.A Parallel VLSI Hardware Design
Although the virtual design can be parallelised to achieve faster throughput there may be some applications that require much faster throughput than can be easily accommodated. In that case the fastest implementation requires a fully parallel hardware design. The parallel hardware design is similar to the virtual design except for the fact that for each new x input all the filter computations are done simultaneously in parallel hardware rather than via a
CPU in a computation cycle. Figure 3 shows the main elements of the design.
The memory, comparator and divider parts can be implemented with digital VLSI technology quite easily however, the parallel accumulators may be better implemented in analog form. Either way the design is quite simple using the RBF according to equation (4). The xi comparators are simply AND gates with appropriate input buffers fed by the high bits of all the elements of the vector x fed. The value of xi is programmed into the comparator hardware by setting the input buffers to be inverting or non inverting to match the correct binary bit code, ie. inverting
=
logic 0 while non inverting=
logic 1. Clearly only those comparators with the correct bitmatch at their input will output logic 1's which enable the appropriate Zi and Zi yi memories to feed into the parallel accumulators. As before the value of o is set by preselecting the number of lower memory bits for each element of the x vectors which are ignored by the
AND
gate comparator.Conclusions
It has been shown how it is possible to easily implement the simplest
forms
of theG R "
andMPNN
algorithms specifically for general nonlinear signal processing applications in real-time with optoelectronic and VLSI hardware technologies. These designs can be extended for more complexRBFs
which can produce better results with fewertraining vectors or centres. However, equation (4) can produce quite satisfactory results if comparatively more training vectors or centres are used to compensate for the poorer
RBF.
[ 11 Zaknich, Anthony and Attikiouzel, Yianni, "Time Series Characterisation Schemes for the Modified Probabilistic Neural Network', Australian Journal of Intelligent Information Processing Systems, Vol. 2, No. 2, Winter 1995, pp. 1-1 1.
[2] Zaknich, Anthony, desilva, Christopher and Attikiouzel, Yianni, "The probabilistic neural network for nonlinear time series analysis", IEEE International Joint Conference on Neural Networks (IJCNN), Singapore, 17-21St November
[3] Zaknich, Anthony and Attikiouzel, Yianni, "Automatic optimization of the modified 1991, pp. 1530-1535.
probabilistic neural network for pattern recognition and time series analysis", Proceedings of the First Australian and
New
Zealand Conferenceon
Intelligent Information Systems, Perth, Western Australia, 1-3rd December, 1993,[4] Specht, D. F., "A genera1 regression neural network", IEEE Transactions on Neural Networks, Vol. 2, No. 6, November 1991, pp.
[5] Specht, D. F., "Probabilistic neural networks", Neural Networks, Vol3, 1990, pp. 109-1 18. [6] Farhat, N. H., "Optoelectronic neural networks
and learning machines", IEEE Circuits and Devices Magazine, September, 1989, pp. 32-41. [7] Atlas, L. E., Suzuki, Y., "Digital systems for
artificial neural networks", IEEE Circuits and Devices Magazine, November 1989, pp. 20-24. [8] Masaki, A., Hirai, Y. and Yamada, M., "Neural
networks in CMOS: A case study", IEEE Circuits and Devices Magazine, July 1990, pp. 12-17. 191 Farhat, N. H., "Optoelectronic analogs of self-
programming neural nets: architecture and methodologies for implementing fast stochastic learning by simulated annealing", Applied Optics, Vol. 26, No. 23, December 1987, pp. 5093-5103. pp. 152-156.
analog delay line pl-exl taps long
Analog input signal
P
1 -D outprt deby line exl taps long FIGURE 1: MPNN Optoelectronic Deslgn1 -D delay line p-ext taps long
Index through memory.
Repeat cycle for each x input.
Enable at eacl memory index
Enable after each cycle.
i - 1
Memory
I i = M
d/ & d /
AND Gate
I
CPU
:IGURE 2: MPNN Virtual Digital VLSl Hardware Design
1 - D delay line pexl taps long x vector, dimension p
.
, ...
...
Only the memories enabled by the
comparators are Divider
accumulated. Comparator activates when upper bas of x and xi match. 0 = lower ba size
1 -D outwt delay line e a taw Iona
I I ' .~
FIQURE 3: YPNN VLSi Parallel Hardware Design