Modified probabilistic neural network hardware implementation schemes

(1)

MURDOCH RESEARCH REPOSITORY

http://dx.doi.org/10.1109/TENCON.1996.608775

Zaknich, A. and Attikiouzel, Y. (1996) Modified probabilistic

neural network hardware implementation schemes. In:

Proceedings of the 1996 IEEE Region 10 TENCON - Digital Signal

Processing Applications Conference, 26 - 29 November, Perth,

Western Australia, pp 167 - 172.

http://researchrepository.murdoch.edu.au/17957/

Copyright © 1996 IEEE

Personal use of this material is permitted. However, permission to reprint/republish

this material for advertising or promotional purposes or for creating new collective

(2)

1996 IEEE TENCON - Digital Signal Processing Applications

Modified Probabilistic Neural Network ardware Implementation Schemes

Anthony Zaknich, Member IEEE, Associate Member of AES Yianni Attikiouzel, Senior Member E E E , Fellow IEE, Fellow IEAust

Centre for Intelligent

Information Processing Systems (CIIPS),

Department

of

Electrical and Electronic Engineering,

The

University

of

Western Australia,

Nedlmds 6907,

Western Australia

Abs&act

The modified probabilistic neural network for nonlinear time series anajysis was developed and introduced in 1991. It effectively represents a simple family of clustering methods for reducing the size of Specht's general regression neural network and retaining all its benefits. Three hardware implementation schemes for the most basic form of the modified probabilistic neural network are described. The first is an optoelectronic implementation and the other two are Very Large Scale Integration designs: a virtual implementation and a fully parallel implementation.

The Modified Probabilistic Neural Network

(MPNN)

[ l,2,3] is similar to Specht's General Regression Neural Network ( G R " ) [4] and they are both closely related to the Probabilistic Neural Network (PNN) [5]. The MPNN and the G R " are general regression or function mapping algorithms which can

be implemented as three layer feed forward neural networks. They can be trained very easily and quickly and as such they can be very useful for general nonlinear signal processing applications.

If the yi are allowed to be individual real valued scalars equation (1) becomes exactly Specht's GRNN which incorporates each and every training vector pair {xi -> yi} into its architecture (xi is a single training vector in the input space and yi is the associated desired scalar output). If it can be assume that there is only one Centre in the input space per (output yi then a convenient general model to use for

(all forms of the MF"N and the GRNN is:

fi(

)

centxi

0

yi

M

Zi

is a Parzen radial basis function (RBF).

is the centre or mean vector for class i in the input space (real valued or quantised). is the single learning or smoothing parameter chosen during network training. is the output related to centxi (real valued or quantised).

is the number of unique centres i in the MPNN structure.

is the number of input training vectors xj associated with centxi.

M C i = l

NS = Zi, is the total No of training vectors.

Equation (2) can be derived from the G R " equation (1) through the following approximation:

Zi

C

fi(

(x

-

xj),o )

j = l

Zijj(

(x

-

centxi),o ) (3)

This is a reasonable approximation if the x. are close J

together in a relatively small local space and can be adequately represented by a single centre vector centxi. The key to the practical application of the general MPNN equation (2) is related to the method of selection of the yi and the grouping (clustering) of the associated input vectors in each class i. The various methods for this have been developed by Zaknich et al [1,2] previously. Given that a satisfactory grouping has been determined for a particular nonlinear signal processing problem equation (2) can be implemented as a general MPNN nonlinear signal processing machine.

(3)

The most simple Parzen

RBF

which is most convenient to use for hardware implementations is defined by equation (4).

when all IXk

-

centxikl

e

0, else

fi( (x

-

centxi),, ) = 0 (4)

where:

Xk are the elements of p-dimensional vector T

x = (XI, x2

,...

Xk

...

x )

.

P

The main advantage that a suitable neural network hardware implementation can offer is the ability to directly implement the parallelism and concurrency of the neural network algorithm. According to Farhat

[6] there are two broad directions in which hardware implementation research seems to be going. The first is toward connection machines or general-purpose parallel computers in which a large number of digital central processing units are interconnected to perform the parallel computations in Very Large Scale Integration (VLSI) hardware. Some of the more standard Digital Signal Processing (DSP) processors and transputers or special processors designed with parallelisation in mind can be used for these machines [7]. The other is towards analog hardware where a large number of simple processing units or neurones are connected through modifiable weights such that their phase-space dynamic behaviour has useful signal processing functions associated with it. The later approach is dominated by analog optoelectronic hardware implementations because optics offers the massive inter connectivity and parallelism and the electronics side offers the flexibility, high gain and decision making. Ultimately, a full optical solution would be the most desirable and could end up to

be

the winning approach. However, according to other researchers [8] CMOS digital VLSI technology is also promising because of its outstanding success with integration scale and ultra-low power dissipation in computer logic and memory devices. This paper offers very basic MPNN designs which are based on optoelectronic, virtual and fully parallel VLSI electronics technologies.

Optoelectronic Implementation

Optoelectronic implementations are typically analog in nature. Farhat [6,91 introduces a few basic optoelectronics devices which can be used to develop the "N. This can be done easily because the MPNN is much less demanding than other neural networks in its implementation requirements. The

basic building blocks include the light emitting array (LEA), the 2-D spatial light modulator (SLM), photo diode array (PDA) and the anamorphic lens system (cylindrical and spherical lenses in tandem). Amongst other things they can be used to perform dot products on 2-D arrays of weight vectors or in our case training vector centres.

Since light signals vary from zero to some upper positive intensity it is most convenient to use unsigned positive arithmetic for signal representation. The negative signal components are dealt with by introducing a fixed positive bias to

make all signal components positive for processing. The appropriate bias is then subtracted from the final result to restore the negative parts if necessary.

A basic optoelectronic implementation design is depicted in Figure 1. For simplicity of representation the vector xi is taken to be equivalent to centxi. The positive analog signal (original signal plus DC bias) passes through an analog delay line having (p-I-ext) taps. This is added to ext tap values from an output analog delay line forming a (p- 1)-dimensional positive valued input vector. The input vector needs to be normalised to have a unity magnitude for reasons specified later. However, to preserve the original signal energy scaling an energy feature is added to the input vector before normalisation. The extra energy feature can be computed as follows:

All vectors x and xi are normalised to unit length, ie. llxll= 1 and llxill = 1 by the following formulation:

x = x / d(x.x), if x.x > 0 and zero if x.x = 0.

The resulting unity magnitude x vector has a dimension of p. It is necessary to do this unity normalisation to be able to exploit the following rela tionship:

(x

-

xi) T (x

-

xi> = (2

-

2x.xi) = r2

where r is the Euclidean distance from vector x to vector xi.

If a FU3F similar to equation (4) is to be used a maximum radius of (5 around the input vector x can

be defined by specifying an equivalent dot product threshold as follows:

x.xi = (2

-

02) / 2 = threshold

(4)

All vectors xi producing dot products having values greater or equal to the threshold value can be said to be within a Euclidean distance o of x. 'This is a very useful result since it is very easy to build a dot product function with optical components.

The elements of the x vector are fed to the first LEA which produces light beam intensities in proport~on to the element magnitude

then directed with a le dimensional SLM encoded vectors xi. Each of the ligh

elements of

x

are directed at all the respective cells of the SLM representing the same elements in each

xi.

The light passing through the SLM is then focused

onto a PDA with another lens system such that the resulting outputs from the PDA represent all the dot products x.xi. The output current from each of the M photo diodes (PD's) is fed to its own comparator with a common threshold setting. If the threshold is exceeded the comparator turns on else it stays off. Each of the activated comparators turn m the respective cell of the following LEA which produces a light beam of fixed intensity while the other cells remain off. At this stage the M light beams from last LEA are simultaneously directed at two verti one-dimensional SLMs, each with M cells. them has the Zi yi information encoded in it and the other the Zi information. The light passing through each SLM is focused to single PDs producing the two summations C Zi yi A(x) and C Zi fi(x). Since the RBF is a flat top hat type the summations involve simple accumulations of those quantities which are within the required distance of x. Other RBFs can be accommodated by simply modulation the light intensities from the last LEA as a function of r or the dot product xxi. The summations are finally divided to produce the output

96)

.

As mentioned before the DC bias can be removed from the result at this stage if necessary.

VLSI Implementation

VLSI technology is quite satisfactory for the implementation of two basic MPNN hardware designs using equation (2) and

equation (4). The first design is a virtual digital design and the second a fully parallel design. This RBF selects training centres which are within a specified city block of the input vector x and applies a straight linear weighted average to them to compute the output

96)

.

Although, this approach is the least accurate it is the most convenient to implement in hardware. If the training cluster centres are kept close the accuracy can be improved at the expense of increasing network size.

ly the MPNN memory and are the most crucial ones. that the memory and systems control d the <T o p ~ i ~ i s e r functions are

performed by a standard serial type hose computer since these are the least time critical functions if the adaptive operation i s not so important. If all the training and cluster centre allocations are done by the hmt computer which then down loads the a m i n g to the hardware a minimal hardware size and complexity can be achieved.

A virtual Ml" machine design is best done with

I circuitry. It has a, single central (CPU) fed by a random access which holds the network parameters. For each input vector x the machine cycles through the whole mupied memory from i = 1 to i = M ~ e e d ~ ~ ~ g the relevant memory values to the processor. tes and accumulates the ith sub then produces an

The hardware system is designed to work with ms&nened binary integer words with values from 0 to (2b1ts -l), where bits equals the maximum number of word bits. Zero is represented by 2(bits-1), positive values range between 2(bits-1) and (2bits -1) and negative values range between (2(bits-1) -1) and 0. As indicated in the diagram above the

RAM

is addressable from i = 1 to i = M and at each address location i the training vector centre xi, the number of ith training vectors Zi, and the product of Zi times the desired output yi are stored. These parameters are computed by the host computer using any suitable metPld before hand and then just simply down loaded to their respective

JRAM

locations. Centres xi are computed as the mean rounded integer values of the ith input training vectors and the outputs yi are the means of their associated desired outputs. The xi ally the same or a little evice. For example if the need not be larger than 8 bits. The Zi RAM size needs to be large enough the take the largest expected number of training vectors and the Zi yi

RGM

big enough the take the highest Zi times the maximum yi value. Typically the output BAC size is the same as the input ADC size which is also the maximum yi size.

bit device then the

(5)

number of word bits as the number of upper bits plus the number of lower bits, ie. bits = @ispper

+

bitslower). The upper bits of each the elements of the vectors xi and x are compared using a logic AND gate. If all the upper bits match then they can be said to be within the same city block where the city block size is defined by the lower bits which also defines the o. For example, if the 3 lower bits of the word are excluded from the matching then the city block size is Illbinary = 7de5eal and the matching vectors can be said to be wthm 7 or closer in any of their dimensions. This is a good method provided it is acceptable to allocate values for 5 as only powers of

2, ie, 0,2,4, 8, 16 etc. which correspond to 0, 1, 2, 3 , 4 etc. numbers of lower bits respectively. Since acceptable 0's tend to be fairly broad ranging this is not a serious limitation. On average this approach should produce acceptable results but some individual results can be biased if the particular x vector elements have lower bit values which are greater than half the city block distance as defined above. For example, if an element of x has the lower bits Illbinary it will only match with centres having lower bits less than or equal to 11 lb i n a and those centres above that but still close ( eg. 1 h i n a q )

will be ignored. This can be fixed to some extent if the following logic is included. If the upper bit of the lower bits of an element of

x

is a 0 then apply the method as normal. If it is a 1 then reduce the n u m k r of upper matching bits by 1 ( ie. increases the lower bits by 1 ). This ensures that sufficient matching vectors either side of x are included but the effective

CT is now variable by 1 bit. Which ever approach is

used the value of o is set by preselecting the number of lower memory bits for each element of the x vectors which are ignored by the AND gate comparator. When the AND function determines a match the gate passing the Zi and Zi yi values eo the accumulators is activated. The optimal 5 can be

determined quite quickly by the host computer by running some real data through the system for test values of o = 0, (5 = 2, o = 4 etc. until a satisfactory

mean squared error (mse) is achieved during training.

The final output value

96)

is computed at the end of the cycle by dividing the final value of the Zi yi accumulator by the final value of the Zi accumulator. The accumulator sizes must be sufficient to accept the highest expected totals to avoid overflow else, overflow circuitry must be included. The highest expected Zi yi accumulator value is +/- (M maximum(Zi ) maximum(yi)) and the highest accumulator value is +/- (M maximum(Zi )). Finally, the divider must be able to handle these highest inputs to result in peak values of

%)

close to +/-

(maximum(yi)).

A fairly easy way to implement any virtual MF"N design is to use a standard DSP chip fiynily such as the Texas Instruments TMS320C30 or TMS320C40 for the

CPU

and the memory. These types of chips have standard multiply and accumulate functions needed by the MPNN and they can also perform other arithmetic functions such as division. If this type of implementation is not computationally fast enough with one DSP chip then an easy solution to this would to break the M P N N in parallel sections and implement them with virtual machines similar to that described above but without their final divider stages. The subtotals from these machines can be transferred to external accumulation and division circuitry to compute the final output. This strategy would introduce very little extra delay. Due to the likelihood of amassing very large final accumulator values it would probably be best to consider a floating point CPU design for very large network sizes.

A Parallel VLSI Hardware Design

Although the virtual design can be parallelised to achieve faster throughput there may be some applications that require much faster throughput than can be easily accommodated. In that case the fastest implementation requires a fully parallel hardware design. The parallel hardware design is similar to the virtual design except for the fact that for each new x input all the filter computations are done simultaneously in parallel hardware rather than via a

CPU in a computation cycle. Figure 3 shows the main elements of the design.

The memory, comparator and divider parts can be implemented with digital VLSI technology quite easily however, the parallel accumulators may be better implemented in analog form. Either way the design is quite simple using the RBF according to equation (4). The xi comparators are simply AND gates with appropriate input buffers fed by the high bits of all the elements of the vector x fed. The value of xi is programmed into the comparator hardware by setting the input buffers to be inverting or non inverting to match the correct binary bit code, ie. inverting

=

logic 0 while non inverting

=

logic 1. Clearly only those comparators with the correct bit

match at their input will output logic 1's which enable the appropriate Zi and Zi yi memories to feed into the parallel accumulators. As before the value of o is set by preselecting the number of lower memory bits for each element of the x vectors which are ignored by the

AND

gate comparator.

(6)

Conclusions

It has been shown how it is possible to easily implement the simplest

forms

of the

G R "

and

MPNN

algorithms specifically for general nonlinear signal processing applications in real-time with optoelectronic and VLSI hardware technologies. These designs can be extended for more complex

RBFs

which can produce better results with fewer

training vectors or centres. However, equation (4) can produce quite satisfactory results if comparatively more training vectors or centres are used to compensate for the poorer

RBF.

[ 11 Zaknich, Anthony and Attikiouzel, Yianni, "Time Series Characterisation Schemes for the Modified Probabilistic Neural Network', Australian Journal of Intelligent Information Processing Systems, Vol. 2, No. 2, Winter 1995, pp. 1-1 1.

[2] Zaknich, Anthony, desilva, Christopher and Attikiouzel, Yianni, "The probabilistic neural network for nonlinear time series analysis", IEEE International Joint Conference on Neural Networks (IJCNN), Singapore, 17-21St November

[3] Zaknich, Anthony and Attikiouzel, Yianni, "Automatic optimization of the modified 1991, pp. 1530-1535.

probabilistic neural network for pattern recognition and time series analysis", Proceedings of the First Australian and

New

Zealand Conference

on

Intelligent Information Systems, Perth, Western Australia, 1-3rd December, 1993,

[4] Specht, D. F., "A genera1 regression neural network", IEEE Transactions on Neural Networks, Vol. 2, No. 6, November 1991, pp.

[5] Specht, D. F., "Probabilistic neural networks", Neural Networks, Vol3, 1990, pp. 109-1 18. [6] Farhat, N. H., "Optoelectronic neural networks

and learning machines", IEEE Circuits and Devices Magazine, September, 1989, pp. 32-41. [7] Atlas, L. E., Suzuki, Y., "Digital systems for

artificial neural networks", IEEE Circuits and Devices Magazine, November 1989, pp. 20-24. [8] Masaki, A., Hirai, Y. and Yamada, M., "Neural

networks in CMOS: A case study", IEEE Circuits and Devices Magazine, July 1990, pp. 12-17. 191 Farhat, N. H., "Optoelectronic analogs of self-

programming neural nets: architecture and methodologies for implementing fast stochastic learning by simulated annealing", Applied Optics, Vol. 26, No. 23, December 1987, pp. 5093-5103. pp. 152-156.

(7)

analog delay line pl-exl taps long

Analog input signal

P

1 -D outprt deby line exl taps long FIGURE 1: MPNN Optoelectronic Deslgn

1 -D delay line p-ext taps long

Index through memory.

Repeat cycle for each x input.

Enable at eacl memory index

Enable after each cycle.

i - 1

Memory

I i = M

d/ & d /

AND Gate

I

CPU

:IGURE 2: MPNN Virtual Digital VLSl Hardware Design

1 - D delay line pexl taps long x vector, dimension p

.

, ...

...

Only the memories enabled by the

comparators are Divider

accumulated. Comparator activates when upper bas of x and xi match. 0 = lower ba size

1 -D outwt delay line e a taw Iona

I I ' .~

FIQURE 3: YPNN VLSi Parallel Hardware Design