Distributed Computation - Pattern Recognition Using Associative Memories

Using fMRI it has been shown that different areas of the ventral stream are used when recognising different categories of object [57]. From the results obtained, it was concluded that rather than the brain containing a separate area for each category, it seems that the representations of objects are in fact distributed. The mechanism or representation used for the distribution was not known, however it is noted that the representation appears to be organised in a way that reflects the differences between categories. It is proposed that the distinguishing factors between representations of different object categories may be similar to those proposed by Tanaka [101], including changes in luminosity in particular areas of an image, or specific sub-shapes such as circles and squares.

On the other hand, it has been found in [83] that stimuli from each of a rat’s whiskers is processed by a small but separate module, or group of neurons, in the rat’s brain. They also concluded that the timing of spikes (an electrical pulse from one neuron to another) plays a large part in the encoding of information. Previously it had been assumed that the timing of spikes was not sufficiently accurate, and that only the number of spikes in a given time could be used to pass information [93]. This difference of findings seems to indicate that it is possible that some functions in the brain are performed by exclusive modules, where others—particularly higher-order functions—are performed in parallel using a distributed representation.

As discussed in Section 2.3.2, local representations are a straightforward way to store content in a number of nodes, where each individual node stores one piece of information. A local representation is ideal for parallel processing, as the content of each node

2.4 Distributed Computation is independent of the other nodes and therefore each node may be processed separately by dedicated hardware or software [50]. While this may be easy to implement, it is more susceptible to errors from noisy inputs, and each module must be implemented specifically for a node.

Some systems built with neural networks use a distributed vector representation in- stead [50]. Using a distributed representation, content is stored as a pattern of activity across a number of nodes rather than in the activation of a single node. This gives various benefits, such as the ability to generalise and gracefully degrade with a partial or noisy input, and the use of a number of simple homogeneous nodes that can be processed in parallel [84].

2.4.1 Tensor Product Production System

Productions are simple rules that may be used in various fields, such as artificial intelligence and expert systems. They contain antecedents and consequents—the consequents “fire” if all the antecedents of a production are met.

In their most basic form, a production system may be thought of as being similar to a state machine, with the antecedent of a rule being the current state, and the consequent of a rule becoming the new state. In this form productions are limited to having only a single antecedent, referred to as arity-1 rules. Productions with two antecedents are arity-2 rules, etc.

Dolan and Smolensky developed a simple production system in 1989, using tensor products [29]. The system required third-order tensors (or three-dimensional matrices), and successfully operated on a given set of productions. The experiment was limited to a set of six productions, and six tokens (out of a possible 35), represented by seven-bit vectors. Despite their very limited size, the results were encouraging—particularly those involving the injection of faults, as the system was able to continue operating successfully for a reasonable number of injected faults.

2.4.2 Parallel Distributed Computation

Parallel Distributed Processing (PDP) was a concept presented by the PDP Research Group at the University of California, San Diego [90]. Essentially this work expands upon the use of distributed vector representations, developing and expounding upon a number of models suited to varying uses.

a b c d e f

Figure 2.9: Example non-deterministic finite state automaton

Parallel Distributed Computation (PDC) is a similar idea presented by Austin [5], specifically applying to CMMs. PDC has an important feature that is not shared by PDP—the ability to process more than one computation simultaneously, using only a single neural network.

An alternative way to consider a CMM, rather than as an associative memory, is as a state machine [5]. This extends the use of associations as a form of storage to allow them to be used for computation. Operation of the network is exactly the same—namely that an input vector is presented, and the associated vector is output—but considering it in this way can encourage inspiration for different applications.

Developing Dolan and Smolensky’s production system further, Austin [6] proposed a two-layer CMM to implement productions. One of the particular aims of this work was to develop a neural network-based production system that could be practically implemented on available systems. Kustrin and Austin [66] extended the notion, creating a CMM- based system capable of performing connectionist propositional logic, and demonstrating its ability to resolve queries using the trained axioms. This work was limited, however, in that it provided no way for a production to have more than one consequent. If a rule like this were to exist, then distinguishing individual vectors in a superimposed output would have been a problem.

As well as a potential for further application, some of the capabilities of CMMs be- come more apparent. When considering a particular application of CMMs, for instance performing pattern recognition, the benefits of parallel operation seem obvious—if two images are processed at once, then the overall execution time will be halved compared to only processing one at a time. Similarly, in a classical system designed to process a finite state automaton, parallel operation will reduce the processing time—but sequential operation is sufficient. When considering a CMM to be a state machine, however, the ability to operate correctly on multiple states simultaneously becomes not only desirable, but essential if the limitation shown in Kustrin and Austin’s work is to be avoided.

2.4 Distributed Computation

Input

state CMM

Output state Figure 2.10: A CMM as a state machine

a 1 0 1 0 0 b 0 1 0 1 0 c 1 0 0 0 1 d 0 1 1 0 0 e 1 0 0 1 0 f 0 1 0 0 1 (a) a_∨b=11110 a∨c=10101 b∨c=11011 e∨f =11011 (b)

Figure 2.11: (a) Baum codes generated with n = 5, s = 2, p1 = 2, p2 = 3, and (b)

demonstrating a potential issue distinguishing overlapping vectors.

As an example, consider the states and transitions in Figure 2.9. A system may traverse the state space using a number of algorithms, for instance breadth-first or depth- first search. The state transitions can be trained into a CMM, representing each state token by a distributed vector and then associating the appropriate vectors. Operation of the state machine is iterative, with the output of a recall operation becoming the input to the next iteration, as shown in Figure 2.10.

Upon presentation of the input vectora, in an associative memory that converges to a single output state—such as the Hopfield network [54]—only a single output state would be selected, either b or c. As we saw in Section 2.3.5.1, however, both states may be recalled simultaneously when using a CMM. Due to the nature of matrix memories all of the output neurons involved with mappinga to b, as well as a toc, would fire—after applying an appropriate threshold the output would then be the superposition of the two statesb and c.

Both the L-max and L-wta thresholds require knowing the exact weight of an output vector, with the L-wta method also requiring knowledge about the distribution of bits set to one. If multiple vectors are superimposed then neither the weight nor the bit distribution can be guaranteed, as vectors may overlap (as can be seen in Figure 2.11b

witha_∨c). This leaves the original Willshaw threshold as the only appropriate option, in this case using a threshold value equal to the weight of a single input vector. The problem now faced is to be able to distinguish between the superimposed vectors that are output in such a situation. Using the tokens in Figure 2.11a, the superposition of vectorsb andcis

11011. The superposition of vectors eandf, however, is exactly the same. Although this is a contrived example, it serves as a demonstration of the potential difficulty. A solution to this problem will be described in Section 4.3.

In document Pattern Recognition Using Associative Memories (Page 46-50)