TWO CELLULAR ARCHITECTURES FOR INTEGRATED IMAGE SENSING AND PROCESSING ON A SINGLE CHIP

(1)

TWO CELLULAR ARCHITECTURES FOR INTEGRATED IMAGE SENSING

AND PROCESSING ON A SINGLE CHIP

Gail Erten Fathi M. Salam

IC Tech, Inc. Electrical Engineering 2157 University Park Dr. Michigan State University

Okemos MI 48864 East lansing MI 48824 [email protected] [email protected]

ABSTRACT:

Two architectures for a programmable image processor with on-chip light sensing capability are described. The first is a VLSI implementation of a cellular neural network. The second is a distributed dual-structure mutation of the first architecture. The distributed dual architecture leverages the speed of silicon against the large silicon area requirements. Moreover, the innovative integrated nature of the dual-structure design significantly reduces the bottleneck and computational overload caused by data transfer from sensory focal plane to the image processor. The paper also describesVLSI chip prototypes and test results.

1. Introduction: Image Processing with Cellular Neural Networks (CNN)

CNN is a hybrid of Cellular Automata and Neural Networks (hence the name Cellular Neural Networks) and it incorporates the best features of both concepts. Its continuous time feature allows real-time signal processing, and its local interconnection feature makes physical realization in VLSI possible. Its grid-like structure is suitable for the solution of a high-order system of first order nonlinear differential equations on-line and in real-time. In summary, CNN can be viewed as an analog nonlinear dynamic processor array [Chua88]. The basic unit of CNN is called a cell. Each cell receives input from its immediate neighbors (and itself via feedback), and also from external sources (e.g., the sensor array points and/or previous layers).

The canonical CNN equation summarizes these relationships:

ij ij ij r N kl kl kl ij ij r N kl ij kl kl ij ij ij

x

A

y

B

u

I

x

t t t t t t ij ( )=− ()+ _∈

∑

( ( ), ())+ _∈

∑

₍ ₎ ; ( (), ( ))+ ) ( ;

&

τ

Equation (1)

y

_ij

=

f x

(

_ij

)

Equation (2) and

I

ij

=

I

Equation (3)

where u is the input, x is the state, and y is the output associated with the cell (i,j), ττ ij is the time

constant of the cell (i,j) determined by R and C values, Iij is the bias current, Nr is the neighbor

hood of connectivity of the cell (i,j) such that kl ∈∈ Nr , and A and B represent the cloning

(2)

×× ×× ×× ×× f ( )

x

_ij

y

_ij

u

_ij

u

_{i k j l}₊ _, ₊

a

₀₀

a

_kl

y

_{i k j l}₊ _, ₊ ……… ……… ……… ………

b

₀₀

b

_kl

C

R

I

Feedforward Feedback

Figure 1: Architecture of a physical realization of the cellular neural network canonical equations.

In a typical CNN, local connections between the neighbors (feedback weights, or the entries of the matrix A in Equation 1), along with connections form the sensory array (input weights, or entries of the matrix B in Equation 1) form the programmable cloning templates. Cloning templates to perform numerous types of visual processing tasks have been developed. Each template is specific to a particular application, e.g. a cloning template for edge detection [Lee94] or binocular stereo [Park94]. Cellular neural networks are attractive in image processing because of their programmability: One needs to change only the template to perform a different iconic task.

2. VLSI Adaptations of the CNN Model:

2.1. Motivation

The CNN model described in Equations (1-3) is not suitable for direct VLSI implementation. Below we introduce a qualitative discussion that motivates the models we implemented :

In an integrated circuit implementation of the CNN model, the summation equation is a current based computation, as the circuit model in Figure 1 suggests. By Kichoff’s law, all currents coming into the node that defines the state of the cell ( x ) must add to zero. Because the intrinsic resistance values ( R ) are very large and capacitance values are very small, little charge is required to maintain a particular voltage. This also means that the current required to alter the voltage value of the state is very small. The values of the noise currents are of sufficient magnitude to make a significant difference. When one adds to that the fact that the transistor characteristics can

(3)

To overcome these obstracles, we introduced a self-normalizing feedback structure for the computation of the inputs and feedforward template. This borrows a crucial element from the resistive grid. The feedback in the feedforward input weight multiplication process is necessary to keep the state node ( x ) from quickly accumulating or losing charge to saturate to a power rail. In the circuit implementation shown in Figure 2:

1. both u (unsigned) and x (signed) are represented as analog voltages u and x; 2. template weights can be digitally stored and are programmable; and

3. the transfer function f ( ) in Equation 2 can in fact become a programmable tri-level saturation function, that produces zero output in a region of the input (rather than only at a point).

The particular equations that incorporate the characteristics of the circuit structure are listed below.

ij ij r N kl kl kl ij r N kl kl kl ij ij

A

y

B

u

x

I

R

x

C

t ij t k T q t t t

x

=

−

+

∑

+

∑

−

+ ∈ ∈

[

(

)]

) ( ) ( 2 tanh ) ( ) ( ) ( ) ( ) ( κ

&

Equation (4)

y

f x

q

kT

x V

REF

=

( )

=

tanh (

κ

(

−

))

2

Equation (5) and

I

ij

=

I

Equation (6)

where the hyperbolic tangent (tanh) describes a basic transconductance circuit.

Note that the input (feedforward) weights (B) are implemented in a negative feedback structure. The reasons have been qualitatively explained above. This manner of implementation was chosen because (i) the feedback structure stabilizes the state (x) node, which otherwise has a tendency to charge all the way up or down to a rail limited by the power supply, and (ii) the structure is far more resilient to manufacturing related issues such as transistor mismatches. In Figure 2, the modified CNN model is shown. Some assumptions have been made about the polarity of inputs and weights: These are that B is nonnegative and A is nonpositive.

(4)

x

_ij

u

_ij

u

_{i k j l}₊ _, ₊

a

₀₀

a

_kl

x

_{i k j l}₊ _, ₊ ……… ……… ……… ………

b

₀₀

b

_kl

C

R

I

+ + V_ref + + Feedforward Feedback

Figure 2: Adaptation of the CNN model to the VLSI domain. It is assumed that all input weights (or, all entries of B weights are nonnegative) and all feedforward weights (all entries of A) are nonpositive.

2.2. Integrated Light Sensing and Processing in the Cellular Realm

An overall integrated light sensing and processing architecture one can propose based on the cellular neural network paradigm is comprised of an array of identical sensor processor cells, each of which contain:

1. a photodiode and circuits for active light sensing,

2. transconductance amplifiers for feedforward template weight multiplication, 3. wide range transconductance amplifiers for feedback template weights, 4. analog/ single bit digital local dynamic memory cells,

5. a data bus for transfers, 6. local programmable logic, and 7. read/write data controls.

Within this structure, one can perform a variety of operations of the cellular paradigm using, for instance, a pair of 3 x 3 cloning templates. Each operation can be completed in about 30 nsec. over the entire image. This implies a processing rate of over 30 million frames/sec per template instruction. The operations include all known convolution operations, plus connected component detection and manipulation allowed for by the feedback weights. Example operations are edge location, morphology operators such as dilation, thinning, and erosion, light adaptation, scratch removal, texture, color and shape analysis. In addition, by using the initialization of the states by previously obtained frames, it is also possible to use the two template array to implement temporal operations, such as motion analysis, e.g., local velocity detection, motion direction detection.

(5)

or a digital signal processor depending on the computing needs of the particular application at hand. The program memory can be internal, where a more compact address gets horizontally coded into bit level microinstructions and determine template values and data transfers within the architecture. In the same figure we also illustrate the contents of a cell in detail. The feedforward and feedback weights which also integrate the transfer function are shown in outline. In addition, the local logic and memory functions are illustrated. There is a main data bus across which data transfers can occur. Two way connections are made between the data bus and the two analog and four digital memory units, the state (x) and input (u). Also, connection is made from the logic output and the reference voltage to the data bus. The logic can be implemented by programming, similar to programmable logic arrays. The emerging reconfigurable logic arrays represent another option.

(6)

Program memory R o w s e l e c t . . . . Sensor timing and control Row address DSP, CPU or µC

Cellular array

to display u reset . . . . . x . . . . . V ref photodiode feedforward feedback from external sensor

-if used as processor

plus data storage, controls, and program-able logic x u V_ref L O G I C d_1 d_2 d_3 d_3 digital memory a_1 a_2 data bus analog memory

Figure 3: This figure illustrates a fully configured cellular architecture element, or the cell. For the sake of an uncluttered simple diagram, the figure shows a 5 x 5 array and many of the elements of the cell have been omitted.

(7)

2.3. Implementation of Several Cellular Model Circuits

A CMOS chip containing several types of cellular units was designed, simulated and laid out. The design was submitted for manufacturing to the MOSIS 2 micron ORBIT ANALOG process. The die size is the so-called TINYCHIP package that is a 2.3 mm x 2.3 mm area bonded to a 40 pin DIP.

The implemented chip contains eleven types of cells. All outputs are available through wide range followers at the outputs and all programmable cell parameters, i.e., feedforward and feedback weights, and bias are at 3 bit resolution (plus sign bit). External inputs are available via pins to set these weights through the pins. Figure 4 shows the chip’s layout. Additional figures show the schematic contents of the cells.

column of eleven followers four one

feedforward, one feedback cells with initialization programmable feedback cells initializable cells cells with programmable bias cell with no feedback

Figure 4: Layout of the VLSI adapted cellular neural network models implemented and tested.

Selected cells are described in detail in the following sections. Moreover, Figures 5-8 illustrate these cells schematically.

2.3.1. Four One Feedforward – One Feedback Cells with Initialization

At the top of the chip, there are four one input one output cells. There are four versions of the simple one input – one output simple cell we illustrated in Figure 2. The four versions arise due to the four ways in which the permutations of the positive and negative feedbacks can be implemented. Four combinations of +/- terminals are possible and all of these have been implemented. Moreover, all four cells include a state initialization mechanism. This means that the state can be initialized at any voltage between (0-5 Volts). This is equivalent to defining an

(8)

initialization point x t=0 in Equation (4). The schematics of the four cells are illustrated in a compact manner in Figure 5. Both the feedforward and the feedback amplifiers have three bit programmable gain.

One feedback one feedforward cell with initialization

+/--/+ +/--/+ V_initilialize initialize

input (u) state (x) V_ref

Figure 5: The one input – one output cell with initialization.

2.3.2. Two Programmable Type (Positive or Negative) Feedback Cells

Two three input one output selectable kind (+ or -) of feedback cells were implemented. One of these also allowed for initialization of the state in the manner identical to the one illustrated in Figure 5. Figure 6 shows this type of cell without initialization. Note that in this cell, there is an added delay of a pass-through transistor in feedback of the state. The two signals (the state node x and Vref) are directed to the terminals of the feedback weight amplifier based on the feedback sign select bit. All amplifiers have three bit programmable gain.

(9)

Figure 6: Programmable Type Feedback Cell.

Programmable feedback cells

+

-input (u

₁

)

state (x)

+

-input (u

₂

)

+

-input (u

₃

)

+

-bias (sign)

feedback

sign

select

V_ref

(10)

2.3.3. Two Hardwired Feedback and State Initializable Cells.

Two three input one output cells illustrated in Figure 7 were included with initialization. These cells differ from the ones described in Figure 6 only slightly. The type of feedback is directly implemented rather than being programmable as seen in Figure 6. Two combinations of +/-terminals are possible and both have been implemented. Both amplifiers have three bit programmable gain. Initializable cells + -/+ input (u₁) state (x) + -input (u₂) + -input (u₃) + -bias (sign) V_ref V_initilialize initialize

Figure 7: Positive and negative feedback hardwired and state initializable cells.

2.3.4. Two (+/-) Hardwired Feedback and Floating State Cells

These cells are identical to the ones described in Figure 7 but do not contain state initialization circuits. Initial value of the state is thus undefined. These cells are illustrated in Figure 8 (a). 2.3.5. One Feedforward only Cell.

This is a three input one output cell with programmable bias. The feedback component is eliminated. Thus the cell diverges from the cellular neural network paradigm and instead is useful for feedforward operations only. This cell is illustrated in Figure 8 (b).

(11)

Cells with programmable bias + -/+ input (u1) state (x) + -input (u2) + -input (u3) + -bias (sign) V_ref

Cell without feedback

+ -input (u1) state (x) + -input (u2) + -input (u3) + -bias (sign)

(a) cells with feedback feedforward and bias (b) cell without feedback Figure 8: Additional cells.

2.4. Sample Tests: Feedforward Weights and Image Convolution

Convolution is a very common image processing step that precedes many vision tasks. It is often used as a technique to generate a second (different) image from the original where desired features are enhanced and/or undesired characteristics (e.g., noise.) are suppressed. Convolution can be described as a continuous spatial or temporal operation, but its application to sampled images is discrete and involves a convolution kernel with discrete values. This discrete convolution operation denoted by ⊗, between an image I and kernel k can be described as :

I

k

I

k

x,y x i,y j j N N i N N i, j

⊗

=

_{+ +} =− =−

∑

Equation (7)

where I is the input image, x and y are the two dimensional image coordinates, and k is an (2N+1 x 2N+1) square kernel.

(12)

B

kl

u

kl

u

kl N r ij ij t t

(

( ), ( ),

)

( ) ∈

∑

in Equation (1) and the sum in Equation (7).

Thus, if one sets, I=0 and A=0, the steady state solution to Equation (1) is

x

_ij

B

_{ij kl}

u

_kl

u

kl Nr ij ij

t

( )

_;

(

( ) ,

( ))

( )

=

∈

∑

which is in fact equivalent to

I

⊗

k

_{x,y i, j}.

Convolution operations pertinent to this project are constrained by the analog VLSI hardware implementation. Even though this may appear to be a disadvantage, in reality, it creates the opportunity to implement a normalized convolution operation.

The described VLSI adapted model performs a normalized operation which replaces feedforward weights time input values summation terms, in the canonical model of Equation (1), namely

B

kl

u

kl kl N_r ij t

(

( )

)

( ) ∈

∑

with

B

kl

u

kl

x

kl N r ij q kT t ij t tanh

[

(

( ) ( )

)]

( ) κ 2 ∈

∑

−

in Equation (4).

We already discussed qualitatively the reasons for this modification in the previous section.

Normalized convolution, denoted by ⊗_{n is very similar to the conventional convolution operation,} except that the output is normalized by the sum of kernel entries. Because kernel entries are the same across the whole image, the result is essentially division by a normalization factor common to the entire image. The advantage is that the dynamic range of the resulting image is essentially the same as that of the initial image.

I k I_{x i, y} _j j N N i N N k_{i, j} j N N i N N k_{i, j} n _x,y ⊗ = = − + + ∑ = −∑ = −∑ = −∑ Equation (8)

For the implementation of normalized convolution described above, input voltage values are loaded with input voltages from the image, and gains from the entries of the kernel. Since all conductances (or gain values) are necessarily positive, to implement negative kernel values with this approach, one needs to be able to define negative input voltages and select negative polarity for the input to the transconductance amplifier for which a negative kernel value is desired. This cuts the dynamic range of images in half, although the range of kernel entries remain unaffected.

(13)

MSB

LSB

1/8

1/2

1

1/4

Vin (4 bits)

(= W / L)

Figure 9: Digital input accommodation via parallel transistor cascading.

X

u

_ij

b

₀₀ mux

u

_ij+

b

₀₀

u

_ij−

b

₀₀

b

₀₀

b

₀₀

b

₀₀ (sign) MSB LSB W L 2 1 ½ ¼

x

_ij

x

_ij

MSB: most significant bit LSB: least significant bit

Figure 10: Feedforward cell. The input u is always nonnegative but has a signed

representation. Positive entries of the B template select u+ and negative ones select u-.

In the tests we performed on the actual VLSI circuits the four bit plus sign representation shown in Figure 9 as entries of the B template in Equations (1) and (4) take on values equivalent to the range of -3.75 and 3.75 in increments of 0.25. The cell is equivalent to the feedforward configuration shown in Figure 8 (a) provided that the gain of the amplifiers are sent by the B template. With more

(14)

area and more bits higher resolution and dynamic range would be possible. Other parameters are as follows: 0 < u < 1 u + = V ref + u u - = V ref - u -3.75 < b < 3.75 u min < x < u max

V ref is the representation of zero in this architecture. One suggested value for Vref are the midpoint between the power supply ranges GND and Vdd.

Implementation of sign is based on selection of the u + or u -. Note that this also creates the opportunity to implement negative u. In other words, it is possible to accommodate u < 0 values in this structure, as well. In that case, again

-1 < u < 1 u + = Vref + u u - = V ref - u

Specific realization of the –3.75 to 3.75 range realizing different values of template B in the architecture: b = b sign (v) b 3 (MSB) b 2 b 1 b 0 (LSB) -3.75 0 V 1 V 1 V 1 V 1 V -2.00 0 V 1 V 0 V 0 V 0 V -1.00 0 V 0 V 1 V 0 V 0 V -0.50 0 V 0 V 0 V 1 V 0 V -0.25 0 V 0 V 0 V 0 V 1 V 0.25 5 V 0 V 0 V 0 V 1 V 0.50 5 V 0 V 0 V 1 V 0 V 1.00 5 V 0 V 1 V 0 V 0 V 2.00 5 V 1 V 0 V 0 V 0 V 3.75 5 V 1 V 1 V 1 V 1 V

where Vdd = 5 V and logic 1 = 5 V. We can select 1 V as bias to operate the bias transistors around threshold. 1

1 Note that the state node (x) is referenced to V ref and does not need a signed representation. Positive entry values for the feedback template “A” can be used to steer x to the + terminal and negative entries steer it to the - terminal of the wide range differential amplifier. In the four bit

(15)

The feedback configuration of the feedforward weights provides for some safety against the tendency of active current computation units to be attracted to a power supply rail.

Two real images were captured using a CCD camera and presented to the chip three inputs at a time. Three weights for the inputs are set to represent the one dimensional vertical edge kernel of 1 0 -1. Each pixel is presented in this fashion. The whole image was presented three pixels at a time along the horizontal direction to the model circuit shown in Figure 8(b) with the bias amplifier disabled. The result is shown in Figure 11.

Edge 1

input =

output =

thresholded

output

Edge 2

input =

output =

thresholded

output

Figure 11: Hardware test results from a three input one output cell with no feedback connections (See Figure 9(b) for schematics of the tested circuit).

(16)

The following are area rough estimates based on the full implementation of the cellular paradigm as shown in Figure 3.

In determining the dimensions of the preprototype, we used the models implemented as a guide. Results are in Table 1. Consequently, the technology size used for the area estimate is given in units of lambda. Lambda equals technology size, i.e., lambda=2µ in 2µ CMOS technology. Note that as the technology scales, the area will do so accordingly. For instance, for a 1.2 µ process, the size estimate should be multiplied with a scale factor of 0.36, or more precisely 1.44/4.0 (which equals (1.2 µ)2/(2.0 µ)2. The size estimate also includes a buffer analog storage for the pixel entries.

Area per cell

(in lambda sq.)

I/O pin Count Estimate

Light sensor plus buffers 10,000 sq. λ 16 I/0 (per

column)

16

Feedforward weights bias and data transfers

75,000 sq. λ templates 76

Feedback weights 40,000 sq. λ power 8

Local logic 25,000 sq. λ column select /

read write

10

routing overhead 50% 75,000 sq. λ data xfer /

clocks / control

20

Total ( one block only ) 225,000 sq. λ

(475 x 475 square)

Total 130

Total ( 16 x 16 blocks ) 57,600,000 sq. λ

7600 λ x 7600 λ

Table 1: Area estimates for implementing the full CNN architecture

These data show that a 16x16 pixel chip can be implemented using a roughly 2.0 mm x 2.0 mm area in 0.25µ technology. Pin count is also within most standard package limits. A one third inch chip (8.0 mm x 8.0 mm) in the same technology would be able to contain 64 x 64 pixels.

There are several comments to be made at this point:

(1) A commercial camera contains about an order of magnitude higher resolution.

(2) Only a small portion of the cell is light sensitive. Consequently, the sensor would have a very small fill factor. Even for operation at a quite low resolution, the chip would require a mechanism for focusing the light incident over the whole cell onto the light sensitive area. (3) The sensitivity of CMOS sensors to light (or quantum efficiency) is lowered as the technology

shrinks. Thus using 0.25 micron technology may not be feasible. (4) 3x3 kernels may be too small and restrictive.

(17)

3. A New Paradigm : The Dual Distributed Architecture

3.1. Motivation

In order to build a commercially viable product, one must pay close attention to the findings of the previous section. Commercial objectives would generally dictate that :

(1) The light sensing resolution of the imaging system be close to that available commercially. (2) The fill factor be significantly higher.

(3) The sensor performance be maintained with shrinking technology size. (4) A set of programmable kernel sizes rather than only a 3x3 kernel.

The cellular neural network paradigm, as is, is not likely to allow us to reach these goals in a cost effective manner using currently available technology. We therefore undertake to think about image sensing and processing outside of it.

Since our goal is to build a system in which one can leverage the speed of silicon against the large area requirements of a cellular processor, we view again the components of the cell as they are listed Section 2, namely:

1. a photodiode and circuits for active light sensing,

2. transconductance amplifiers for feedforward template weight multiplication, 3. wide range transconductance amplifiers for feedback template weights, 4. analog/single bit digital local dynamic memory cells,

5. a data bus for transfers, 6. local programmable logic, and 7. read/write data controls.

At this point, we take a systematic approach to meeting our objectives itemized above. We note that:

(1) To improve the resolution one needs to shrink the cell.

(2) To improve the fill factor, one needs to increase the light sensitive areas.

(3) To keep the sensor operational, one may need to build the sensors (or the whole chip) using a larger technology which again means that one needs to make the cell yet smaller.

(4) To increase the kernel size, one may need to build yet a larger cell that connects to more of its neighbors.

Item four runs counter to all the rest until we decide to break up the cell into two, leading to the division of the array into (1) sensory convolution unit and (2) logic unit. This concept is illustrated in Figure 12.

(18)

data bus to process area u reset photodiode outputs from neighboring cells L O G I C d_1 d_2 d_3 d_3 digital memory data bus Program memory R o w s e l e c t . . . . Timing and control row address Programmable Cellular Logic Array (PCLA)

R o w s e l e c t Timing and control row address Programmable Convolution Pixel Array (PCPA)

Column select Column select column address column address Program Memory E X T E R N A L P R O C E S S O R

Figure 12: This figure illustrates dual distributed cellular architecture and the two elements, For the sake of an uncluttered simple diagram, the figure shows a 5 x 5 array and many of the elements of the cell have been omitted.

In this manner, one can also implement (in the feedfoward connections) arbitrary size kernels, including a kernel that is as large as the sensor itself, but only with one such kernel at a time. In summary, the described architecture contains two distinct structures: (i) PCPA, a light sensing grid with convolution capability, which could also include some short term memory, and (ii) PCLA, a digital logic and memory area on which a “mental picture or representation of the image” is recreated. This is in some very coarse sense analogous to the concept of the retina and the visual cortex being separated from one another, where the latter constructs a mental picture from the information it receives from the former. Both areas compute in parallel and communicate with each other as needed in a serial but random access fashion. Also, it is primarily the digital logic array (PCLA) that communicates with a conventional digital signal processor, microprocessor or microcontroller.

(19)

3.2. The Signed Output Pixel (SOP)

For the functioning of the convolution circuits, the light sensor should produce signed outputs, i.e., there needs to be a (+) and a (-) pixel output. Positive (+) output is used for feeding into the amplifier if it contains a positive kernel value. Correspondingly, the negative (-) pixel output is used for negative kernel value. A one dimensional example is given below.

Suppose that one needs to implement a one dimensional edge kernel of [ 1 0 –1]. The equivalent arithmetic operation is the sum pixel[i-1] + (– pixel[i+1]). Since current summation is used, one needs to add the current for the positive representation of pixel[i-1] to the current for negative representation of pixel[i+1].

Figure 13 shows the layout of such a single signed output active sensor. The corresponding schematic of the signed sensor is illustrated in Figure 14, where the positive and negative pixel representations are marked. Note that the additional follower is necessary since the photocurrent is very small and the photosensing node should not be perturbed.

Figure 13: Layout of the signed sensor. Active area is a square of 100 λ x 100 λ. The whole cell covers a 195 λ x 130 λ area. The (grounded) METAL2 layer covering the areas of the cell that should not be exposed to light is not shown.

(20)

Figure 14: Schematic of the signed sensor.

3.3. Programmable Convolution Pixel Array (PCPA)

Figure 15 shows a representative drawing of the schematic of one element of the sensor array-PCPA. Each such pixel cell can be addressed by the combination of row and column select signals. Weight bits can also thus be written to selected pixels. Pixel midpoint determines the “0” value -i.e., no light, of the signed pixel value representation. Reset pixel will charge the pixel output to that value. A dedicated GND signal need not be routed since a metal layer covering the circuits (with opening at light sensitive areas) can be grounded. Each cell, shown schematically in Figure 15, occupies about a 350 λ by 350 λ square area and contains :

1. one signed output pixel (SOP),

2. four D flip flops for storage of three bit weight and sign, 3. four multiplexers,

4. one wide range transconductance amplifier connected in the unity follower configuration, and 5. a three input AND gate.

pixel_reset inv_ref pixel_out sign pix_ref v_folbias L=2u W=22u L=2u W=22u L=2u W=22u W=22u L=2u L=2u W=22u L=2u W=22u L=2u W=22u W=22u L=2u W=22u L=2u (+) pixel output (-) pixel output ___ sign This is the photosensor model can be replaced by the photodiode symbol

(21)

period of time. Several data and control signals are routed to each cell. Controls signals common to all cells are (1) clock weight, (2) reset photodiode, and (3) sign. Addressing signals are row and column. Data signals common to all cells are (1) pixel reference, (2) inverter reference, (3) weight high, (4) weight low, (5) weight MSB, (6) weight MID, (7) weight LSB. Furthermore, elements of each row have a common row select signal and elements of each column have a common column select signal. All outputs are summed on a common line, column read. The cell design is shown in Figure 15.

Figure 15: Schematic diagram of a cell of the PCPA.

The photo sensor was illustrated in Figure 14. Four weight bits (three magnitude and one sign) are stored in D-flip-flops. The weight bits are clocked in only when both the row and the column are selected, as well as when the weight clock bit is active. The outputs of the weight flip flops select between the weighthigh and weightlow. The wide range transconductance amplifier at the output stage has three bias transistors with W/L ratios of ¼ ½ and 1, each corresponding to the magnitude bits. If all weights are zero, the pixel does not contribute to the output. If any of the weights is non-zero, based on the sign, the pixel makes a positive or negative current contribution to the overall reading. The data signal, weighthigh also determines the gain on the photosensor. The D flip flops could be replaced with dynamic memory to save silicon area.

weighthigh weighlow biasbit1 biasbit0 biasbit2 pix_ref inv_ref pixel_reset pixel_readout setweight column row signbit sign with follower CLK D Q QBAR CLK D Q QBAR SA B A SB CLK D Q QBAR SA B A SB CLK D Q QBAR SA B A SB

(22)

Figure 16: Layout of a single cell with areas of the circuit elements marked.

In Figure 17, a PCPA layout of 5 x 5 pixels is illustrated. The 2 micron CMOS technology design with its pads fits into an area of 2.3 mm x 2.3 mm and is embedded in a 40 pin DIP chip package. All 25 pixel outputs are summed on a common output line. One can select different signs and weights for each of the cells and thus perform arbitrary convolution operations. Arbitrary 4-bit kernels up to 5 x 5 can be programmed.

(23)

3.4. Programmable Cellular Logic Array (PCLA)

The output of a convolution operation performed by the PCPA is an analog voltage value. While this rapid arbitrary size kernel convolution operation is extremely useful, there are many early vision operations that require further processing of the results of many convolution operations. The programmable cellular logic processor is a means of implementing this stage. The cellular logic processor is a binary (or digital) set of programmable logic elements arranged on a cellular grid . The size of the PCPA and the size of the PCLA could be different, where the latter grid functions as a scratch-pad for a set of early vision operations, such as shape detection, contour following, pattern matching, and performed using the outputs of the former grid, potentially at different resolutions.

Furthermore, each element of the PCLA can be addressed like a memory array where the PCPA results can be stored and processed. Added functionality would result from the ability to transfer bits between arbitrary locations of the cellular logic grid.

A simple implementation is illustrated in Figure 18.

(a) 3 x 3 arrangement (b) contents of one element

Figure 18: Minimal implementation of an element of the PCLA Note that in (a) only center cell connections are shown to prevent clutter. Also as shown in (b) the cell output has to be staged to prevent race conditions.

Each cell of the PCLA minimally accepts inputs from (1) the outputs of its immediate neighbors, and (2) the PCPA. Furthermore, added functionality could be realized (1) with increased connectivity, where each cell receives more inputs, such as those from additional cells in the neighborhood, or from arbitrary cells on the array via memory-like addressing; and (2) with increased cell memory, where the results of previous operations can be stored in the cell. The logic within the cell should be programmable. In fact, the program will change many times during the operation. Sample logic functions could be AND, OR, INVERT, XOR etc. of a set of the available inputs.

The PCLA would benefit from the use of the emerging reconfigurable field programmable gate arrays (FGPA’s). This is a relatively new field, however, researchers are keenly looking at ways to capitalize on the inherent flexibility in these devices to facilitate the building of a better computing paradigm.

output of cell thresholded output

from the PCPA outputs of (8) immediate neighbors L O G I C

(24)

One specific concept in this domain, termed the plastic cell architecture (PCA), brings forth a new circuit type that is laid out as an array of identical computing elements — or cells which could dynamically reconfigure themselves for specific problems [Nagami98]. This new computing paradigm offers a novel feature beyond the common reconfigurable FPGA concept, where so far, it's been possible to reconfigure circuits only via software downloaded to one or more FPGA and the chips then directly execute the prescribed functions as hardwired circuits. This added feature is the ability of one circuit to dynamically configure another circuit. The resulting processing array is able to mimic the ability to create specialized cells, which in turn allow a cellular array like the PCLA to configure itself based on outputs of its neighbors, or of itself. This level of data driven performance allow for implementation of very complex functions from very simple rules.

3.5. Communication between the PCPA and PCLA

It is evident that in addition to power and ground signals, many other data, control and address signals common need to be distributed all across the chip. As is the case with many photosensor chips, the PCPA portion of the chip also needs to be covered with a layer of metal with openings only at the photosensitive areas to allow for the exposure to light of the package. Typically, one would ground that layer of metal that covers the whole chip. A similar layer could also be used on the PCLA portion of the dual architecture to carry the output of the PCPA to all points of the logic array.

The row select and column select signals of the PCPA are reminiscent of memory addressing diagrams. Only when both signals are logic high’s, a specific pixel is selected. These signals are used only to load the weights to the cell.

Elements of the PCLA can also be addressed as one would address memory cells. A row of cells could be thought of as a long word. A set of write and shift operations could replace the need to route multiple address signals across the PCLA.

4. Conclusions and Future Work

With the advent of CMOS light sensors, the chance to implement integrated sensor processor devices has taken a big step towards realization. The integration of CMOS based light sensors with locally connected arrays will form very powerful as well as low power integrated sensor - processor arrays that will meet the visual computation challenges of the twenty first century. In this context, the two architectures represented this paper offer two potential roadmaps.

For successful commercial realization, an integrated image sensor-processor must be packaged as a viable marketable product or a cost-effective problem solution option. Actual demonstrations should not only prove functionality but also address how the investment in the full-scale development of these systems can return high dividends to those who take the risk.

IC Tech has demonstrated the feasibility of several neurally inspired image processing systems to address specific opportunities in emerging consumer vision markets. Two architectures, their discussion and some results were included in this paper.

(25)

Acknowledgement:

This work was supported in part by the Department of the Air Force though SBIR contract numbers F29601-97-C-0107 and F29601-98-C-0023.

References:

[Anton93] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies. "Image coding using wavelet transforms." IEEE Transactions on Image Processing, April 1992 vol. 1, n.2 p:205(16). (1992)

[Ballard82] D.H. Ballard and C.M. Brown. Computer Vision, Prentice Hall (1982).

[Chua93] Chua, L.O. and Roska, R. "The CNN Paradigm." IEEE Transactions on Circuits and Systems , Vol 40, N.3, March 1993.

[Chua95] International Symposium on Circuits and Systems, ISCAS'95 Tutorial on Cellular Neural Nets, L.O. Chua and T. Roska, Organizers, Seattle, WA.

[Chua88] L.O. Chua, L. Yang. Cellular Neural Networks: Theory and Applications, IEEE Transactions on Circuits and Systems, Vol 35, pp. 1257-1290, 1988.

[Cognex95] Product literature available from Cognex Corporation, Massachussetts.

[Delb89] T. Delbrück, "Bump circuits for computing similarity and dissimilarity of analog voltages," CNS Memo 10, California Institute of Technology.

[Delb94] T. Delbrück, G. Erten, and F. Salam. Tutorial on Low Power VLSI Design, Tutorial Session in Midwest Circuits, Systems and Signal Processing Conference, Lafayette, Louisiana (1994).

[Erten93] G. Erten. An Analog VLSI Architecture for Stereo Correspondence. Doctoral Dissertation, California Institute of Technology, (1993).

[Erten95] Erten, G., Oh, J., Kay, S., and Salam, F.M. (1995). "Designing a Real-Time Image Processing Interface using ASIC and FPGA devices." IC Tech, Inc. Internal Report # 95-0715.

[Erten96] G. Erten, R. Goodman. Analog VLSI Chip for Stereo Correspondence between 2D Images, IEEE Transactions on Neural Networks, March 1996 (1996).

[Fossum96] E.R. Fossum. "Novel sensor enables low-power miniaturized imagers." Photonics Spectra. January 1996 (1996).

[Hutch88] J. Hutchinson, C. Koch, J. Luo, C. Mead. Computing motion using analog and binary resistive networks. IEEE Computer, pp. 52-63 (March 1988).

[Kawa91] S. Kawahito, J. Takahashi, A. Ashiki, T. Nakamura. A Parallel Signal Processing Technique Using MOS Current-Mode Analog Circuits for Smart Image Sensors. 1991 IEEE Transducers, pp. 326-329.

[Lee93] J.C. Lee, B.J. Sheu, W.C. Fang, R. Chellappa. VLSI Neuroprocessors for Motion Detection. IEEE Transactions on Neural Networks, 4,2 (1993).

[Lee94] C. Lee, J.P. de Gavez. Single Layer CNN Simulator, Proceedings of the 1994 IEEE International Symposium on Circuits and Systems (ISCAS), Vol. 6, pp. 217-220.

[Mahow92] M. Mahowald. VLSI Analogs of Neuronal Visual Processing. Doctoral Dissertation. California Institute of Technology, (1992).

[Mead89a] C. Mead. Adaptive Retina. Analog VLSI Implementation of Neural Systems, C. Mead, M. Ismail, editors. Kluwer Academic Publishers, Boston, Massachusetts (1989).

(26)

[Moore91] A. Moore, J. Allman, R.M. Goodman. A real-time neural system for color constancy. IEEE Transactions on Neural Networks, 2, 237-247 (1991).

[Nagami98] K. Nagami, K. Oguri, T. Shiozawa, H. Ito, and R. Konishi. Plastic Cell Architecture: Towards reconfigurable computing for general purpose. IEEE Symposium on Field-Programmable Custom Computing Machines, Napa, California, April 14-17, 1998.

[Oku90] Okutomi, T. Kanade. A locally adaptive window for signal matching. Proceedings of the IEEE Third Conference on Computer Vision, pp. 190-199 (1990).

[Park94] S. Park, S.J. Min, S.I. Chae. Stereo Correspondence with Discrete-Time Cellular Neural Networks, Proceedings of the 1994 IEEE International Symposium on Circuits and Systems (ISCAS), Vol. 6, pp. 225-228.

[Pryz91] K. Przytula, W. Lin, and V. Kumar, Partitioned implementation of neural networks on mesh connected array processors, VLSI Signal Processing IV, H. Moscovitz, K. Yao, and R. Jain, Editors. IEEE Press (1991).

[Rodrig95] A. Rodriguez-Vazquez, S. Espejo, and R. Dominguez-Castro. "VLSI Implementation of the Analogic CNN Universal Machine Including On-Chip Photosensors" in Microsystems Technology for Multimedia Applications, IEEE Press (1995).

[Salam88] F. Salam, "A Formulation For The Design of Neural Processors," The Proceedings of the 2nd IEEE ICNN, San Diego, CA, July 24-27, 1988, pp. I-173-180.

[VVL95] Product literature : VVL 1070 from VLSI Vision Ltd. (1995)

[Wade98] W. Wade. SIA sees 1.8% drop in 1998 chip sales. Semiconductor Business News, CMP Media, Inc. June 3, 1998.