Evaluating Vision Processing Techniques - Semantic Labelling for Prosthetic Vision

In this section, we demonstrate how new vision processing techniques for prosthetic vision may be applied within our framework. We show how SPV rendering can be used to qualitatively evaluate and compare processing methods in terms of potential functional outcomes. We describe the implementation of a bounding box object detector and an image segmentation system, in the context of processing stages in our framework. Either or both of these techniques may be applied in a practical prosthetic vision device.

§3.3 Evaluating Vision Processing Techniques 35

3.3.1 Object Detection

Current bionic vision technology provides low visual acuity for any significant field of view [22]. This can impede the recognition of objects or symbolic information in the environment. Allowing the user to enlarge (zoom) the image of the environment can increase acuity, but at the cost of field of view. Users may have difficulty locating an object of interest with a narrow field of view, but may not be able to identify it with a wider field of view and corresponding reduction in acuity.

Systems with manually controlled field of view allow the user to ‘scan’ a scene to find and identify objects of interest [245, 321]. In this section we propose a system that can detect and localise potential objects of interest, such as street signs, within the input image. Automatically restricting the field of view to an area where a sign is present would effectively automate the manual scanning process. This is similar to the face-based fixation system of [127], which enables face and expression recognition through automatic zooming on human faces. Our system could improve the ability of users to read signs at a distance, an activity associated with quality of life in the VF-14 index of functional impairment [271, 272].

We take an image of the environment from a camera as input, and scan it using a sliding window approach. We can then use a binary classifier to detect the presence of an object in each window location, which is a widely used technique to localise in- stances of objects in images [299, 74, 331]. Each window is a rectangular subset of the pixels in the image, defined by the width, height and 2D position of the window. We vary these window parameters to find all windows in the image and use each window as an input to a classifier. Windows in which the classifier detects the presence of an object of interest, can then be downsampled to compute phosphene intensities. When multiple windows are detected, a user interface (such as gaze tracking) may be used to select which to sample. We illustrate this in Figure 3.4.

We set window parameters assuming a 640×480 pixel input image. Windows are square, with width and height both set to 24, 32 or 48 pixels. Position offsets are scanned in a grid such that offsets are separated by 8 pixels horizontally and vertically. We exclude windows that extend outside the boundaries of the image, resulting in 12,770 windows per image. Note that windows are classified indepen- dently of each other, and can be processed in parallel.

We use the method of [331] trained on a set of images of street signs to classify windows. With an efficient parallel implementation, classification on large numbers of windows can be performed in real-time [212]. Each window is normalised by resampling with bicubic interpolation to 32×32 pixels for feature computation.

We compute a HOG feature vector [74, 331] on the RGB pixel intensities of each window, with nine orientation bins over 16×16 pixel blocks of four 8×8 pixel cells, as in [74]. The resulting feature vectors are then classified by an AdaBoost cascade

RGB image User interface Sample window Sampling Phosphene locations Phosphene intensities For each window

Classifier Trained model

Select

Figure3.4: Our object detection method as a processing stage in our framework. The sampling component is identical to our sampling system described earlier in this chapter.

classifier [299], which we train on 20,000 positive and 100,000 negative example windows*_{from the NICTA Road Scene Database [212]. HOG features yield good discrim-}

inative power for street signs, with a similar system to ours achieving a detection rate of 98.8%, with a false positive rate of 10−10 on real-world images [212].

Note that this system is reliant on an accurate and robust classifier. If the classifier fails to detect an object in a practical setting, there is no benefit to this system. The types of object may be limited to those that are easily detected with a sliding window approach. Furthermore, the classifier must be trained on specific examples of objects, which limits its flexibility. Recent work with convolutional neural networks has im- proved the accuracy and robustness of object detection [110], but large quanitites of training data are required.

3.3.2 Image Segmentation

The bounding box method addresses the limited acuity of prosthetic vision devices by dynamically changing the field of view to assist the user in identifying objects. However, prosthetic vision devices are also limited by the number of discrete levels of stimulation that may be produced [171]. Changing the field of view alone may not improve a user’s ability to identify objects with low visual contrast.

In this section, we use an image segmentation algorithm to automatically divide an image into a set of regions by appearance. We can then convert the shape of a region, or set of regions, to a binary image, and sample this to produce phosphene intensities. This approach only relies on two distinct levels of intensity (on and off) for each phosphene, so is not limited by the contrast in the input image or the specific

§3.3 Evaluating Vision Processing Techniques 37

Figure 3.5: A simplified example of graph-based image segmentation. A graph is formed where each node (circle) corresponds to an image pixel, and edges connect adjacent nodes. Edges are weighted by the colour difference between the nodes they connect, and edges with high weight are depicted as dashed lines. By removing edges along a closed path, the graph is split into multiple regions. The path is selected such that the resulting regions have a lower total edge weight, removing edges with high weight.

number of discrete levels of stimulation. We expect that image segmentation will enable faster and more accurate symbol recognition, than the bounding box method, by effectively improving contrast of symbols in the resulting phosphene image.

In addition to the camera image, this system takes a fixation pointas input. This allows for future prosthetic vision systems to measure the user’s eye gaze direction as an alternative to requiring the user to turn their head to focus on an object of interest. The fixation point is provided as a pixel location in the image, and in this section we simulate reading or scanning behaviour by moving it through the image. In systems where eye gaze tracking is not available, the fixation point may simply be fixed in the centre of the image, or controlled by another form of user input.

We use the method of [94] to segment the image, which we briefly describe here and in Algorithm 3.1. The image is modelled as a graph (illustrated in Fig. 3.5), with one node for each pixel in the original image. Let G = (V,E) be an undi- rected graph, with vertices v ∈ V corresponding to image pixels to be segmented, and edges (vi,vj) ∈ E corresponding to pairs of neighbouring vertices. Each edge (vi,vj)∈ Ehas a corresponding weightw(vi,vj), which is a non-negative measure of

the dissimilarity between elementsv_i andv_j.

A segmentation Sis a partition of the verticesV into componentsC⊆V. We de- sire a segmentationS= (C1, . . . ,Cr)such that pixels within a component are similar

to each other, that is, the appearance difference between pixels in a component will be small. The overall internal difference of a component C is defined as the largest edge weight in the minimum spanning tree of the component, MST(C,E). That is,

Int(C) = max

(a) (b) (c)

Figure 3.6: (a) The original image, zoomed to show detail; (b) Part of the original image corresponding to (a), segmented with the method of [94], with segments shown in different colours; (c) result of fixation point region selection and merging.

Components can be merged while maintaining a good segmentation if an edge between those components has weight below the internal difference of at least one component (plus some threshold). That is, if there exists an edge e ∈ E connecting vertices between components C1 and C2, the components can be merged if w(e) ≤

min(Int(C1) +τ(C1),Int(C2) +τ(C2)), whereτ(C)is a threshold function [94].

We follow [94] in using a threshold function of the form τ(C) = k/|C|, where |C| is the number of vertices in C, and k is some constant parameter. Starting with a segmentation such that every vertex is in its own component, components can be repeatedly merged in order to achieve a good segmentation with a small number of components. The threshold parameter k influences the size of the resulting regions such that larger values of k cause a preference for larger regions. We set k = 500 as we found this value to yield qualitatively good results. We illustrate a typical segmentation in Figure 3.6.

Edge weights are calculated by smoothing the image with a Gaussian filter, then finding the L2 (Euclidean) distance between adjacent pixel intensities in RGB space,

as per [94]. The Gaussian filter compensates for image noise or other artefacts. For the examples shown in this section, we found σ = 0.1 to be sufficient. Each vertex vi has a corresponding intensity vectorIi, with elements corresponding to red, green

and blue pixel intensity after applying the filter. For an edgee = (vi,vj), the weight w(e) =kIi−Ijk.

For most natural images, this algorithm could find tens or hundreds of regions. Note in Figure 3.6 that the sign is segmented into many regions, some of which are superfluous to understanding the meaning of the sign. We add a user interface to allow a prosthetic vision user to select which region is displayed. Ideally, if the user can ‘scan’ across the field of view, regions can be displayed in sequence, enabling an interface similar to scanning a magnifier across text for low-vision reading.

§3.3 Evaluating Vision Processing Techniques 39

Algorithm 3.1Segmentation algorithm of [94]

Input: Graph G= (V,E)withnvertices andmedges

Output: Segmentation ofVinto componentsS= (C1, . . . ,Cr)

SortEintoπ= (o1, . . . ,om)by non-decreasing edge weight

Start with segmentation S0 _{where each vertex is in its own component}

forq=1 tomdo

ConstructSq givenSq−1:

Letvi andvj denote the vertices connected by theq-th sorted edge: oq = (vi,vj).

LetCq_i−1be the component ofSq−1 containingvi.

LetCq_j−1be the component ofSq−1 containingvj.

ifC_iq−16= C_jq−1andw(oq)≤min(Int(C1) +τ(C1),Int(C2) +τ(C2))then Sqis obtained fromSq−1by merging C_iq−1andCq_j−1.

else

Sq=Sq−1. end if end for

S=Sm.

We add an input to our system, thefixation point, which is used to select which region to display in the output phosphenes. The fixation point is a single pixel location in the image, corresponding to a vertex we denote vf. In a prosthetic vision system,

it may be a fixed location in the user’s field of view (such as the centre of the camera image), or it could be based on user input. It is possible to combine a retinal implant system with eye gaze tracking [12]. Computing a fixation point from eye gaze direction could provide a natural interface for the user to scan their environment for signs.

The segmentation should group adjacent pixels of similar colour, which we expect to correspond to parts of symbols on signs or in text. Signs often have large areas of a single colour, but these can be segmented into multiple regions due to minor variations in intensity over the area (for example, due to shadows). To improve the chance of capturing the area of the entire sign in these cases, we added a post- processing stage that merges regions of similar average colour.

We compute the average colour Ai of a region Ci as the arithmetic mean of pixel

intensity vectors in a region: Ai =

∑

j|vj∈Ci

Ij |Ci|

. The region containing the fixation point,

Cf 3 vf, is merged with regionsCi whereCi is adjacent toCf, and the difference in

average colour between regions kAi− Afk is less than some threshold. Where the

elements of I are integers in the range [0, 255] we merge where kAi −Afk ≤ 70,

which was found empirically to give good results on tested images.

Note that this is a simplistic approach to avoid oversegmentation of signs, sufficient for our proof of concept. A more robust segmentation method may be required

RGB image

Fixation point

Segmentation

Merge regions Binarisation Sampling Phosphene locations

Phosphene intensities

Figure3.7: Our image segmentation method as a processing stage in our framework. Note that phosphene locations here are, as in our sampling method, set according to the stimulation stage and considered fixed.

to achieve good results in practical scenarios. For example, video segmentation methods such as [104, 58, 303] use temporal information to potentially yield more stable segmentations, but with corresponding increased computation and memory require- ments. Superpixel methods [4, 164] may yield better performance, but require a more complex region merging step (such as a graph-based approach) as regions are generated with consistent shape and size that may require oversegmentation to capture the boundaries of large objects accurately.

We convert the final result of segmentation and region merging to phosphene intensities such that the shape of the fixated regionCf is communicated to the user. A

binary image is generated, such that maximum intensity is assigned to pixel locations in C_f, and zero intensity to all other pixels. An example of this binary image is shown in Figure 3.6 (c). In this example, the background of the sign is merged into a single region, resulting in the key information of the sign being communicated as the shape of the region. The binary image is then rescaled such that the region is as large as possible without distorting its aspect ratio, while being entirely within the image area sampled by our sampling method. We then apply our sampling method from Section 3.2.2 to the rescaled image to compute phosphene intensities. This is illustrated in Figure 3.7, showing inputs and outputs as they relate to our overall framework.

§3.3 Evaluating Vision Processing Techniques 41

(a) (b)

Figure 3.8: (a) The input image; (b) the SPV result, with 1024 phosphene locations, from processing (a) with sampling only; (c) windows in which signs were detected in (a); (d) zoomed windows from (c) and corresponding SPV results (also with 1024 locations) with our processing method.

3.3.3 Results

We present typical SPV results having used photographs as input and our SPV rendering method as output. An arrangement of 1024 phosphenes in a 32×32 rectangular grid is used. We use this high resolution to illustrate clear comparisons between our methods and simple downsampling as a baseline. We expect that in a practical device, both methods could improve functional outcomes and both could be imple- mented. However, due to the requirement of training, the bounding box detector is not as flexible as the segmentation method. In this section, we focus on the breadth of situations where the segmentation method could be applicable.

We illustrate example outputs from our object detector method in Figure 3.8. Note that when the entire field of view is rendered with SPV in (b), it is not possible to locate or identify the signs in the original image. Sampling within a limited field of view as in (d) allows the signs to be identified in the resulting SPV renderings. Note that three signs are detected and, as described earlier, we assume a user interface is

(a) (b) (c)

Figure 3.9: (a) The original image; (b) rendered output from processing (a) with sampling only; (c) rendered output from our segmentation processing stage with fixation point within the sign. Both SPV results use 1024 phosphene locations.

present allowing a user to select which window to sample.

We tested this approach using a set of street images captured using a vehicle- mounted camera†. We selected 25 images at random, all containing standard street signs, and produced rendered SPV images from each using our processing method as in Figure 3.8. We asked a normally sighted volunteer to identify the type of sign (pedestrian crossing, speed limit, etc.) from the SPV image. We found that using sampling only, the sign could not be identified in any SPV images. However, when the volunteer used our object detection method, they correctly identified the signs in 72% of the SPV images. This suggests this method may be beneficial, but an in-depth study is required to determine whether to implement it in a practical system.

As discussed above, we expect our segmentation method to be more flexible in real-world scenarios. In Figure 3.9 we show a rendered result from our segmentation method using the same input image as in Figure 3.8. Note that no training for specific objects was required. The contrast is enhanced relative to the object detection method, with phosphenes in Cf (the fixated region after merging) set to maximum

intensity. This may make the segmentation approach more robust to variation between implanted devices and patients, as the number of discrete levels of intensity may vary [171].

A number of example images were tested with our segmentation method. While formal studies have not yet been performed to measure functional outcomes of the system, we can use our SPV rendering to show that in many cases the original symbols can be easily recognised from the rendered output. Significantly, the examples shown here are from a range of real-world situations, with varying lighting condi- tions, size of object, and contrast present in the images.

An example of a typical situation where our segmentation processing method may be useful is shown in Figure 3.10. Even with a high-acuity retinal implant, simply downsampling the image to produce phosphene intensities would yield a

§3.3 Evaluating Vision Processing Techniques 43

(a) (b)

Figure 3.10: (a) The original image; (b) rendered output with sampling only; (c) and (d) rendered output from our segmentation processing method, with two different fixation points selected within the sign area. All SPV results use 1024 phosphene locations.

stimulus similar to that shown in (b), which does not clearly convey the information in the sign, or even that a sign is present. Applying our segmentation method, the phosphenes rendered in (c) and (d) can be generated by scanning the fixation point around the image. These renderings emphasise the important symbolic information

In document Semantic Labelling for Prosthetic Vision (Page 52-63)