Combined Tracking and Super Resolution - Model based methods for locating, enhancing and recogn

Wide area surveillance tasks require high resolution images of the object of interest while only acquiring low resolution video of the scene. This chapter proposes a new method for combining model-based tracking and super resolution in the context of large scale surveillance. The key idea is the use of a deformable 3D object model for both tracking and super resolution. Unlike most existing super resolution techniques, the proposed method increases only the resolution of the object rather than the entire scene without using interpolation techniques.

A common super resolution algorithm is the super resolution optical flow (Baker and Kanade,1999). This method interpolates each frame to twice its size and optical flow is used to register previous and consecutive frames, which are then warped into a reference coordinate system. The super-resolved image is calculated as the average across these warped frames. However, the first step of interpolation introduces artificial random noise which is difficult to remove. Secondly, the optical flow is calculated between previous and consecutive frames preventing its use for an online stream processing algorithm. Also, accurate image registration requires precise motion estimation (Barreto et al.,2005), which in turn affects the quality of the super-resolved image (Zhao and Sawhney,2002). Most optical flow methods fail in low textured areas and cannot be used to register non-planar and non-rigid objects. A recent technique proposed by Gautama and van Hulle (2002) calculates sub-pixel optical flow between several consecutive frames (with non-planar and non-rigid moving objects), however it is unable to estimate an accurate dense flow field, which is needed for accurate image warping.

Solving all the issues in a general case is difficult, as the general problem of super resolution is numerically ill-posed and computationally complex (Farsiu et al., 2004). A specific issue is addressed here: Simultaneous tracking and increased super resolution of known object types (i.e. faces, license plates, etc.) acquired by low resolution video. The use of

an object-specific 3D mesh overcomes the issues with optic flow failures in low textured images. Interpolation is avoided and a 3D mesh is used to track, register and warp the object of interest. Using the 3D object mask to estimate translation and rotation parameters between two frames is equivalent to calculating a dense sub-pixel accurate optical flow field and subsequent warping into a reference coordinate system. The 3D mesh is subdivided, such that each triangle is smaller than a pixel when projected into the image, which makes super resolution possible (Smelyanskiy et al.,2000) and allows for sub-pixel accurate image registration and warping. In addition, such a fine mesh improves the tracking performance of low-resolution objects. Each triangle then accumulates the average colour values across several registered images and a high resolution 3D model is created online during tracking. This approach differs from classical super resolution techniques as the resolution is increased at the model level rather than at the image level.

Furthermore, only the object of interest is tracked and super-resolved rather than the entire scene, which reduces computation costs. Lastly, the use of a deformable mask mesh allows for tracking of non-rigid objects, like human faces.

This chapter is organised as follows: The image formation process is described in Sec-tion 4.1 and based on this the proposed method is outlined in Section 4.2. Next, the 3D tracking approach and the model based super resolution method are introduced in Sections4.3and4.4respectively. An extension to non-planar and non-rigid objects is pro-posed in Section 4.5. The experimental evaluations of the combined tracking and super resolution method are demonstrated in Section 4.6.

4.1 The Image Formation Process

The image formation process is important for understanding the necessity for super res-olution. When taking a picture with a digital camera, the resulting image is captured from the (high resolution) 3D world and projected onto the CCD chip - the image plane.

During this process the high resolution image I^high is sub-sampled, warped and blurred resulting in a degraded low resolution image I^low:

I^low = AI^high+ h (4.1)

where I^high and I^low are the high and low-resolution images respectively. The degrading matrix A represents image warp, blur and image sampling; h models the uncertainties due to noise.

Figure 4.1: The basic principles of the image formation process: The high resolution 3D object is warped from the world coordinate system into the camera coordinate system and is finally projected onto the image plane. After this process the resulting low resolution image is warped, blurred and sub-sampled.

This image formation process (Faugera,1993) is illustrated in Figure 4.1. The 3D object is warped from the world coordinate system into the camera coordinate system and then projected into the 2D image plane. The resulting image is sub-sampled as an effect of the finite number of pixels on the imaging chip. Furthermore this image is degraded by blurring, which is caused by the optical system of the camera, motion and additional random noise. This image formation process can be described by Equation4.1, where the degrading matrix A is used to model all possible degradations. However, the matrix A is unknown and hard to estimate.

(a) (b) (c) (d)

Figure 4.2: (a) and (c) show high resolution images of a cube being projected on a low resolution grid representing the image plane. Depending on the position on the grid different low resolution appearances result as shown in (b) and (d).

Another effect of the sub-sampling process that occurs when a 3D object is projected onto the 2D imaging chip is shown in Figure 4.2. The low resolution image depends on the number of pixels on the imaging chip, the size and position of the 3D object in front of the camera. An imaging chip with a smaller number of pixels results in a lower resolved

image compared to an imaging chip with more pixels. The position on the imaging chip on which the object is projected will also change the appearance of the low resolution image, as illustrated in Figure4.2.

The high resolution cubes in Figures4.2(a)and4.2(c)are projected onto different parts of the imaging chip. The image formation process is modelled by averaging over the number of high resolution points that fall within each pixel. The resulting effect is most prominent along the edges of the cube. Depending on whether the black edge falls in between pixels and depending on the colour of adjacent pixels, different shades of grey result in the low resolution image. Using this effect, the key idea of the proposed approach is to assume that the 3D shape of the high resolution object is known. The 3D object model is projected back into the image, low resolution images are created and then used to reconstruct the appearance of the high resolution 3D object.

4.2 Method Overview

Using the effects of the image formation process as outlined in Section 4.1, the proposed method reconstructs the high resolution appearance of a known 3D object as illustrated in Figure 4.3.

It is assumed that the 3D object is known and that a 3D model of that object is available.

This is a realistic assumption given that a large number of different 3D object models are freely available on the internet or can be created with 3D software tools like Google SktechUp¹. Such a 3D model is used within a model-based tracking approach to estimate translation and rotation parameters between consecutive frames. The model based track-ing approach allows for accurate tracktrack-ing of non-planar and non-rigid objects. Once the pose parameters of the current frame are estimated, the 3D object is projected back into the image and instead of using traditional texture mapping techniques, the 3D model is textured by projecting every mask triangle into the image and assigning it with a single colour value.

In order to achieve a 3D object with a high resolution texture, every mask triangle has to be smaller than a pixel when projected into the image (Smelyanskiy et al.,2000). This is achieved by subdividing the 3D object model using standard computer graphics methods.

Depending on the size of the object and resolution of the image, the 3D object mask

1http://sketchup.google.com

Tracking

Textured 3D Model Super-Resolved 3D Model

Frame i+1

Figure 4.3: The basic outline of the proposed tracking and super resolution approach. The object of interest is tracked across several frames. Assuming that the type of object is known, the 3D model of the object is projected back into every image and every quad or triangle of the 3D model is assigned a single colour value. The super-resolved texture is then calculated as the mean across several frames.

triangles are subdivided until they are smaller than a pixel when projected into the image.

Following the previous Section 4.1, the number of pixels that the object covers within the whole image depends on the size of the imaging chip, the optical lens, the size of the object itself and the distance between object and camera. As the object or the camera moves, it may be projected onto different pixels of the imaging chip in different frames.

In Figure 4.3, the black edges surrounding the gradient on the front side of the cube are projected nearly exactly into pixel centres resulting in 14 black pixels on either side of the cube in the image plane in frame i. The movement of the cube in front of the camera results in sub-pixel movements on the image plane. The black edges of the cube now fall between pixels of the imaging chip, resulting in grey edge pixels in frame i + 1. As a result, the two 3D models of the cube in Figure4.3are textured differently for each frame. Over time each model mask triangle will accumulate different colour values and thus, the super-resolved 3D model is then calculated as the mean colour value of each triangle. Without loss of generality, Figure4.3only shows the projection and super resolution of one side of the cube; the same is true for non-planar and/or non-rigid objects.

The super-resolved 3D model is created online during tracking and improves with every frame, whereas super resolution optical flow incorporates consecutive and previous frames which prohibits its usage as an online stream processing algorithm. Furthermore, using an object-specific 3D model in a combined tracking and super resolution approach inverses

the image formation process in Equation 4.1. The subdivided 3D mesh represents the high-resolution object I^high that is down-sampled by projection into the image plane. The finer the mesh, the higher the resolution of I^high and the higher the possible increase in resolution. Thus, interpolation, the first step of the optical flow algorithm, is unnecessary and the resulting super-resolved 3D model is less blurred whilst maintaining the same resolution increase. This in turn makes deblurring (the last step of the optical flow al-gorithm) unnecessary. Lastly, using the 3D mesh for tracking equals image registration, warping and the estimation of a dense flow field, comprising steps 2 and 3 of the optical flow algorithm.

In document Model based methods for locating, enhancing and recognising low resolution objects in video (Page 91-96)