System Setup and Algorithms - Mobile Interaction with Large Multimedia Information Spaces

We have prototypically implemented the interaction techniques. In the following, we describe the hardware setup, as well as the implemented algorithms.

4.5.1 Hardware

Figure 4.11 shows our prototype. We have attached an Aaxa L1 laser pico projector to a Microsoft Kinect with hook-and-loop tape, which we use as a mobile camera-projector unit. The projector has a resolution of 800 600 pixels. The Microsoft Kinect features a pair of depth-sensing range cameras (320 240 pixels), an infrared structured light source and a regular RGB color camera (640 480 pixels). In order to support hassle free document recognition, we have attached a megapixel webcam with autofocus to the unit. Kinect, webcam and pico projector are calibrated and aligned.

The mobile camera-projector unit can be further mounted onto a strong suction cup, which also features a handle. Thus the unit can be easily carried in one hand by using the handle. Moreover, it can be attached to basically any flat surface, even vertical surfaces or ceilings to achieve a top-down projection.

Figure 4.11: Hardware prototype using a Microsoft Kinect, mounted on a suction cup. The pico projector is placed on top of the Kinect. We have added a webcam on the right hand side for document recognition.

4.5. System Setup and Algorithms 143

4.5.2 Algorithms

In the following, we describe the algorithms used to track objects, support the spatial interaction and recognize physical documents.

4.5.2.1 Object Tracking and Interaction Support

As projection surfaces, we currently consider flat surfaces of 3D objects. We model them as 2D planes in 3D space. To support a robust tracking of arbitrary objects, independent of varying lighting conditions, we aimed at using solely the depth image in our tracking algorithm. The algorithm is thus less complex than other approaches [Lowe, 2004], yet robust and highly efficient due to its simplicity. Algorithm 1 depicts a pseudocode representation of the algorithm.

First, a threshold is applied to the depth image to filter out any background objects (line 2). A blob detection for the objects in the scene is carried out (line

3). As a simple example, Figure 4.12 (left) shows only one object (here: a piece

of paper), which is held in hand. Figure 4.12 (right) shows the corresponding depth image. We isolate the object from the scene (here: to discard the hand) in three steps, which are carried out for each detected blob (line 4):

1. Breaking up weakly connected components: the objective of this step is to detect weak connections between objects in the image and eliminate them to finally isolate the target object (i.e. the paper in Fig. 4.12). A weak connection is a thin line in the input image, connecting areas in the image which technically resemble one large blob (e.g. the piece of paper and the arm in Fig. 4.12). The separation is done with four basic image operations. First, an and-mask of the detected is applied to the image, to discard other blobs and therefore focus only on the current blob (line 5). The image is then blurred heavily, which results in lower gray-color values for the connections. Then a binary threshold is applied, eliminating the blurred borders (line 6). Finally, morphological open and close operators are applied to concretize the object borders (line 7).

2. Detecting inner points of the target object: the resulting image of step 1 contains isolated objects (i.e. both paper and hand in Fig. 4.12 are now two separate blobs). However, due to the image operations, the area and consequently the contour have been reduced. A further blob detection

now enables the detection of the reduced area (line 8). The algorithm choses the largest blob as the desired projection target (line 9).

3. Mapping inner to original corner points: a rotation invariant mini- mum bounding rectangle of the corresponding blob is calculated. The corner points of this bounding rectangle serve as the input points for the next step: the inner corner points are finally mapped to the original object corners by considering the contour of the object recognized in Figure 4.12. The bounding rectangle (and thus the inner corner points) is it- eratively expanded by a fixed factor to approach the contour of the original target object (lines 10-14). Once the distance is smaller than a certain threshold , the corners of the target object have been found. The algorithm then stores the detected target object and starts over for the remaining blobs.

Algorithm 1 Object Tracking for LightBeam

1: procedure TRACKOBJECTS(grayImage,objects) . objects serves as output set.

2: threshold(grayImage) . Apply depth threshold.

3: _blobs_detectBlobs(grayImage)

4: for each blob in blobs do

5: _imgand(blob, grayImage) . Remove other blobs.

6: _{binaryThreshold}(blurHeavily(img))

7: dilate(erode(img))

8: _reducedBlobs_detectBlobs(img)

9: _lBlobgetLargestBlob(reducedBlobs) . Within original blob.

10: _contour_getContour(blob)

11: repeat

12: _{targetObject lBlob}

13: _lBlob_expandArea(lBlob, ) . Uniform expansion by factor .

14: _cornersgetCornerPoints(boundingRectangle(lBlob))

15: untildistance(contour, corners)  . If near to original blob contour.

16: objects.add(targetObject)

17: end for

18: return objects

4.5. System Setup and Algorithms 145

Figure 4.12: Left: color image of a paper, held in hand. Its four corners are detected and indicated by four colored dots. Right: depth image after thresholding and blob detection. The red mark designates the thin connection, which the algorithm removes for object detection.

In its current version, the algorithm is implemented using OpenCV1_{. It is} important to note that line 9 in Alg. 1 restricts projection surfaces to be not smaller than a user’s hand, assuming that objects in the scene are used for tangible interaction. The algorithm needs to be adapted to support smaller projection surfaces.

In combination with the depth information for the detected object contour, we model and track the detected objects as 2D planes in 3D space. The projection is mapped using a homography, correcting any perspective errors. We also analyze the optical flow within the regions of the blobs in the RGB image. This allows us to detect whether an object has been rotated. Additional interaction devices such as the pen in Figure 4.10 are tracked based on their color. As mentioned earlier, more sophisticated approaches such as touch have been described elsewhere [Harrison et al., 2011] and are out of the scope of this thesis.

4.5.2.2 Document Recognition

The system automatically recognizes paper documents to support the rich in- teractions described in the mobile document interaction scenario. The recognition uses FACT [Liao et al., 2010], which unitizes local natural features [Lowe,

2004] to identify ordinary paper documents without any special markers. The current FACT implementation can operate at about 0.5 fps for recognizing a frame of 640⇥480 pixels on a PC with a quad core 2.8GHz CPU and 4GB RAM. Considering that users usually do not change documents very quickly during their tasks, this recognition speed is acceptable for practical use. The FACT implementation had to deal with various difficulties due to only using data from an RGB camera; e.g. small document tilting angles or interferences of overlaid projections with the original natural features.

FACT provides an interface which accepts captured camera images and re- turns the detected digital version of the document (i.e. a PDF). Figure 4.13 illustrates how the communication between LightBeam and FACT works. We leverage the capabilities of the Kinect depth camera to overcome these difficulties and enhance the camera image before passing it to FACT. The 3D pose estimation based on the depth image is independent of the document’s natural features and thus the system is robust to insufficient feature correspondence. Moreover, a rectification of the color images based on the 3D pose decreases the perspective distortion and allows for greater tilting angles. Last, the pose estimation and the document recognition can be carried out in two separate threads, each updating the world model asynchronously. Therefore, from the aspect of users, the system is able to locate specific document content in 3D space in real time.

Update Object Tracking / Pose Estimation Projection Mapping / Rectification LightBeam Document Recognition FACT

Rectified Camera Image

Figure 4.13: LightBeam separates the document recognition into two threads: it con- tinuously estimates the 3D pose of a document and asynchronously queries FACT with rectified camera images. FACT then sends the recognized document back to LightBeam.

In document Mobile Interaction with Large Multimedia Information Spaces (Page 154-159)