3. Marker-based tracking
3.3 Multi-marker setups (marker fields)
The four corners of a marker determine the pose of a camera as previously ex-plained. However, additional points stabilise a tracking system, improve the accu-racy and enable the system to discard outliers. Especially if there is noise, addi-tional reference points improve robustness and accuracy [94, 95]. Three main approaches to increase the number of points used for pose detection are:
to use more than four points per marker to use more than one marker
to use natural features in addition to the marker.
The improvement in the stability of the first approach is small; the points still dis-tribute on a physically narrow area and therefore they increase the stability of the pose close to nothing. A common problem in tracking is that the field-of-view of cameras is narrow, especial in mobile devices. If the user moves the camera, it soon loses the marker from view. A wider field-of-view does not help either if the user rotates the camera. Therefore, the use of a single marker tracking system restricts the permissible movements of the user, as the camera must see the marker all the time.
This is an unwanted limitation in many applications and therefore, the second and third options are commonly preferred. AR systems habitually use them to increase robustness and usability. In this section, we discuss the second option
and describe how an AR system can define and track a multi-marker setup. The third approach will be covered in Section 5.3 Hybrid tracking.
A tracking system can cope with larger camera motion if the user distributes several markers in different directions. When the system detects and tracks each marker individually, the information related to each marker is lost as soon as the system is unable to detect the marker. Multi-marker systems (aka marker fields) combine the information from all markers, and therefore these systems are more robust and accurate. For example, multi-marker systems can handle partial occlu-sions and deduce the location of a marker even if it is invisible, as long as they detect some other markers belonging to the marker field.
A multi-marker setup or marker field is a system that uses several markers jointly to estimate the camera pose. A system where each marker is individually used to calculate its relative pose to the camera is not a multi-marker system even if several markers are used.
In order to deduce the location of a non-detected marker, a tracking system needs to know the relative position of the marker compared to the others. Either the relative location of the markers can be predefined, or the system can allow free distribution of the markers and deduce the configuration of markers as it de-tects them.
3.3.1 Predefined multi-marker setups
Predefined multi-marker setups are widely used and support for them is a stand-ard feature in marker-based toolkits and libraries. For instance, ARToolKit [82], ARTag [71], ALVAR [19] and StudierStube Tracker [85, 96] offer support for a multi-marker setup. The planar multi-marker setup approach is the same as using a big marker with more than four points. Now the marker field is “a big marker” and markers in it are “sub features”.
A multi-marker system can use a non-planar predefined marker field as well, for example, markers may cover the sides of a cube, some of the markers are on the wall, etc. For instance, ARToolKit, ARTag and ALVAR all support non-planar mul-ti-marker setups as well. Non-planar mulmul-ti-marker setups provide tracking infor-mation for a larger scale environment than a single marker system. Non-planar multi-marker systems cope with larger camera movements than planar systems.
Markers attached to 3D objects allow the system to recognise them from different angles, which is desirable with tangible user interfaces, for example.
The problem with non-planar setups is that in practice it is difficult to measure the physical position and orientation of each marker relative to each other. This calibration process is often time consuming and inaccurate if done by hand [8]. It is possible to use external aids for measuring the marker locations, for example a tachometer (as in [97]), but vision-based reconstruction approaches are more interesting from the viewpoint of the augmented reality system since an AR sys-tem contains a camera and a computational unit anyway.
In the ideal realisation of a multi-marker system, the user can place markers freely on site without any predefined constraints and then the system creates the marker field based on observations of the marker locations. This is called automat-ic reconstruction of multi-marker setups.
3.3.2 Automatic reconstruction of multi-marker setups
In automatic reconstruction of multi-marker setups, a system needs to determine the 3D coordinates of markers based on observations (2D images). This is a clas-sical structure from motion (SfM) problem with the distinction to a general case that (some or all of) the features used for 3D reconstruction come from markers, not randomly from the environment. It is sufficient to model only the locations of the markers and leave the rest of the scene unmodelled.
Researchers have applied several visual methods successfully to the SfM prob-lem. An AR application designer could apply any of these methods for marker field reconstruction. A good overview of the basic methods can be found in [72]. In the following, we discuss the most common approaches used in AR.
Since the SfM problem is computationally demanding, many of the algorithms work offline. A common approach is to create the 3D map in a separate process at the beginning of the application or implement the reconstruction process gradually.
Researchers have successfully applied the Kalman filter for SfM. One example is a recursive two-step method to recover structure and motion from image se-quences based on Kalman filtering [98]. The algorithm consists of two major steps.
The first step of this algorithm estimates the object’s pose with an extended Kal-man filter (EKF). In the second step, each feature point’s 3D position is estimated with a separate EKF.
Simultaneous localisation and mapping (SLAM) is an approach where a map of the unknown environment is built simultaneously whilst tracking the camera pose.
Researchers use it widely in mobile robotics and some have adopted it for aug-mented reality as well. Researchers have reported several SLAM implementations for AR. One of the first using it for AR was [99]. Developers may implement SLAM in a separate thread. This way, the reconstruction works in an incremental way. In the beginning, a coarse map is used, but within time, the accuracy improves.
Figure 39. Augmented reality game Tower Defence for Symbian smartphone.
The SLAM type approach is suitable even for mobile devices with limited computa-tional capacity as the 3D map of the environment is built incrementally. [100] pre-sents a real-time algorithm for mobile robotics that can recover the 3D trajectory of a monocular camera, moving rapidly through a previously unknown scene.
Mobile games have a lot of potential for the mass market, which makes mobile AR an interesting research area. In addition to games, companies and research groups provide several tools for mobile AR development. For example, Cel-laGames provides a marker-based tracking library for the Symbian platform [66, 66] for AR (game) development. This SMMT library (SLAM Multi-marker Tracker for Symbian) uses a SLAM type of approach (as its name suggests) to calculate the marker field at the beginning of application. The Tower Defence game demo (see Figure 39) is one application that uses it. For marker-based AR a SLAM type approach allows markers to be set up freely.
The idea of automatic real-time calibration without any preparation is old and re-searchers have carried out several implementations. For example, we presented in [8] a system where calibration is a real-time process and where the user can lay markers randomly on suitable places and start tracking immediately. This system allows the user to place markers in any 3D arrangement including even arbitrary angles and slanting planes. The accuracy of the system improves on the run as it updates the transformation matrices dynamically. In our system, we can imple-ment the calibration of a marker field as a separate calibration stage as well. The user can save the results and use them later with another application. In our sys-tem, we created a graph of markers and used graph optimisation to create a marker field.
Our approach is well-suited to situations where the marker field as a whole cannot be seen but parts of it create chains of markers bridging one area to an-other. For example, it is suited to installations where marker fields extend from one room to another along a corridor or marker fields circulate around an object (Fig-ure 40). In a situation where the whole set of markers is visible simultaneously a bundle adjustment would probably optimise the whole marker set better.
3.3.3 Bundle adjustment
Given a number of images taken from different viewpoints, bundle adjustment is defined as the solution of 3D coordinates describing the scene geometry, the relative motion of camera and the optical characteristics of the camera.
Let us assume that we have
m
images taken with one or more cameras. Let{ X
i| i 1, , } n
be a set of 3D points. We mark the corresponding 2D coordi-nates of pointX
i in imagej
with xij. We denote the projection matrix associated with imagej
with Pj. In an ideal casex
ijP X
j i.
However, the real measure-ments are subject to noise. Therefore, the problem is to find the maximum likelihood estimate for parameters{ X
i}
and{ P
j}
which minimise the reprojection error between points x and y. Finding a solution for this is a bundle adjustment problem.The bundle adjustment problem can be formulated in most cases as a non-linear least squares problem and can be solved using the Levenberg-Marquard method, for example.
If a system has
n
points in m images and each camera has 11 degrees of freedom, then the system has3 n 11 m
parameters. Thus, it needs to factorise (and sometimes even invert) the matrices of the size(3 n 11 ) (3 m n 11 ) m
. If n and/or m increase, the processing time and capacity required for solving this increase polynomially. General solutions to this problem are [72]:data reduction, interleaving and sparse methods.
In data reduction, the system uses only some of the images (reduce m), or only some of the key points (reduce n). It may skip images with small parallax and reduce redundant data. The system includes only key images that best represent the data (based on a heuristic method). If a marker field is being created, it keeps the images containing several markers in different view angles, and retains enough images to cover the whole marker field. Generally, robust features cover-ing the whole scene sparsely are used. Augmented reality systems can use data reduction easily. They reduce the number of features naturally, as they use only the corner points of markers for reconstruction.
Interleaving means minimising reprojection error by varying only one source at time. In interleaving, the system alternates minimising reprojection error by varying the cameras and minimising reprojection error by varying the points. It estimates each point independently assuming fixed cameras, and similarly it estimates each camera independently assuming fixed points. Thus, the biggest matrix in the min-imising problem is the 11 x 11 camera matrix. Interleaving minimises the same cost function as the original problem, and thereby finds the same unique solution, if it exists. However, it takes a longer time to converge [72], which limits its use on real-time AR applications.
Sparse methods take into account the knowledge that the interaction between parameters is sparse (e.g. only some of the features exist in each image). In this case the matrices in the optimisation problem have large empty blocks (zeros). It has been shown that the use of the sparse variant of the Levenberg-Marquardt algorithm gains clear computational benefits compared to the standard version [101].
A widely used method for sparse bundle adjustment is the one described in [101].
In a general bundle adjustment problem, the internal camera parameters are unknown, e.g. 3D reconstruction based on historical images from unknown cam-eras. In AR applications, in most cases it is possible for the user to calibrate the camera and the system to gather the internal camera parameters beforehand, which simplifies the bundle adjustment problem. With a pre-calibrated camera, the number of free camera parameters is reduced from 11 to 6, the camera’s transla-tion (x, y, z) and rotatransla-tion R
( , , ) .
3.3.4 Dynamic multi-marker systems
The marker field may be dynamic; the user can move, remove or add markers during the application use. In this case, the system needs to be able to configure the marker field dynamically during the run time, (not only at initialisation phase).
The Kalman filter approach is able to adapt to dynamic changes and is often used for dynamic marker fields.
However, a more common situation is that an application has two types of
dynamic ones to manipulate objects and for user interactions. In this case, the system creates a multi-marker setup with all static markers, and the camera pose is calculated relative to these markers. In addition, the system detects and tracks all dynamic markers and calculates their individual pose relative to camera.
An example of this kind of dynamic multi-marker system is an application where a static marker field defines a table surface, and users can use dynamic markers to pick, move and drop objects. Dynamic markers can also be used for defining in-teraction actions, e.g. proximity actions, deformations, etc.