Many algorithms have been proposed over the last few decades to solve this kind of optimisation problems. These algorithms can be classified into two main categories. The first categories includes linear approaches, which is based on least squares optimisation by minimising the algebraic cost function, leading to using the singular value decomposition (SVD) [82, 84]. This approach is efficient and yields a closed-form solution. However, its main drawback is the fact that the quantity being minimised is not geometrically or statistically meaningful [25].
The second category includes iterative methods, where algorithms such as Levenberg-Marquardt, Gauss-Newton, Newton, gradient descent or conjugate gra- dient are used to minimise a geometric cost function in iterative way [199]. In this category the cost function has a geometrical meaning and, under an assumption of Gaussian noise, shown to be statistically optimal. However, the core problem with these methods, and notwithstanding of dependency on good initialisation, is related to the high probably of converging to a local minimum or even an infeasible solution.
As a powerful alternative, convex optimisation represents a third category offering the possibility of getting around problems that linear and iterative approaches have [25]. The cost function of this optimisation technique is geometrically meaningful, and has a single global minimum [94].
1.4
Visual SLAM and Visual Odometry
Imagine a digital camera is moving through its environment and acquiring sequence of images. Exploiting the rich amount of information in images, camera motion can be estimated by aligning the frames to each other and using the multiple view geometry. In fact, this motion consists on recovering the trajectory, which is built up from the estimated camera positions and orientations at different time steps. Specifically, the motion of image points (known as well as image features) can be used to determine the trajectory of the camera and the three dimensional structure of the scene. Two main motion estimation categories can be distinguished: "visual odometry - VO" (visual motion estimation) and "visual simultaneous localisation and mapping - visual SLAM (vSLAM)"
Though an exact discriminative line between the two categories is not fully defined, some properties of the employed algorithm can define its category. Specifically, if the algorithm relies on image feature matching between pairs of images, this algorithm belongs to the "visual odometry - VO" category (visual motion estimation). However, if the matching is performed between a map of the scene structure and the current image, it is identified as "visual simultaneous localisation and mapping - vSLAM" [210]. Our approach in this thesis belongs to the former category. Indeed, throughout
6 Chapter
1.
Introduction this thesis we discuss how image features between images can be used as a general tool for motion estimation via convex optimisation.It is worth-noticing some basic notions about the SLAM category. In the SLAM framework, the problem is to estimate the motion of a moving vehicle as it con- tinuously observes and maps its unknown environment using sensors which do not necessarily include cameras. When cameras are employed as the only exteroceptive sensor, it is called visual SLAM (vSLAM). In some applications this is referred as vision-based SLAM. Many studies have been presented in the last decade in which visual systems are used as the only external perception for SLAM systems [47, 103, 153, 162]. This is due to rich amount of information that cameras can provide.
The first notions of SLAM are started to appear during the period of 1985-1990, when Chatila and Laumond [34] and Smith et al. [181] proposed a mapping and localisation framework. After a while later, this problem took the name of SLAM (simultaneous localization and mapping). The key feature of SLAM is its capacity of building a global map of the environment and uses this map to deduce its own location at any time step [62]. In order to successfully do that, the system must possess exteroceptive (range lasers, sonar, cameras or GPS) and proprioceptive (encoders, accelerometers and gyroscopes) sensors. However, all these sensors are noisy and have limited range capabilities. Therefore, their ability to accurately estimate the vehicle position is compromised since errors are cumulative.
Simultaneous localisation and mapping (SLAM) algorithm is extensively for- mulated for the indoor and outdoor ground vehicle applications. This approach is stated as follows: starting from an initial position, the vehicle navigates through an unknown environment and obtains a set of sensor measurements at each position. The final aim is to process the sensor measurements to estimate the position while concurrently building a map of its environment. SLAM is known as an expensive algorithm, especially for the 6DOF implementation. Increasing the state size by including the sensor errors, and dealing with the sampling rates when using the inertial sensors, further intensifies the problem. In more details, when the number of features in the system state increases, then computational cost grows rapidly and consequently it becomes difficult to maintain the frame rate operation. To solve this problem, old features can be removed from the state to maintain a stable number of features. However, if old features are removed, then previous mapped areas cannot be recognized in the future. In the context of visual odometry, this fact is not considered as a problem.
In contract, visual odometry approach (visual motion estimation) is based on con- secutive pairs of images to exclusively estimate the relative motion, neglecting scene