3.6 Simulation Results
3.6.1 Systematic Error
The MeanShift-based tracker searches for regions of the image with similar color char-acteristics in order to maximize a similarity metric based on the distance to a target color-histogram. The tracker has no concept of the physical shape of the tracked object; hence, the MeanShift-based tracker is essentially tracking the convex hull of the apparent contour of the object in the image frame. The centroid of the apparent contour object may not necessarily coincide with the projection of the physical center of mass of the tracked object. Therefore, the tracker may introduce a systematic bias into the estimate of the object’s 3D position, which is based on the computed centroid in each camera view. Model-based trackers pro-vide an alternative to overcome this systematic error. Given a CAD model, the model-based trackers use identifiable features such as line edges or particular keypoint features on the object’s surface in order to orient a virtual model of the object within each camera view.
The tracking of a physical model of the object permits the model-based trackers to more precisely estimate the center of mass of the object, which may not coincide with the centroid
CG0I0
FM0.0I0 FM0.5I0 FM1.0I0 FM1.5I0 FM2.0I0 LD0I0 LD1I0 LD2I0 LD3I0 OC0I0 OC1I0 OC2I0
FM0.5-OC0-I0FM0.5-OC1-I0FM0.5-OC2-I0 Test Scenario
100 101 102
Image-Domain RMSE (pixels)
Proposed IMM Tracker Proposed UKF Tracker
Adaptive Color-based Particle Filter Tracker TLD
Figure 3.5: Semilog plot of RMSE (in pixels) performance in the image-domain of visual trackers: proposed trackers (e.g. UKF and IMM variant), state-of-art TLD [78], and an alternative color-based tracker [105]. Camera sample rate: 20 Hz.
of the apparent contour of the object. Model-based trackers are prominent in the field of visual-servoing; whereby, cameras are used to guide robots to interact with 3D objects in vision. Comport et al. demonstrate how a visual-servo framework would be used to track object features such as straight line edges in order to infer the 3D orientation and position of a known object [35]. Pieropan et al. use RGB-D cameras in order to construct a 3D model of the object to be tracked [112]. The algorithm models the object by an inscribed cuboid, and learns BRISK feature descriptions of each face of the cuboid from the target object. By matching visible keypoints on the object to learnt faces of the 3D object model, the orientation and centroid of the object cuboid is tracked.
CG0I0
FM0.0I0 FM0.5I0 FM1.0I0 FM1.5I0 FM2.0I0 LD0I0 LD1I0 LD2I0 LD3I0 OC0I0 OC1I0 OC2I0
FM0.5-OC0-I0FM0.5-OC1-I0FM0.5-OC2-I0
Figure 3.6: Semilog plot of RMSE (in m) performance in the world-domain of visual trackers:
proposed trackers (e.g. UKF and IMM variant), state-of-art TLD [78], and an alternative color-based tracker [105]. Camera sample rate: 20 Hz.
Test Scenario Proposed UKF Tracker Proposed IMM Tracker
Image
CG0I0 1.85728 0.00222 0 1.24879 0.00168 0
FM0.0I0 3.83396 0.00427 0 3.42458 0.00388 0
FM0.5I0 5.20128 0.00558 0 4.31347 0.00477 0
FM1.0I0 6.82793 0.00748 0 5.16160 0.00561 0
FM1.5I0 8.80483 0.00929 0 5.91680 0.00646 0
FM2.0I0 27.34134 0.44964 370 6.58141 0.00719 0
LD0I0 1.52338 0.19418 0 1.04194 0.19251 0
LD1I0 1.54125 0.18730 0 1.05174 0.19482 0
LD2I0 1.54424 0.16730 0 1.04943 0.16693 0
LD3I0 1.57998 0.14412 0 1.07110 0.14493 0
OC0I0 2.98911 0.01773 9 2.72483 0.01923 8
OC1I0 2.41757 0.00318 3 2.05014 0.00290 3
OC2I0 2.85883 0.01543 59 2.36748 0.01532 9
FM0.5-OC0-I0 5.76043 0.01873 3 5.00987 0.01843 3
FM0.5-OC1-I0 5.40373 0.00591 8 4.55332 0.00577 8
FM0.5-OC2-I0 28.43920 1.52156 179 4.64167 0.01192 15
Table 3.2: Performance comparison of proposed visual-trackers. The RMSE for the trian-gulated 3D trajectory of the target object is presented alongside the average RMSE of the image trajectories. The visual-trackers may lose lock on the target; hence, the number of frames in which the tracker lost the target is shown. IMU sample rate: 100 Hz; camera sample rate: 20 Hz.
Test Scenario [105] TLD [78]
CG0I0 10.34412 0.02131 165 19.85577 0.02762 362
FM0.0I0 4.95430 0.01022 0 21.69398 0.03212 108
FM0.5I0 5.79461 0.01222 0 25.16244 0.04174 54
FM1.0I0 6.33667 0.01211 197 22.87521 0.05441 118
FM1.5I0 6.88781 0.01337 47 19.50455 0.02856 267
FM2.0I0 8.41893 0.01346 341 21.99306 0.03720 177
LD0I0 2.04186 0.19210 0 15.21141 0.18120 19
LD1I0 2.32111 0.17916 0 17.81229 0.19933 7
LD2I0 2.03279 0.16868 0 16.63727 0.13731 271
LD3I0 2.03388 0.14477 0 18.74799 0.15071 17
OC0I0 4.95574 0.02103 18 27.43972 0.10352 162
OC1I0 4.30165 0.01184 7 22.23004 0.03483 261
OC2I0 5.28539 0.01724 173 23.23003 0.06858 244
FM0.5-OC0-I0 6.39345 0.01993 24 24.70154 0.03724 166
FM0.5-OC1-I0 6.22938 0.01351 217 20.70481 0.02908 309
FM0.5-OC2-I0 6.17545 0.01639 131 23.90602 0.05844 197
Table 3.3: Performance comparison of color-based and feature-based visual-trackers. The visual trackers presented are the adaptive color-based particle filter [105] and the Track-Learn-Detect method of [78]. The RMSE for the triangulated 3D trajectory of the target object is presented alongside the average RMSE of the image trajectories. The visual-trackers may lose lock on the target; hence, the number of frames in which the tracker lost the target is shown. Note, two entries of [105] have been omitted, as the tracker completely fails to track the target object and the RMSE are on the order of 1015. Camera sample rate:
20 Hz.
Figure 3.7: Occlusions sequence frames #12-16. Simulated IMU is moving right to left with an occlusion region in the middle of the trajectory
3.7 Conclusion
Vision-based systems are the leaders in motion-capture accuracy; hence, visual-tracking is an integral part of the grander visual-inertial sensor fusion scheme in order to achieve a state-of-the-art level of precision. This chapter has briefly reviewed a select subset of con-temporary visual-tracking algorithms in use today, remarking on the particular application of interest where appropriate. Popular filter frameworks were then presented as the back-bone of many visual-tracking algorithms. A class of robust visual-trackers (i.e. MeanShift), known as a low-complexity method for addressing motion blur and partial occlusions, was then analyzed in more detail as a part of the visual-tracker proposed to tackle the specific visual-inertial sensor-fusion application at hand. The proposed visual-tracker was presented with two different filter frameworks (i.e. UKF and IMM) and evaluated on a series of test scenarios alongside the state-of-the-art TLD [78] tracking-by-detection scheme and an adap-tive color-based tracker [105]. The IMM-variant of the proposed tracker was demonstrated to be suitable tracking algorithm for the motion capture application in mind, capable of handling fast motion as well as occlusions. The importance of camera-calibration was also emphasized with several of the test scenarios examining the effects of lens distortion on the ability of the visual-tracking algorithm to appropriately triangulate the target object. The simulation environment used to generate the test scenarios will be examined in detail in the upcoming Chapter 4. Also, visual-inertial sensor fusion will be explored using the proposed visual-tracker.
Chapter 4
Visual-Inertial Sensor Fusion
4.1 Introduction
Visual-inertial sensor fusion is a powerful tool for many industries: it allows the medi-cal practioners to better understand and diagnose illnesses; it allows the engineer to design more flexible and immersive virtual reality environments; and it allows the film-director to more fully capture motion in a scene [8, 52, 145]. The complementary nature of visual and inertial sensors is well-toted throughout these industries. The faster sampling rate of the inertial sensors fits lock-and-key with the higher accuracy of the visual sensors, to un-lock the potential for algorithms capable of tracking high-velocity objects through cluttered environments.
Sensor-fusion involves the use of different sensing modalities to design robust and ac-curate systems. Augmented reality is one such application which requires the use of both visual, inertial and other classes of sensors in order to reliably mesh together virtual and real-world elements. Sensor-fusion typically comes up in the development of human-computer interfaces (HCI), which warrant a greater degree of robustness based on the naturalness of the interface.
Any system that attempts to integrate multiple different sensing modalities will come across many new challenges compared to single-sensor systems. The main challenges being:
sensor synchronization - measurements may require temporal and spatial alignment to a common time-axis and reference frame; multi-rate sensors - the sampling rate of the sensors may differ or not be integer multiples of each other; data imperfection- each sensor may undergo a different noise model; data correlations - distributed sensors may still be subject
Figure 4.1: Illustration of simulation environment
to correlated noise or system parameters. All these issues pose a serious challenge to the sensor-fusion algorithm and need to be resolved in both an efficient and robust manner.
A main difficulty with developing sensor fusion algorithms is the ability to effectively debug the algorithms. In many sensor-fusion applications, ground-truth data may be dif-ficult or even impossible to acquire. A common barrier to development and comparison of different sensor fusion methods is the need for holistic performance metrics to assess an algorithm’s robustness, resource-expense, as well as precision. The scope of sensor-fusion applications makes it difficult to isolate individual components for analysis. For these rea-sons, a simulation environment capable of generating the various sensing information as well as preserving the subtle data correlations between different sensors can be invaluable to the development of effective algorithms.
A simulation environment is presented, which applies directly to visual-inertial sensor fusion applications [137]. The environment is able to generate both the raw camera frames as well as all the pertinent information from a nine-axis IMU with tri-axial accelerometers,
gyroscopes, and magnetometers. The simulation environment provides the ability to inject different noise sources in order to faithfully reproduce real-world sensor data.
Synchronization is a key component of many distributed sensing networks. In order to combine different sensor data optimally, the data needs to be temporally aligned. Differ-ent approaches exist for synchronizing data. The data can be transmitted to a cDiffer-entralized module which will process all sensing information; however, this approach can be computa-tionally intensive and can be daunting for real-time applications. Alternatively, dedicated modules could provide an external trigger to all sensing devices in order to synchronize the capture of information; however, the approach requires extra hardware, which may incumber the portability of the fusion system. In order to boost the flexibility of the sensor-fusion system, many approaches have also been developed in order to handle out-of-sequence measurements (OOSM) or temporally misaligned information within the sensor-fusion algo-rithm.
A visual-inertial sensor fusion algorithm capable of handling de-synchronized measure-ments will be presented and validated using the aforementioned simulation environment.
The resulting algorithm will be capable of fusing raw camera frame information, which has a relatively low sample rate, with higher rate inertial information in real-time without the need for additional hardware for synchronization.
This chapter is organized as follows. An introduction of a simulation environment for visual-inertial sensor fusion in Section 4.3 is preceded by a review of visual-inertial sensor-fusion in Section 4.2. The individual modules of the simulation environment are validated using the visual-tracker outlined in Chapter 3 and inertial-only trackers. The visual-inertial sensor fusion framework is then detailed in Section 4.4, followed by the application of the algorithm to asynchronous sensor data in Section 4.5. Concluding remarks are saved for Section 4.6.
4.2 Background
The field of visual-inertial sensor fusion has many diverse applications [8, 52, 145] and correspondingly a large amount research effort has gone towards developing robust precision systems.
Different frameworks exist for visual-inertial sensor fusion, the two most prominent groups are: loosely-coupled and tightly-coupled sensor fusion schemes. The groups are di-vided based on how they integrate the two sensing modalities together. Tightly-coupled sensor fusion algorithms will combine the raw visual (i.e. camera frames) and inertial (e.g. accelerometers, gyroscopes, magnetometers, etc ...) measurements into a unified filter stage [89]. Whereas, the loosely-coupled sensor fusion approach tends to process each modal-ity independently prior to the fusion; hence, the filter framework exhibits a characteristic cascade structure [36,134]. The choice of approach is largely dependent on the application at hand. The tightly-coupled approach is used extensively for visual odometry; whereby, the camera and inertial sensors are anchored to a common platform - such as in virtual-reality applications. The loosely-coupled approach is better-suited for applications where the vi-sual and inertial sensors do not share the same frame of reference- such as motion capture applications with external cameras and strapdown inertial sensors [145].
There are many challenges faced by sensor-fusion algorithms as a result of combining disparate sensing modalities. Data imperfection in the form of sensor noise is ever-present and varies between the various sensors employed- requiring different measurement models.
The various sensors may also be correlated in inconspicuous manners, exposing the pitfalls of data incest as a result of over-confidence in measurements due to un-modelled correlations.
Operational timing is prominent issue in multi-sensor networks; data alignment or regis-tration is an important step towards fusing information in a consistent manner across the various sensing modalities. Temporally mis-aligned and/or asequent observations across the sensors is particularly difficult task to handle, especially for real-time applications - where a naïve approach of buffering information prior to performing some form of bundle adjustment may be infeasible with regards to computational resources (e.g. time, memory, etc ...) [82].
Visual sensors (i.e. cameras) typically have low sampling rates due to the computational burden of image-processing for real-time applications. As a result, camera frames with fast target object motion will be corrupted by motion blur, which degrades the visual system’s ability to precisely estimate the pose and position of the object. Visual sensors are also highly dependent on line-of-sight and will have trouble tracking in cluttered environments, where extraneous items in the environment may block or occlude the view of the target object.
Inertial sensors provide a natural complement to visual sensors. Since the IMU would be attached to the target object, the inertial sensor does not suffer from the equivalent
of occlusions faced by cameras. The high sampling rate of inertial sensors also provide key information for estimating the pose of the target under fast motion. However, the inertial sensors will suffer from bias noise which limits the sensor’s contribution for low target velocities.
A probabilistic framework is a well-established approach to sensor-fusion, using proba-bility distributions to model uncertainty in the data measurements. However, many other frameworks have been developed in order to intuitively address data and model uncertainty including: fuzzy-reasoning and evidential frameworks [82].
Kalman filter-based approaches have predominated the probabilistic approaches to sensor-fusion. However, the classic KF-based approaches are prone to filter divergence in the pres-ence of outliers or spurious data. This behaviour can affect robustness of visual systems, which may be subject to complete occlusions. Ligorio et al. compared a DLT-based and error-driven EKF, and reported framework pitfalls in the event of loss of line-of-sight of the target object [89]. Bleser et al. uses an outlier rejection scheme for an EKF-based approach to visual-inertal sensor fusion [18]. The outlier rejection scheme is based on the covariance-weighted `2 norm of the EKF innovation. A scheme is also implemented to de-tect filter divergence using the Frobenius norm of the state covariance matrix as well as monitoring the norm of the orientation quaternion - in case it sufficiently deviates from unit norm indicating filter error.
The high accuracy of vision-based systems at low target velocities but degradation at higher velocities, and the complementary well-suitedness of the inertial-based system to higher velocities but degradation at lower velocities - compels the use of multi-rate systems.
Armesto et al. generalizes the classic EKF and UKF visual-inertial sensor fusion approaches to multi-rate systems and validates the system on a mobile-robot application where motion is restricted to a single plane of motion [5].
Synchronization of multiple sensors of varying domains is a key issue in distributed sens-ing networks. The problem is also known as temporal alignment and the solution is a pre-requisite for successful sensor-fusion. Dynamic-programming is a conventional approach to aligning different sensor data samples. Dynamic time-warping (DTW) attempts to minimize an appropriately chosen norm (e.g. `2) between the samples of two sensors while satisfying weak constraints on the monotonicity of the matched samples, boundary conditions (e.g.
first samples are synchronized), and continuity of the samples [98].
Out-of-sequence measurements (OOSM) is related to temporal alignment, and results from sensor measurements arriving after the filter has been updated to a later time-step.
Bar-Shalom et al. addressed the issue for KF-based fusion approaches by developing an means to incorporate single-lag OOSM’s into the current filter estimate [13]. The work is generalized in [14] to multi-lag OOSM for measurements arriving greater than a single time step after the filter update.