Repeatable Condition-Invariant Visual Odometry for Sequence-Based Place Recognition

Full text

(1)Repeatable Condition-Invariant Visual Odometry for Sequence-Based Place Recognition Arren Glover1,2 , Edward Pepperell2 , Gordon Wyeth2 , Ben Upcroft2,3 and Michael Milford2,3 1 iCub Facility, Italian Institute of Technology, Genova, Italy 2 Queensland University of Technology, Brisbane, Australia 3 Australian Centre for Robotic Vision, Queensland University of Technology, Brisbane, Australia [email protected], [email protected]. Abstract This paper describes a vision-only system for place recognition in environments that are traversed at different times of day, when changing conditions drastically affect visual appearance, and at different speeds, where places aren’t visited at a consistent linear rate. The major contribution is the removal of wheel-based odometry from the previously presented algorithm (SMART), allowing the technique to operate on any camera-based device; in our case a mobile phone. While we show that the direct application of visual odometry to our nighttime datasets does not achieve a level of performance typically needed, the VO requirements of SMART are orthogonal to typical usage: firstly only the magnitude of the velocity is required, and secondly the calculated velocity signal only needs to be repeatable in any one part of the environment over day and night cycles, but not necessarily globally consistent. Our results show that the smoothing effect of motion constraints is highly beneficial for achieving a locally consistent, lighting-independent velocity estimate. We also show that the advantage of our patch-based technique used previously for frame recognition, surprisingly, does not transfer to VO, where SIFT demonstrates equally good performance. Nevertheless, we present the SMART system using only vision, which performs sequence-base place recognition in extreme low-light conditions where standard 6-DOF VO fails and that improves place recognition performance over odometry-less benchmarks, approaching that of wheel odometry.. 1. Introduction. For low cost and easy set-up applications it is desirable to employ a single camera on a robot or autonomous vehicle and have it localise and navigate over long periods of time. Both motion estimation (visual odometry (VO)) and place recognition are required for such a task and in good conditions robotic navigation systems have shown success on routes up to 1000 km in length [Cummins and Newman, 2011]. Many state-of-the-art systems rely on point features for both VO and place recognition, which achieve. (a). (b). (c). (d). Figure 1: The condition-invariant place recognition system presented here matches between (a) day-time and (b) night-time conditions. In (b) we also show the experimentation with novel patch-based features for VO. The framematches over time are non-linear and noisy (c) as a route is traversed at different speeds but is linearised (d) by accounting for the speed variation using VO. The stronger linear diagonal line is more easily matched by a sequencebased place recognition system. best performance when using crisp, clean, high-resolution imagery [Geiger et al., 2011]. Unfortunately robots in realworld environments, such as self driving cars, must also contend with poor weather conditions and illumination change that drastically reduce the information available to the visual navigation system, conditions under which many current state-of-the-art systems fail catastrophically [Milford and Wyeth, 2012]. In particular, the problem of place recognition between sunny days and stormy nights can be extremely difficult (even for humans [Milford et al., 2014a]) due to the significant change in appearance of an image. Lighting is not evenly cast across a scene making many areas too dark to use for recognition (see Figure 1), and due to camera hardware limitations, image blur becomes more pronounced in low-light conditions [Milford et al., 2013]. One approach to recognition in such difficult conditions is to consider a sequence of images [Milford and Wyeth, 2012; Pepperell et al., 2014a], in which the probability of a successful match is increased only if the majority of consecutive frames all have some partial similarity. However, sequence-based techniques must be extended to account.

(2) for route traversal at different rates which occur, for example, due to stopping at traffic lights or due to different walking speeds. In order to develop a vision-only route navigation system, we leverage the power of sequence matching and incorporate a speed signal from the camera to account for speed variation throughout sequences. Previously the SMART place recognition system [Pepperell et al., 2014a] did so using the wheel-speed signal from the cars on-board diagnostics (OBD) to account for route traversal at different rates. The performance was improved over previous methods, but was not a vision-only system and therefore could not be used on, for example, hand-held devices. Simply replacing the OBD functionality with an off-theshelf VO system is not a simple task; most state-of-theart VO algorithms have only been demonstrated successfully on high-resolution, relatively well-illuminated imagery. However, the advantage of SMART for this problem is that it does not necessarily need a full six degree of freedom (6-DOF) solution but only requires a consistent forward velocity signal (as we assume the route is identical between traversals). We modify the open-source LIBVISO2 library [Geiger et al., 2011] for our sequence-based approach and use it to generate a speed signal to normalise the frame rate of route traversals. In the paper we present the use of image normalisation, patch-based feature matching and nonholonomic motion constraints together with their effects on the consistency of the speed signal. The image normalisation and patch-based features have been used previously in [Milford et al., 2014a], which acheived strong results in day-to-night visual recognition. We therefore evaluate their applicability for day-night motion calculation. In total six variations of VO are processed from which the absolute translational speed signal is extracted from the sophisticated 6-DOF estimation of LIBVISO2. Our approach is therefore vastly different to patch-based odometry previously published [Milford et al., 2014b]. Finally the result is combined with sequence-based place recognition for vision-only SMART. The paper reads as following: Section 2 describes related work, Section 3 describes the system overview and the algorithms, Section 4 describes the datasets on which the algorithms were tested, and Section 5 presents the results, which are discussed in Section 6.. 2. Background. Currently the most common base-line appearance-based place recognition system is FAB-MAP [Cummins and Newman, 2011]. Despite impressive performance on day-time datasets, it has been shown to have poor performance under extreme condition change [Milford and Wyeth, 2012] because the underlying feature detection and representation often fails when lighting drastically changes. Some solutions have altered the feature-type or matching algorithm [Neubert et al., 2014; McManus et al., 2013], however simple image comparisons have been shown to work well when emphasis is placed on the sequence of image matches [Milford and Wyeth, 2008; Milford and Wyeth, 2012; Sünderhauf et al., 2013b; Pepperell et al., 2013], rather than a single match of high. certainty. In particular, SeqSLAM [Milford and Wyeth, 2012] has shown surprising performance on imagery taken during a sunny day and a stormy night. Extending this method, SMART [Pepperell et al., 2013; Pepperell et al., 2014a] normalised the linear sequence searching algorithm with wheel speed information to account for differing sequence rates. The only other method to do something similar was [Lynen et al., 2014] who explicitly accounting for varying inter-frame distances. Both these methods relied on specific hardware making the method not completely visual and not useable on alternate hardware, for example on hand-held devices or head-mounted cameras. Standard monocular visual odometry (VO) is performed by calculating the change in camera position between consecutive frames by first calculating the point correspondences between two images [Hartley and Zisserman, 2003]. The open-source LIBVISO2 [Geiger et al., 2011] library is the culmination of multi-view geometry theory and practical application to provide an open-source solution to 6-DOF monocular VO. The out-of-the-box version of LIBVISO2 uses SIFT [Lowe, 2004] to calculate image correspondence and has been successful in several applications using high-resolution, crisp, day-time imagery [Lenz et al., 2011; Scherer et al., 2012; Geiger et al., 2012]. VO at night is a sparsely investigated area of research: [Milford et al., 2014b] has used a sum-of-absolutedifference (SAD) matching on single sub-image focused on the road segment in-front of the vehicle, [Mcmanus et al., 2014] has adapted traditional VO techniques to scanning LIDARs for low-light teach and repeat experiments, and [Krajnik et al., 2013] has performed path repetition between day and night following maps of SURF keypoints that were generated in good conditions. In the last, the translational robot speed was still measured using wheel encoders. Finally [Meilland et al., 2011] has shown dense tracking methods also have potential in conditions with changing lighting, at least over relatively short distances. This paper further investigates the problem of visual odometry at night for use in a place recognition system that can operate in changing lighting conditions and in environments that aren’t necessarily always traversed at the same rate. We take an orthogonal approach in that only the velocity magnitude is required for sequence-matching rather than an accurate global 6-DOF map, however we still need to calculate the motion from vision alone. To do so we use the LIBVISO2 toolbox but investigate the impact on performance when using motion constraints and replace SIFT with patch-based features. The proposed patch-based method is much different from previous simplistic velocity estimates [Milford et al., 2014b] that used only a single patch to estimate rotation and translation, as instead an array of patches placed over the full image provides the input to the camera pose estimate. Finally we present a fully vision-based augmentation of the SMART system [Pepperell et al., 2014a], which overcomes current limitations of sequence-based systems.. 3. Algorithm. The algorithm consists of two processes: visual speed calculation using LIBVISO2 and sequence-based place.

(3) Imaget-n. Loop. Extract Features. Feature Matching. Extract Features. Intrinsic Camera Parameters. 8-point RANSAC. Convert Points from 2D to 3D. Motion Constraints. Calculate Rotation and Translation. (a). (b). Final Rotation and Translation. Imaget. Accumulated Distance Since Last Image Addition > Constant SMART Distance. List of SMART Images (Night) Add Imaget. Low-res SAD Comparison: Sequence Length 30. Sliding Window List of SMART Frames (Day). Figure 2: An overview of the VO and place recognition system. Highlighted boxes indicate components of the original LIBVISO2 algorithm modified to be used with the patch-based method and motion constraints. The SMART place recognition algorithm normalises frame rate between two dataset traverses with varied speeds based on the VO output. Note: we use the inherent ground plane assumption of LIBVISO2 to calculate absolute scale in monocular VO. recognition using SMART. We describe the implementation of the proposed augmentations to feature tracking for low-light conditions: image processing, patch-based matching and the implementation of motion constraints. We also state the parameters used with SMART.. 3.1. Figure 3: An example (b) of the results of image normalisation. A 12×12p patch was used, the size of which is indicated by the white rectangle in (a).. Summary of Data-flow in Standard Monocular Odometry and SMART. The system overview shown in Figure 2 is split into VO (from LIBVISO2) and place recognition (from SMART) systems with alterations for experimentation highlighted by the shaded boxes. VO is performed by extracting features, either point-based SIFT features or patch-based features, from two images, finding feature matches between the two images, and calculating the Fundamental Matrix using standard 8-point RANSAC. The Fundamental Matrix with the most inliers is converted to the Essential Matrix using the camera calibration parameters. The final rotation and translation is calculated from the Essential matrix, with checks for validity and estimation of scale using a ground plane assumption (standard from the LIBVISO2 library). The speed signal is extracted from the magnitude of the change in position from the motion estimate, i.e. the Euclidean of the translation vector. Place recognition is performed by first forming a list of. locations separated by a constant distance, as calculated from the visual speed signal. Image comparison is performed between two sets of distance-normalised frames using a low-resolution sum-of-absolute-difference (SAD) whole image metric and the final image matching is accepted only if a sequence of images, prior to the current, all achieve a matching score above a threshold (see [Pepperell et al., 2014a] for details on the SMART system). The exact implementation of the patch-based normalisation and matching used in this paper differs from [Milford et al., 2014a] and for completeness is described also in this paper, where necessary. For more details refer to [Milford et al., 2014a]. We also describe a method for applying motion constraints in the LIBVISO2 system by ensuring the final rotation and translation is possible given a nonholonomic vehicle (inspired by [Scaramuzza et al., 2011]).. 3.2. Image Normalisation. Image normalisation (see Figure 3) is a pre-processing step that achieves some invariance to lighting by normalising the contrast at a local level in the image, and is performed before feature extraction and matching. The normalised image I 0 is calculated for each pixel location x, y: ( I −µ xy xy σxy > σT 0 σxy (1) Ixy = random σxy ≤ σT where µxy and σxy are the mean and standard deviation of the pixel intensity of the original image I in a square region that (x, y) is centred within, and σT = 0.1 is a threshold on the variance within the region. The algorithm differs from the previous implementation (see [Milford et al., 2014a]) due to the possibility of pixel randomisation. However, this step greatly reduces the chance of any patches matching within low variance (i.e. bland) areas of the image because the normalised image consists of random noise which varies between consecutive images, making patch matching improbable. A similar effect was previously achieved with a pixel mask.. 3.3. Patch Correspondence. Calculating patch correspondences replaces the SIFT feature tracking used in LIBVISO2 and produces a list of points shifted between the image at time t − n and time t. Patch locations are uniform across the image t such that.

(4)

(5) θ

(6)

(7) − φ

(8) < 3◦ 2. (3). |β | < 0.5◦. (4). ◦. |γ| < 1.5. (5). ◦. (6). (a). |δ | < 3. where β , γ, φ , δ , and θ are the roll, pitch, yaw, angle of elevation and angle of azimuth respectively. The motion constraints were combined with LIBVISO2 by calculating the rotation and translation within the RANSAC loop as indicated by the extended dashed box in Figure 2. Motion was discarded when outside the constraint bounds resulting in RANSAC finding the motion with the highest number of inliers that complies with the constraints.. (b). (c). (d). (e). 3.5 Figure 4: Frame to frame motion is calculated using uniformly spaced patches on the normalised images by (a) searching a set area in the frame at time t − n, which is defined by (b) a set patch size and position in the frame at time t. (c) The heat plot shows the SAD score as the patch at time t is shifted within the boundaries of the patch at time t − 1. The resulting patch match between time t (d) and time t − n (e) can occur with difficult-to-match textures. a regular grid of size 20×12 covers the image with a 50% overlap of patches in both axes (Figure 4(b)). SAD values are compared between each patch and the image at time t − n in a constrained region 80 pixels either side of the patch and 10 pixels both above and below the patch, with the centre of the region equal to the centre of the patch at time t (Figure 4(a)). The resulting SAD calculations form a two dimensional landscape of size 160×20 (see Figure 4(c)) from which a quality ratio grat is calculated as: grat. g2 = g1. (2). where g1 is the lowest value SAD score in the landscape and g2 is the second lowest SAD score that is at least 2 pixels away from the position of g1 . Patches with a grat < 1.04325 (as used in [Milford et al., 2014a]) are discarded ensuring only patches which have a clearly unique shift position are used to calculate motion. The point correspondences which are used by LIBVISO2 to calculate the fundamental matrix are considered as the centre of all patches at time t to their corresponding shifted position at time t − n.. 3.4. Non-holonomic Constraints. The non-holonomic constraints use the same motion model as described in [Scaramuzza et al., 2011], namely that the angle of azimuth of the translation should be half the yaw angle. However, it was found that the identical application of the constraints often failed in low-light conditions due to overly tight bounds; in many cases no image transform could be found. As such, the constraints were relaxed and simplified:. Sequence Reset. Error accumulates more quickly in VO if smaller motions are calculated more often; to prevent this LIBVISO2 only accepts motion over a certain threshold. In such cases, the image at the next time step is compared to the image at time t − n, where n is the number of frames since acceptable motion. Due to the difficult nature of the night datasets it was often found there would repeatedly be too few inliers. In cases of large values of n, the image at time t shared no visual features with the image at time t − n, and it was not possible for the system to recover. To counteract the problem, n was reset to 0 upon reaching a threshold (of 4). By resetting n small portions of motion could be lost from the overall path; however the system was able to recover from the complete failure. In constrast to global route matching techniques, local resets are acceptable because SMART only requires local consistency and a reset will only affect performance within a small section of the route.. 3.6. SMART Implementation. For full details and parameters used in the SMART algorithm, see [Pepperell et al., 2013]. The maximum horizontal image comparison offset was increased to 4 pixels on the Porto Antico dataset to account for larger viewpoint variation. The speed-normalised image sequences were searched at an angle of 45◦ with a sequence length equivalent to 30 m (30 frames for Surfers and 100 frames for Porto Antico). Sky blackening was also used.. 4. Experiments and Datasets. The vision-only sequence-based place recognition system was evaluated on three datasets. The day-time only Kitti dataset was used to verify the novel feature methods were comparable to SIFT under ideal conditions, with a well characterised camera. Place recognition experiments between day and night were performed using the Surfers and Porto Antico datasets, examples of the poor imagery at night can be seen in Figure 3 and Figure 1 respectively. Importantly, the Porto Antico dataset does not have wheel odometry and therefore only visual information is available for SMART. Night-time imagery is challenging for both frame correspondence and VO (see Figure 6). Both these datasets had speed variation as well as.

(9) Antico Path Integration. 150 100 50 0. (a) -50 -100 -150 100 0 -100. -100. -150. (b). 7000. (d). Ground Truth Example Sequence Search. 12000. 10000. Frame Number (Night). 5000. 4000. 3000. 2000. complete stops as seen by the non-linear/changing slopes in Figure 7. VO was computed using: SIFT, SIFT with patch-normalised images, and patch-based features; all algorithms were also evaluated under non-holonomic constraints. Intrinsic camera parameters were estimated using calibration software where necessary.. Kitti Odometry. The Kitti VO dataset [Geiger et al., 2013] (only dataset 0) consists of 3.7 km of suburban street driving in Karlsruhe, Germany filmed with a high-resolution video camera (down-sampled to 620×188p) mounted to a station wagon. The dataset is day-time only (see Figure 5(a)), has accurate intrinsic camera parameters, and has a full pose ground truth (see Figure 5(b)). A ground-truth speed signal was generated by differentiating the ground truth pose.. Surfers. The Surfers dataset [Pepperell et al., 2014a] consists of 3.5 km of urban roads in Surfers Paradise, Australia recorded with a GoPro Hero3 (down-sampled to 480×200p) mounted on the inside dashboard of an unmodified car (see Figure 5(d)). The dataset has both day-time and night-time traverses and has speed ground truth taken from the OBD.. Porto Antico. The Porto Antico dataset is approximately 1 km of walking in Genova, Italy (Figure 5(c)), taken with a handheld iPhone 5 (down-sampled to 480×270p) with only visual feedback for stabilisation. The camera always. 8000. 6000. 4000. 2000. 1000. 0 0. 4.3. 100. Surfers. 0. 4.2. 50. 14000. 6000. Frame Number (Night). Figure 5: Three datasets were used for experimentation. An example image (a) and the ground truth patch (b) of the vehicle-based Kitti odometry dataset, the route taken (c) on the Porto Antico walking dataset (see Figure 1 for example imagery), and the route taken (d) on the Surfers dataset (see Figure 3 for example imagery).. 4.1. 0. Figure 6: The night-time conditions make it difficult for the standard 6-DOF LIBVISO2 algorithm (Unconstrained SIFT) resulting in extremely poor path integration on the Porto Antico dataset, i.e. compared to Figure 5(c). Porto Antico. (c). -50. 2000. 4000 6000 Frame Number (Day) (a). 8000. 0. 5000 10000 Frame Number (Day) (b). 15000. Figure 7: The ground truth (blue dotted) image correspondence between day and night traversals of (a) Porto Antico and (b) Surfers. The slope change and non-linear sections make the datasets challenging for naive sequencebased place recognition techniques. However it is important to note that the for SMART the sequence only has to be locally linear (exemplified by the solid black lines) and a single VO failure doesn’t cause catastrophic failure (as opposed to techniques that require global accuracy, such as [Lynen et al., 2014]). Note: for clarity of the illustration the local linear examples are depicted 10× larger than actually used. pointed forward but with unavoidable variation in position. The dataset has day-time and night-time traverses, but no ground truth.. 5. Results. We present a validation of the patch-based method on the Kitti dataset before looking at the effect of the image processing techniques on the linearisation of the frame traversal rate for Surfers and Porto Antico. Finally place recognition performance is presented for the latter two datasets.. 5.1. Speed-signal Validation. The LIBVISO2 algorithm demonstrated typical results on the Kitti dataset, taking into account the low resolution imagery, irrespective of the differing image pre-processing and features, as seen by the path integration somewhat.

(10) Odometry Veri-cation. Surfers SUnc'd Ex. Seq MSE error. Night Distance (m). Path Integration (m). 8000. PN-hN'd Ex. Seq MSE error. 6000. 4000. 2000. 0. Speed Measurement (m/s). 0. 2000. 4000 6000 Day Distance (m). 8000. 0. 2000. 4000 6000 Day Distance (m). 8000. (a) Porto Antico 1200 1000. Uncon'd Error: 0.0524 Non-hol Error: 0.0537 SIFT. Uncon'd Error: 0.0514 Non-hol Error: 0.0555. SN-h Ex. Seq MSE Histogram. SUnc'dN'd Ex. Seq MSE Histogram. SN-hN'd Ex. Seq MSE Histogram. PUn'dN'd Ex. Seq MSE Histogram. PN-hN'd Ex. Seq MSE Histogram. 800. Uncon'd Error: 0.0446 Non-hol Error: 0.0450. SIFT-normalised. SUnc'd Ex. Seq MSE Histogram. 600. Patches-normalised. 400 200 0 1200 1000. Night Distance (m). Figure 8: The integrated path (solid lines) of the nonholonomic versions of the VO processed on the Kitti dataset should follow the ground truth (red dotted line). The visual speed signal matches the acceleration of the ground truth, with the patch-based method having the smallest overall mean-square error. The first 2000 frames of error are shown, indicated by the square on the map.. 800 600 400 200 0 1200 1000. tracing the ground truth in Figure 8. However, accumulated errors in the integration are irrelevant to the speed signal (which is the critical output required for sequencebased place recognition) as it must only be locally repeatable. The speed signal qualitatively follows the ground truth speed indicating the potential for an accurate visual speed signal using the proposed methods. The proposed patch-based features had a smaller error than SIFT implementations. The non-holonomic constraints did not significantly improve the speed signal error.. 5.2. Frame Sequence Linearisation. The velocity estimate along a dataset traversal is used to linearise the frame sequence such that a sequence-based place recognition system can perform a successful 45◦ search in frame-space. By taking into account the distance along the trajectory that each frame was taken we evaluate how well each VO method linearised the ground-truth frame association. On the Surfers dataset we found that all variations of SIFT and Patches, with normalisation and/or motion constraints, performed comparably, two of which are shown in Figure 9(a). The MSE histogram is displayed such that the column on the right of the figure indicates the count of highly linear (low MSE) segments with bars further on the left indicating less-linear segments. The patchbased method produced a slightly higher count of lesslinear segments however the difference is not significant. SIFT showed adequate feature tracking without any visual augmentation despite the night-time conditions. The Porto Antico dataset offered even more challenging conditions on which the different methods produced varying results, see Figure 9(b). Qualitatively the SIFT-based datasets (top four plots) match more closely to the example linear segments compared to the patch-based methods (bottom two plots). The MSE also indicates patches pro-. 800 600 400 200 0 0. 500 Day Distance (m). 1000. 0. 500 Day Distance (m). 1000. (b) Figure 9: The linearised (a) Surfers dataset using unconstrained SIFT and for constrained Patches (for brevity) in which search sequences (solid black) almost completely overlap the linearised frame sequence (blue dots) resulting in a strong potential for place recognition, and (b) Porto Antico dataset with all VO methods. The MSE histogram (solid red) is a count of linear 30 m long sequence segments; bar density further to the right indicates lower fitting error. Note: the example sequences shown are 10× larger than those actually used.. duced less linearisation as seen by the higher bars on the left-side of the histogram. Image normalisation (middle left) produced a higher count of highly linear segments compared to vanilla SIFT (top left), however the histogram tail was equally large. By far the biggest influence on the MSE was the non-holonomic constraints (right column), which reduced the tail of the histogram, i.e. the number of highly non-linear segments was reduced. We observed that a major component of the poor linearisation exhibited by the constrained patch-based method (bottom right of Figure 9(b)) is due to the absolute scale difference between day and night datasets, resulting in a search slope 6= 45◦ . In addition we see that the relative speed over the dataset is similar to the SIFT variant (Figure 10) but has a constant deficit, which also indicates an incorrect scaling factor..

(11) Porto Antico Night-time Speed. Surfers. 0.45. 100 SIFT Unc'd Patch N-hN'd. 0.4. 90 80 70. 0.3. Precision (%). Distance / Frame (m). 0.35. 0.25 0.2. 60 No Odometry SIFT Uncon'd SIFT Uncon'd Norm'd Patches Uncon'd Norm'd SIFT Non-hol SIFT Non-hol Norm'd Patches Non-hol Norm'd OBD. 50 40 30. 0.15. 20. 0.1. 10 0. 0.05. 0. 10. 20. 30. 40. 50 Recall (%). SUn'd 61.2% 78.2%. SUn'dN'd PUn'dN'd 64.2% 69.4% 79.7% 75.1%. 60. 70. 80. 90. 100. 0 0. 1000. 2000. 3000 4000 5000 Frame Number. 6000. 7000. 8000. Figure 10: A comparison between SIFT and patch methods across the night-time Porto Antico dataset indicates a similar motion estimate, however the patch-based method is consistently underestimating the scale. For presentation clarity the speed signal has been filtered with a moving average.. 5.3. Recall at: 99% Prec 95% Prec. None 20.4% 25.1%. SN-h 64.5% 75.3%. SN-hN'd 66.6% 76.1%. PN-hN'd 66.3% 73.9%. OBD 91.3% 94.2%. Figure 11: Place recognition performance of all six VO algorithms, the ground truth OBD signal and the noodometry case on the Surfers dataset. The theoretical ideal is the top right corner achieving 100% recall at 100% precision, however in practice, even with very accurate wheel odometry, just over 90% recall can be achieved.. Condition and Speed Invariant Place Recognition. Porto Antico 100 90 80 70. Precision (%). The final performance of vision-only SMART is evaluated using the low-resolution SAD score, instead of the ground truth that was used above to evaluate linearisation. An example of an image comparison difference matrix on which sequence searching is performed can be see in Figure 1. We evaluate the precision and recall achieved as the matching threshold is varied. The baseline performance of SMART is measured assuming the frame rate is proportional to the speed and identical for both day and night. On average the speed signal improves the recall at 99% precision by a factor of 3 from the baseline 20% to 65% compared to a maximum possible recall using the OBD of 91% (see Figure 11). We therefore confirm the successful implementation of a vision-only place recognition system capable of dealing with drastic changes in lighting and large variation in traversal speed. However, as reflected by the linearisation results, the difference between the various VO methods was not significant, indicating the possibility of using SIFT for night-time VO. The Porto Antico dataset was challenging for visual speed generation due to the hand-held collection method introducing shake, blur, inconsistent height and inconsistent focal point. Some pose variation was also present between day and night that introduced further challenges for image comparison. Without an odometry signal the system could not solve the dataset (i.e. maximum precision was only 84% with 20% recall, see Figure 12). Once again, introducing VO of any type resulted in a significant improvement in precision-recall performance. The nonholonomic SIFT methods achieved over 30% recall at 99% precision, while SIFT on its own was only able to achieve 9%. The image normalisation improved standard SIFT to almost 17% precision but normalisation benefits were not additive with improvements from the non-holonomic constraints. Reflecting the linearisation results, patch-based methods under performed compared to SIFT.. 60 50 No Odometry SIFT Uncon'd SIFT Uncon'd Norm'd Patches Uncon'd Norm'd SIFT Non-hol SIFT Non-hol Norm'd Patches Non-hol Norm'd. 40 30 20 10 0 0. Recall at: 99% Prec 95% Prec. 10. None -% -%. 20. SUn'd 8.91% 9.04%. 30. 40. 50 Recall (%). SUn'dN'd PUn'dN'd 16.8% 13.7% 32.1% 15.7%. 60. SN-h 30.5% 47.9%. 70. SN-hN'd 30.0% 44.1%. 80. 90. 100. PN-hN'd 19.0% 21.1%. Figure 12: Place recognition precision-recall for the Porto Antico dataset.. 6. Discussion and Future Work. The first surprising result is the fact that the patch-based method did not perform as well as SIFT for the Antico dataset (Figure 12) while showing equal performance on the Kitti and Surfers dataset. We proposed that part of the issue was due to an absolute scale factor difference. The scale calculation relies on a certain number of tracked features being detected to form a ground plane. We believe that due to a subtle interaction between the dataset, estimation of height and tilt values, and fixed locations for patchsearch, that the ground plane was incorrectly estimated and therefore the scale also. Systems-interaction analysis and improvement would therefore improve the scale accuracy up to SIFT for the Antico dataset. The improvement on the Antico dataset due to nonholonomic constraints, despite the fact the dataset was not recorded with a non-holonomic vehicle, can be explained as the non-holonomic constraints behaved as a standard filter for the constant small up/down motions. Due to the limited requirements of the VO method we are mostly only interested in the component of forward motion and without.

(12) constraints, small motions contributed to an overestimation of the motion in some sections of the dataset; as indicated by the longer overall path lengths for unconstrained motion in Figure 9(b). For the (easier) Kitti and Surfers datasets, the image normalisation procedure did not significantly boost the performance of SIFT, however in the more difficult Porto Antico dataset performance was significantly improved. If SIFT is used in the furture, it would be worthwhile to more thoroughly analyse the affect of normalisation when the visual data is extremely poor. We demonstrated that patches from place-recognition tasks could be successfully adapted to the image motion problem, but found that even in poor lighting conditions SIFT performed equally well (or better). This result is perhaps not surprising when considering the appearance change between two frames is far less in a VO context, in comparison to the large time intervals in a place recognition application. When considering the (completely unoptimised) computation time of patches was larger than the (fully-optimised) SIFT, the results would suggest that SIFT is the better option. We propose that, while somewhat unfruitful, the experiments and analysis of adapting the lighting invariant patch-based techniques to the night-time VO problem is still a useful contribution to the robotics community. The major contribution of this paper is the validation of a vision-only sequence-based place recognition method that can be successfully applied even in conditions in which a traditional application of VO fails (recall Figure 6). The reason for the success is the simple forward motion requirement combined with the fact that VO only need be accurate for local regions of the route. Small sections of poorly calculated VO does not catastrophically effect overall performance. The performance of the sequence-based place recognition drastically improved over the no-odometry case and we believe this result is a significant step forward addressing several of the shortcomings of current sequence-based place recognition techniques. We demonstrated the vision-only system on two particularly difficult datasets with both day-to-night lighting change and multiple changes in speed, including complete stops. We now have a purely visual sequence-based place recognition system which accounts for speed variations while retaining all of the other benefits of sequence-based techniques [Naseer et al., 2014; Johns and Yang, 2013; Sunderhauf et al., 2013a]. However the application of condition-invariant patches for VO could be further explored to produce a more consistent scale. The application of a more general filtering approach to replace nonholonomic constraints might also further improve handheld datasets. Such an approach is only viable as the highly. accurate 6-DOF motion is not required. We can also incorporate the VO with recent extensions to SMART matching across multiple-lane roads [Pepperell et al., 2014b; Pepperell et al., 2015]. In addition the evaluation of a direct VO approach similar to [Meilland et al., 2011] would be a beneficial study to further evaluate VO techniques for night-time conditions. The integration of multi-view geometry and sequence-based place recognition also allows us to explore solutions to pose variation by estimating the local lateral motion.. Acknowledgments This research was supported by an Australian Research Council Future Fellowship FT140101229, Discovery Project DP110103006 and DECRA DE120100995.. References [Cummins and Newman, 2011] Mark Cummins and Paul Newman. Appearance-only SLAM at large scale with FAB-MAP 2.0. The International Journal of Robotics Research, 30(9):1100–1123, 2011. [Geiger et al., 2011] Andreas Geiger, Julius Ziegler, and Christoph Stiller. StereoScan: Dense 3d reconstruction in real-time. 2011 IEEE Intelligent Vehicles Symposium (IV), pages 963–968, June 2011. [Geiger et al., 2012] Andreas Geiger, Martin Lauer, Frank Moosmann, Benjamin Ranft, Holger Rapp, Christoph Stiller, and Julius Ziegler. Team AnnieWAY’s Entry to the 2011 Grand Cooperative Driving Challenge. IEEE Transactions on Intelligent Transportation Systems, 13(3):1008–1017, September 2012. [Geiger et al., 2013] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset. The International Journal of Robotics Research, 32(October 2011):1–6, 2013. [Hartley and Zisserman, 2003] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. [Johns and Yang, 2013] E Johns and G Z Yang. Feature Co-occurrence Maps: Appearance-based Localisation Throughout the Day. In International Conference on Robotics and Automation, Karlsruhe, Germany, 2013. IEEE. [Krajnik et al., 2013] Tomas Krajnik, Sol Pedre, and Libor Preucil. Monocular Navigation for Long-Term Autonomy. Internaltional Conference on Advanced Robotics, 2013. [Lenz et al., 2011] Philip Lenz, Julius Ziegler, Andreas Geiger, and Martin Roser. Sparse scene flow segmen-.

(13) tation for moving object detection in urban environments. In 2011 IEEE Intelligent Vehicles Symposium (IV), number Iv, pages 926–932. Ieee, June 2011. [Lowe, 2004] D Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004. [Lynen et al., 2014] Simon Lynen, Michael Bosse, Paul Furgale, and Roland Siegwart. Placeless PlaceRecognition. In International Conference on 3D Vision, 2014. [McManus et al., 2013] Colin McManus, Paul Furgale, and Timothy D. Barfoot. Towards lighting-invariant visual navigation: An appearance-based approach using scanning laser-rangefinders. Robotics and Autonomous Systems, 61(8):836–852, August 2013. [Mcmanus et al., 2014] Colin Mcmanus, Ben Upcroft, and Paul Newman. Scene Signatures: Localised and Point-less Features for Localisation. In Robotics: Science and Systems, Berkeley, USA, 2014. [Meilland et al., 2011] M Meilland, A Comport, and Patrick Rives. Real-time dense visual tracking under large lighting variations. In British Machine Vision Conference, 2011. [Milford and Wyeth, 2008] M Milford and G Wyeth. Mapping a suburb with a single camera using a biologically inspired SLAM system. IEEE Transactions on Robotics, 24(5):1038–1053, 2008. [Milford and Wyeth, 2012] M Milford and G Wyeth. SeqSLAM: Visual Route-Based Navigation for Sunny Summer Days and Stormy Winter Nights. In IEEE International Conference on Robotics and Automation, St Paul, United States, 2012. IEEE. [Milford et al., 2013] Michael Milford, Ian Turner, and Peter Corke. Long exposure localization in darkness using consumer cameras. In The IEEE International Conference on Robotics and Automation. IEEE, 2013. [Milford et al., 2014a] Michael Milford, Walter Scheirer, Eleonora Vig, Arren Glover, Oliver Baumann, Jason Mattingley, and David Cox. Condition-Invariant, TopDown Visual Place Recognition. In The IEEE International Conference on Robotics and Automation, 2014. [Milford et al., 2014b] Michael Milford, Eleonora Vig, Walter Scheirer, and David Cox. Vision-based Simultaneous Localization and Mapping in Changing Outdoor Environments. Journal of Field Robotics, 31(5):780– 802, September 2014. [Naseer et al., 2014] Tayyab Naseer, Luciano Spinello, W Burgard, and Cyrill Stachniss. Robust Visual Robot Localization Across Seasons using Network Flows. In Conference on Artificial Intelligence, 2014.. [Neubert et al., 2014] Peer Neubert, Niko Sünderhauf, and Peter Protzel. Superpixel-based appearance change prediction for long-term navigation across seasons. Robotics and Autonomous Systems, August 2014. [Pepperell et al., 2013] Edward Pepperell, Peter I Corke, and Michael J Milford. Towards Persistent Visual Navigation using SMART. In Australasian Conference on Robotics and Automation (ACRA), Sydney, Australia, 2013. ARAA. [Pepperell et al., 2014a] Edward Pepperell, Peter I Corke, and Michael J Milford. All-Environment Visual Place Recognition with SMART. In IEEE Conference on Robotics and Automation, pages 1612–1618, Hong Kong, China, 2014. [Pepperell et al., 2014b] Edward Pepperell, Peter I Corke, and Michael J Milford. Towards Vision-Based Pose- and Condition-Invariant Place Recognition along Routes. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA), Melbourne, Australia, 2014. ARAA. [Pepperell et al., 2015] Edward Pepperell, Peter Ian Corke, and Michael John Milford. Automatic Image Scaling for Place Recognition in Changing Environments. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seattle, United States, 2015. IEEE. [Scaramuzza et al., 2011] Davide Scaramuzza, Andrea Censi, and Kostas Daniilidis. Exploiting motion priors in visual odometry for vehicle-mounted cameras with non-holonomic constraints. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4469–4476, San Francisco, CA, USA, September 2011. Ieee. [Scherer et al., 2012] Sebastian Scherer, Joern Rehder, Supreeth Achar, Hugh Cover, Andrew Chambers, Stephen Nuske, and Sanjiv Singh. River mapping from a flying robot: state estimation, river detection, and obstacle mapping. Autonomous Robots, 33(1-2):189–214, April 2012. [Sunderhauf et al., 2013a] N Sunderhauf, P Neubert, and P Protzel. Predicting the Change A Step Towards LifeLong Operation in Everyday Environments. In Robotics Challenges and Vision Workshop at Robotics Science and Systems, Berlin, 2013. [Sünderhauf et al., 2013b] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are We There Yet? Challenging SeqSLAM on a 3000 km Journey Across All Four Seasons. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2013..

(14)

No results found