CV datasets for evaluation - Hardware, software supports and datasets for evaluation

Chapter 2 Literature Review and Background

2.3. Hardware, software supports and datasets for evaluation

2.3.3. CV datasets for evaluation

The proposed vision-based AR system consists of several CV-based technologies, as set out in Section 2.2. There are plenty of image databases for various CV research problems, and each of these problems requires particular evaluation metrics for assessing the performance of applied algorithms. Some of these databases are categorised and archived online, available for public use (e.g. CV Datasets on the web17 and CVonline: Image Databases18). In this section, the datasets for evaluating 3D reconstruction/mapping and loop closure detection methods are reviewed.

CV Datasets on the web : http://www.cvpapers.com/datasets.html

113

3D reconstruction / mapping

Two kinds of 3D reconstruction or localisation and mapping method are available in this proposed work for learning a specific target environment: one is SfM, taking a set of RGB images as input; another is RGBD-based SLAM, taking RGBD data as input. 3D reconstruction and 3D localisation/mapping are not the same task. Visual SLAM applications – as the name implies – focus on learning the environment from the obtained visual information and locating the sensor with respect to the map they have built. Meanwhile, SfM-based applications put more focus on 3D reconstruction, which estimates 3D geometric information from the images for creation of virtual 3D model – either a meshed model or a set of point cloud. Thus the accuracy of data produced by CV-based methods for 3D reconstruction is generally evaluated by comparing the created models against the ground truth. Schöning & Heidemann (2015) state that, the ground truth data in most benchmarks or evaluations on multi-image 3D reconstruction is acquired by traditional terrestrial 3D laser scanners and light detection and ranging (LIDAR) systems, such as Zoller+ Fröhlich’s IMAGER 5003 laser scanner in Strecha et

al. (2008), Zoller + Fröhlich’s IMAGER 5006h and IMAGER 5010 terrestrial laser

scanners in Kersten & Lindstaedt (2012) and ATOS Compact Scan 2M 3D scanner in Mousavi et al. (2015). Further, Schöning & Heidemann (2015)’s benchmark require two criteria: 1) including real scene photographs as well as photographs taken in a controlled indoor environment; 2) the availability of a ground truth. They examined several multi-view datasets and finally chose the datasets fountain-P11 and Herz-

JesuP8 (with integrated LIDAR 3D triangle meshes as ground truth, as shown in Figure

114 Geometry Group, 2004) for a controlled indoor environment. Schöning & Heidemann (2015) then make use of an iterative closest point algorithm (Besl & McKay, 1992), aligning and registering the model with the ground truth. The minimal distance between every point of registered ground truth model to any triangular face of the reconstructed mesh is computed. The mean value and the standard deviation of all these distances are used for accuracy comparison between different reconstruction methods, and the computation time is also considered.

115 Figure 2-20: Diffuse rendering of the integrated LIDAR 3-D triangle meshes for the datasets fountain-P11

(upper) and Herz-Jesu-P8 (lower). (Strecha et al., 2008)

However, a good reconstructed meshed model is not necessary in the present thesis. The basic task of markerless AR tracking is closer to a SLAM problem, in which the accurate pose of user viewport is in demand and what need to be “reconstructed” is a reference map of the target environment which consists of both geometric information and recognisable visual features, i.e. the point cloud of keypoints. In this case, a meshed model of ground truth cannot match the requirement. In fact, the visual information can only be extracted by CV methods and it is hard to obtain so-called “real values” by other types of sensors as ground truth. However, as introduced in Section 2.2.3, both

116 SfM and SLAM methods contain the processes of camera pose estimation and map creation with 3D point clouds (which further becomes a dense model in SfM-based applications), and the resultant accuracy of these two processes are highly dependent on each other. Therefore the evaluation criterion designed for SLAM system which usually uses associated camera pose of each image as ground truth is considered instead, and Strecha et al. (2008)’s dataset also provide the ground truth of camera pose along with the model.

Since RGBD data can also be used as input in this proposal, Sturm et al. (2012)’s benchmark for the evaluation of RGBD SLAM systems is one of the options. 39 RGBD image sequences of an office environment and an industrial hall are provided, which are recorded from a Microsoft Kinect with highly accurate and time-synchronised ground truth camera poses from a motion capture system. The authors declared that this dataset is the first RGBD dataset suitable for the evaluation of visual SLAM systems and propose two evaluation metrics: 1) evaluate the end-to-end performance of the whole system by comparing its output (map or trajectory) with the ground truth; 2) compare the estimated camera motion against the true trajectory. The accuracy is then measured with relative pose error and absolute trajectory error. Assume P₁,...,P_nSE(3) is a sequence of poses from the estimation and Q₁,...,Q_nSE(3)is the sequence from the ground truth. The relative pose error at time step i is defined as

)

(

:

_i1 _i__ 1 _i1 _i__

Q

)

(P

P

RPE

(2.13)

where  is fixed time interval. The absolute trajectory error at time step i is defined as 1

:

_i _i

Q

SP

117 where the rigid-body transformation S corresponds to the least-squares solution that maps the estimated trajectoryP₁_:_n onto the ground truth trajectory Q₁_:_n. The errors over all time indices then are evaluated by computing root mean squared error (RMSE), which gives less influence to outliers than computing mean error.

An alternative benchmark for RGBD SLAM is given in Handa et al. (2014). Their dataset is collected from two different environments: the living room and the office room. Just like Sturm et al. (2012), all RGBD image sequences are associated with ground truth trajectory, but moreover the sequences from the living room scene have camera pose information associated with a 3D polygonal model. Thus, these sequences can be used to benchmark both camera trajectory estimation and 3D reconstruction. One of the latest surveys of RGBD datasets – Cai et al. (2017) – compared Sturm et al. (2012)’s benchmark dataset with Handa et al. (2014)’s dataset, commenting that the latter “is more challenging and realistic since it covers large areas of office space and the camera motions are not restricted”.

Loop closure detection

Another CV-based key technique applied in the present work is visual loop closure detection. From a visual perspective, finding a loop closure can be expressed as if there is sufficient similarity between the current image and a map image (Liu & Zhang, 2013). In general the datasets collected for studying algorithm performance on navigation and mapping (i.e. SLAM) can also be used on loop closure detection, such as some sequences inside Sturm et al. (2012) RGBD dataset (e.g. [freiburg1_room]). More specifically, several loop closing-targeted researches made use of the SLAM datasets

118 for evaluating their methods which are described below. Cummins & Newman (2008) have tested their FAB-MAP system on their New College and City Centre image collections19 which are composed of images of outdoor urban environment collected by mobile robot, the corresponding coordinates of each image derived from interpolated GPS, the ground truth “mask” used for measuring the image-to-image correspondence matrix generated by the loop closing algorithm, aerial photo for visualising results and camera calibration information. Another dataset20 for loop closing problem are provided in Angeli et al. (2008) as supplemental material, which includes an indoor image sequence with strong perceptual aliasing and a long outdoor image sequence. The dataset also contains the ground truth image-to-image correspondence matrix and camera calibration information. Glover et al. (2010) introduced an appearance-based SLAM for multiple times of day and they collected their dataset21 from a selection of streets in the suburb of St. Lucia with corresponding GPS data for experiment. The visual data were collected by traversing a route at five different times during the day to capture the difference in appearance between early morning and late afternoon. The route was traversed again, another five times, two weeks later for a total of ten datasets.

As mentioned in Section 2.2.4, visual loop closure detection can be considered as an

FAB-MAP – dataset:

http://www.robots.ox.ac.uk/~mobile/IJRR_2008_Dataset/data.html

Cognitive Robotics at ENSTA:: Loop Closure Detection – dataset: http://cogrob.ensta-paristech.fr/loopclosure.html

21_{St Lucia Multiple Times of Day dataset – dataset:}

119 image retrieval problem. Therefore the evaluation metrics designed for information retrieval system are usually used for assessing the algorithm for loop closing. One of the most commonly used metrics is precision-recall curve, which has been presented in several loop closing-related works for performance evaluation, such as Cummins & Newman (2008) and (Liu & Zhang, 2013). Precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned (Pedregosa et

al., 2011). They can be defined with true positive rate and positive predictive value

respectively, as shown below:

fp

tp

precision





(2.15)

fn

tp

recall





Where tp refers to true positives, fp to false positives and fn to false negatives, defined as follows:

Prediction: positive Prediction: negative Truth: positive true positive (tp) false negative (fn) Truth: negative false positive (fp) true negative (tn)

Precision and recall are typically inversely related, A system with high recall but low precision returns many results of predication, but most of the results are incorrect when compared to the truth; a system with high precision but low recall returns very few results, but most of them are correct when compared to the truth. Hence the precision- recall curves can be used to find an appropriate trade-off between precision and recall, assisting on selecting algorithms for different requirements (e.g. high precision at the

120 lower recall or high recall at the lower precision). An example of the precision-recall curves are given in Pedregosa et al. (2011) , as shown in Figure 2-21.

Figure 2-21: An example of Precision-Recall curves to multi-class. Pedregosa et al. (2011)

In loop closure detection problem, the recall means the capacity of a system to detect a loop closure correctly when revisiting a mapped place while the precision means the proportion of the system detected loop closures are real loop closures. Whether a loop closure occurs or not is generally judged by the similarity between the images, which can form an image-to-image correspondence matrix. Entry (i,j) of this correspondence matrix will set to 1 if image i and image j were determined to be taken at the same place, or 0 otherwise. Thus the ground truth correspondence matrix provided in Angeli et al. (2008) and Cummins & Newman (2008)’ datasets can be used to inspect the

121 performance of the algorithms.

In document User-oriented markerless augmented reality framework based on 3D reconstruction and loop closure detection (Page 115-124)