Two-view geometry epipolar geometry - Robust convex optimisation techniques for autonomous vehi

Red, green and blue colours are considered as additive colours. That is the colours are added together to form the final colour image. This can be explained by the fact that these colours are the dominant colours that our eyes can see. Combining them together we can effectively generate almost every colour our eyes can see.

2.9 Two-view geometry - epipolar geometry

Views are defined as images taken by a one or more cameras at different locations [82, 122, 200]. If the same camera is used, each view has the same image size and the same calibration parameters. In two-view geometry, the two images are captured by a single camera that is shifted from one place to another, or they can be captured by two cameras at two different positions in space [89]. Multiple views imply that a single camera is used to capture multiple images, which form a single image sequence, from n different positions.

It is possible to estimate correspondences between image points of the same 3D scene points at each camera position assuming that the scene is relatively static.

Now, let us consider an arbitrary 3D scene point X in the scene (Figure 2.7), where ˆX = [X 1]⊤. From (2.26), this scene point X will be projected to image point xi = PiX in the left image lˆ i and to image point xj = PjX in the right image lˆ j.

Hence, xi and xj are corresponding (matching) image points of the same 3D point X.

Also, it can be seen from Figure 2.7 that scene point X, the matching image points xi and xj, and the camera centres Ci and Cj are in the same plane, denoted as π. In

the case where xj is known, then our task is to recover its corresponding xi. The

distance between the two camera centres is called the baseline. Knowing that xi is

on the plane π, therefore the search area must be restricted to the line, where the left image plane πi intersects with the plane π. This important line is called the

Epipolar line.

Now the question is: what is the geometrical relationship between the two points xi and xj? This is discussed in the following section.

2.9.1 The essential matrix

Let us consider the scenario where the point X is seen by two cameras Pi and Pj

from different locations. Therefore, points xi and xj ∈ R3 represent the coordinates

of the projection of the same 3D scene point X. As stated before, each camera has its own reference frame (Section 2.5.1). Let Xi ∈ R3 and Xj ∈ R3 be the 3D coordinates

38 Chapter

2.

Image Geometry

Fig. 2.7 Two-view geometry

by the rigid body motion:

Xj = RXi+ T (2.27)

where R is the rotation matrix ∈ SO(3) and T ∈ R3 _{is the translation. Here}

we assume that the second camera Pj is the reference frame. The transformation

g = (R, T ) ∈ SE(3) gathers the location and the orientation of the camera Pi. This

equation may be written using the image points xi and xj as:

λjxj = Rλixi+ T (2.28)

where λj and λj are again the depths. In order to eliminate these depths, one may

multiply both sides of (2.28) by [T ]×, and since [T ]×T = 0, then:

λj[T ]×xj = [T ]×Rλixi (2.29)

Since [T ]×xj = T × xj is perpendicular to the vector xj, the inner product is

x⊤_j [T ]×xj = 0, then x⊤j [T ]×Rλixi = 0. We know that the depth λi is positive (λi > 0),

then:

2.9. Two-View Geometry 39

This equation is of great importance in computer vision and is called the essential

constraint or epipolar constraint.

The geometric interpretation of this equation is shown in Figure 2.7. The camera centres Ci and Cj, the 3D scene point X and its image points xi and xj are coplanar

(i.e. lie in the same plane π). Therefore (2.30) is simply the co-planarity constraint expressed in the reference frame of camera Pj. Let the matrix E = [T ]×R ∈ R3×3,

then (2.30) could be rewritten as:

x⊤_j Exi = 0 (2.31)

Matrix E is called the Essential Matrix. This matrix constraints the relative translation T and rotation R between the two cameras Pi and Pj [2, 72, 82, 89, 122,

200]. Note that the E matrix deals with points expressed in the camera coordinate frame. In this thesis, we define a space for this kind of matrices in R3×3 _{called the}

Essential space denoted by E :

E = { [T ]×R | R ∈ SO(3), T ∈ R3 } ⊂ R3×3 (2.32)

[T ]× is the 3 × 3 skew-symmetric matrix of T . 1 Now let us define this theorem

[82, 89, 200]:

Theorem 1. A non-zero matrix E ∈ R3×3 _{is an essential matrix if and only if E}

has a singular value decomposition (SVD): E = U ΣV⊤, where Σ = diag{σ, σ, 0} for some σ ∈ R+ and U, V ∈ SO(3).

Proof to this theorem is given in [123].

2.9.2 Pose extraction from the essential matrix

Obviously, when we have the relative rotation R and the relative translation T between the two cameras, we can immediately estimate the essential matrix by just putting E = [T ]×R. Now the question is how can we estimate the pose T and R

by knowing the essential matrix E? This problem is known as pose recovery from the essential matrix. Prior to that, we have to estimate first this matrix. This is performed using the corresponding image points. Then, recovering the translation and rotation from the estimated E. However, before performing the second step, we have to make sure that the recovered E is really an essential matrix, i.e. E ∈ E.

First let us show how to estimate the E matrix itself. Equations (2.30) and (2.31) imply that corresponding image points are connected by the E matrix. Thus, it is

1_{[T ]}

40 Chapter

2.

Image Geometry possible to estimate the E matrix given these correspondences. This can be done using what is called the eight-point algorithm.

The eight-point linear algorithm

Longuet-Higgins [117] was the first to develop the 8-point algorithm in computer vision. This algorithm estimates the essential matrix using 8 pairs of matching points across two views [2]. The 3 × 3 essential matrix E can be derived from the essential constraint (2.31): x⊤_j Exi = (xj, yj, zj) ⊤     e1 e2 e3 e4 e5 e6 e7 e8 e9     (xi, yi, zi) = 0 (2.33)

Let the vector e ∈ R9 contain the elements of the essential matrix:

e = (e1, e2, e3, e4, e5, e6, e7, e8, e9)⊤ ∈ R9 (2.34)

If we have n correspondences, then the n × 9 matrix A ∈ Rn×9 _{is given by:}

A =         xj1xi1 xj1yi1 xj1zi1 yj1xi1 yj1yi1 yj1zi1 zj1xi1 zj1yi1 zj1zi1 xj2xi2 xj2yi2 xj2zi2 yj2xi2 yj2yi2 yj2zi2 zj2xi2 zj2yi2 zj2zi2 .. . ... ... ... ... ... ... ... ... xjnxin xjnyin xjnzin yjnxin yjnyin yjnzin zjnxin zjnyin zjnzin         (2.35) The epipolar geometry constraint (2.31) can be simply formulated as the inner product of A and e. This leads to the linear equation in the entries of e:

Ae = 0 (2.36)

This linear equation may be solved for the vector e. The rank of the matrix A needs to be exactly eight in order for the solution to be unique. This requires at least 8 corresponding points, i.e. n ≥ 8. However, even with a sufficient number of corresponding points, the linear equation (2.36) may have no solution due to the noise. In this case, the one available option is to recover the entries of e that minimise ∥Ae∥2_.

Before extracting the relative pose from the recovered E matrix, this matrix must satisfy the essential constraint, i.e. E ∈ E (2.32). Enforcing this involves

2.9. Two-View Geometry 41

orthogonally projecting it onto the essential space. Let us first consider the following theorem [82, 122]:

Theorem 2. Let H ∈ R3 _{be a real matrix and its SVD(H) = U diag(λ}

1, λ2, λ3)V⊤,

where U, V ∈ SO(3) and λ1 ≥ λ2 ≥ λ3. The optimal essential matrix E ∈ E is the

one that minimises the cost function: ∥E − H∥2

f, given by E = U diag(σ, σ, 0)V

⊤_,

where σ = (λ1+ λ2)/2. The subscript f designates the Frobenius norm.

The recovered E matrix is to an unknown scale, since the corresponding image points are expressed in homogeneous coordinates. A typical solution to deal with this ambiguity is to select an E whose non-zeros singular values are 1, i.e. E =

U diag{1, 1, 0}V⊤.

Now after estimating E, let us consider the following theorem [82, 122]:

Theorem 3. For a non-zero Essential matrix E = U diag(1, 1, 0)V⊤ ∈ E, there exist four possible choices for the relative poses (R, T ), where R ∈ SO(3) and T ∈ R3_:

• [R|T ] = [U W V⊤ | + u3] or [R|T ] = [U W V⊤ | − u3] or

• [R|T ] = [U W⊤V⊤| + u3] or [R|T ] = [U W⊤V⊤| − u3].

where W =h0 −1 01 0 0 0 0 1

is the rotation by an angle of π₂ around the Z-axis, and u3 is

the last column of U . Proof of this theorem can be found in [82].

So we end up with a four-fold ambiguity. This ambiguity may be solved by the reconstruction of a scene point X, which must be in front of both cameras in only one of these four solutions. This is known as the cheirality constraint. This constraint means the condition that points in an image must obviously lie in front of the camera and not at the back. The E matrix here gives four possible solutions for R and T . However, there is just one combination of R and T that guarantees the depth of the 3D reconstructed points is positive. Therefore, three out of the four solutions will be infeasible and hence will be discarded [72, 82, 84, 122, 200].

A structured form of the eight-point algorithm is given in Appendix C (Algorithm 4, page 298).

2.9.3 The fundamental matrix

The primary aim of two-view geometry is to recover the relative pose (R, T ) between two views, even in non-calibrated camera scenarios circumstances. Let us again consider the rigid body motion between two views [2, 72, 82, 122, 200]:

42 Chapter

2.

Image Geometry where λi and λj are again the unknown depths. Let K be the calibration matrix,

then, by multiplying both sides of (2.37) by K we get:

λjKxj = KRλixi + KT, (2.38) λjx′j = KRK −1 λix′i+ KT, (2.39) λjx′j = KRλiK−1x′i+ T ′ . (2.40)

where x′_j = Kxj and T′ = KT . Notice that the coordinates in x′i and x

′

j are now

in pixels. Similarly to the case in (2.29) and (2.30), the depths λi and λj may be

eliminated from (2.40) by multiplying both sides by the cross product (x′_j× T′_):

x′_j⊤K[T′]×RK−1x′i = 0 (2.41)

Then:

x′_j⊤F x′_i = 0

F = K[T′]×RK−1 ∈ R3×3

(2.42)

Matrix F is called the fundamental matrix and x′_j⊤F x′_i = 0 is called the epipolar

constraint. Note that when K = I, the fundamental matrix is equal to the essential

matrix [T′]×R. The relationship between the fundamental and essential matrices is

then given by:

E = K⊤F K (2.43) Therefore, the fundamental matrix, F , is another algebraic representation of epipolar geometry (Figure 2.7) [82]. This matrix, however, deals with image points measured in pixels. Therefore, it is of great importance for motion estimation, since the image point locations in two frames are measured in pixels. More detail about estimating this matrix is given in Appendix C (Section C.1, page 293).

In document Robust convex optimisation techniques for autonomous vehicle vision-based navigation (Page 65-70)