Camera and camera calibration - Computer vision methods in AR

Chapter 2 Literature Review and Background

2.2. Computer vision methods in AR

2.2.1. Camera and camera calibration

A camera is a form of optical equipment which captures the reflected light from the environment to achieve similar functions to those of the human eye. The sensed image data can take the form of individual photographs or image sequences constituting videos. In computer vision, most single-lens camera devices can be simplified into a monocular pinhole camera model (see the dashed box in Figure 2-10) in image processing (Hartley & Zisserman, 2003). In contrast to the monocular camera, a stereoscopic camera has two or more separate lenses to simulate the binocular vision of human and to capture 3D images. However, technologically a stereoscopic camera can also be depicted by a set of monocular camera models where each of its lenses is replaced by an individual pinhole camera.

61 Figure 2-10: The perspective projection procedure of a pinhole camera model where upper case X,Y,Z

denote camera coordinates and lower case x, y denote image coordinates.

The ideal pinhole camera model mainly consists of an optical centre (a.k.a. projection centre) and an image plane, which defines a 3D reference frame to express the spatial relationships between the camera and the objects around it. This local reference frame is called camera reference frame. As can be seen in Figure 2-10, the coordinate system has its origin at the optical centre, the X-Y plane parallel to the image plane, and the Z- axis along the optical axis perpendicular to the image plane. The location of the image plane can be described with the shortest length to the optical centre – known as focal

length, and the intersection point where the optical axis joining to the image plane,

referred as image centre or principal point. Figure 2-10 also presents the projection procedure from a 3D object (with respect to the camera coordinate system) to the 2D camera image plane: by looking at the 3D point P, the reflected light from P going through the image plane and arriving at to the optical centre, and the intersection p on

62 the image plane is the projected image point of the P.

2.2.1.1 Camera calibration

In practice, real camera devices perform perspective projection to map a 3D scene to 2D images, which is controlled by intrinsic camera parameters. As the name indicates, they are intrinsic properties of the camera devices. The pictures taken using the same camera share the same intrinsic parameters. The intrinsic camera parameters include the focal length and the principal point mentioned above, and additionally the lens distortion which is caused by lens imperfections or intentionally introduced by a fisheye lens for creating a wide panoramic or hemispherical image (Horenstein, 2005). Camera calibration refers to the process of finding these parameters, which is important to the visual AR applications in the quest to achieve the best user experience, since the AR process will insert the virtual objects to the scene of the input images and project the augmentations on the user display screen correctly with these intrinsic camera parameters (Baggio, 2012).

The intrinsic camera parameters can be provided by the manufacturer or computed through a known target for calibration purpose (e.g. a chessboard pattern plane shown in Figure 2-11 (Heikkila & Silvén, 1997; Zhang, 2000)). There are several implementations for processing camera calibration , such as the camera calibration toolbox in Matlab (Bouguet, 2004) and the camera_calibration sample code provided by OpenCV library (Bradski, 2000). The main idea of camera calibration is to take several images from different viewports of a set of annotated 3D points to determine their projected points on the images. Specifically if OpenCV is used for calibration, then

63 the 3D points will be extracted from each inner corners of the black-white square within the chessboard pattern. A pattern reference frame is defined. Since the pattern is flat, the Z axis of the reference frame is assumed to be perpendicular to the pattern plane and all points on the pattern are located at Z = 0. The X and Y axes are assumed to be aligned with the grid of pattern thus the 3D positions of corner points can be identified by giving the actual size of the square.

Figure 2-11: Black-white chessboard pattern with size of 9x6 provided by OpenCV library.

The basic principle of camera calibration involves taking known 3D points, measuring the 2D image points and finding the intrinsic camera parameters from those correspondences. The mathematical details are described in Section 3.2.3.

2.2.1.2 Kinect sensor calibration

In the present research, the Microsoft Kinect 1.0 is specifically used to obtain RGBD input data (unless otherwise noted, all references to the "Kinect" in this section concern the Microsoft Kinect 1.0 product). The Kinect combines a monocular colour camera, an

Infra-Red (IR) camera and an IR speckle projector to provide traditional colour images

and depth information for each pixel at a certain frame rate (i.e. 30 fps for Kinect). The sensor data are read and stored as colour and depth images by utilising an Open Source software framework OpenNI. The calibration of the colour camera is quite similar to the approaches described in Section 2.2.1.1. The depth information is determined by using the IR camera and speckle projector as pseudo-stereo pair, and it can be calibrated by detecting a chessboard in the IR image too, which has described in Burrus (2012) and Reimann (2015). In fact it is not necessary to calibrate a Kinect by hand since the

OpenNI camera driver provides default intrinsic camera models with reasonably

accurate focal lengths. The lens distortion is ignored due to the low-distortion lenses used by Kinect. In addition to the intrinsic parameters calibration mentioned so far, it should be noted that the colour camera and the depth camera are generally working concurrently, but the acquired videos may slightly out of sync. The asynchronous colour-depth frame pairs can be dropped by checking the difference between their timestamps (measured in microseconds by OpenNI). It should also be noted that there is a space displacement between the lenses of the colour and depth cameras, thus the imaging ranges of the cameras are a bit different. An example is given in the upper of Figure 2-12. This can be solved by mapping depth pixels with the corresponding colour pixels through a registration process which is supported by some devices, like Kinect, and the calculations can be performed in hardware and accessed through the OpenNI API. The registration result is shown in the lower of Figure 2-12. Alternatively a custom calibration of RGBD camera can also be performed for achieving rigorous results. The entire set of calibration methods are presented in Herrera et al. (2011) and Zhang & Zhang (2014).

65 Figure 2-12: the unregistered depth image (upper) and the registered depth image (lower) with their

corresponding colour image captured by Kinect.

In document User-oriented markerless augmented reality framework based on 3D reconstruction and loop closure detection (Page 63-68)