Techniques for 3D reconstructions - Visual, spatial and temporal quality in video based reconst

Traditional approaches to representing a virtual user within an immersive collaborative virtual environment have sought to connect an artistically generated character (avatar) to a tracked human body. As computational power has increased, and with the development of newer, higher quality, multipoint tracking systems, the approach has remained similar, but refined to ever finer levels of detail. Connecting these user tracked avatars to animation models further expands the range and scale of the avatar features that can be por- trayed by determining appropriate animation sequences that can be triggered to represent identified emotional states [63]. However, the result is an iconic virtual representation of the tracked user that bases the gross features on actual tracked data and fine detail on implied simulation.

recreate the 3D visual appearance of a moving thing. The use of such techniques is growing in applications such as film and TV production, sports com- mentary and collaborative virtual environment applications/tele-presence, particularly as a way of capturing and representing participating actors. A virtual human created by such means is fundamentally different to a conventional CGI avatar, most notably because it approximates a faithful reproduc- tion rather than an iconic one. Of particular interest was the use of VBR to reconstruct people in a remote space to support real time non-verbal interac- tion in collaborative, interactive 3D environments. While it is important to ensure virtual characters are recognizable and engaging, especially in immersive settings [94], conventional avatars are not synthetic virtual characters, but iconic representations of real people. To ensure that communication and collaboration are as effective as possible it is thought desirable that the human participant within the immersive environment has a similar empathetic engagement, with the virtual character, to that they would experience if face- to-face with the other participant. Fundamentally, the capture of features, such as facial expression, eye-gaze, body positioning, etc, impart important visual cues that enhance the flow of communication between participants [84]. Failure to meet these conditions leads to an abnormal representation that debilitates the real occupant as they cannot emotionally relate to their virtual colleague [75].

In conventional approaches each image is projected into the reconstruction space, either as a segmented silhouette, or a profile curve derived from it. The geometric shell is then defined as the surface of the volume enclosed by the sum of each projected shape. This process is commonly referred to

as space carving [44, 42]. The resultant shell is tessellated to form a set of 3D vertices and indices denoting draw-able triangles, in some cases these are optimized into efficient, render-able geometries [100, 30, 58]. Further evaluation of the vertices maps texture data from the camera images to the form before the reconstruction is passed into a 3D rendering process for graphical display from any viewpoint [56]. In theory, if this can be achieved fast enough, with the computational processing taking place between each frame refresh, a new geometric form and texture set can be displayed for each rendered frame, thereby achieving the goal of real-time animated 3D reconstruction [1]. This creation of an approximation to the visual hull is the most established method of video based reconstruction. Laurentini [44] and Matusik [58] demonstrate how projection of multiple, vectorized silhouette edges enables the computation and shading of visual hulls (geometric shells denoting the form surface) which reconstruct 3D humans. Yue [103], further expands these approaches to incorporate human body part segmentation, thereby overcoming the difficulties within visual hull approaches to construct concave regions and applying the techniques to reconstruction of articulated human reconstruction. These polyhedral approaches are further developed by Franco and Boyer [1, 24] to demonstrate how, with sufficient performance, these reconstructions can be utilized in mixed reality environments.

These geometric approaches seek to resolve the 3D reconstruction as a holistic whole which generates a 3D model that can be viewed from any angle. As such they are the most popular techniques for human reconstruction. However, these approaches impose high computational complexity inherent

in the back-projection of points to all silhouettes. Attempts to overcome this have predominantly sought to adapt or optimize heavy weight CPU based processes to utilize cluster networks or GPU processing for speedup of critical sub-processes [102]. However, this approach does not inherently scale as more computational resources are added and significantly increases system costs [56, 93]. In addition, the formation of a 3D form, enacted prior to the texturing and display, disconnects two stages of the process that are effectively utilising the same data. The result of this is that discrepancies in the generated form, either due to mis-matched camera calibrations, or localised depth artefacts, are further emphasised during the 3D rendering process when source images are applied to the surface within a texturing process. Floating Textures [20] seeks to reduce this problem by applying texture through view-point weighted, projective texturing which seeks to match the best image source to the viewer position. Further elaboration of this technique also allows for depth based assessment of projective texture occlusion to reduce additional effects such as ghosting within the textured form.

A common alternative to polyhedral hull forming is to use projected source images to populate a volume space comprised of voxels [13]. This is typically enacted as a pre-processing stage, delivering a 3D texture (voxel map) containing the reconstruction form’s occupancy, and in some cases, localized colour values. In a few instances, this volume model is rendered directly, with the populated colour voxels denoting the surface texture of the reconstruction form [66]. However, it is more common to skin a binary voxel set, which simply denotes occupancy, typically using a derivation of the

marching cubes algorithm and apply texture to the resultant geometric form [13, 25, 56, 30, 26, 90, 102]. In recent years the emphasis of research in these techniques appears to have shifted to concentrate more on the mapping of texture to a surface of sufficient quality [20, 57]. Yue further expands these approaches to incorporate human body part segmentation, thereby overcoming the difficulties within visual hull approaches to construct concave regions and applying the techniques to reconstruction of articulated human reconstruction [103].

Depth Based Image Rendering (DIBR) techniques seek to minimize the computational load by attempting to create a reconstruction that is valid from a single viewpoint, rather than solving a complete model [89]. Typically these approaches project source image(s) into the 3D environment where they are mapped onto an intermediate geometry located at the approximate spatial position of the reconstruction target [16]. Whilst effective these are often limited solutions that constrain the motion of the observer within tight

proximity to the position of image capture [21]. These approaches have

an inherent lower computational overhead than view-independent geometric solutions with high texture fidelity, but generally lack the three dimensional depth of geometric solutions.

High Dynamic Range (HDR) imaging [53] is a potentially enhancing tech- nology that could improve the fidelity of reconstruction. The extended range of image fidelity, which enhances contrast in shadowed regions, offers a signif- icant improvement in the potential problem areas for person reconstruction such as eye sockets, where shadows hinder the capture of features such as eye (gaze) direction. Benefits have already been demonstrated for narrow

baseline stereoscopic imaging [54] and recent innovations are seeing adapta- tions of the techniques to support video capture [92] which would support video based 3D reconstruction. 3D reconstruction based on the combination of depth and HDR imaging is reported [36], suggesting that processing of the HDR composition is possible within the latency constraints for real-time reconstruction for a single HDR imaging device. However, it is uncertain if this would be possible for multiple video streams. The capture and initial processing of imagery is outside the scope of this research, but the deliv- ered imagery would be direct replacements for the imagery current acquired through traditional video cameras.

In document Visual, spatial and temporal quality in video based reconstruction of people: Achieving, prototyping and evaluating (Page 43-48)