• No results found

3.2 Building Shape Refinement for Elevation Models

3.2.5 Deep Learning-Based Approaches

Most of the conventional approaches which investigate DSM improvements are still based on assumptions, such as a specific shape of man-made constructions, their equal height or spectral information within one object polygon. However, the urban city planning does not follow a specific pattern in building construction. Therefore, the generation of the approach which is able to reconstruct an accurate elevation with true silhouettes of terrestrial objects on it without taking into consideration the pre-defined knowledge or the assumption is of great interest. In this section, we review the deep learning-based methodologies which have been initiated and that are able to achieve promising results for the depth information reconstruction task.

3.2.5.1 Depth Image Reconstruction from Single Image

As deep learning techniques have emerged over the past 10 years, new approaches for remote sensing image processing have achieved significant breakthroughs. However, most of these approaches work with spectral imagery, while depth image processing still has not been well investigated using these new techniques, especially, when it comes to satellite data.

In contrast with computer vision, several attempts have been made to generate, re- store, and enhance depth images using CNNs. A first attempt at applying CNNs for depth estimation was done by Eigenet al.[218] and followed by Eigenet al.[115], where

the authors performed coarse-to-fine learning of two and three convolutional networks in stages, respectively, to transform a monocular color input image into a geometrically meaningful output image at a higher resolution. Tian et al. [219] trained a CNN on

patches cropped by a large window centered at each pixel of raw RGB image. The authors explained the purpose of large window as a requirement for each pixel to get a wide enough contextual information from the surrounding area. Li et al. [220] tackled

the problem of depth prediction from single color image by regression in a CNN cou- pled with a CRF which played the role of post-processing refinement step. Applying the proposed CNN, the method learned the mapping from multi-scale image patches to

depth at the super-pixel level. The super-pixels were then refined to the pixel level by the hierarchical CRF. Unlike the above method, Liu et al. [221] explored the strength

of an end-to-end deep structured CNN which learns the unary and pairwise potentials of a continuous CRF enforcing local consistency in the output image. In contrast to standard methods, it inputs an image consisting of small regions of homogeneous pixels to the network. The method can also work with single pixels, but it is computationally inefficient. It delivers predictions with sharper transitions compared to previous studies, but with a mosaic appearance.

Zhuet al. [222] trained a model for depth estimation consisting of two parts: a pre-

trained VGG [36] and two fully connected layers of their own design. This network only allows a gradient descent optimization algorithm for five convolutional layers starting from the end. Although these methods are able to generate depth images relatively close to the ground truth, the sharpness of the object edges and their appearances in the image are very coarse. Jeon et al. [223] aimed at solving a problem similar to ours regarding

depth image enhancement. They explored a multi-scale Laplacian pyramid-based neural network and structure preserving loss functions to progressively reduce the noise and holes from coarse to fine scales.

The development of GANs [65] helped to achieved impressive results in high-quality image generation tasks. There have already been many studies on the mapping of images between different domains, such as black and white images into color, or satellite im- ages to maps [67]. Recently, some works proposed the learning of object representations in three-dimensional space based on different variations of GAN architecture. These methods typically use autoencoder networks [224,225] combined with a generative ad- versarial approach to generate 3D objects. Wu et al. [226] modeled 3D shapes from a

random input vector by using a variant of GAN with volumetric convolutions. Although the algorithm produces 3D objects with high quality and fine-grained details, the final grid has limited resolution. Rezendeet al.[227] introduced a general framework to learn

3D structures from 2D observations with a 3D-2D projection mechanism. However, the proposed projection mechanism minimizes the discrepancy between the observed mask and the reprojected predictions either through a learned or fixed reprojection function. Recently, Yanget al.[228] proposed an automatic completion of 3D shapes from a single

depth image using GANs. The architecture combinesconditional Generative Adversarial Networks (cGANs)[66] with autoencoders to generate accurate 3D structures of objects.

The method learns both local geometric details and the global 3D context of the scene to infer occluded objects from the scene layout. However, designing a network that can efficiently learn both components is a non-trivial task [229]. All of these studies learn a single object reconstruction based on existing libraries of individual objects and are able to produce a probability for occupancy at each discrete position in the 3D voxel space. Yet the computational and spatial complexities of such voxelized representations significantly limit the output resolution.

In contrast to the computer vision field, height image generation from single input data has so far been rarely addressed in the remote sensing community. Mouet al.[230] tack-

an end-to-end fully convolutional-deconvolutional network architecture IM2HEIGHT,

encompassing residual learning. The authors demonstrated that skip connections are very important for remote sensing tasks because they keep detailed boundaries and edges for miniscule objects representations in remote sensing images. The method was developed, different from our task, for processing aerial images which in general are more accurate and detailed compared to satellite data. Moreover, their approach is able to reconstruct only nDSM. Ghamisi et al. [231] proposed a cGAN-based IMG2DSM net-

work for simulating the DSMs from single optical images consisting of near-infrared, red and green bands. Their generated output was a DSM with three channels represent- ing the same DSM copied three times. The training was done on high-resolution aerial images with Ground Sampling Distance (GSD) below 10 cm. Their experiment showed

that the presented network was able to generalize well on the test data resembling the spatial-spectral information to the training dataset, but produced relatively low results on a new region which was not covered by training data and generated by a different acquisition platform.

3.2.5.2 Depth Image Reconstruction from Multiple Data Sources

As mentioned above, the problem of continuous values prediction in remote sensing based on learning techniques started to evolve only recently. As a result, the idea of classical remote sensing approaches which integrates multiple data sources to compensate the lack of knowledge from a single image also started to spread within deep learning-based methodologies for surface models reconstruction. However, only a few of them have been recently developed.

Costante et al. [232] developed a CNN-based method which reconstructed a monoc-

ularDigital Elevation Model (DEM) from intereferometric images. The amplitude and

phase components of complex SAR images were mainly the inputs to the Encoder- Decoder architecture. The output was a DEM re-projected in the radar coordinates. The network was trained by minimizing the objective that was the pixel-wise linear RMSE. The network was able to estimate the elevation statistics resembling the ground truth. However, a significant smoothing effect was also present in the reconstructed elevation model.

Paschalidou et al. [233] explored multi-view geometry constraints from multi-view

aerial images to correlate the physical process of perspective projection and occlusion based on a learning approach. More specifically, a CNN architecture responsible for estimating surface probabilities from correlated nearby images was integrated with MRF that aggregated the physics of perspective projection and occlusion across all viewpoints.

3.2.5.3 Depth Image Reconstruction in a Multi-Task Context

Image analysis tasks, whether classification, semantic segmentation, or regression, are related to each other and can feature some aspects that are in common. As a result, one task can help to learn other tasks. As it has been already reviewed in Section2.2.6,

the multi-task learning has been successfully integrated in many computer vision appli- cations [104–107].

In remote sensing, the method proposed by Srivastava et al.[234] is the only known

multi-task deep learning-based approach developed for semantic segmentation maps pre- diction, as well as nDSM generation from single monocular images. The authors used a joint loss function for CNN training, which is a linear combination of a dense image classification loss and a regression loss responsible for DSM error minimization. The model is trained by alternating over two losses. However, the major drawback of the proposed method is that in the training phase the network requires pixel-wise labeled segmentation masks as input, which are not widely available.