Voting-based Scale Estimation - 3D RECONSTRUCTION OF TRANSIENT OBJECTS

CHAPTER 4: 3D RECONSTRUCTION OF TRANSIENT OBJECTS

4.1 Approach

4.1.2 Voting-based Scale Estimation

At this point, we have obtained an initial absolute depth estimate for each person relative to the camera that observes them. Next, the method estimates an initial placement of the detections into the reconstruction space, while at the same time obtaining an initial absolute scale estimate for the scene. If the scene scales(e.g. the length of 1 meter in the reconstruction space) were known, the

3D neck point of personiin the reconstruction space could be calculated as

Pi(s) =sRTi Ni+Ci, (4.4)

whereNi ∈ R3 is the estimated 3D position of the neck point relative to the observing camera,

Ri ∈R3×3is the scene-to-camera rotation matrix, andCi ∈R3is the 3D position of the camera in

the reconstruction space.

In principle,scould be determined from a known absolute distance between two points in the reconstruction space,e.g., the width of a building or the distance between two cameras. Alternatively, if the cameras were synchronized, an individual could be triangulated from detections in multiple views, and the scale could be chosen as that which best matches this 3D point. Lacking known distances, I propose to instead leverageapproximate semantic triangulation. The idea here is that, given enough input images, and especially in well-traveled areas, there is a high probability that at least two individuals in different images will be observed in nearby locations, and at similar heights above the ground. The method samples a range of scale hypotheses for the 3D reconstruction and scores each based on the observed person correspondences.

Pairwise Approximate Triangulation: More explicitly, consider the 3D neck placementsPi(s)

and Pj(s)(Eq. (4.4)) for two individuals at some scene scales. Recall that, by convention, the

yaxis defines the vertical span of the scene, and thexzplane defines the horizontal space. Two individuals are identified as standing “nearby” if they are within some fixed absolute distance τxz in the horizontal space. In addition, say that the individuals are standing at similar heights

if their neck points are within some fixed absolute distance τy in the vertical space. Taking

∆Pij(s) = Pi(s)−Pj(s), letMij(s)denote the binary indicator function that determines whether

personsiandj are approximately triangulated at scales:

Mij(s) = ||∆Pijxz(s)||< sτxz

∧ |∆P_ijy(s)|< sτy

where_||∆Pxz

ij (s)||and|∆P y

ij(s)|denotes the horizontal and vertical distances between the neck

points, respectively.

The valueMij(s)is computed for all pairs of detected people in separate images. An individual

is successfully triangulated at scalesif any pairwise approximate triangulation was successful, and if they satisfy a visibility constraint (Vi(s), explained below):

Mi(s) =Vi(s)∧ _ j (_Ii 6=Ij)∧Vj(s)∧Mij(s) ! , (4.6)

where_Ii denotes the image in which personiwas detected.

Visibility Constraint: An important constraint in the scale estimation is that the line segment

fromCi toPi(s)should not intersect with structures such as walls. This constraint may be violated

ifsis too large, which pushesPi(s)further from the observing camera. Accordingly,Vi(s)is an

indicator function denoting whether the detection of personiis possible at scalesgiven the free space of the static parts of the scene. In practice,Vi(s)is computed by voxelizing the SfM 3D point

cloud with a fixed voxel size of one meter (sunits in the reconstruction space). Ray-tracing is then performed fromCi along rayRTi Ni to compute the first point of intersection with a filled voxel.

Denote the distance fromCito this voxel asvi(s). Vi(s)is then defined as

Vi(s) = s||Ni||< vi(s). (4.7)

Scale Scoring: A hypothesized scalesis scored by taking a weighted aggregate of allMi(s):

S(s) =X

wiMi(s). (4.8)

Setting wi = 1 is equivalent to counting the successfully triangulated individuals at scales. I

have experimentally found slightly better performance by weighting individuals by the number of detections in their associated image,i.e.,wi = 1/NIi, whereNIi is the total number of detections

0

1

2

3

4 Scale (Reconstruction Units per Meter)

0

500 1000

Score

Figure 4.3: Scale scoring curve for a model of the Pantheon. The peak is chosen as the initial scale estimate.

in image_Ii. This weighting mitigates the ambiguity of person placement in crowded areas, where

incorrect scales can still yield valid triangulations due to the overall person density.

Finally, an initial voting-based estimate of the scene scale is obtained by sampling a range of possible scales and selecting the scale hypothesis with the highest score S(s). For purposes of implementation, this range is generated by assuming that the vertical span of the SfM point cloud is between 1 and 1000 meters. The method starts at the smallest possible scale and test all scales in the range, stepping at 2% increments ins. Also at this stage, the approach only considers individuals having all five torso joints detected with at least 30% confidence. I use absolute horizontal and vertical thresholds ofτxz = 1.5m andτy = 0.1m; an example scoring curve using these parameters

is shown in Fig. 4.3. In practice, I have found that the voting approach is not too sensitive to the value of these parameters – the “nearby” and “similar height” heuristics should merely reflect how a pedestrian might characterize these terms for someone passing them on the street. Besides, the main point of this stage is to obtain an approximate initial scene scale, and I show in my experiments that the method can tolerate an initial scale error of at least 15%.

In document Price_unc_0153D_18939.pdf (Page 82-85)