CHAPTER 2: RELATED WORK
2.1 Touch Sensing Technology
3.2.2 Runtime Touch Detection
3.2.2.1 Projection Space Touch Detection
3.2.2.1.1 Algorithm
Suppose that C cameras capture images of an actual touch or hover event (i.e. not noise or erroneous contours), but it is unknown whether they correspond to a touch or to a hover. Let E denote this event, and let ECi = {xCi
1 , x Ci
2 , . . . , xCni} denote the pixels in the camera contour of camera Ci,
where 1 ≤ i ≤ C (Figure 3.7). Via the lookup table, each camera pixel xCi
j can be converted to
a 3D point XTCH3
Ci,j on the touch surface S; let X
Ci = {XTCH3 Ci,1 , X TCH3 Ci,2 , . . . , X TCH3 Ci,n } be the set of
3D points on S from camera Ci’s contour. Furthermore, for the present discussion, we assume
that the lookup table contains perfect correspondence information. If E represents a touch, then the actual 3D points corresponding to the user’s finger are located on the touch surface S, which implies that the sets {XC1, XC2, . . . , XCC} of the 3D correspondences of observed camera pixels
comprise 3D points on the same 3D region of S. If E instead represents a hover, then these 3D user input points are located off the surface. However, by definition, all of the camera-to-3D points XTCH3
Ci,j are constrained to the surface, and so the sets X
different (but potentially overlapping) 3D regions ofS. This is exactly equivalent to the behavior shown in Figures 3.8 and 3.9.
To simplify determining whether and to what extent the sets XCi overlap, let us instead consider
the conversions of ECi from camera space to projector space: that is, let UP Ci = {u
P 1, u
P
2, . . . , uPn}
be the projector contour pixels in projector P corresponding to the camera contour pixels of cam- era Ci. Since the camera pixel xCji corresponds to the 3D touch surface point XTCH
3
Ci,j , and x Ci j
corresponds to projector pixel uPj, it follows that uPj corresponds to XTCH3
Ci,j . The same conclusion
about the contour overlap for touch events versus hovers applies here, but the advantage is that this holds in a two-dimensional coordinate space (PRO2), where such computations are more direct.
In other words, if E represents a touch, the sets {UCP
1, U P
C2, . . . , U P
CC} will comprise the same 2D
regions in the coordinate space of projectorP, whereas if E represents a hover, these sets will com- prise different (but potentially overlapping) 2D regions. This holds for the entire set of projectors {P1, P2, . . . , PP}.
Compared to the analysis of the 3D regions XCi, this is a much simpler operation that can be
performed directly on the pixels of images. To compute the amount of overlap, we create the projector response mask RP with the same dimensions as the projector P, where RP
u,v is set to be
the number of camera-to-projector contours containing pixel (u, v)P. More formally,
RPu,v = C X i=1 1 if (u, v)P ∈ UP Ci 0 if (u, v)P 6∈ UP Ci (3.3)
If camera Ci contains some pixel (x, y)Ci that corresponds to projector P pixel (u, v)P, so that
(u, v)P ∈ UP
Ci, we say that camera Ci contributes(u, v)
P. Thus, informally, the projector response
mask RP encodes the number of cameras that contributed each projector pixel through its camera- to-projector contour. An example is shown in Figure 3.12.
Based on this formulation, RP
u,v ≥ 1 for every projector pixel for which at least one camera-
to-projector contour contains the pixel (u, v)P, and RP
u,v = 0 if no camera-to-projector contour
contains this pixel. Thus, we can represent the aforementioned conclusion regarding touch versus hover imagery as the following: E contains a touch if RPu,v 6= 0 =⇒ RP
u,v = C for all pixels
(u, v)P. This means that all camera-to-projector contours are identical, which implies that the corresponding 3D points on the touch surface S exist on the same 3D region of S. If this statement does not hold—that is, ∃(u, v)P | (RP
u,v ≥ 1 ∧ RPu,v < C)—then these 3D points occupy distinct
(but potentially overlapping) regions of S and thus represent a hover.
1 1 1 (a) 1 1 1 2 2 2 3 (b) 1 1 1 2 2 2 3 (c)
Figure 3.12: Example projector response masks, consisting of potentially overlapping camera-to- projector contours. In each example, the number of cameras contributing to each contour is shown. (a) The three camera-to-projector contours show no overlap, suggesting a hover. (b) The contours show some degree of overlap. (c) The contours show a high degree of overlap, suggesting a touch.
In practice, we must loosen this requirement. First, due to the relationships between the cameras, a projector P, and the touch-sensitive surface, it may be the case that not all C cameras have a correspondence at pixel (u, v)P. Instead, we consider the number of cameras that are known to have correspondences at (u, v)P, denoted N (u, v)P, where N (u, v)P ≤ C. This value can be computed directly from the lookup table, which encodes missing correspondences as null entries. Now, we state that E contains a touch if RPu,v 6= 0 =⇒ RP
u,v = N (u, v)P. Informally, this
Furthermore, we cannot assume that the lookup tables contain perfect correspondences: in practice, the camera-to-projector contours of an actual touch may exhibit a small amount of inconsistency, being composed of several close overlapping regions instead of a single consistent one, due to inac- curacies in calibration, lookup table construction, or infrared imagery segmentation. However, we can still expect camera-to-projector contours from hovers to show a higher degree of dissimilarity. Instead of asserting strict equality among the contours, we employ a weighted scoring scheme that considers how they overlap. If a large number of cameras—but not all—contributed a large number of projector response mask pixels—but not all—this is still evidence of a touch event. However, if only one or two cameras out of a possible four contributed each projector pixel, this suggests a hover. Thus, we consider the ratio of projector response mask pixels contributed by i cameras to the number of pixels contributed by at least one camera (i.e. the area of the projector response mask contour in pixels).
More formally, we create a weight vector w = {w1, w2, . . . , wC}, where weight wi applies to
regions of the projector response mask RP for which i cameras contained a contour pixel with a correspondence in projector P. Let A(RP) refer to the number of nonzero pixels in RP, and let
A(RP, i) be the number of pixels in the response mask equal to i, meaning i cameras contributed.
Finally, let N (RP) ⊆ {1, 2, . . . , C} be the set of nonzero values in the response mask—that is, i ∈ N (RP) =⇒ ∃(u, v)P | RP
u,v = i. Thus, the maximum number of cameras that could
contribute to the projector response mask is given by max N (RP). The multi-camera agreement scoreS(RP) is defined as S(RP) = 0 N (RP) = ∅ N (RP) X i wi A(RP, i) A(RP) N (RP) X i wi N (RP) 6= ∅ (3.4)
Informally, this represents:
• the weighted sum of the ratio between the number of pixels contributed by i cameras (repre- sented by A(RP, i)) to the number of pixels contributed by at least one camera (A(RP)),
• divided by the sum of the weights.
This provides a confidence score in the range [0, 1] for a projector response mask. In the event that all C camera-to-projector contours are identical, then N (RP) = {C}, A(RP, C) = A(RP), and the multi-camera agreement is equal to 1. If no cameras contributed any pixels to the projector response mask, the score is set to 0. As the size of regions for which the projector response mask is less than max N (RP) increases, the multi-camera agreement decreases as dictated by the weight values. For example, if w1 < w2 < · · · < wC, then pixels for which the projector response mask is
equal to 2 contribute less to the score than those for which the projector response mask is equal to 3. This places a penalty on regions for which only a small number of cameras contributed projector response mask pixels and gives higher weight to regions with a larger number of contributing cameras. Thus, higher scores provide higher confidence of an observation of a touch, while lower scores suggest a hover.