• No results found

POSTRACK: A Low Cost Real-Time Motion Tracking System for VR Application

N/A
N/A
Protected

Academic year: 2021

Share "POSTRACK: A Low Cost Real-Time Motion Tracking System for VR Application"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

POSTRACK: A Low Cost Real-Time Motion Tracking System for

VR Application

Jaeyong Chung, Namgyu Kim, Gerard Jounghyun Kim, and Chan-Mo Park

VR Laboratory, Department of Computer Science and Engineering, Pohang University of Science and Technology

San 31, Hyoja-dong, Nam-gu, Pohang, Kyung-buk, 790-784, Korea [email protected]

Abstract

One of the obstacles to the proliferation the VR technology to digital contents is the expensive, intrusive, cumbersome and brittle nature of the sensors that are required to detect user’s intent. While optical tracking, being wireless with respect to the user’s body, has been regarded as one solution to this usability aspect, the traditional problems of establishing marker correspondence and resolving their occlusions in real time still remain. One avenue of efforts to address these problems is to simply add more and more hardware (or processing power), making the tracking system too expensive for general usage, while another major effort looks to directly track human body parts, but usually suffers from the inability to track point features. In this paper, we present a relatively inexpensive (e.g. runs on a high-end PC), but reasonably accurate real time optical motion tracking system, called the “POSTRACK”, that can still find a wide range of applications for VR. POSTRACK uses four cheap 8-bit grayscale cameras attached with infrared LED’s, and the user wears several (1~5) highly reflective markers. The markers are “designed” to be very easy to wear (snap-on), and even fashionable. The four cameras are calibrated using well known algorithms, and the initial marker assignments are found by a simple heuristic based on the normal human posture, while the fundamental matrices are used to find blob correspondence and compute the 3D positions of the markers. Since the snap-on markers are rather large and their positions are computed from their image centroids (which constantly change), the computed 3D position data are jittery, more so than other trackers that use much smaller marker sizes, or compared to magnetic trackers. While the technology is old (i.e. uses standard vision and stereo algorithms), the engineering of the POSTRACK strikes a good balance between cost and its capability, and we demonstrate this through results from using it for gesture motion recognition and orientation tracking for 3D pointing.

1. Introduction

One of the obstacles to the proliferation the VR technology to digital contents is the expensive, intrusive, cumbersome and brittle nature of the sensors that are required to detect users intent. While optical tracking, being wireless with respect to the user’s body, has been regarded as one solution to the usability aspect, the traditional problems of establishing marker correspondence and resolving their occlusions in real time still remain to be difficult. One avenue of efforts to solving this problem is to employ expensive hardware (e.g. wireless magnetic trackers, high resolution cameras, etc.) and/or increase the computing power, making the tracking system too expensive for general usage and everyday application [10][11]. Another major approach looks to eliminate the need for the markers, directly tracking parts of the body, for instance, the hands, face, and limbs, however, this requires further processing for feature detection and usually suffers from the inability to track point features [6][7].

(2)

We present a relatively inexpensive (e.g. runs on a high-end PC with minimal hardware set-up), but reasonably accurate (e.g. within few centimeters) optical motion tracking system, called the “POSTRACK”, that can still find a wide range of applications for VR. While the technology is old (i.e. uses standard vision and stereo algorithm), we believe that the engineering of the proposed tracking system strikes a good balance between cost and its capability. We demonstrate this by applying POSTRACK to two different VR applications, namely, gesture motion recognition and object selection, and evaluated its utility (performance/usability vs. cost) against the popular (but expensive) magnetic tracker.

2. Related Work: Vision based Tracking

Lately, there has been a rekindled interest in the vision-based tracking for the next generation user interface. Vision-based tracking used to suffer from the classical problems of establishing marker or feature correspondences and occlusion problems, which can partially be resolved in real time by using fast computers. The ever-increasing computing power of desktop computers seems to be the major cause to this revived interest. The most popular 3D trackers used in VR applications are the magnetic and ultrasonic types, both usually cumbersome to use for not being wireless (wireless versions are much more expensive), and moreover, relatively still too expensive to use for the everyday desktop application (at least, several hundred to few hundred thousand dollars range). Most commercial motion capture systems employ vision-based methods because they offer the wireless convenience, instead, require high number of special high precision infrared cameras and heavy computing power for accuracy and speed needed for the professional animation production, the main application area of motion capture [10][11]. Obviously, cost wise and set-up wise, these systems are not fit for general VR interfaces which only need several tracking points and lower accuracy.

Another major approach looks to eliminate the need for the markers, directly tracking parts of the body, for instance, the hands, face, and limbs, however, this requires further processing for feature detection and usually suffers from the inability to track point or detailed features [6][7]. As both cameras and computers become cheaper compared to their capabilities these days, many augmented/virtual reality systems are starting to exploit their own (IR-based) optical tracking framework at a reasonable cost, accuracy and speed [1][2][3][5]. Our work shares the same goal.

3. POSTRACK

POSTRACK runs on a standard PC with an 800 MHz Pentium III processor and uses four cheap 8-bit grayscale cameras attached with infrared LED’s. We opted to use four cameras to account for possible marker occlusion, yet keep the overall cost relatively low (could have used just two or more than four). For video capturing tasks, we use PCI frame grabber that can acquire four channel images from four NTSC analog cameras at about 24 frames per second. For tracking purposes, the user wears one or more retro-reflective markers. As the marker does not have any orientation, one marker tracking amounts to just position tracking, that is, orientation tracking is possible by tracking more than two markers by calculating their relative positions. The markers are made of a material called the ScotchLite1with about thousand times higher reflectance to light than everyday materials. The markers are also “designed” to be very easy to wear (reflective material wrapped around a flexible snap-on

1

(3)

metal band), and even fashionable (5 markers cost about $8). Figure 1 shows the IR camera and markers.

After calibrating the cameras (described in the next section), the 2D centers of gravity of the markers (in the respective image space) are calculated. The matching markers between the four captured images are solved for using the epipolar constraint, then the 3D positions of the markers are computed. When using multiple markers, the marker assignments (e.g. marker 1 is for the right ankle) are maintained by using a prediction-based algorithm after a heuristic initialization at the beginning of tracking.

3.1. Camera Calibration

The camera calibration is carried out using the calibration functions developed by the Intel’s Open source computer vision libraries (OpenCV) [4]. OpenCV camera calibration functions are used for calculating intrinsic and extrinsic camera parameters by having the four cameras reference on known visual features, for example, vertices from a black and white pattern image. The camera calibration process usually operates by numerical optimization, and in our case, the calibration works well (converges to a good solution), as the approximate location and orientation of the cameras are already known (at top four corners of a rectangular volume, see Figure 2). Slight modifications were made to the OpenCV calibration functions to make it work for multiple cameras and black and white images.

3.2. Marker Tracking

When lit by the infrared LED’s, the markers in the four images obtained by the capture board appear as white blobs, therefore, after performing a simple thesholding operation (e.g. filter out every pixel with gray scale higher than about 248, 255 being the maximum), only marker images are left. A median filter is applied to remove any scattered noise. Figure 3 shows two markers (at the user’s wrists) seen from the four cameras (before thesholding). Using the epipolar constraint that states that the corresponding marker in the other image must like somewhere in the epipolar line (epipolar line for one marker is in green and the other is in red), we can find the matching markers by solving the following equation that represents the epipolar constraint [12].

(a) (b) (c)

Fig. 1. (a) View of a camera mounted IR LED’s and an IR filter. (b) Retro-reflective markers (upper is when snapped, and lower is when opened) (c) A user wearing five markers at ones ankles, wrists and belly.

(4)

plF pr= 0, where

F is called the fundamental matrix and can be obtained by Mr -T

E Ml -1

. plis the position of the marker in one image and pris the position of the same marker in the other image in the pixel coordinates. Mrand Mlare the matrices of the intrinsic parameters of the two cameras.

E is called the essential matrix that establishes the link between the epipolar constraint and the extrinsic parameters of the stereo system. Since POSTRACK uses four cameras, we consider candidate image pairs that produce the least value for the above equation.

Fig. 2: (a) Four cameras mounted on the ceiling (two front cameras are shown, the other two are in the rear in symmetric positions). (b) The graphical view of the calibration process result.

Fig. 3. Epipolar lines for two markers on the user’s wrists.

The initial marker assignments are found by a simple heuristic based on the normal human posture, that is, assumed from a known initial pose (e.g. blob in the lower left portion the image is for the left foot).

4. Application 1: Gesture Motion Recognition

We applied POSTRACK as an input to a gesture motion recognition module for an interface to navigating 3D virtual environment. This is perhaps the simplest application of POSTRACK as it only tracks one marker placed on the wrist of the user. The user makes continuous motions by moving one’s arm, and the positions of the marker tracked by POSTRACK are used as an input to the gesture recognition module. This is a typical demonstration of the VR technology that incorporates more natural interaction for higher usability and a stimulating experience. We modeled four basic motion gesture for navigating in a 3D environment: forward, stop, turn left, and turn right. The motion gestures are shown in Figure 4.

(a) (b)

(5)

Fig. 4. Motion gestures for VE navigation and the trajectories of the four 3D motion gestures.

4.1. Gesture Recognition

Aside from just recognizing the gesture itself, moving gestures creates another subproblem, that is, detecting the starting and ending points of the intended gesture in the midst of position data that are streaming in. One simple solution is define a “still” state, and for instance, require the user to be stationary for few seconds to signal the start and the end of a motion command. To overcome such inconvenience, our method looks for a meaningful motion pattern from a stream of data contained within a finite data search window. The data search window starts at a minimum length (e.g. 0.25 seconds, or 5 frames at 20 Hz sampling rate) from the current frame, and grows to a predefined maximum (e.g. 3 seconds, or 60 frames at 20 Hz sample rate). The predefined minimum and maximum lengths of the search window are determined based on a heuristic that a given gesture command would require at least that minimum amount of time to be carried out, and must not exceed that maximum amount of time to be completed.

The varying length of the motion command is handled through a normalization process of the sampled data. That is, the motion command template data size is either truncated or elongated to fit the input data size before applying the match algorithm. Thus, as far as the duration of the motion command is kept within a reasonable bound, it will be recognized. The above search and match process is repeated at every data sampling period (which is about 20 Hz). Once a gesture is recognized the data can be further analyzed for additional input properties such as speed or acceleration. However, being time dependent, this simple algorithm can not handle gestures that are similar in part, for instance, between a “C” and an “O” motion. A “C” gesture would be recognized in the midst of giving an “O” gesture.

Given a data search window, a particular motion gesture is found through a correlation– based match algorithm. The correlation analysis basically analyzes for how well a regression fits the sampled data. The formula shown in Figure 6 computes for a measure of the quality of the fit between the input and the template motion data, and the correlation coefficient is computed in the x, y, and z dimensions. If the correlation coefficient is higher than a predefined threshold value (e.g. 0.85, +1 representing a perfect correlation), a match is deemed to occur. By using the correlation analysis, the correct classification rate is made

left

stop

right forward

(6)

              + −       +               −             +       −       + =

= = = = = = = 2 0 0 2 2 0 0 2 1 0 0 ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( i u I i u I n i T i T n i u I i T i u I i T n u Corr n i m n i m n i m n i m n i m n i m n i m m m where m = x, y, z u = current time index – n T = Template data, I = Input data

much less sensitive to slight tracking error. In addition, the recognition is made independent from the size or location of the gesture, thus there is no need for data normalization.

Fig. 5. Searching for a match in the data search window.

Fig. 6. Computing for the correlation coefficient.

4.2. Comparative System Performance

We tabulated and compared the gesture recognition rate between when using the POSTRACK and when using the FASTRAK2magnetic tracker. The input were captured simultaneously (i.e. the user wore the marker and were attached with the magnetic tracker receiver at the same time) for a fair comparison. Figure 7 shows the comparison result and there is no significant difference in the recognition rate. Gesture recognition, by its algorithmic nature, is quite insulated from small tracking errors, and thus a system like POSTRACK proves to be a sufficient, yet cost effective data acquisition system for VR.

2

(7)

Fig. 7. The comparative recognition rate between POSTRACK and FASTRAK.

5. Application 2: Object Selection by 3D Pointing

As a bit more challenging application, we applied POSTRACK to 3D pointing for object selection in virtual environments. 3D pointing, also known as the virtual pointer metaphor or ray casting [9], is one of the most popular interaction techniques in VR. To extract the orientation of the ray, two markers are tracked, one on the wrist (See Figure 8) and one placed on a switching device (the switch is needed to confirm the selection after pointing).

5.1. 2D Cursor Position Calculation

The direction of the ray is computed simply by computing for the direction vector from the two marker positions. Object selection by ray casting is in fact a 2D interaction technique because the selection is made on the image plane at which the 3D object and the end of the ray are projected to. The position of the cursor (or the end of the ray) is simply the position on the image plane where the virtual ray intersects with it.

Fig. 8. Virtual pointing and the marker placement.

                        Stop Go Turn right Turn Image/cursor plane

current cursor point

(8)

5.2. Comparative System Performance

Again, we compared the user perceived system performance by running a selection test using the POSTRACK and the FASTRAK magnetic tracker. We had subjects make pointing selections to a randomly generated six objects and measured the object selection time. The same tests were repeated for five different object sizes. Figure 9 shows the result of the comparative performance test. The test was conducted on a 240cm by 180cm projective display screen with 800 x 600 pixel resolution. The user was about 208cm away from the display screen. The size of the level 1 object was at 20 x 15 pixels, and 60 x 55 pixels for the level 5. The results show that the selection performance starts to degrade with level 2 objects (about 15cm wide). This is comparable to a desktop display situation where user is located about 40 cm away from the monitor looking at a 3 cm wide object (or big icon). For most VR applications, such accuracy is sufficient for all practical purposes. Moreover, the jittery optically tracked data (due to the constantly changing centroid of the irregularly shaped blobs) can be smoothed for better performance.

Fig. 9. The comparative selection time between POSTRACK and FASTRAK.

6. Other Applications: Motion Evaluation for Training and Animation

We have applied POSTRACK to other applications, although did not compare its performance against the magnetic tracker. POSTRACK is used on our VR-based motion/dance training system is called the “Just Follow Me (JFM)” [8]. JFM uses a visual interface called the Ghost metaphor in which the motion of the trainer is visualized in real time as a ghost moving out of trainee's body. The trainee, who sees the motion from a chosen viewpoint, is to “follow” the ghostly master as close as possible both timing wise and position wise. As a motion training system, it is required to capture the motion of the user and compare it to that of the reference motion data. Ideally, we would have to track all the joints in the user’s body, however, we only track five body points, the wrists, ankles and the belly, which is sufficient to evaluate the closeness between two human body motions. Like many VR applications, the motion evaluation scheme of JFM requires only approximate motion data, since it is certainly not necessary for the user to reproduce the master’s motion exactly, but only within a certain “tolerable” bound. However, timing of postures is an important and implicit evaluation criterion, thus a fair (~15 Hz) sampling rate is required of the tracking system. POSTRACK can track five markers at about 15 Hz.

                        T im e(Sec)

(9)

POSTRACK is also used for real time animation, using inverse kinematics, to control limbs of an avatar using a small number of sensors (or markers). Again, a sampling rate of at about 15 Hz is required to produce a natural looking animation.

/ / / / / / / / / / / / /

Fig. 10. Just Follow Me: the dance version. As the user tries to imitate a character in the screen, one’s motion is captured and compared to the reference motion.

6. Conclusion

In this paper, we have presented POSTRACK, a low cost (less than $400 excluding the PC) real time optical motion tracking system that performs up to about 15 Hz sampling rate for five markers. As illustrated with the number of applications introduced in this paper, many VR applications do not require pin (or pixel) pointing accuracy, but may still, require a sampling rate of at least about 15 Hz so that that tracked object may be rendered smoothly. We believe that the engineering of the POSTRACK strikes a good balance between cost and its capability. The popular wired magnetic trackers are not only cumbersome to use, but also sensitive to metallic objects, and often require custom calibration process. A system like POSTRACK can open doors to creating more stimulating interfaces to many existing applications and bring VR out to the general public.

7. References

[1] Madritsch F., Gervautz M., “CCD-camera based optical beacon tracking for virtual and augmented reality”, Euro-graphics, vol. 15, no. 3, pp. 207-216, 1996

[2] K. Dorfmuller, “Robust tracking for augmented reality using retro-reflective markers”, Computers & Graphics, vol. 23, no. 6, pp. 795–800, 1999

[3] Miguel Ribo, Axel Pinz, Anton L. Fuhrmann, “A New Optical Tracking System for Virtual and Augmented Reality Applications”, IEEE Instrumentation and Measurement Technology Conference, Budapest, Hungary, May 21–23, 2001

[4] Intel, “Open Source Computer Vision Library Reference Manual”, 2001

[5] Chang C., Tsai W., “Vision based Tracking and Interpretation of Human Leg Movement for Virtual Reality Applications”, IEEE Trans. On Circuits and Systems for Video Technology, Vol. 11, No. 1, 2001

(10)

[6] Fujiyoshi H., Lipton A., “Real Time Human Motion Analysis by Image Skeletonization”, Proc. of IEEE Intl. Conf . on the Face and Gesture Analysis, 1998

[7] Wren C., et al. “Pfinder: Real Time Tracking of Human Body”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 19(7), pp. 780-785, 1997

[8] Yang, U., “Just Follow Me: A VR based Motion Training System”, Emerging Technologies, ACM SIGGRAPH, 2001

[9] Poupytrev I., Weghorst S., “Egocentric Object Manipulation in Virtual Environments: Empirical Evaluation of Interaction Techniques”, Computer Graphics Forum, pp. 41-52, EUROGRAPHICS, 1998

[10] Vicon, http://www.vicon.com/, 2001

[11] MotionAnalysis, http://www.motionanalysis.com/, 2001

[12] Trucco E., Verri A., “Introductory Techniques for 3-D Computer Vision”, Prentice Hall, 1998

8. Acknowledgement

The work presented in this paper has been supported in part by the Korea Ministry of Education’s BK21 Project and Korea Science and Engineering Foundation (KOSEF), and separately by the KOSEF supported Virtual Reality Research Center.

References

Related documents

simulation has proved to be one of the most effective methods of analysis and solving complex problems [3]. This simulation based analysis can be used to develop a

i) For normal or rack rate guests, white slips are used. ii) For VIPs Pink Slips are used. iii) For travelling groups, light blue slips are used. iv) For travelling agents green

In a mature system such as the US, T&D costs are one third total electricity costs, 36 but given the low average wholesale price in Andia, and an expanding system, T&D

Marmara University Atatürk Faculty of Education Journal of Educational Sciences, 46(46), 75-95. Higher Education Institution. Education faculty teacher training undergraduate

Location Offshore Local Level of ownership Local in-house sourcing Full ownership Relational Contractual Nearhsore Nearshore captive Offshore captive Local relational

• A Class B utility must complete Form PSC/ECR 20-W (11/93), titled “Class B Water and/or Wastewater Utilities Financial, Rate and Engineering Minimum Filing

It is recommended that the RSAF and United States Air Force managers of the F-15 supply chain identify gaps that exist in the trip from base level to repair source and, to a

The study highlights that many young people live without parents; that many are out of school while they value education; that many adolescents are survival workers who are at risk