Implementation of a Real System - Bayesian 3D multiple people tracking using multiple indoor ca

In a room that is approximately 6(W ) x 4(L) x 3(H) m3_{, four cameras are installed}

near the corners of the room, as shown in Figure 7.1. Six microphones are located at the positions shown in Table 3.4.

In the past, in order to implement a real tracking system using multiple cameras,

Presonus Firestation (A/D) 4 Channel Video Capturing Board (Videum4400) PC serial ports IEEE 1394 port .. . .. . VISCA interface analog composite video interface 1394 interface VISCA interface

Figure 7.2: The system configuration.

only one camera was connected to a computer because the processing power of the computer was not very high. Each computer processed its camera’s images and extracted the visual features, which were then collected in a central computer, which was assigned for tracking. The computer assigned for the tracking processed the transmitted visual features to extract the final interest of the targets [58]. However, this use of multiple computers, as well as the necessary interfaces between computers, required a great deal of time and effort to function properly.

This thesis implements a single PC-based tracking system. The prototype of this system came from Huang [31], who implemented an acoustic-source localization system using TDOAs based on a PC. This system worked quite well.

In the system in this research, all cameras and microphones are connected into the PC, as shown in Figure 7.2. One Videum4400 card, which has four analog video inputs, is installed in the PC to capture the four synchronized camera images. Four SONY EVI-D30 cameras are connected into the Videum4400 card. The control port of one camera is connected into a serial port of the PC, and the control ports of the other cameras are connected in a daisy-chain. All the cameras are controlled using the SONY VISCA control interface from the PC.

Six microphones are connected into an audio A/D converter, which captures acoustic signals at 44.1KHz or 48KHz sampling rates synchronously and sends them to the PC via a 1394 interface. A PortAudio open source library [57] is used to drive the interface in order to send audio streams to the PC through the 1394 interface.

The user interface of the system is shown in Figure 7.2. The whole user interface is divided into two parts of input monitoring and output monitoring. The left side of the user interface is primarily designed for monitoring input signal from cameras and microphones. It also gives intermediate status of visual feature detection to know how well it can detect visual features. The three windows of the figure show the results of the Viola-Jones detector and corner detection. The right top side of the output monitoring part is for showing the entire system status and rough details of each sub-system. The figure says that the entire system works in off-line mode using avi files, and Camera 1 and Camera 2 processed both Viola and corner detection and detected objects, but Camera 3 did not detect any object. The middle black box shows the object followed by extra camera, which is designed to steer to one object. It does not show any image in the figure because the system in Figure 7.2 works in off-line mode. The bottom part of it shows accumulated x and y locations of the tracked objects since the system was started. The figure of the user interface was taken when it processes 46th _{frame after the system began.}

The tracking performance is demonstrated with the use of one extra camera, whose control port is connected to a control port on the PC, and the camera’s video port is then connected to a monitor. During the demonstration, the PC steers the camera to follow a one person or a current speaker.

As an intermediate step, this system saves the video streams from the four cameras in AVI video format and the wave data from the six microphones in PCM file format. These data are read by Matlab, which uses the data for algorithm development.

7.2.1 Synchronization between Cameras and Microphones

The acoustic capturing board and the multiple video capturing board do not work synchronously here. To implement joint tracking with acoustic data and visual data together, synchronization of both data types is necessary. Since both boards do not send a time code in their streams for synchronization, the synchronization has to be done manually. The initial, synchronized starting point for both audio and video streams is achieved by delaying or advancing the audio steam relative to the video frame. This is done by changing the size of the input buffer of the acoustic stream.

Once the two kinds of data have been captured by each board and processed in the feature-detection programs, the frame rate for capturing the video drops automati- cally if the processing power cannot process all the frames captured. However, even in this situation when the video frame rate is decreased, the audio interface continuously saves the audio stream into the input buffer and potentially causes the audio buffer to overflow. To solve this overflow problem, this research deletes a certain number of audio frames, as described in Figure 7.4.

Audio Stream Video Frames : Used : Not used(dropped) ... ... 1 2 t-1 ... ... t t+1

Figure 7.4: The synchronization between audio steams and video frames.

A rectangular block in the audio stream in the figure represents one frame to be used for TDOA estimation. If the frame rate for the video decreases, a pre- calculated number of audio frames, which are white rectangles in the figure, are deleted, as shown in the figure. To solve this overflow problem, the audio steam can be down-sampled, but a high rate of down-sampling distorts the input audio signals and prohibits accurate estimation of TDOAs.

In document Bayesian 3D multiple people tracking using multiple indoor cameras and microphones (Page 131-136)