8.1 Future work
8.1.2 The pedestrian detection domain
Pedestrian detection has already made a huge progress in the last decade, especially in the field of accuracy. Currently a problem of the pedestrian detection techniques we use, is that the top performance is limited to a few datasets, and thus a limited context. This performance is therefore limited by the data it is trained on (training set bias). We experienced this when comparing
the performance of ACF when trained on INRIA and Caltech in figure 5.14.
This makes it a difficult choose which model to use in a real-life application. In the previous subsection, we already introduced the possibility of combining models trained on different datasets, but also there we use information of a single dataset to determine the Confidence and Complementarity values. Although Deep Learning techniques are known to be able to generalise over a very large dataset, the computational cost is too high for sliding window evaluation. Therefore, in most cases a "traditional" pedestrian detection approach is used to propose a number of windows, which are then classified by the Deep Learning classifier. The same flaws of the baseline detector will also be present here. Therefore a solution should be found to solve this bias, allowing to create a pedestrian detector that "perfectly" generalises the representation of a pedestrian in its model representation. A naive solution could be training a detector using the training data of multiple datasets, e.g. both INRIA and Caltech, but we would not expect that such a "general" detector will obtain the same performance as a detector that is trained specifically for that kind of images, but this should be tested to know for sure.
Since each detector is trained on only a sub domain (a limited dataset) of the problem domain (all possible scenarios with pedestrians), it seems impossible to generalise their performance of a sub domain to the whole domain. In the
future work of our combination approach [26], we suggested of determining the combination parameters based on features retrieved from the image, or even the selection of the detector to use for each window to classify. For example, if a detector performs well on high textured images, or high contrast, it can be assigned a higher confidence in a combination, or be the preferred detector of choice. Having information about the strenghts and weaknesses of all detectors can be of great help in selecting the "best" detector(s) for an application. In our publication "Faster and more intelligent object detection by combining
OpenCL and KR" [24], we proposed different levels of integrating Computer
Vision and Knowledge Representation, a branch of Artificial Intelligence to describe and reason with knowledge. Knowledge about the scene can be greatly beneficial for scene understanding, which is not yet applied in literature. A first possible way of combining these two domains can be performed in a cascade, where first algorithms from the computer vision domain are used to retrieve data about the scene, which is then compared to a model, or mutual models, described in a knowledge representation language. For example, pedestrian detection and tracking can be used to obtain information about the players on a basketball court, which can then be interpreted by KR-models to find out which strategies are used, if violations against the rules are made, ... Note however that the correctness of this system depends on the accuracy of the computer vision algorithms used. For such an integration to work properly, a broad range of the most accurate computer vision algorithms need to be performed, with the corresponding computational requirements, such that the knowledge base has all the necessary (correct) information to work with. This flaw makes a cascaded approach challenging to work with in practical applications.
To meet this problem, a second possible manner of integration can be used, such that a bidirectional interaction exists between the two domains. For example, when a car is detected at a zebra crossing, based on a knowledge base a hypothesis can be made that a pedestrian should be found, that walks on this zebra crossing. It is however possible that the pedestrian detector does not find such a pedestrian, in which case the search can be repeated at a lower threshold or with a better performing pedestrian detector on a limited area (around the zebra crossing) to find support this hypothesis, or the hypothesis can be changed (e.g. maybe it is just a traffic jam). In this manner, a valid hypothesis can be made, and the most accurate algorithms, which are commonly the most computationally intensive, are only performed if needed. A correct interpretation of the scene can be beneficial for many applications. Another example can be a bank (or other company) with a security door. A person may only pass this door in certain circumstances, such as being staff, be accompanied by staff, cleaning staff during a limited time-window, ... These rules can be described using KR, which on its turn can trigger the appropriate algorithms
FUTURE WORK 149
to validate a reason someone passed the door, using computer vision (face recognition for staff, tracking of pedestrians to obtain an optimal viewpoint for such recognition algorithms, ...). In this example, the computer vision algorithms are steered from KR.
Note that the information obtained as scene knowledge in knowledge representation, may also help to improve the accuracy of computer vision algorithms, since detections that do not comply with this information (e.g. the kind of distortions in the image, the size pedestrians have in relation to other objects,...) can be assumed incorrect detections.