Both case studies have verified the effectiveness of the proposed probabilistic inference framework in improving the performance of concept detectors by using their output as evidence. We have seen how domain knowledge and application context act beneficially in media interpretation by favoring the co-occurrence of evidence that are known from experience to co-exist. Given that in all examined cases the improvement in perfor- mance derives mainly from the incorporation of knowledge and context to the analysis process, we may rightfully claim that the proposed framework can be used to improve the performance of any set of concept detectors that produce a probabilistic output.
Going a bit deeper in Section 3.3 particularly interesting have been the results of Section 3.3.3.1, which led us to the conclusion that the amount and nature of the se- mantic information that can be used to enhance image interpretation depends largely on the special characteristics of the domain. More specifically, although using the infor- mation from the knowledge structure KD and the causality relations Wij ∈ X obtained
from context was proven to be useful in all cases, the semantic constraints originating from the domain were only able to facilitate image interpretation when the imposed rules were sufficiently concrete. For instance, the disjointness between “Tennis” and all other category concepts of the PS domain expresses a rather strict distinction that is suggested by knowledge. On the contrary, attempts to incorporate semantic constraints that, although valid from the point of logic, were less strict from the visual inference point of view didn’t result in performance improvements. Another interesting experi- mental finding was observed in Section3.3.3.4, showing that a sufficiently large amount of training data is required for approximating the prior and conditional probabilities using frequency information. Indeed, it was evident that the availability of realistic prior and conditional probabilities for the BN nodes is particularly important for the efficiency of our framework. Learning them from data was only possible when there were enough training samples to learn from. However, given that the manual anno- tation of images is a cumbersome procedure, especially at region level, this option is not always feasible. In such cases a potential solution to the problem could be to mine the necessary annotations from social sites like Flickr that are being populated with hundreds of user tagged images on a daily basis. This was actually our basic motivation for developing the framework for scalable object detection, presented in Chapter 4.
Similar conclusions have also been drawn from the experimental study presented in Section 3.4, that was conducted with the goal to examine whether the proposed probabilistic inference framework can be used to effectively combine evidence extracted across media. Our experimental findings in Section 3.4.2.2 have indeed verified that there are many cases where the high-level concepts contained in a multi-modal resource can only be extracted if evidence is considered across media. Moreover, it has been shown in Section 3.4.2.2 that the information coming from the domain knowledge is particularly useful when dealing with heterogeneous types of content, even if provided in a very simplistic and rough form. Interesting were also the results of Section3.4.2.2
showing that when performing cross media analysis, the generative models are more suitable for incorporating explicit knowledge and outperform the discriminative models that lack a straightforward way to benefit from such knowledge. Finally, a drawback related to the amount of annotation effort was also revealed in this case study posing the requirement to deeply model the analysis context, both in terms of engineering the domain ontology and producing the necessary cross media annotations. This is a critical requirement that makes our framework appropriate for cases where this effort is justified by the added value in the application, or in cases where social media can be used to mine the necessary annotations.
Scalable object detection by
leveraging social media
In this chapter we present an approach that leverages social media for the effortless learning of object detectors [150]. We are motivated by the fact that the increased training cost of methods demanding manual annotation, limits their ability to easily scale in different types of objects and domains. At the same time, the rapidly growing social media applications have made available a tremendous volume of tagged images, which could serve as a solution for this problem [151]. However, the nature of annota- tions (i.e. global level) and the noise existing in the associated information (due to lack of structure, ambiguity, redundancy and emotional tagging), prevents them from being readily compatible (i.e. accurate region level annotations) with the existing methods for training object detectors. We overcome this deficiency by using the collective knowl- edge aggregated in social sites to automatically determine a set of image regions that can be associated with a certain object [152], [153], [154].
4.1
Description of the proposed approach
Machine learning algorithms for object detection fall within two main categories that are characterized by the annotation granularity of their learning samples. The algorithms that are designed to learn from strongly annotated samples [155], [156], [157] (i.e. samples in which we know the exact location of an object within an image) and the algorithms that learn from weakly annotated samples [158], [11], [9], [66] (i.e. samples in
which we known that an object is depicted in the image, but its location is unknown). In the first case, the goal is to learn a mapping from visual features fi to semantic
labels ci (e.g. a face [155], [157] or a car [156]) given a training set made of pairs
(fi, ci). New images are annotated by using the learned mapping to derive the semantic
labels that correspond to the visual features of the new image. On the other hand, in the case of weakly annotated training samples the goal is to estimate the joint probability distribution between the visual features fi and the semantic labels ci given
a training set made of pairs between sets {(f1, . . . , fn), (c1, . . . , cm)}. New images are
annotated by choosing the semantic labels that maximize the learned joint probability distribution given the visual features of the new image. Some indicative works that fall within the weakly supervised framework include the ones relying on aspect models like probabilistic Latent Semantic Analysis (pLSA) [158], [159] and Latent Dirichlet Allocation (LDA) [160], [161], which are typically used for estimating the necessary joint probability distribution.
While model parameters can be estimated more efficiently from strongly annotated samples, such samples are very expensive to obtain raising scalability problems. On the contrary, weakly annotated samples can be easily obtained in large quantities from social networks but the estimation of model parameters is far more difficult. Motivated by this fact our work aims at combining the advantages of both strongly supervised (learn model parameters more efficiently) and weakly supervised (learn from samples obtained at low cost) methods, by allowing the strongly supervised methods to learn from training samples that can be mined from collaborative tagging environments. The problem we consider is essentially a multiple-instance learning problem in noisy context, where we try to exploit the noise reduction properties that characterize massive user contributions, given that they encode the collective knowledge of multiple users. Indeed, Flickr hosts a series of implicit links between images that can be mined using criteria such as geo-location information, temporal proximity between the image timestamps, or images associated with the same event. The goal of this work is to exploit the social aspect of the contributed content at the level of tags. More specifically, given that in social tagging environments the generated annotations may be considered to be the result of the collaboration among individuals, we can reasonably expect that tag assignments are filtered by the collaborative effort of the users, yielding more consistent annotations. In this context, drawing from a large pool of weakly annotated images,
our goal is to benefit from the knowledge aggregated in social tagging systems in order to automatically determine a set of image regions that can be associated with a certain object.
In order to achieve this goal, we consider that if the set of weakly annotated images is properly selected, the most populated tag-“term” and the most populated visual-“term” will be two different representations (i.e. textual and visual) of the same object. We define tag-“terms” to be sets of tag instances grouped based on their semantic affinity (e.g. synonyms, derivatives, etc.). Respectively, we define visual-“terms” to be sets of region instances grouped based on their visual similarity (e.g. clustering using the regions’ visual features). The most populated tag-“term” (i.e. the most frequently appearing tag, counting also its synonyms, derivatives, etc.) is used to provide the semantic label of the object that the developed classifier is trained to recognize, while the most populated visual-“term” (i.e. the most populated cluster of image regions) is used to provide the set of positive samples for training the classifier in a strongly supervised manner. Our method relies on the fact that due to the common background that most users share, the majority of them tend to contribute relevant tags when faced with similar types of visual content [162]. Given this fact, it is expected that as the pool of the weakly annotated images grows, the most frequently appearing “term” in both tag and visual information space will converge into the same object.
In the following we describe the general architecture of our approach and provide technical details for the independent analysis components. Subsequently, we provide some theoretical insight on the convergence properties of our approach and present the experimental findings that are used to support our claims.