3.3 Search Relevance based on the Semantic Web
3.3.1 Approaches using Semantic Models as Relevance Source
One of the early attempts to use a semantic model is made by Croft [54] in which the domain knowledge is modeled as a thesaurus of concepts, each one of which has a name, relation- ships to other concepts, and a list of more or less ad-hoc rules to recognize the concept in a document. The main types of relationships only include synonymy, hypernyms, hyponymy, meronymy and similarity. By using these relationships, word occurences are replaced by word senses (i.e. concepts) and used to expand both queries and documents during indexing and retrieval. Later, Voorhees [232] carried out experiments to exploit the semantics contained within WordNet for query expansion in order to improve retrieval effectiveness by indexing with word senses. Experiments are based on TREC collections, in which all terms in the query are expanded by a combination of synonyms, hypernyms and hyponyms. Although the perfor- mance on short queries is improved, no significant improvement is achieved for long queries indicating non-robustness of the approach. Gonzalo et al. [84] used Wordnet senses to index the documents and apply retrieval based on these senses. However, in this work, they rely on manually disambiguated queries and documents that limits the applicability of the approach. Despite this, the retrieval performance has improved by 30% which shows the effectiveness of a semantic model. Further improvements are also obtained in the follow-up works such as [120, 145, 207] recently. Although theseaurus-based approaches are shown to be effective to solve the ambiguity problem in both queries and documents, their scope is only limited to that purpose due to the nature of conceptual information contained in those models – i.e. words and word senses. In other words, for any unambigous query and document, their prediction power of relevance is similar to what a keyword-based technique can achive, because matching word senses is nothing but matching words in the case of unambiguity [14].
The use of expressive ontologies as semantic model to shift the Web search beyond keyword- based capabilities has also been often considered scenario in the area of Semantic Web [149, 224]. Although the idea looks appealing in the first sight due to the power of ontologies describ- ing domain-specific knowledge, in practice many challenges exists. In particular, a common way of ontology-based search is to assume that all the information is stored in a knowledge- base (KB) in a specific format (e.g. RDF) and conforming to an expressive ontology (e.g. OWL) and users specify (semi-)structured queries (e.g. SPARQL) for their information needs. The semantic relevance is mainly obtained by the underlying query engine which receives the query and executes it on a KB. In this respect, this line of approaches can be regarded in the spectrum of data retrieval (as in the sense of relational databases) rather than IR (Sec. 2.1 ). Under this perspective, some prominent approaches include semantic portals [41, 154, 216] or YAGO-NAGA knowledge discovery[115, 116]. However, these approaches utilize the actual semantic information (i.e. ontological axioms) to retrieve the structured query processing and ranking according to relevance is mostly regarded as a low degree of relevance. In fact, [216] employs a ranking scheme for ontology triples based on term frequency of an entity label in a relation type. Similarly, in [116] a language model based ranking is applied based on the prob-
abilities of ontology triples occurring on the witness pages (e.g. Wikipedia pages) processed by the YAGO extraction algorithm. There exists two main drawbacks of these approaches in their applicability on Web search: First, they assume a deterministic retrieval (e.g. structured query processing) which is decoupled from the actual ranking scheme that probabilistically de- termines the relevance to user’s information need. Also in those ranking schemes, the relevance of textual content to the information need is mostly ignored by only focusing on triple-based probabilities – i.e. similarity measure between user query and ontology results are not clearly justified in comparision to standard IR-based relevance techniques. More importantly, the im- plicit assumption made by these approaches is the fact that the whole information is to be fully represented as a formal knowledge base which is not affordable with the current volume of unstructured information worldwide – e.g. in [191] it is argued based on experiments that by using the state-of-the-art methods such a conversion of the size of one Terabyte unstructured data would take 388 years to complete. In addition, unstructured data includes more informa- tion than its structured representation and an inevitable loss of information occurs when they are replaced with a relatively small number of axioms. Furthermore, more and more unstruc- tured data becomes part of the Web databases where both structured and unstructured forms of information becomes substantial leading to a more hybrid data. Therefore, the dominance of one or the other form of information is a highly impractical assumption.
As another line of work, there exist ontology-based approaches which consider keeping the textual information (i.e. documents) and the KB seperated with the help of semantic an- notations relating the text to a semantic model. KIM semantic annotation platform, for in- stance, focusses on the automatic population of documents on large scale and a ranking model is also utilized on top of a Lucene-based1 IR system to index and retrieve based on annota- tions. Another complementary work to KIM is proposed in [40, 228] proposing a ranking model by adopting VSM that utilizes semantic annotation frequencies as weights instead of terms frequencies. Recently Meij et al. [162] presented an approach adopting Lavrenko’s gen- erative model of relevance to construct conceptual language models (CLM) by employing the document-level annotations as concepts in the representation space (i.e. relevance). It is unique in the sense that semantic relevance is modeled exactly as concepts annotating documents and both documents and queries are considered as random samples generated from the concepts. However, one problem of this model actually occurs in the estimation of the so-called gener- ative concept modelwhich is trained as a unigram model derived from all relevant documents annotated with that specific concept. Since these documents may contain many terms not nec- essarily related to the concept, an expectation-maximization (EM) based training is also used to find exact probabilities similar to the model-based feedback method presented in [254]. Thus it requires all relevant documents of the concepts to be pre-given which is only possible with a limited scope of document corpus such as those in medical domain (e.g. TREC Genomics [92] or CLEF domain-specific track [177]). This prevents its applicability to more general sce- narios such as ad-hoc document retrieval. Broadly speaking, these semantic annotation based approaches also have a similar bottleneck as the abovementioned ontology-based approaches in terms of preprocessing all documents to be annotated since they are used as training data to estimate the degree of semantic relevance. In addition, the experiments in [162] reveal that
Figure 3.3: Discriminative semantic relevance
CLM shows very similar improvements as the standard relevance models and even performs poorer in some of the datasets.