Further support for the addition of relevant queryterms comes from the methodology to be used in the data analysis. The analysis involves the examina- tion of the collocations and resulting lexical networks of the core terms refu- gee(s) and asylum seeker(s), and the interrelations in meaning/use that they may reveal. Arguably, these interrelations will potentially become clearer if the study could also take into account the collocational patterns and lexical networks of the related terms. For example, terrorism registers as a very strong key word when two sample corpora drawn from the database using the core query are compared to the written BNC Sampler. 4 That is, terrorism seems to be strongly associated with topics related to the terms refugee(s) or asylum seeker(s), or, at the very least, to be present in texts containing one or both of these core queryterms. It would be helpful, therefore, to examine what other terms (i.e. entities, concepts, states or processes) terrorism tends to be associated with in the corpus. Another example is the case of asylum. As one of the groups in focus is those who seek asylum, it seems beneficial to examine its collocational networks in the corpus to be constructed, in order to examine possible links between its dif- ferent uses. These relations can, of course, also be examined in a representative general corpus, but there are also arguments for examining such associations within the same corpus. The collocational relations established within the speci- alised corpus can yield additional insights, as they would reveal the use of the term terrorism not in a diverse (albeit representative) range of genres and text types, but in the same clearly specified range of texts in which the associations of the core terms themselves were also established (see McEnery 2006). To put it simply, the associations would be compared against the same background.
Although the pseudo RF techniques described in this section can improve retrieval performance over not using pseudo RF, the problem still remains that it is a variable technique: some queries will be improved, others will be harmed. Several of the authors mentioned indicate that uncovering more details about the collection statistics, documents being used for RF and query characteristics may be used to predict which queries should be used for pseudo RF. For example, Lindquist et al., [LGF97] investigated various parameters for automatic RF using the vector-space model and found optimal performance was gained using between 5-20 documents and 1-20 terms for feedback. They also provide support for weighting new queryterms against original queryterms, using within-document term frequency and thresholding the queryterms (only performing relevance feedback on queries that have terms with a high idf value). This leads to the suggestion that certain characteristics of a term may be good at predicting how the query is likely to improve given expansion by that term, which may be useful in pseudo feedback.
With input of contextual terms, Bobo works in two rounds. In round I, a query is issued us- ing by default the combination of queryterms and contextual terms, or just the contextual terms if they are unlikely to co-occur much with the queryterms. Then from the results, the top k documents (full pages or snippets) satisfying certain qual- ity conditions, e.g., number of terms contained in each seed, are selected as seeds. Optionally, seeds can be cleaned by removing the contained queryterms to reduce background noise of indi- vidual seeds, or purified by removing possibly ir- relevant seeds to improve overall concentration. Contextual terms themselves can be used as an elf seed, which is a special document allowing nega- tive terms, functioning as an explicit feedback.
One contributing factor to this misleading behaviour of summaries is the lack of coherence. This is a well- known problem of sentence-extraction approaches . Our query-biased approach is based on the belief that coherency problems can be tackled by the customisation of the summary to the query. For the purposes of a time-limited interactive search task, users seek relevance clues in order to achieve their goals, especially the context in which queryterms are used in the documents . It is in our intentions to further analyse our results, and identify the cases where summaries lead users to false positive answers, our aim being to improve user interaction.
The central idea behind our approach is to combine the orthogonal information sources of the translation model and the language model to expand queryterms in context. The translation model proposes expansion candidates, and the query language model performs a selection in the context of the surrounding queryterms. Thus, in combination, the incessant problems of term ambiguity and query drift can be solved. One of the goals of this article is to show that existing SMT technology is readily applicable to this task. We apply SMT to large parallel data of queries on the source side, and snippets of clicked search results on the target side. Snippets are short text fragments that represent the parts of the result pages that are most relevant to the queries, for example, in terms of query term matches. Although the use of snippets instead of the full documents makes our approach efﬁcient, it introduces noise because text fragments are used instead of full sentences. However, we show that state-of-the- art statistical machine translation (SMT) technology is in fact robust and ﬂexible enough to capture the peculiarities of the language pair of user queries and result snippets. We evaluate our system in a comparative, extrinsic evaluation in a real-world Web search task. We compare our approach to the expansion system of Cui et al. (2002) that is trained on the same user logs data and has been shown to produce signiﬁcant improvements over the local feedback technique of Xu and Croft (1996) in a standard evaluation on TREC data. Our extrinsic evaluation is done by embedding the expansion systems into a real-world search engine, and comparing the two systems based on the search results that are triggered by the respective query expansions. Our results show that the combination of translation and language model of a state-of-the-art SMT model produces high-quality rewrites and outperforms the expansion model of Cui et al. (2002).
In addition to the scores assigned to sentences, information from the query submitted by the user was also employed in order to compute the final score for each sentence. A query score was thus computed, intended to represent the distribution of query words in a sentence. The rationale for this choice was that, by allowing users to see the context in which the queryterms occurred, they could better judge the relevance of a document to the query. The computation of that score was based on the distribution of queryterms in each sentence. This was based on the belief that the larger the number of queryterms in a sentence, the more likely that sentence conveyed a significant amount of the information need expressed in the query. The actual measure of significance of a sentence in relation to a specific query, was derived by dividing the square of the number of queryterms included in that sentence by the total number of the terms comprising the query.
Traditional information retrieval techniques can give poor results on the Web, with its vast scale and highly variable content quality. Recently, however, it was found that Web search results can be much improved by using the information contained in the link structure between pages. The two best- known algorithms which do this are HITS  and PageRank . The latter is used in the highly successful Google search engine . The heuristic underlying both of these approaches is that pages with many inlinks are more likely to be of high quality than pages with few inlinks, given that the author of a page will presumably include in it links to pages that s/he believes are of high quality. Given a query (set of words or other queryterms), HITS invokes a traditional search engine to obtain a set of pages relevant to it, expands this set with its inlinks and outlinks, and then attempts to find two types of pages, hubs (pages that point to many pages of high quality) and authorities (pages of high quality). Because this computation is
shot query’ , but instead is an iterative process, during which users reformulate a query, change user preferences or change queryterms looking for satisfactory results. Online searching not only involves input of various terms, but also depends on the ability and experience of the individual person performing a search . The digital library user needs to learn how to use the query language, and which strategies to use in a specific online environment. A good search strategy develops over time and it requires not only an understanding of a searching paradigm, but also knowledge of the task domain. Our data shows that users start their search from a very simple query using keywords. Later, they apply more sophisticated searches which involve changing preferences such as: limit a source (publication), date of publication, or use of a ‘search within field’ option (full text, abstract, title or author). In this paper, we present changes of users’ searching strategies triggered by their evaluation of results.
Michael Jordan, retirement, effect, Chicago Bulls”, which achieves a better MAP of 0.2095. When carefully analyzing these terms, one could find that the meaning of Michael Jordan is more precise than that of NBA Star, and hence we improve MAP by 14% by removing NBA Star. Yet interestingly, the performance of removing Michael Jordan is not as worse as we think it would be. This might be resulted from that Michael Jordan is a famous NBA Star in Chicago Bulls. However, what if other terms such as reason and effect are excluded? There is no explicit clue to help users determine what terms are effective in an IR system, especially when they lack experience of searching documents in a specific domain. Without comprehensively understanding the document collection to be retrieved, it is difficult for users to generate appropriate queries. As the effectiveness of a term in IR depends on not only how much information it carries in a query (subjectivity from users) but also what documents there are in a collection (objectivity from corpora), it is, therefore, important to measure the effectiveness of queryterms in an automatic way. Such measurement is useful in selection of effective and ineffective queryterms, which can benefit many IR applications such as query formulation and query expansion.
Information Retrieval is a field of computer science that has seen a tremendous change in the past two decades. After the advent of World Wide Web, access to information became handy because of search engines. An Information Retrieval system consists of the following components i) Query interface where the use poses the information need ii) Index file which is created by the indexing process and the iii) Matching process which finds out the relevant documents from the available index file. The performance of the Information Retrieval system always depends on the above three components. So the improvement in one component can have a significant impact in the overall retrieval performance. In many of the query expansion approaches that are proposed in the literature, the system automatically selects the terms which will be then, be added to the initial query, whereas the user has no control in the query expansion process. Here we propose na interactive query expansion approach where the user gets the help from the system to select the terms that has to be added to the initial query. Our experiments conducted in this regard has also revealed that all the terms generated by the system may not be relevant for expansion but however the user decides which terms to be added and which may not be considered. In this experiment we intend to use a domain Ontology for interactive query expansion and we compare the performance of our system with the traditional one.
as unsupervised learning. This paper presents the study on various clustering algorithms. Lots of work has been done on fluffy data recovery to model vulnerability and imprecision in IR. Michal Kozielski proposed a strategy which utilizes fluffy c-means bunching to group the XML records .The technique bunches highlight vectors encoding XML reports on the distinctive structure levels. Razaz and M. Sch. proposed a position calculation in view of fluffy c-means grouping procedures . The principal phase of the calculation is to develop the maximum item transitive conclusion of the association grid which is then utilized as a fluffy likeness connection in a bunching procedure. Guadalupe J. Torres, Ram B. Basnet, Andrew H. Sung, Srinivas Mukkamala and Bernardete M. Ribeiro propose a calculation for performing closeness investigation among various bunching calculations .There are a few papers accessible which utilizes fluffy c implies grouping calculation to re-rank the records after recovery. Yet, these papers don't utilize arrangement of record bunch utilizing fluffy c-implies grouping in the preprocessing steps which is utilized as a part of this paper. In past papers stress is given on the record and question cover score figuring. On the off chance that the report rundown contains all the more no of inquiry terms then the archive will be more applicable as for question . However below are terminology explained to adjudicate the scheme and scenario depicted?
Method a 39.23% 85.99% 53.9% Method b 67.75% 81.31% 73.9% +28.52% -4.68% +20.0% The improvement is largely due to the use of our approach to extract CSS and correct the speech recognition errors in the CSS components. More detailed analysis of long queries in Table 3 reveals that our method performs worse than the baseline method in recall. This is mainly due to errors in extracting and breaking CSS into basic compo- nents. Although we used the multi-tier mapping approach to reduce the errors from speech recogni- tion, its improvement is insufficient to offset the lost in recall due to errors in extracting CSS. On the other hand, for the short query cases, without the errors in breaking CSS, our system is more ef- fective than the baseline in recall. It is noted that in both cases, our system performs significantly bet- ter than the baseline in terms of precision and F 1
during daily medical care. The physi- cian should be able to find a potentially useful article within one or two queries, leaving enough time for critical appraisal. Observation of the search process during daily medical care is crucial for identifying the tools that actu- ally work in this setting. We therefore created an online information portal that could monitor the complete search process without interfering with the search. Physi- cians working at our teaching hospital are accustomed to using online information sources and they have all received some education in evidence-based medicine. They are therefore likely to use a wide array of queries and search tools. We performed an observational study of que- ries sent to PubMed during daily medical care to answer the following questions. To what extent are search tools used, and does the use of these tools improve article selec- tion for further reading? How many articles should be retrieved by a query to enhance the chance that one will be selected for further reading? What is the relationship between the number of terms, the articles retrieved by a
Consider an initial set of objects, we first divide the objects into groups based on their global dis- tribution. We can refine this partitioning further by dividing each existing group based on the local distribution of a subset of objects. This process may be repeated by taking a smaller subset at each time until no further improvements are possible. Finally, we have a hierarchical set of groups. This is roughly the basic idea of DAHC-tree, where such a model is adapted to disk. In other words, given a query object, we can reduce the search space by gradually considering a subset of objects with a more relevant distribution.
stored. Specifically, the distance to the node which exists within a certain range is calculated, and POI information is saved in each node, generating islands. After finding nodes by extending a network from the query point, the distance to the POI is calculated using the node‟s islands, and then results are being found. The island technique is able to regulate a storage amount by adjusting and calculating a range for island formation so that it offers more flexible structure than VN3 in terms of POI and network data update. Both VN3 and island techniques are studies on a POI-based pre-computation technique which calculates and saves the distance from the POI to a network node. In the event of changes in POI‟s location information, inter-node and node-POI distances should be recalculated, showing update vulnerability. In VN3, search performances increase as the POI becomes less dense. Because network information needed for distance calculation increases, however, update performances rather drop. In the island technique as well, as the range „R‟ increases, search performances could improve. Due to changes in the POI‟s location information, however, there will be a lot of nodes which require distance recalculation. Therefore, it is very disadvantageous for update functions. Furthermore, the above methods are not applicable to a moving object in which location information keeps changing.
classification can help the information providers to understand users’ needs based on the categories searched by the users. To build the domain corpus, most of the query classification system use ontology, Wikipedia category source, graph database etc. In this system, graph database and ontology are built as domain corpus for query classification process by using Neo4j and Web Ontology Language (OWL). Web Query Classification Algorithm (WQCA) with five steps is implemented as a web service by using XML web service technology. Proposed system classifies each domain term of user query into their relevant categories according to this WQCA algorithm by using different domain corpus. Finally, this system compares the performance between Neo4j-based and OWL-based WQCA to show the effectiveness of using graph database in the query classification process.
Kaushik Chakrabarti, Michael Ortega, Kriengkrai Porkaew and Sharad Mehrotra in ” Query Refinement in Similarity Retrieval Systems” show how the EasyAsk system supports a wide variety of features such as approximate word matching, word stemming, synonyms and other word associations .It also recognized phrases, and supported comparisons which it translated into appropriate SQL conditions. Padding the query with synonyms model was easily extendible to allow scaling down of node weights to account for approximate match or synonyms. They also stated that Synonyms were particularly useful in the context of matching metadata. Xiaoou Tang, Ke Liu, Jingyu Cui, Fang Wen and Xiaogang Wang  showed existing linguistically-related methods find either synonyms or other linguistic-related words from thesaurus, or find words frequently co- occurring with the query keywords.
To evaluate the success of this approach, some sample search-based tasks were identified. A query with a single term (a naïve query) was initially generated to retrieve links to documents which may be of interest. The first 150 links were examined and the retrieved documents were classified to form the sets Relevant and Irrelevant. These sets were then used to synthesise a new query, via the expression building process described in Section 3. The resulting queries were then re-submitted to the search engine, and the first 150 documents returned were again manually classified as Figure 1: The singular value decomposition process as used in Latent Semantic Analysis
Francesco et al. (2013): suggested the usage of a structured depiction with named, weighted word pairs to develop the preliminary query. This technique has the categorical relevance feedback to achieve a new query. These authors extracted routinely the Weighted Word Pairs exemplification from documents using term extraction. TREC-8 dataset has been adopted to test the model, allied to about 520,000 documents on 50 topics. The result has shown that their method retrieved more relevant documents as compared with an method using a list of words only .
Abstract. This paper describes the collaborative participation of Dublin City University and Trinity College Dublin in LogCLEF 2010. Two sets of experiments were conducted. First, diﬀerent aspects of the TEL query logs were analysed after extracting user sessions of consecutive queries on a topic. The relation between the queries and their length (number of terms) and position (ﬁrst query or further reformulations) was examined in a session with respect to query performance estimators such as query scope, IDF-based measures, simpliﬁed query clarity score, and average inverse document collection frequency. Results of this analysis suggest that only some estimator values show a correlation with query length or position in the TEL logs (e.g. similarity score between collection and query). Second, the relation between three attributes was investigated: the user’s country (detected from IP address), the query language, and the interface language. The investigation aimed to explore the inﬂuence of the three attributes on the user’s collection selection. Moreover, the investigation involved assigning diﬀerent weights to the three attributes in a scoring function that was used to re-rank the collections displayed to the user according to the language and country. The results of the collection re-ranking show a signiﬁcant improvement in Mean Average Precision (MAP) over the original collection ranking of TEL. The results also indicate that the query language and interface language have more inﬂuence than the user’s country on the collections selected by the users.