• No results found

A systematic study of multi-level query understanding

N/A
N/A
Protected

Academic year: 2021

Share "A systematic study of multi-level query understanding"

Copied!
132
0
0

Loading.... (view fulltext now)

Full text

(1)

c

(2)

A SYSTEMATIC STUDY OF MULTI-LEVEL QUERY UNDERSTANDING

BY YANEN LI

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

in the Graduate College of the

University of Illinois at Urbana-Champaign, 2014

Urbana, Illinois

Doctoral Committee:

Professor Chengxiang Zhai, Chair Professor Jiawei Han

Professor Bruce R. Schatz Professor Dan Roth

(3)

Abstract

Search and information retrieval technologies have significantly transformed the way people seek information and acquire knowledge from the internet. To further improve the search accuracy and usability of the current-generation search engines, one of the most impor-tant research challenges is for a search engine to accurately understand a user’s intent or information need underlying the query.

This thesis presents a systematic study of query understanding. In this thesis I have proposed a conceptual framework where there are different levels of query understanding. And these levels of query understanding have natural logical dependency. After that, I will present my studies on addressing important research questions in this framework.

First, as a major type of query alteration, I addressed the query spelling correction problem by modelling all major types of spelling errors with a generalized Hidden Markov Model. Second, query segmentation is the most important type of query linguistic signals. I proposed a probabilistic model to identify the query segmentations using clickthrough data. Third, synonym finding is an important challenge for semantic annotation of queries. I proposed a compact clustering framework to mine entity attribute synonyms for a set of inputs jointly with multiple information sources. And finally, in the dynamic query understanding, I introduced the horizontal skipping bias which is unique to the query auto-completion process (QAC). I then proposed a novel two-dimensional click model for modeling the QAC process with emphasis on such behavior.

(4)
(5)

Acknowledgments

First of all I want to express my deepest gratitude to my advisor Professor ChengXiang Zhai, who guided me through my whole Ph.D study. This thesis would not have been possible without his advice. Professor Zhai introduced me to the wonderful world of Information Retrieval and Text Mining; he later guided me to focus on an important topic of multi-level query understanding in web search. In the pursuit of my Ph.D thesis, I have been always inspired by his passion, vision and knowledge in the field. Also, I am heartily thankful to the way Professor Zhai treated his students. He worked with me like peers and colleagues: he always encouraged me to think critically and independently, defended and improved my ideas through discussion with him. From him I have learned the way of being an independent researcher.

I also want to thank my doctoral committee members, Professor Jiawei Han, Professor Dan Roth, Professor Bruce R. Schatz and Doctor Bo-June (Paul) Hsu, for their valuable guidance on my study and research, as well as constructive suggestions on this dissertation. I owe gratitude to my colleagues and friends, Huizhong Duan, Hongning Wang, Hongbo Deng, Anlei Dong, Yi Chang and Kuansan Wang. They offer valuable help to my research. I also want to thank all my colleagues in Database and Information System (DAIS) group, and all my friends in the University of Illinois at Urbana-Champaign.

I would like to thank my parents, for their unconditioned love, patience and encourage-ment through my pursuit of my Ph.D thesis. Finally I am grateful to my wife Chenchen Feng and my son Mason Li for their love, faith and confidence in me. Without their support my achievements would have been impossible.

(6)

Table of Contents

Chapter 1 Introduction . . . . 1

Chapter 2 Related Work . . . . 8

2.1 Query Spelling Correction and Query Reformulation . . . 8

2.2 Query Segmentation . . . 10

2.3 Synonym Mining . . . 11

2.4 Query Auto-Completion . . . 12

2.5 Query Log and Other Resources . . . 13

Chapter 3 Query Spelling Correction by a Hidden Markov Model . . . . 15

3.1 Introduction . . . 15

3.2 Problem Setup and Challenges . . . 18

3.3 A Generalized HMM for Query Spelling Correction . . . 20

3.3.1 The gHMM Model Structure . . . 20

3.3.2 Generalization of HMM Scoring Function . . . 22

3.3.3 Discriminative Training . . . 25

3.3.4 Query Correction Computation . . . 26

3.4 Experiments and Results . . . 30

3.4.1 Dataset Preparation . . . 30

3.4.2 Evaluation Metrics . . . 31

3.4.3 Overall Effectiveness . . . 31

3.4.4 Results by Error Types . . . 33

3.4.5 gHMM for Working Set Construction . . . 33

3.4.6 Efficiency . . . 34

3.5 Conclusions and Future Works . . . 36

Chapter 4 Query Segmentation Using Clickthrough . . . . 37

4.1 Introduction . . . 37

4.2 Problem Setup . . . 38

4.3 Query Segmentation . . . 39

4.3.1 Model Parameter Estimation by EM . . . 43

4.3.2 Utilizing Other Resources . . . 47

4.4 Segmentation Experiments . . . 48

4.4.1 Data Preparation and Evaluation Metrics . . . 48

(7)

4.4.3 Results on the 1000-query Dataset . . . 50

4.4.4 Effect of the Penalty Factor . . . 50

4.5 Integrated Language Model . . . 52

4.5.1 Oracle Ranker . . . 52

4.5.2 Integrated Language Model . . . 52

4.6 Retrieval Experiments . . . 54

4.6.1 Evaluation Metrics and Baselines . . . 55

4.6.2 Retrieval Results . . . 55

4.7 Conclusions and Future Works . . . 56

Chapter 5 Mining Entity Attribute Synonyms via Compact Clustering . 58 5.1 Introduction . . . 58 5.2 Problem Definition . . . 62 5.3 Compact Clustering . . . 63 5.3.1 Similarity Kernels . . . 64 5.3.2 Basic Model . . . 68 5.3.3 Standard Model . . . 71

5.3.4 Solving the Standard Model . . . 72

5.3.5 Incorporating Additional Information . . . 74

5.4 Experiments and Results . . . 76

5.4.1 Datasets . . . 76

5.4.2 Evaluation Metrics . . . 78

5.4.3 Baselines . . . 78

5.4.4 Entity Mentions . . . 79

5.4.5 Arbitrary Strings . . . 80

5.4.6 Ambiguous Attribute Values . . . 81

5.4.7 Contribution of Similarity Kernels . . . 82

5.4.8 Adding Noisy Labeled Data . . . 83

5.5 Conclusions and Future Works . . . 84

Chapter 6 Modeling Query Auto-completion by a Two-dimensional Click Model . . . . 85

6.1 Introduction . . . 85

6.2 Data and User Behavior Analysis . . . 88

6.2.1 A High-Resolution QAC Log . . . 88

6.2.2 QAC User Behavior Analysis . . . 89

6.3 Modeling Clicks in Query Auto-Completion . . . 92

6.3.1 QAC Click Bias Assumptions . . . 93

6.3.2 Model Formulation . . . 94

6.3.3 Click and Conditional Probabilities . . . 96

6.3.4 The Form of Distributions . . . 98

6.3.5 Features . . . 99

6.3.6 Model Estimation via E-M Algorithm . . . 99

(8)

6.4.2 Evaluating the Relevance Model . . . 103

6.4.3 Relevance Model Performance by Query Length . . . 106

6.4.4 Validating the H Model: Automatic Labeling by TDCM . . . 106

6.4.5 Validating the D Model . . . 107

6.4.6 Understanding User Behavior via Feature Weights . . . 109

6.5 Conclusion and Future Work . . . 110

Chapter 7 Summary and Future Directions . . . . 111

(9)

Chapter 1

Introduction

Search and information retrieval technologies have significantly transformed the way people seek information and acquire knowledge from the internet. The effectiveness of Web search engines such as Google, Bing and Yahoo significantly affects the quality of our life and our productivity. To further improve the search accuracy and usability of the current-generation search engines, one of the most important research challenges is for a search engine to accurately understand the user’s intent or information need behind a query. Accurate understanding of the user issued queries also enables new types of applications that help the user make decision and finish tasks directly, resulting in great increase of productivity.

However, accurate query understanding is not an easy task due to the following chal-lenges. First, a query usually contains misspelling or mis-use of words, which leads to a gap between the ideal query in a user’s mind and the ill-formed query received by the search engine. Second, The linguistic structure of a query is never explicitly observed. A user query is usually short and ambiguous. It often has no standard grammar or has idiosyncratic gram-mar. Further, there is usually no capitalization and punctuation in a query. Thus the lack of linguistic and structure makes it hard to infer the semantics of a query by adopting the traditional Natural Language Processing techniques. Third, the intention of a user query is very difficult to infer in some complex situations. One example is partial query. The user would ask the search engine for suggesting the query completion dynamically in real time giving a short prefix.

To systematically improve query intent understanding, I propose a conceptual framework where there are different levels of query understanding. In this framework, the highest level

(10)

of query understanding means knowing precisely the user’s interests, the complete linguistic and semantic structure of the query, and the temporal/spatial constraints. In the ideal case, a query can be transformed into an equivalent SQL-like structural query that rigorously defines all aspects of the user’s intention. Information retrieval with such a query representation will be similar to searching against the web database with structural query, and can thus be much more accurate than the current retrieval paradigm which depends on the bag-of-word query representation. However, we may not always be able to infer such a deep understanding of a query accurately, thus lower levels of query understanding would also be required to improve robustness. These levels of query intent understanding are listed as follows:

• Query Alteration. Queries issued by users usually contain errors and mis-used word-s/phrases. Although a user might have a clear intent in her mind, inferring the query s intent in this case becomes difficult because of the edit distance or vocabulary gap between the user’s ideal query and the query issued to the search engine. Query refor-mulation is to automatically find alternative forms of a query that eliminate or reduce such gap. Effective query reformulation can help improving information retrieval in two ways. First, it can help inferring the user’s intent even if the query is ill-formed. Second, retrieval models can be enhanced by transforming the query to its top reformu-lations. There are several types of query alterations, including query spelling correction which is to transfer a misspelled query into the correct form, query expansion which to expand the original query by adding related terms so as to make the query intent more evident, and query rewriting which is to transform the original query into a new form that is more representative. Note that on this level of query intent understanding, a user query is represented by bag of words. This query representation has been proved to be very effective for ad-hoc retrieval where relevant documents are returned to a key-word based query.

(11)

Natu-ral Language texts, web search queries are characterized by lack of explicit linguistic structures such as quotation, capitalization, punctuation and standard grammar. This level of query intent understanding aims at discovering the latent linguistic signals like segmentation (phrase boundary), part-of-speech tagging, capitalization etc. of a user query. Successful discovery of latent query linguistic signals can help improve the ad-hoc retrieval. For example, query segmentation can reduce the number of candi-date terms to score where terms are generalized to phrases instead of individual words. It will also improve the scoring function by leveraging the proximity constraint in a segmentation. Furthermore, This level of query intent understanding builds the foun-dation for deeper level understanding and representation of query intent – semantic understanding and interpretation of queries.

• Semantic Annotation of Queries. The bag-of-words query representation has been a great success in document retrieval where relevant documents are returned to the query. However, as the search has been expanded to many other types of application-s, bag-of-words representation is not sufficient to support the requirements of these applications. One such application is entity search. Nowadays the web contains a wealth of structured data, such as various entity databases, web tables, etc. There is a growing trend in search engines to match unstructured user queries to these structured data sources. In the entity-centric search, schema annotation of queries is required to match the schema of the structured data sources. Another application is to present direct answer and facts to queries. Examples include the instant answer box of modern web search engines, and the computational knowledge engines like Wolfram Alpha. In order to understand the intention of the user and judge whether a direct answer should be triggered, the query has to be transferred to semantic components. And these com-ponents are further precessed and matched against the knowledge bases. This level of query intent understanding goes beyond the bag-of-words representation. It aims

(12)

at deciphering the semantic structure of queries, that is the meaning of every piece of query segment and their relation. It involves tasks such as target type classification which is to infer the category/domain of the query, name entity and attribute recogni-tion and disambiguarecogni-tion, schema matching to a catalog which is to match a query to predefined catalog schema like product tables, semantic role labelling in queries etc. • Dynamic Understanding and Representation of Queries. In the above levels of

query understanding, it is assumed that we have the whole query in advance. However in some application scenarios this assumption may not hold. For example, in the application of query auto-completion, the objective is to predict user’s preferred query based on partial query prefix dynamically in real time. Because only very limited information is exposed from the partial query, other contextual information such as user’s short and long term behavior must be taken into account to predict the real user information need. In this case, to understand and represent user’s intent dynamically given partial query may need to take into account contextual information such as the user query history, user’s short term interaction with search engine, external knowledge from other knowledge bases etc.

In this thesis, I will present several studies that I have conducted on addressing important research questions for advancing different levels of query understanding. First, as a major type of query reformulation, I addressed the query spelling correction problem by modelling all major types of spelling errors with a generalized Hidden Markov Model [56]. Subsequently a Latent Structural SVM model was proposed to model the same problem [30]. Second, query segmentation is the most important type of query linguistic signals. I proposed a probabilistic model to identify the query segmentations using clickthrough data [57]. Third, synonym finding is an important challenge for semantic annotation of queries. I proposed a compact clustering framework to mine entity attribute synonyms for a set of inputs jointly with multiple information sources [59]. Subsequently I applied a similar clustering framework

(13)

to detect synonymous query templates for attribute intents [58]. Finally, in the direction of dynamic understanding and representation of queries, I introduced a new kind of user behavior called “Horizontal Skipping Bias”, which is unique to the query auto-completion process. I then proposed a two-dimensional click model to model the query auto-completion process. My researches in this thesis are summarized as follows.

• Query Spelling Correction by a Generalized Hidden Markov Model. Queries issued by web search engine users usually contain errors. Inferring the query’s intent in this case becomes difficult. As an important type of query alteration, query correction aims at transforming the potentially misspelled query into its correct form. Existing methods in the literature have two major drawbacks. First, they are unable to handle important types of spelling errors, such as concatenation and splitting. Second, they usually employ a heuristic filtering step to select a working set of top-K candidates for final scoring, leading to non-optimal predictions. In [56] I addressed both limitations by proposing a novel generalized Hidden Markov Model with discriminative training that can not only handle all the major types of spelling errors in a single unified framework, but also efficiently evaluate all the candidate corrections to ensure the finding of a globally optimal correction. I had also built a query speller system called CloudSpeller [55], which won the second place in the Microsoft Speller Challenge [4].

• Query Segmentation using Clicktrhough. One difficulty toward deeper level query intent understanding is that web search query is usually lack of explicit lin-guistic signals such as quotation, capitalization, punctuation and standard grammar. Successful detection of such latent query linguistic signals can help improve the retrieval performance. I addressed the identification of the most important type of query lin-guistic signals – query segmentation. Existing segmentation models either use labeled data to predict the segmentation boundaries, for which the training data is expensive to collect, or employ unsupervised strategy only based on a large text corpus, which

(14)

might be inaccurate because of the lack of relevant information. To address these limi-tations, in [57] I proposed a probabilistic model to exploit click-through data for query segmentation. I further studied how to properly interpret the segmentation results and utilize them to improve retrieval accuracy. Specifically, I proposed an integrated lan-guage model based on the standard bigram lanlan-guage model to utilize the probabilistic structure obtained through query segmentation.

• Entity Attribute Synonyms Mining. If the query and structured data sources are all written in well-formed texts, the task of semantic annotation of queries is manageable. However a big challenge of query semantic annotation is to handle the variation of text expression. Users may use many different alternative forms of the same entity when they input a query. For example, people usually issue “LOTR show time” instead of “lord of the rings show time” as typing the short form of the entity is more convenient. Discovering such alternative surface forms of entities and attributes is crucial for improving query semantic annotation thus advancing the query intent understanding and retrieval. However, most previous approaches only focused on utilizing a single feature, such as distributional similarity or query-entity clicks. In addition, previous methods usually look for synonyms one entity at a time, ignoring the information provided by the entire set of inputs. In [59] I proposed a compact clustering framework to identify synonyms for a set of entity attributes jointly. Signals from multiple sources of information are integrated for finding synonyms.

• Modeling Query Auto-completion by a Two-dimensional Click Model. No-tice that the above levels of query understanding is static, meaning that we have to know the entire query in advance. However in many scenarios it is not possible: the users want to be assisted when they just give a tiny amount of query hint, which is called dynamic query understanding. One example is to predict users intended queries based on partial queries in the task of query auto-completion. For this purpose, in

(15)

[54] I have introduced a new kind of user behavior called “Horizontal Skipping Bias”, which is unique to the query auto-completion process. Based on this novel discovery, Ive proposed a novel Two-Dimensional Click Model to model the users behavior in QAC and the resulting relevance model significantly improves the relevance ranking in QAC than most of the existing click models.

The rest of the proposal is organized as follows. In Chapter 2, I review the studies related to this thesis. Chapter 3, 4, 5, and 6 present the approaches for Query Spelling Correc-tion, Query SegmentaCorrec-tion, Entity Attribute Synonyms Mining and Query Auto-Completion respectively. Finally I will summarize my thesis works and point out some potential future directions in Chapter 7.

(16)

Chapter 2

Related Work

In this chapter, we review related work in existing literature on these topics: (1)Query spelling correction and query reformulation; (2) Query segmentation; (3) Synonym mining; (4) Query Auto-completion; (5) Query log and other resources.

2.1

Query Spelling Correction and Query

Reformulation

Query spelling correction has long been an important research topic [50]. Traditional spellers focused on dealing with non-word errors caused by misspelling a known word as an invalid word form. A common strategy at that time was to utilize a trusted lexicon and certain distance measures, such as Levenshtein distance [52]. The size of lexicon in traditional spellers is usually small due to the high cost of manual construction of lexicon. Consequently, many valid word forms such as human names and neologisms are rarely included in the lexicon. Later, statistical generative models were introduced for spelling correction, in which the error model and n-gram language model are identified as two critical components. Brill and Moore demonstrated that a better statistical error model is crucial for improving a speller’s accuracy [17]. But building such an error model requires a large set of manually annotated word correction pairs, which is expensive to obtain. Whitelaw et al. alleviated this problem by leveraging the Web to automatically discover the misspelled/corrected word pairs [97].

(17)

With the advent of the Web, the research on spelling correction has received much more attention, particularly on the correction of search engine queries. Many research challenges are raised, which are non-existent in traditional settings of spelling correction. More specif-ically, there are many more types of spelling errors in search queries, such as misspelling, concatenation/splitting of query words, and misuse of legitimate yet inappropriate word-s. Research in this direction includes utilizing large web corpora and query log [23, 27, 6], training phrase-based error model from clickthrough data [82] and developing additional fea-tures [32]. However, two important challenges are under addressed in these approaches, i.e., correcting splitting and concatenation errors, and ensuring complete search in the candidate space to evaluate an effective scoring function.

Query spelling correction also shares similarities with many other NLP tasks, such as speech recognition and machine translation. In many of these applications, HMM has been found very useful [46, 89].

Query reformulation is a broader topic which naturally subsumes query spelling correc-tion. Beside correcting the misspelled query, query reformulation also need to modify the ineffective query so that it could be more suitable for the search intent. For this purpose, many research topics have been studied. Query expansion expands the query with additional terms to enrich the query formulation [99, 72, 68]. Other query reformulation methods intend to replace the inappropriate query terms with effective keywords to bridge the vocabulary gaps [95]. Particularly, there is research attempt [35] to use a unified model to do a broad set of query refinements such as correction, segmentation and even stemming. However, it treats query correction and splitting/merging as separate tasks, which is not true for real world queries. Also, it has very limited ability for query correction. For example, it only allows one letter difference in deletion/insertion/substitution errors.

(18)

2.2

Query Segmentation

Query segmentation models have been studied in recent literature [43, 47, 15, 101, 84, 36]. Initially, the mutual information (MI) between adjacent words in a query is employed to segment queries with a cutoff [43, 47].The major drawback of MI based methods is that they are unable to detect multi-word or phrase based dependencies. Compared with MI based models, supervised query segmentation approaches can achieve higher accuracies [15, 101]. For example, by considering every boundary between two consecutive query words as a binary decision variable, Bersgma and Wang [15] trains the weights of a linear decision function with a set of syntactic and shallow semantic features extracted from the labeled data. However, its focus on noun phrase features may not be appropriate for the segmentation of web queries. Furthermore, acquiring training labels demands a great deal of manual effort that may not scale to the web. As another supervised learning approach, Yu and Shi [101] applies conditional random fields to obtain good query segmentation performance. However, it relies on field information features specific to databases, not available for general unstructured web queries. Moreover, the evaluation was conducted only on synthetic data, which is less desirable than real query data.

Tan and Peng [84] introduce a generative model in the unsupervised setting by adopting n-gram frequency counts from a large text corpus and computing the segment scores via expectation maximization (EM). It also utilizes Wikipedia as another term in the minimum description length objective function. Similar probabilistic model is also proposed in [102], but this model focuses in parsing noun phrases thus not generally applicable to web queries. Our work is also related to the retrieval models that capture higher order dependencies of query terms. There are several research attempts to incorporate term dependency in query or document to retrieval models [65]. For example, some attempts have been made to add proximity heuristics to the vector space model or generative query LM model [67, 86]. However these methods rely on heuristics, which is not a principled way of incorporating

(19)

term dependency. More unified higher-order language models have been studied by Srikanth et al. (Biterm LM) [81]. However, their assumption that every position is dependent is too strong. In fact, the word dependency is stronger within a semantic unit than across the unit, which is what we assume in our work. LM with query syntactics [33] assumes a structure on the query, but they are too complex to estimate accurately. More important, the query syntactic models usually take only the top (most likely) query structure in the modeling process. However, it is more appropriate to assess the probability for all possible segmentation if multiple structures have comparable probabilities to represent the query.

2.3

Synonym Mining

There is a rich body of work on the general topic of automatic synonym discovery. This re-search topic can be divided into sub-areas, including finding word synonyms, entity/attribute synonyms, and related query identification. Identifying word level synonyms from text is a traditional topic in the NLP community. Such synonyms can be discovered using simple dic-tionary based methods such as WordNet and Wikipedia redirects; distributional similarity based methods [60, 61]; and approximate string matching approaches [69]. In this work, we are interested in finding entity attribute synonyms, which usually have more domain context than plain words.

Researchers have employed several similarity metrics to find synonyms from web data. Such similarities include distributional similarity [60, 61], coclick similarity [24, 20], pointwise mutual information [88], and co-occurrence statistics [11]. Unlike these works, our work introduces a novel similarity metric called categorical pattern similarity for jointly finding synonyms from a set of attributes.

Although several similarity metrics are introduced to find synonyms, most previous ap-proaches use only a single metric in their model. [20] tries to combine multiple metrics, however they manually choose a set of thresholds for individual metrics, which leads to a

(20)

high precision but potentially low recall approach.

2.4

Query Auto-Completion

Query auto-completion is the search process of preferred queries given the issued prefix of a user. Most of the existing works focus on relevance ranking. For this purpose, traditional QAC engines rely on query popularity counts. However it’s impossible to return queries matching a user’s specific preference such as location and freshness in time etc. Recent QAC models employ learning-based strategy that incorporates several global and personal features [10, 79]. But there is no consensus of how to optimally train the relevance model.

The QAC process is very personal in nature, so it’s almost impossible to obtain a labeled dataset by third-party annotation. Existing methods use the clicks as a relevance surrogate, and train a model trying to maximize the clicks. The straightforward way is to only utilize the data in the last prefix, and use the skip-above as well as the skip-next hypothesis to obtain a set of labels. Then we could use the learning-based algorithms to train a model that linearly combines a set of features. Most recently [79] introduces a different strategy, which exploits all suggested queries for all simulated prefixes of the clicked query. However, this automatic labeling strategy might be problematic, since it may introduce many false negative examples where the user skips looking down the list. If she had to examine the list, she would have clicked a query. So there is a lot of uncertainty in the labeled examples introduced by this method.

Besides relevance modeling, there are previous works addressing different aspects of QAC. For example, [12, 38] studied the space efficiency of index for QAC. [96, 41] investigated the efficient algorithms for QAC. [29] addressed the problem of suggesting query completions even if the prefix is mis-spelled. And [8] studied the context-sensitive QAC for mobile search. The QAC is a complex process where a user goes through a series of interactions with the QAC engine before clicking on a query. Deciphering the user behavior in QAC is an

(21)

interesting and challenging task. Despite of its importance, little research is done on this direction, mainly because of the lack of suitable QAC log. It is in this work we first collect a high-resolution QAC log and attempt to model the user behaviors.

Modeling the query auto-completion is closely related to Click Models. In the field of document retrieval, the main purpose for modeling a user’s clicks is to infer the intrinsic relevance between the query and document by explaining the positional bias. The position bias assumption was first introduced by Granka et al. [34], stating that a document on higher rank tends to attract more clicks. Richardson et al. [75] attempted to model the true relevance of documents in lower positions by imposing a multiplicative factor. Later examination hypothesis is formalized in [26], with a key assumption (Cascade Assumption) that a user will click on a document if and only if that document has been examined and it is relevant to the query. Later, several extensions were proposed, such as the User Browsing Model (UBM) [31], Bayesian Browsing Model [62], General Click Model [105] and Dynamic Bayesian Network model (DBN) [22]. Despite the abundance of click models, no existing click models can directly apply to QAC without considerable modification. The click model most similar to our work is [104], which models users’ clicks on a series of queries in a session. However because of the main difference between QAC and document retrieval, our model structure is very different from [104]. To the best of our knowledge, our work is the first click model for modeling the QAC process.

2.5

Query Log and Other Resources

To better understand the query intents we have to leverage users’ short and long term interaction with search engine. Such activities are usually recorded in query log. Query log has been utilized for diverse applications such as query segmentation [15, 84], query reformulation [44, 78], relevance ranking [74, 42, 5], query clustering [100] etc. Recently modeled the long-term query logs using language models to improve the personalized search

(22)

[85, 83]. Other types of large-scale resources can be also exploited to decipher the challenging problem of query intent understanding. These resources include web ngram language model [3], knowledge bases such as FreeBase and wikipedia, large-scale web page corpus such as clueweb [1, 2].

(23)

Chapter 3

Query Spelling Correction by a

Hidden Markov Model

3.1

Introduction

Queries issued by web search engine users usually contain errors and mis-used words/phrases. Although a user might have a clear intent in her mind, inferring the querys intent in this case becomes difficult because of the edit distance or vocabulary gap between the user’s ideal query and the query issued to the search engine. Query reformulation is to automatically find alternative forms of a query that eliminate or reduce such gap. Effective query reformulation has been proved to be very effective in improving the performance of information retrieval. There are several types of query reformulations, including query spelling correction, query expansion, query rewriting etc. In this chapter we focus on an important type of query reformulation – query spelling correction.

The ability to automatically correct potentially misspelled queries has become an indis-pensable component of modern search engines. People make errors in spelling frequently. Particularly, search engine users are more likely to commit misspellings in their queries as they are in most scenarios exploring unfamiliar contents. Automatic spelling correction for queries helps the search engine to better understand the users’ intents and can therefore im-prove the quality of search experience. However, query spelling is not an easy task, especially under the strict efficiency constraint. In Table 3.1 we summarize major types of misspellings in real search engine queries. Users not only make typos on single words, (insertion, dele-tion and substitudele-tion), but can also easily mess up with word boundaries (concatenadele-tion and splitting). Moreover, different types of misspelling could be committed in the same

(24)

query, making it even harder to correct. Unfortunately, no existing query spelling correction Table 3.1: Major Types of Query Spelling Errors

Type Example Correction

In-Word

Insertion esspresso espresso Deletion vollyball volleyball Substitution comtemplate contemplate Mis-use capital hill capitol hill Cross-Word Concatenation intermilan inter milan Splitting power point powerpoint

approaches in the literature are able to correct all major types of errors, especially for cor-recting splitting and concatenation errors. To the best of my knowledge, the only work that can potentially address this problem is [35] in which a Conditional Random Field (CRF) model is proposed to handle a broad set of query refinements. However, this work consid-ers query correction and splitting/merging as different tasks, hence it is unable to correct queries with mixed types of errors, such as substitution and splitting errors in one query. In fact splitting and merging are two important error types in query spelling correction, and a major research challenge of query spelling correction is to accurately correct all major types of errors simultaneously.

Another major difficulty in automatic query spelling correction is the huge search space. Theoretically, any sequence of characters could potentially be the correction of a misspelled query. It is clearly intractable to enumerate and evaluate all possible sequences for the purpose of finding the correct query. Thus a more feasible strategy is to search in a space of all combinations of candidate words that are in a neighborhood of each query word based on editing distance. The assumption is that a user’s spelling error of each single word is unlikely too dramatic, thus the correction is most likely in the neighborhood by editing distance. Unfortunately, even in this restricted space, the current approaches still cannot enumerate and evaluate all the candidates because their scoring functions involve complex features that are expensive to compute. As a result, a separate filtering step must first be

(25)

used to prune the search space so that the final scoring can be done on a small working set of candidates. Take [32] as a two-stage method example, in the first stage, a Viterbi or A* search algorithm is used to generate a small set of most promising candidates, and in the second stage different types of features of the candidates are computed and a ranker is employed to score the candidates. However, this two-stage strategy has a major drawback in computing the complete working set. Since the filtering stage uses a non-optimal objective function to ensure efficiency, it is quite possible that the best candidate is filtered out in the first stage, especially because we cannot afford a large working set since the correction must be done online while a user is entering a query. The inability of searching the complete space of candidates leads to non-optimal correction accuracy.

In this chapter, we propose a generalized Hidden Markov Model (gHMM) for query spelling correction that can address deficiencies of the existing approaches discussed above. The proposed gHMM can model all major types of spelling errors, thus enabling consid-eration of multiple types of errors in query spelling correction. In the proposed gHMM, the hidden states represent the correct forms of words, and the outcomes are the observed (potentially) misspelled terms. In addition, each state is associated with a type, indicating merging, splitting or in-word transformation operation. The proposed HMM is generalized in the sense that it would allow adjustment of both emission probabilities and transition probabilities to accommodate the non-optimal parameter estimation. Unfortunately, such an extension of HMM makes it impossible to use a standard EM algorithm for parameter estimation. To solve this problem, we propose a perceptron-based discriminative training method to train the parameters in the HMM.

Moreover, a Viterbi-like search algorithm for top-K paths is designed to efficiently obtain a small number of highly confident correction candidates. This algorithm can handle split-ting/merging of multiple words. It takes into account major types of local features such as error model, language model, and state type information. The error model is trained on a large set of query correction pairs from the web. And web scale language model is obtained

(26)

by leveraging the Microsoft Web N-gram service [3].

We conducted extensive evaluation on our proposed gHMM. For this purpose, we have constructed a query correction dataset from real search logs, which has been made publicly available. Experimental results verify that the gHMM can effectively correct all major types of query spelling errors. It also reveal that the gHMM can run as efficient as the common used noisy channel model, while it achieves much better results for obtaining the candidate space of query corrections. Therefore, in addition to being used as standing alone query correction module, the proposed gHMM can also be used as a more effective first-stage filtering module to more effectively support any other complicated scoring functions such as those using complex global features.

3.2

Problem Setup and Challenges

Formally, let Σ be the alphabet of a language and L⊂ Σ+be a large lexicon of the language.

We define the query spelling correction problem as:

Given a query q ∈ Σ+, generate top-K most effective corrections Y = (y1, y2, ..., yk) where

yi ∈ L+ is a candidate correction, and Y is sorted according to the probability of yi being

the correct spelling of the target query.

It is worth noting that from a search engine perspective, the ideal output Y′ should be sorted according to the probability of yi retrieving the most satisfying results in search.

However, in practice it is very difficult to measure the satisfaction as unlike in ad hoc retrieval where the query is given in its correct form, here the real query is unknown. As a result, different corrections could simply lead to queries with different meanings and it would be very subjective to determine which query actually satisfies the user. In this work, we are mostly concerned with the lexical and semantic correctness of queries with the assumption that correction of mis-spelled query terms most likely would lead to improved retrieval accuracy. The problem of query spelling correction is significantly harder than the traditional

(27)

spelling correction. Previous researches show that approximately 10-15% of search queries contain spelling errors [27]. First, it is difficult to cover all the different types of errors. The spelling errors generally fall into one of the following four categories: (1) in-word transforma-tion, e.g. insertransforma-tion, deletransforma-tion, misspelling of characters. This type of error is most frequent in web queries, and it is not uncommon that up to 3 or 4 letters are misspelled; (2) mis-use of valid word, e.g. “persian golf” → “persian gulf”. It is also a type of in-word tranformation errors; (3) concatenation of multiple words, e.g. “unitedstatesofamerica” → “united states of america”; (4) splitting a word into parts, e.g. “power point slides”→ “powerpoint slides”. Among all these types, the splitting and concatenation errors are especially challenging to correct. Indeed, no existing approaches in the academic literature can correct these two types of errors. Yet, it’s important to correct all types of errors because users might commit different types of errors or even commit these errors at the same time. A main goal of this work is to develop a new HMM framework that can model and correct all major types of errors including splitting and concatenation.

Second, it is difficult to ensure complete search of all the candidate space because the candidate space is very large. The existing work addresses this challenge by using a two-stage method, which searches for a small set of candidates with simple scoring functions and do re-ranking on top of these candidates. Unfortunately, the simple scoring function used in the first stage cannot ensure that the nominated candidate corrections in the first stage always contain the best correction, thus no matter how effective the final scoring function is, we may miss the best correction simply because of the use of two separate stages. In this chapter, we address this challenge by developing a generalized HMM that can both be efficiently scored to ensure complete search in the candidate space and accurately correct all types of errors in a unified way.

(28)

3.3

A Generalized HMM for Query Spelling

Correction

Our algorithm accepts a query as input, and then generates a small list of ranked corrections as output by a generalized Hidden Markov Model (gHMM). It is trained by a discriminative method with labeled spelling examples. Given a query, it scores candidate spelling correc-tions in a one-stage fashion and outputs the top-K correccorrec-tions, without using a re-ranking strategy. Other components of our algorithm include a large clean lexicon, the error model and the language model. In this section we will focus on the gHMM model structure, the discriminative training of it, as well as the efficient computation of spelling corrections.

3.3.1

The gHMM Model Structure

We propose a generalized HMM Model to model the spelling correction problem. We call it a generalized HMM because there are several important differences between it and the standard HMM model which will be explained later. Without loss of generality, let an input query be q = q[1:n]and a corresponding correction be y = y[1:m] where n, m are the length of the query

and correction, which might or might not be equal. Here we introduce hidden state sequence z = z[1:n] = (s1, s2, ..., sn) in which z and q have the same length. An individual state si is

represented by a phrase corresponding to one or more terms in correction y[1:m]. Together the

phrase representing z is equal to y. Therefore, finding best-K corrections Y = (y1, y2, ..., yk) is equivalent to finding best-K state sequences Z = (z1, z2, ..., zk). In addition, there is a type t associated with each state, indicating the operation such as substitution, splitting, merging etc. Also, in order to facilitate the merging state we introduce a NULL state. The NULL state is represented by an empty string, and it doesn’t emit any phrase. There can be multiple consecutive NULL states followed by a merging state. Table 3.2 summarizes the state types and the spelling errors they correspond to. Having the hidden states defined, the hypothesized process of observing a mis-spelled query is as follows:

(29)

1. sample a state s1 and state type t1 from the state space Ω and the type set T ;

2. emit a word in q1, or empty string if the s1 is a NULL state according to the type

specific error model;

3. transit to s2 with type t2 according to the state transition distribution, and emit

another word, or multiple words in q[1:n] if s2 is a merging state;

4. continue until the whole (potentially) mis-spelled query q is observed.

Table 3.2: State Types in gHMM

State Type Operation Spelling Errors

Deletion Insertion

In-word Insertion Deletion

Transformation Substitution Substitution Mis-use Transformation Word Mis-use Merging Merge Multiple Splitting

Words

Splitting Split one Word Concatenation to Multiple Words

Figure 3.1 illustrates our gHMM model with a concrete example. In this example, there are three potential errors with different error types, e.g. “goverment” → “government” (substitution), “home page” → “homepage” (splitting), “illinoisstate” → “illinois state” (concatenation). The state path shown in Figure 3.1 is one of the state sequences that can generate the query. Take state s3 for example, s3 is represented by phrase homepage. Since

s3 is a merging state, it emits a phrase home page with probability P (home page|homepage).

And s3 is transited from state s2 with probability P (s3|s2). With this model, we are able to

come up with arbitrary corrections instead of limiting ourselves to an incomprehensive set of queries from query log. By simultaneously modeling the misspellings on word boundaries, we are able to correct the query in a more integrated manner.

(30)

government

goverment home page of illinoisstate

homepage

NULL of illinois state

goverment s1 type: in-word transformation s3 type: merging s2 type: NULL s4 type: in-word transformation s5 type: splitting query: emission: state sequence:

home page of illinoistate

Figure 3.1: Illustration of the gHMM Model

3.3.2

Generalization of HMM Scoring Function

For a standard HMM [73], let θ = {A, B, π} be the model parameters of the HMM, rep-resenting the transition probability, emission probabilities and initial state probabilities re-spectively. Given a list of query words q[1:n] (obtained by splitting empty spaces), the state

sequence z∗ = (s∗1, s∗2, ..., s∗n) that best explains q[1:n] can be calculated by:

z∗ = arg max

z P (z|q[1:n], A, B, π) (3.1)

However, theoretically the phrase in a state can be chosen arbitrarily, so estimating{A, B, π} is such a large space is almost impossible in the standard HMM framework. In order to over-come this difficulty, the generalized Hidden Markov Model proposed in this work generalizes the standard HMM as follows: (1) gHMM introduces state type for each state, which in-dicates the correction operations and can reduce the search space effectively; (2) it adopts feature functions to parameterize the measurement of probability of a state sequence given a query. Such treatment can not only map the transition and emission probabilities to feature functions with a small set of parameters, but can also add additional feature functions such as the ones incorporating state type information. Another important benefit of the feature function representation is that we can use discriminative training on the model with labeled

(31)

spelling corrections, which will lead to a more accurate estimation of the parameters. Formally, in our gHMM model, there is an one-to-one relationship between states in a state sequence and words in the original query. For a given query q = q[1:n]and the sequence

of states z = (s1, s2, ..., sn), we define a context hi for every state in which an individual

correction decision is made. The context is defined as hi =< si−1, ti−1, si, ti, q[1:n] > where

si−1, ti−1, si, ti are the previous and current state and type decisions and q[1:n] are all query

words.

The generalized HMM model measures the probability of a state sequence by defining feature vectors on the state pairs. A feature vector is a function that maps a context-state pair to a d-dimensional vector. Each component of the feature vector is an arbitrary function operated on (h, z). Particularly, in this study we define 2 kinds of feature vectors, one is ϕj(si−1, ti−1, si, ti), j = 1...d, which measures the interdependency of adjacent states.

We can map this function to a kind of transition probability measurement. The other kind of feature function, fk(si, ti, q[1:n]), k = 1...d′ measures the dependency of the state and its

observation. We can consider it as a kind of emission probability in the standard HMM point of view. Such feature vector representation of HMM is introduced by Collins [25] and successfully applied to the POS tagging problem.

Specifically, we have designed several feature functions as follows: we define a function of ϕ(si−1, ti−1, si, ti) as

ϕ1(si−1, ti−1, si, ti) = logPLM(si|si−1, ti−1, ti) (3.2)

to measure the language model probabilities of two consecutive states. Where PLM(si|si−1)

is the bigram probability calculated by using Microsoft Web N-gram Service [3]. The com-putation of PLM(si|si−1) may depend on the state types, such as in a merging state.

(32)

on the query words and state type, measuring the emission probability of a state. For example, we define f1(si, ti, q[1:n]) =     

logPerr(si, qi) if qi is in-word transformed to si and qi ∈ Lexicon L/

0 otherwise

(3.3)

as a function measuring the emission probability given the state type is in-word transfor-mation and qi is out of dictionary. e.g. “goverment” → “government”. Perr(si, qi) is the

emission probability computed by an error model which measures the probability of mis-typing “government” to “goverment”.

f2(si, ti, q[1:n]) =     

logPerr(si, qi) if ti is splitting and qi ∈ Lexicon L

0 otherwise

(3.4)

to capture the emission probability if the state is of splitting type and qi is in dictionary.

e.g. “homepage” → “home page”.

f3(si, ti, q[1:n]) =     

logPerr(s, qi) if ti is Mis-use and qi ∈ Lexicon L

0 otherwise

(3.5)

to get the emission probability if a valid word is transformed to another valid word. Note that in Equation (3.3), (3.4), and (3.5), we use the same error model Perr(si, qi) (see

Section ?? for detail) to model the emission probabilities from merging, splitting errors etc. in the same way as in-word transformation errors. However we assign different weights to the transformation probabilities resulted from different error types via discriminative training

(33)

on a set of labeled query-correction pairs.

Overall, we have designed a set of feature functions that are all relied on local depen-dencies, ensuring that the top-K state sequences can be computed efficiently by Dynamic Programming.

After establishing the feature vector representation, the log-probability of a state se-quence and its corresponding types logP (z, t|q[1:n]) is proportional to:

Score(z, t) = ni=1 dj=1 λjϕj(si−1, ti−1, si, ti) (3.6) + ni=1 d′k=1 µkfk(si, ti, q[1:n])

where λj, µk are the component coefficients needed to be estimated. And the best state

sequence can be found by:

z∗t∗ = arg max

z,t Score(z, t) (3.7)

Note that the form of Score(z, t) is similar to the objective function of a Conditional Random Field model [51], but with an important difference that there is no normalization terms in our model. Such difference also enables the efficient search of top-K state sequences (equivalent to top-K corrections) using Dynamic Programming, which will be introduced shortly.

3.3.3

Discriminative Training

Motivated by ideas introduced in [25], we propose a perceptron algorithm to train the gH-MM model. To the best of our knowledge, this is the first attempt to use discriminative

(34)

approach to train a HMM on the problem of query spelling correction. Now we describe how to estimate the parameters λj, µk from a set of <query, spelling correction> pairs. The

estimation procedure follows the perceptron learning framework. Take the λj for example.

We first set all the λj at random. For each query q, we search for the most likely state

sequence with types zi

[1:ni], t

i

[1:ni] using the current parameter settings. Such search process

is described in Algorithm 2 by setting K = 1. After that, if the best decoded sequence is not correct, we update λj by simple addition: we promote the amount of λj by adding up

ϕj values computed between the query and labeled correction y′, and demote the amount of

λj by the sum of all ϕj values computed between the query and the top-ranked predictions.

We repeat this process for several iterations until converge. Finally in step 11 and 12, we average all λo,ij in each iteration to get the final estimate of λj, where λo,ij is the stored value

for the parameter λj after i’s training example is processed in iteration o. Similar procedures

can apply to µk. The detailed steps are listed in Algorithm 1. Note that in step 7 and 8

the feature functions ϕj(qi, y′i, t′i) and fk(qi, y′i, t′i) depend on unknown types t′i that are

inferred by computing the best word-level alignment between qi and y′i. This discriminative training algorithm will converge after several iterations.

3.3.4

Query Correction Computation

Once the optimal parameters are obtained by the discriminative training procedure intro-duced above, the final top-K corrections can be directly computed, avoiding the need for a separate stage of candidate re-ranking. Because the feature functions are only relied on local dependencies, it enables the efficient search of top-K corrections via Dynamic Program-ming. This procedure involves three major steps: (1) candidate states generation; (2) score function evaluation; (3) filtering.

At the first step, for each word in query q, we generate a set of state candidates with types. The phrase representations in such states are in Lexicon L and within editing distance δ from the query word. Then a set of state sequences are created by combining these states.

(35)

Algorithm 1: Discriminative Training of gHMM input : A set of ¡query, spelling correction¿ pairs

qi

[1:ni], y

′i

[1:mi] for i = 1...n

output: Optimal estimate of ˆλj, ˆµk, where j ∈ {1, ..., d}, k ∈ {1, ..., d′}

1 Init Set ˆλj, ˆµk to random numbers;

2 for o← 1 to O do

3 for i ← 1 to n do

/* identify the best state sequence and the associated types of the

i’th query with the current parameters via Algorithm 2: */

4 z[1:ni i], t

i

[1:ni]= arg maxu[1:ni],t[1:ni]Score(u, t)

/* where u[1:ni]∈ S

ni,Sni is all possible state sequences given qi

[1:ni] */ 5 if z[1:ni

i]̸= y

′i

[1:mi] then

6 update and store every λj, µk according to:

7 λj = λj+

ni

i=1ϕj(qi, y′i, t′i)

ni

i=1ϕj(qi, zi, ti)

8 µk= µk+∑ni=1i fk(qi, y′i, t′i)ni=1i fk(qi, zi, ti)

9 else

10 Do nothing

/* Average the final parameters by: */

11 λˆj =∑Oo=1ni=1λo,ij /nO, where j ∈ {1, ..., d} 12 µˆk =∑Oo=1ni=1µo,ik /nO, where k∈ {1, ..., d′} 13 return parameters ˆλj, ˆµk;

(36)

In addition, for each state sequence we have created, we also create another state sequence by adding a NULL state at the end, facilitating a (potential) following merging state. It is important to note that if the δ is too small, it will compromise the final results due to the premature pruning of state sequences. In this work δ = 3 is chosen in order to introduce adequate possible state sequences.

At the score function evaluation step, we update the scores for each state sequence according to Eq. (3.6). The evaluation is different for sequence with different ending state types. Firstly, for a sequence ending with a NULL state, we don’t evaluate the scoring function. Instead, we only need to keep track of the state representation of its previous state. Secondly, for a sequence ending with a merging state, it merges the previous one or more consecutive NULL states. And the scoring function takes into account the information stored in the previous NULL states. For instance, to ϕ1(si−1, ti−1 = N U LL, si, ti = merging), we

have

ϕ1(si−1, NULL, si, merging) = logPLM(si−2|si) (3.8)

i.e. skipping the NULL state and pass the previous state representation to the merging state. In this way, we can evaluate the scoring function in multiple consecutive NULL states followed by a merging state, which enables the correction by merging multiple query words. Thirdly, for a sequence ending with a splitting state, the score is accumulated by all bigrams within the splitting state. For example,

ϕ1(si−1, ti−1, si, ti = splitting) (3.9) = logPLM(w1|si−1) + k−1j=1 logPLM(wi+1|wi)

where si = w1w2...wk. On the other hand, the evaluation of fk(si, ti, q[1:n]) is easier because

(37)

query word is used to calculate these functions.

At the final step, we filter most of the state sequences and only keep top-K best state sequences in each position corresponding to each query word. In sum, we have proposed and implemented an algorithm via Dynamic Programming (see Algorithm 2) for efficiently computing top-K state sequences (corrections). If there are n words in a query, and the max-imum number of candidate states for each query word is M , the computational complexity for finding top-K corrections is O(n· K · M2).

Algorithm 2: Decoding Top-K Corrections input : A query q[1:n], parameters ⃗λ, ⃗µ

output: top K state sequences with highest likelihood

/* Z[i, si]: top K state sequences for sub-query q[1:i] that ending with state

si. For each z∈ Z[i, si], phrase denotes the representation and score

denotes the likelihood of z given q[1:i]. */

/* Z[i]: top state sequences for all Z[i, si]. */

1 Init Z[0] ={}

2 for i ← 1 to n do

/* for term qi, get all candidate states */

3 S ← si, ∀si : edit dist(si, qi)≤ δ, si has type si.type

4 for si ∈ S do

5 for z ∈ Z[i − 1] do

6 a← new state sequence 7 a.phrase← z.phrase ∪ {si}

8 update a.score according to si.type and Eq. (3.6), Eq. (3.8) and Eq. (3.9) 9 Z[i, si]← a

/* delay truncation for N U LL states */

10 if si.type̸= NULL and i ̸= n then 11 sort Z[i, si] by score

12 truncate Z[i, si] to size K 13 sort Z[n] by score

14 truncate Z[n] to size K 15 return Z[n];

(38)

3.4

Experiments and Results

In order to test the effectiveness and efficiency of our proposed gHMM model, in this section we conduct extensive experiments on two web query spelling datasets. We first introduce the datasets, and describe the evaluation metrics we use for evaluation. Then we compare our model with other baselines in terms of accuracy and runtime.

3.4.1

Dataset Preparation

The experiments are conducted on two query spelling correction datasets. One is the TREC dataset based on the publicly available TREC queries (2008 Million Query Track). This dataset contains 5892 queries and the corresponding corrections annotated by the MSR Speller Challenge [4] organizers. There could be more than one plausible corrections for a query. In this dataset only 5.3% of queries are judged as misspelled.

We have also annotated another dataset that contains 4926 MSN queries, where for each query there is at most one correction. Three experts are involved in the annotation process. For each query, we consult the speller from two major search engines (i.e. Google and Bing). If they agree on the returned results (including the case if the query is just unchanged), we take it as the corrected form of the input query. If the results are not the same from the two, as least one human expert will manually annotate the most likely corrected form of the query. Finally, about 13% of queries are judged as misspelled in this dataset, which is close to the error rate of real web queries. This dataset is publicly available to all researchers.

We divide the TREC and MSN datasets into training and test sets evenly. Our gHMM model as well as the baselines are trained on the training sets and finally evaluated on the TREC test set containing 2947 queries and MSN test set containing 2421 queries.

(39)

3.4.2

Evaluation Metrics

We evaluate our system based on the evaluation metrics proposed in Microsoft Speller Chal-lenge [3], including expected precision, expected recall and expected F1 measure.

As used in previous discussions, q is a user query and Y (q) = (y1, y2, , yk) is the set of system output with posterior probabilities P (yi|q). Let S(q) denote the set of plausible spelling variations annotated by the human experts for q. Expected Precision is computed as: precision = 1 |Q|q∈Qy∈Y (q) Ip(y, q)P (y|q) (3.10)

where Ip(y, q) = 1 if y∈ S(q), and 0 otherwise. And expected recall is defined as:

recall = 1 |Q|q∈Qa∈S(q) Ir(Y (q), a)/|S(q)| (3.11)

where Ir(Y (q), a) = 1 if a ∈ Y (q) for a ∈ S(q), and 0 otherwise. Expected F1 measure

can be computed as:

F 1 = 2· precision · recall

precision + recall (3.12)

3.4.3

Overall Effectiveness

We first investigate the overall effectiveness of the gHMM model. For suitable query spelling correction baselines, especially approaches that can handle all types of query spelling errors, we first considered using the CRF model proposed in [35]. This method aims at a broad range of query refinements and hence might be also applicable to query correction. However, we decided not to compare this model for the following reasons. Firstly, we communicated

(40)

with the authors of [35] and knew that the program is un-reusable. Secondly, as mentioned in Section 5.1 this work suffers from several drawbacks for query spelling correction: (1) it is unable to correct queries with mixed types of errors, such as substitution and splitting errors in one query, because the model treats query correction and splitting/merging as different tasks; (2) this model only allows 1 character error for substitution/insertion/deletion. And the error model is trained on the ¡query, correction¿ examples that only contain 1 character error. Such design is over simplified for real-world queries, in which more than 1 character errors are quite common. In fact, within the queries that contain spelling errors in the MSN dataset, there are about 40.6% of them contain more than 1 character errors. Therefore it is expected model in [35] will have in inferior performance

Because of the reasons stated above, the best baseline method that we can possibly compare with is the system that achieved the best performance in Microsoft Speller Challenge [63] (we call it Lueck-2011). This system relies on candidate corrections from third-party toolkits such as hunspell and Microsoft Wrod Breaker Service [93] , and it re-ranks the candidates by a simple noisy channel model. We communicated with the author and obtained the corrections by running the Web API of this baseline approach. We also include a simple baseline called Echo, which is just echoing the original query as the correction response with posterior probability 1. It reflects the basic performance for a naive method. Experiments are conducted on TREC and MSN datasets.

We report the results of all methods in Table 3.3. In this experiment up to top 10 correc-tions are used in all approaches. The results in Table 3.3 indicate that gHMM outperforms Lueck-2011 significantly on recall and F1 on the TREC dataset. Lueck-2011 has a small advantage on precision, possibly due to the better handling the unchanged queries. On the MSN dataset which is considered harder since it has more misspelled queries, gHMM also achieves high precision of 0.910 and recall of 0.966, which are both significantly better than that of the Lueck-2011 (0.896 and 0.921). On another important performance metric, which measures the F1 on misspelled queries (F1 Mis), gHMM outperforms Lueck-2011 by a large

(41)

margin (0.550 vs. 0.391 on TREC and 0.566 vs. 0.363 on MSN). These results demonstrate that gHMM is very effective for handling all types of spelling errors in search queries overall.

Table 3.3: gHMM Compared to Baselines Dataset Method Precision Recall F1 F1

Mis Echo 0.949 0.876 0.911 N/A TREC Lueck-2011 0.963 0.932 0.947 0.391 gHMM 0.960 0.976 0.968 0.550 Echo 0.869 0.869 0.869 N/A MSN Lueck-2011 0.896 0.921 0.908 0.363 gHMM 0.910 0.966 0.937 0.566

3.4.4

Results by Error Types

Further, we also break down the results by error types that are manually classified so that we can see more clearly the distribution of types of spelling errors and how well our gHMM model addressing each type of errors. We present the results of this analysis in Table 3.4, only with our model on both datasets. Top 40 corrections are used since it achieves the best results. The breakdown results show that most queries are in the group of “no error”, which are easier to handle than the other three types. As a result, the overall excellent performance was mostly because the system performed very well on the “no error” group. Indeed, the system has substantially lower precision on the queries with the other three types of errors. The concatenation errors seem to be the hardest to correct, followed by the splitting errors, and the in-word transformation errors (insertion, deletion and substitution, word mis-use) seem to be relatively easier.

3.4.5

gHMM for Working Set Construction

Since the gHMM can efficiently search in the complete candidate space and compute the top-K spelling corrections in a one-stage manner, it is very interesting to test its effectiveness for

(42)

Table 3.4: Results by Spelling Error Type

Dataset Error Type %Queries Precision Recall F1 no error 94.9 0.990 0.982 0.986 TREC transformation 3.3 0.388 0.840 0.531 concatenation 1.3 0.348 0.877 0.498 splitting 0.5 0.500 0.792 0.613 no error 86.9 0.978 1.0 0.989 MSN transformation 11.1 0.493 0.762 0.599 concatenation 1.7 0.150 0.600 0.240 splitting 0.6 0.429 0.571 0.490

Note: % of queries might sum up to more than 100% since there might be multiple types of errors in one query.

constructing a working set of candidate corrections to enable more complex scoring functions to be used for spelling correction. For this purpose, we compare gHMM with the common used noisy channel model, whose parameters, namely error model probabilities and bigram language probabilities are estimated by the procedure mentioned in previous sections. We use recall to measure the completeness of the constructed working set, because it represents the percentage of true corrections given the number of predicted corrections. Table 3.5 shows the recall according to different number of outputs. It indicates that the recall of gHMM is steadily increasing by a larger number of outputs. By only outputting top-5 corrections, gHMM reaches recall of 0.969 in TREC and 0.964 in MSN. In contrast, the noisy channel model has a substantial gap in term of recall compared to gHMM. This result strongly demonstrates the superior effectiveness of gHMM in constructing a more complete working set of candidate corrections, which can be utilized by other re-ranking approaches which could further improve the correction accuracy.

3.4.6

Efficiency

The runtime requirement of query correction is very stringent. Theoretically, the gHMM with local feature functions can search top-K corrections efficiently by our proposed tok-K Viterbi algorithm. Here we make a directly comparison between the runtime of gHMM and

References

Related documents

knowledge graphs focus on the scholarly domain and typically contain metadata de- scribing research publications such as authors, venues, organizations, research topics, and

ANCOVA: analysis of covariance; BFCI: Baby Friendly Community Initiative; BFHI: Baby-Friendly Hospital Initiative; CHWs: community health workers; CU: community unit;

[r]

− A model for emissions of large point sources (e.g. large industrial, power plants) that are registered individually and supplemented with emission estimates for the remainder of the

Th e concept of such a route running through Lika-Senj County would fi rst and foremost require involvement at the natio- 10 During 2018, the Town of Novalja was visited by

Evaluation of the Stability of Coated Plates with Antigen at Different Temperatures and Times by ELISA Test to Diagnose Fasciolosis Iran J Parasitol Vol 14, No 2, Apr Jun 2019, pp 231 239

a chronic condition or pre- palliative state Observations may be delayed due to night-time ward management reasons Level of compliance with scheduled observation is used as