2.8 Summary
3.1.3 Problem Statement
Exact MIPS. We study two variants of exact MIPS. The first one searches for each vector q ∈ Q, the set of k vectors from P with the largest inner product with q. Here k is an application-defined parameter. As discussed previously, this problem arises in recommender systems, where we want to retrieve the most relevant items (vectors of P ) for each user (vector of Q).
Definition 3.1 (Top-k-MIPS). Given an integer k > 0, find for every q ∈ Q the set J ⊆ [n] of the k columns of P that attain the k largest values of qTP . Ties are broken arbitrarily.
Note that if Q has only one column (contains a single vector only), the Top-k-MIPS problem is equivalent to top-k scoring with linear scoring function f (p) = qTp [Fagin et al., 2001]. In the general case, in which Q has multiple columns, Top-k-MIPS is equivalent to multi-query top-k scoring. Usually, MIPS is defined in the literature for a single query, i.e., Q = (q). In this work, we focus on the general case in which Q contains multiple query vectors, i.e., the queries may arrive in batches. Our methods can also be used in a streaming setting, in which Q contains a single query vector. By reversing the roles of Q and P , we can also find the top-k queries for each probe vector. The second problem, termed Above-θ-MIPS, asks to retrieve all pairs of vectors with inner product above some application-defined threshold θ. This problem is useful, for example, to determine all high-confidence facts in an open relation extraction scenario. Definition 3.2 (Above-θ-MIPS). Given a threshold θ > 0, determine the set of large entries
{ (i, j) ∈ [m] × [n] | [QTP ]ij ≥ θ }.
A simple solution to the above problems is to first compute QTP and then select the entries above the threshold (for Above-θ-MIPS) or the k largest entries per row (for Top-k-MIPS). We refer to this approach as Naive; it has time complexity O(mnr) and is
infeasible for large problem instances. Recently, a number of algorithms for exact MIPS have been proposed [Curtin and Ram,2014,Curtin et al.,2013,Ram and Gray,2012]; all of these methods are based on suitable tree-based indexes built on P (see Section 3.7). Approximate MIPS. Exact MIPS methods usually offer only limited speedup com- pared to naive search. Thus there has been a significant interest in designing methods for approximate MIPS [Bachrach et al.,2014,Neyshabur and Srebro,2015,Shrivastava and Li, 2014a,b]. Such methods trade off the quality of results in exchange for lower computational costs. In many applications, high-quality approximate results are accept- able. For example, in recommender systems, finding good recommendations fast may be preferable to finding the best recommendations slowly.
There are multiple conceivable ways to measure the quality of the results of an approx- imate MIPS algorithm with respect to a query q. A commonly used metric is recall, which corresponds to the fraction of true results—the ones that an exact MIPS algo- rithm would produce—in the result set produced by the approximate algorithm. Note that for Top-k-MIPS, both approximate and exact methods produce exactly k results, so that recall (fraction of true results overall) and precision (fraction of true results in answer) coincide.
For the Top-k-MIPS problem, recall will indicate how many true results exist in the approximate top-k result for the query. However, the recall value does not give any indication about the quality of the remaining (“false”) vectors in the approximate top-k list. To see why this might be of importance, consider again the recommender system scenario. Generally, we prefer methods that give good false results over methods that gives bad false results, and recall does not allow to distinguish these two cases. To formalize this intuition, denote by s1, s2, . . . , sk the values of the inner products in the
exact solution of a Top-k-MIPS problem in decreasing order, and by ˆs1, ˆs2, . . . , ˆsk the
corresponding result of an approximate algorithm. A measure that captures the difference between the result of the exact and the approximate method in absolute terms is the root mean square error (RMSE, [Bachrach et al.,2014]), defined as:
RM SE = v u u t1 k k X i=1 (si− ˆsi)2. (3.1)
Alternatively, we can quantify the difference relatively using the average relative error (ARE): ARE = 1 k k X i=1 si− ˆsi si . (3.2)
We define the recall/RMSE/ARE for a set of queries by taking the average of the recal- l/RMSE/ARE over all queries.
An approximate MIPS method that provides approximation guarantees takes as input an error bound on either recall, RMSE, or ARE and produces an approximate result that satisfies the specified bound (always or, in some cases, with high probability). Unfortu- nately, many of the existing approximate methods do not provide such guarantees and proceed in a best-effort manner instead. In Section 3.4, we propose a number of novel approximate methods that do provide error guarantees.