Algorithm for recall budget constraint (given ρ )

Information Extraction

Algorithm 8 Algorithm for recall budget constraint (given ρ )

1: – Sort the entries in increasing order of precisions p_w, let w₁,· · ·,w_n be the entries in

sorted order and p1≤ · · · ≤pnbe the corresponding precisions. 2: – LetS_`={w_i:i≤`}, andS₀=∅. 3: – Initializei=1. 4: whilei≤ndo 5: ifF Si ≥FSi−1 then 6: ifR Si ≥ρthen 7: i=i+1, continue. 8: else 9: return S_i₋₁. 10: end if 11: end if 12: end while

S returned by our algorithm satisfies

F_S¯≥ 2∑i∈S¯ ∗p_if_i

∑i∈_S¯∗f_i+∑_ip_if_i+ f_max/p_`₊₁.

Proof. The algorithms orders the elements according to their precision values and selects in this order until the recall budget is exhausted or there is no further improvement in F-score. Let pi= pwi and fi= fwi, where p1≤ · · · ≤pn. Let fmax=max{f1,f2, . . . ,fn},

Si={wj: 1≤j≤i}andS=A\S. Letr`=∑_i_∈_S`pifi=∑i>`pifi. By definition,r∗+fmax≥

r`≥r∗.

Note that recall R_S` = r

`_/_∑

ipifi ≥r∗/∑ipifi = RS∗ ≥ρ. Due to the monotonicity

check, the algorithm will return a solution with F-score≥F_S`. Hence it suffices to give a

lower bound onP_S` .

To do this observe that (1) ∑

i∈S`_\_S∗pifi−∑_i_∈_S∗_\_S`pifi=∑_i_∈_S`pifi−∑i∈S∗ p_if_i≤ fmax, and, (2) ∑ i∈S`_\_S∗ fi ≤ ∑_i_∈_S`_\_S∗pifi p`+1 ≤ (_∑ i∈S∗ \S`pifi+fmax) p`+1 ≤∑i∈S∗\S` fi+ fmax p`+1. From ( 1) and (2), ∑_i_∈_S` fi ≤∑i∈S∗ f_i+ _pfmax `+1. Hence PS` = r` ∑_i_∈_S`fi ≥ r∗ ∑i∈S∗fi+fmax/p`+1, and, FS` = 2 1/R S`+1/PS` ≥ 2 1/R_S∗+1/P S` ≥ 2∑i∈S∗pifi ∑i∈S∗fi+∑ipifi+pfmax`+1

score F_S¯∗ only by the addition of the error term _pfmax

`+1 to the denominator. Individual

frequencies are likely to be small when the given corpus and the dictionary are large. At the same time`and hence p`+1are determined solely by the recall budget. Therefore the

error term fmax

p`+1 is likely to be much smaller than the denominator for a large dictionary.

Our experiments confirm this informal argument.

Optimal F-Score without constraints. Another surprising property of the algorithm we just described is that while it is not necessarily optimal in general, without the recall budget (i.e. with ρ=0) this algorithm finds the solution with the globally optimal F-

score. Naturally, the optimal solution can also be found using the slightly more involved Algorithm7withk=n. The proof of this claim can be found in [151].

5.4.2 Refinement Optimization for Multiple Dictionaries

The optimization problem becomes harder in the multiple dictionary case. In Section

5.4.2.1 we show that even for the simple firstname-lastname rule (rule R₄ in Figure 2.4)

the optimization problem for size constraint is NP-hard; this problem was shown to be poly-time solvable for single dictionary. The case of recall constraint has already been shown to be NP-hard even for single-dictionary. Then in Section5.4.2.2we discuss some

efficient algorithms that we evaluate experimentally.

5.4.2.1 NP-hardness for Size Constraint

We give a reduction from thek0-densest subgraph problem in bipartite graphswhich has been proved to be NP-hard in [48]33. Here the input is a bipartite graph H(U,V,E) with n0

vertices and m0 edges, and, an integer k0 <n0. The goal is to select a subset of vertices

W⊆U∪Vsuch that|W|=k0and the subgraph induced onWhas the maximum number of edges. We will denote the set of edges in theinduced subgraphonW (every edge in the subgraph has both its endpoints inW) byE(W).

For simplicity, first we prove a weaker claim: removing a subsetSsuch that the the size ofSisexactly k(as opposed toat most k) is NP-hard. Intuitively, the vertices correspond to

The complexity of the unconstrained multiple dictionary casek=nor recall budget =0, is an interesting

entries and the edges correspond to occurrences. We show that if the induced subgraph on a subset of vertices of size at most k0 has a large number of edges, then removing entries in thecomplementof this subset results in this induced subgraph that gives a large residual F-score.

Given an instance of the k0-densest subgraph problem, we create an instance of the dictionary refinement problem as follows. The vertices in U and V respectively correspond to the entries in the firstname and lastname dictionaries in the firstname-lastname rule. Every edge (u,v)∈E corresponds to a unique provenance expression φu,v =uv, where the entries u andv are chosen from these two dictionaries respectively. For each

(u,v)∈E, there is one result with label1(Good), and one with label0(Bad). The param- eter kin the dictionary refinement problem is k=n0−k0. We show that there is a subset

W ⊆U∪V, such that |W|= k0 and E(W)≥q if and only if there is a subset S for the dictionary refinement problem such that|S|=kandF_S¯≥ m02

q+2 .

The residual precision in the above reduction is a constant for all choices of S, and therefore, the residual F-score is a monotone function of the residual recall. Hence the above reduction does not work for the relaxed constraint |S| ≤k (the residual recall is always maximized at S=∅, i.e. when k=0, independent of the k0-densest subgraph solution).

Outline of reduction for |S| ≤k. To strengthen the above reduction to work for the relaxed constraint |S| ≤k, we do the following We retain the graph with a good and a bad occurrence for every edge as before. In addition, we add s=mn Good results that

are unrelated to anything. These results will make the differences in the recall between solutions tiny (while preserving monotonicity in the size of E(W)). For every original entryu∈U∪V corresponding to the vertices in the graph, we add s Bad results (u,ui)

1≤i≤sconnected to it. The entriesS

u∈U∪V{ui: 1≤i≤s}are called auxiliary-bad entries. This way the precision will be roughly equal to _n₋1_k (since these results dominate the total count) and hence solutions with smaller number of entries removed will have noticeably lower precision. It is also true that removing any of the auxiliary bad entries will have a tiny effect, so any optimal solution for the refinement problem will always remove the entries corresponding to graph vertices U∪V. The complete reduction is given in the

appendix (Section A.3.3). This proves the following theorem:

Theorem5.7. Maximization of the residual F-score for multiple dictionary refinement under size

In document Provenance and Uncertainty (Page 117-120)