• No results found

The previous section has proposed several changes to the system. To support the charac-

teristics of the new dataset, the modified definition of tags, and the block/tab nature of the

interface, we need to modify the infrastructure accordingly in order to provide responsive

system.

The first difference is how we index the events (i.e. the instances) and how we store the

Co-occurrence Matrix. Since the request now is based on each block, the index and the

matrix are optimized to be built based on blocks too. Our indexing structure is shown in

Figure 6.8(a), where for each block (such as “Content”) we have a corresponding indexing

field, and in each field we index each possible value (such as “flight”) with a posting list

of events that have the ⟨block, value⟩ tag. The Co-occurrence Matrix is also built as an inverted index called Supporting Index as in Figure 6.8(b). There are two retrieval fields

in the Supporting Index: Co-existing Set (tagset ) and Block. In the Co-existing Set field,

each indexing tag has a posting list of tags that co-occur with that tag. In the Block filed,

each indexing block has a posting list of tags that appear in that block. By introducing

the block names as either retrieval fields or indexing terms in a retrieval field, we can

easily get the tags of a block by specifying the block name from both the Event Index and

the Supporting Index.

We defined ontological axioms for tag entailment, which speed up both the preprocess-

ing and the online computation. When building the supporting index, naively, we could

(a) Event Index Structure (b) Supporting Index Structure

Figure 6.8: Index Structures

pair of ⟨p1, v1⟩, ⟨p2, v2⟩ ∈ T . However we note that some combinations will never result in co-occurrences. Therefore, we carefully picked 30 tags (e.g. ⟨AccessType, Email⟩) or tag

groups (e.g. ⟨Content, *⟩), and created a 30 × 30 disjoint matrix, as illustrated in Figure

6.9, in which the cell (i, j) indicates the disjointness (1 means they are disjoint) between

the ith and jth tags (or groups). Disjointness was manually determined based on semantic

conditions of the blocks and tags, resulting in 13% of the cells being marked as disjoint

relation. For example, the tag ⟨AccessType, Web⟩ (f in Figure 6.9) is disjoint with the tag group ⟨Filename, *⟩ (l in Figure 6.9) because no files can co-occur with a web access in an event. Note, the matrix is symmetric, but the diagonal is not always disjoint. In

particular, multi-values properties such as content, server-names (extracted from emails

recipients and senders), contacts, and e-mail to/cc/bcc are not disjoint with themselves.

When building the supporting index, the disjoint matrix is used to prune unnecessary

queries to the event index. This disjointness is also used for pruning online computation.

groups, when any of these disjoint ones are in the context, we can directly ignore that

block, without even querying the supporting index.

Besides the disjointness axioms, we also use the Supporting Index (for the same pur-

pose as the Co-occurrence Matrix in Section 4.2) to reduce the number of tags that will

require frequency (fR) queries. Given block B and context T = {t1, t2, ...}, we issue a boolean query “block=B AND tagset=t1 AND tagset=t2 ...” to the supporting index, and

the result will be a set of tags in block B that are likely to co-occur with the context.

This pruning strategy is called Precomputed Candidate Set (PCS), since the result

from the precomputed supporting index is a set of candidates.

In some cases, we find that the context is actually very selective, and there are only a

few instances that match the context. Instead of issuing queries from the candidate set,

we can directly count the tags that appear in these matched instance. Notably, this will

be inefficient when |Inst(T )| is very large and |B|, i.e. the number of unique tags in this block, is relatively small. Thus in practice, we use the following Conditional Instance

Processing (CIP) rule: if |B|/|Inst(T )| > αB ; otherwise, we use the naive approach

which issues a conjunctive query for every tag t in the dataset. Here αB is a constant

parameter for each block B. In practice we let αContent = 10, and αB = 100 for all the

other blocks. The Content block is specially handled because it is a multi-value block,

accessing each instance would add counts to multiple tags at the same time. We did the

estimation of αB in a very experimental manner, and we think further investigation and

CIP and PCS are efficient in some different cases, so we combine these two ap-

proaches into a third one (CIP+PCS): we use the precomputed candidate set and

apply a modified conditional instance processing rule. On every request, if |T | > 1 ∧

|candP CS(B, T )|/|Isub(T )| > αB, we process matched instances; otherwise we use sup-

porting index to get a smaller candidate set. Note that the key difference between this

combined approach and the CIP only approach is whether we use the precomputed can-

didates from the supporting index or use all the tags from a given block as candidates.

Also because PCS is guaranteed to provide all the co-occurring tags if|T | = 1, we do not consider to use CIP in this case.