The previous section has proposed several changes to the system. To support the charac-
teristics of the new dataset, the modified definition of tags, and the block/tab nature of the
interface, we need to modify the infrastructure accordingly in order to provide responsive
system.
The first difference is how we index the events (i.e. the instances) and how we store the
Co-occurrence Matrix. Since the request now is based on each block, the index and the
matrix are optimized to be built based on blocks too. Our indexing structure is shown in
Figure 6.8(a), where for each block (such as “Content”) we have a corresponding indexing
field, and in each field we index each possible value (such as “flight”) with a posting list
of events that have the ⟨block, value⟩ tag. The Co-occurrence Matrix is also built as an inverted index called Supporting Index as in Figure 6.8(b). There are two retrieval fields
in the Supporting Index: Co-existing Set (tagset ) and Block. In the Co-existing Set field,
each indexing tag has a posting list of tags that co-occur with that tag. In the Block filed,
each indexing block has a posting list of tags that appear in that block. By introducing
the block names as either retrieval fields or indexing terms in a retrieval field, we can
easily get the tags of a block by specifying the block name from both the Event Index and
the Supporting Index.
We defined ontological axioms for tag entailment, which speed up both the preprocess-
ing and the online computation. When building the supporting index, naively, we could
(a) Event Index Structure (b) Supporting Index Structure
Figure 6.8: Index Structures
pair of ⟨p1, v1⟩, ⟨p2, v2⟩ ∈ T . However we note that some combinations will never result in co-occurrences. Therefore, we carefully picked 30 tags (e.g. ⟨AccessType, Email⟩) or tag
groups (e.g. ⟨Content, *⟩), and created a 30 × 30 disjoint matrix, as illustrated in Figure
6.9, in which the cell (i, j) indicates the disjointness (1 means they are disjoint) between
the ith and jth tags (or groups). Disjointness was manually determined based on semantic
conditions of the blocks and tags, resulting in 13% of the cells being marked as disjoint
relation. For example, the tag ⟨AccessType, Web⟩ (f in Figure 6.9) is disjoint with the tag group ⟨Filename, *⟩ (l in Figure 6.9) because no files can co-occur with a web access in an event. Note, the matrix is symmetric, but the diagonal is not always disjoint. In
particular, multi-values properties such as content, server-names (extracted from emails
recipients and senders), contacts, and e-mail to/cc/bcc are not disjoint with themselves.
When building the supporting index, the disjoint matrix is used to prune unnecessary
queries to the event index. This disjointness is also used for pruning online computation.
groups, when any of these disjoint ones are in the context, we can directly ignore that
block, without even querying the supporting index.
Besides the disjointness axioms, we also use the Supporting Index (for the same pur-
pose as the Co-occurrence Matrix in Section 4.2) to reduce the number of tags that will
require frequency (fR) queries. Given block B and context T = {t1, t2, ...}, we issue a boolean query “block=B AND tagset=t1 AND tagset=t2 ...” to the supporting index, and
the result will be a set of tags in block B that are likely to co-occur with the context.
This pruning strategy is called Precomputed Candidate Set (PCS), since the result
from the precomputed supporting index is a set of candidates.
In some cases, we find that the context is actually very selective, and there are only a
few instances that match the context. Instead of issuing queries from the candidate set,
we can directly count the tags that appear in these matched instance. Notably, this will
be inefficient when |Inst(T )| is very large and |B|, i.e. the number of unique tags in this block, is relatively small. Thus in practice, we use the following Conditional Instance
Processing (CIP) rule: if |B|/|Inst(T )| > αB ; otherwise, we use the naive approach
which issues a conjunctive query for every tag t in the dataset. Here αB is a constant
parameter for each block B. In practice we let αContent = 10, and αB = 100 for all the
other blocks. The Content block is specially handled because it is a multi-value block,
accessing each instance would add counts to multiple tags at the same time. We did the
estimation of αB in a very experimental manner, and we think further investigation and
CIP and PCS are efficient in some different cases, so we combine these two ap-
proaches into a third one (CIP+PCS): we use the precomputed candidate set and
apply a modified conditional instance processing rule. On every request, if |T | > 1 ∧
|candP CS(B, T )|/|Isub(T )| > αB, we process matched instances; otherwise we use sup-
porting index to get a smaller candidate set. Note that the key difference between this
combined approach and the CIP only approach is whether we use the precomputed can-
didates from the supporting index or use all the tags from a given block as candidates.
Also because PCS is guaranteed to provide all the co-occurring tags if|T | = 1, we do not consider to use CIP in this case.