Background - Aboutness Systems: Example of the Flat Document Vector Space Model

4.6 Aboutness Systems: Example of the Flat Document Vector Space Model

4.6.1 Background

In the plain vector space model, documents in a collection are viewed as vectors in a vector space [Manning et al., 2008], in which there is one axis for each term in the collection. If we represent each document in the collection by the bag of keywords it contains, it can be considered as a point in this vector space and represented as a vector to this point. In fact, such a vector space representation has been used as the foundation of many types of information retrieval operations from calculating relevance rankings to document clustering and has been very successful as a means to represent information [van Rijsbergen,2004].

In the model, d and q are vectors of weighted or binary index terms t. A term can be a word or any other descriptor for the information the document contains. If the terms are weighted, then these weights are normally based on term frequencies and are values between 0 and 1. If they are unweighted, then the terms will be either 0 if the term appears in a document, or 1, if it does not. However, the weighting scheme is immaterial for our discussion. As we are just discussing an example for our methodology, we only consider unweighted index terms. Weighted index terms can be analysed analogously.

With [Baeza-Yates and Ribeiro-Neto,1999], let the query vector −→q be (u1, ..., um) and

the document vector −→d be (t1, ..., tm). m is the number of index terms in a collection

and the terms are given some canonical ordering so that each term can be found at a particular index in all vectors. The similarity of −→q and −→d can be calculated in many ways. In Salton’s original model [Salton et al.,1975], the relevance of a document d given a query q is estimated using the cosine of the angle between the two vectors of d and q.

rsv(d, q) = Pm i=1ti× ui q Pm i=1t2i × Pm i=1u2i

Since 0 ≤ ti≤ 1 and 0 ≤ ui ≤ 1, rsv varies between 0 and 1.

Next, we translate the vector representations into situations.

4.6.2 Translation

In the translation part, we develop first the map function for the model we are analysing. In this case, we need to express the behaviour of a simple bag of keywords indexing approach.

To this end, we define χ(d) as the descriptor set that is returned by the indexing process as a representation of document d. In our case, let χ(d) be a set of index terms, which correspond to the n non-zero entries in the vector for d , while χ(q) corresponds to non- zero entries for query q. For the translation of the simple vector space we use the basic infon language, as defined on page42. Index terms are then directly translated into infons and the set of all index terms infons is the document situation.

map(χ(d)) = { hhV alue, t; 1ii |t ∈ χ(d)}

Any document situation S representing a document d is thus the set of value infons of index terms of d. As presented in Section 4.3, a common shortform for such infons is to use simply hhtii . This translation is similar to the one for vector spaces in [Huibers,1996]. The translation of a query to a query situation is defined in a similar way.

Next we need to define the operators used in the rules from Section 4.4: equivalence, composition, containment and preclusion. Again, we can reuse what has already been defined in [Huibers,1996] and [Wong et al., 2001]. In particular, we reuse the algorithm in [Huibers et al.,1996a] for parameter replacement:

Definition 5 The notation S(x,y) _{represents the replacement of the parameter x in S}

with the parameter y. The properties of the parameters exchange are defined as follow:

S(w,x)(y,z)=def (Sw,x)y,z.

Using this notation, we can define the operators according to [Huibers et al., 1996a]: • Equivalence: Given two situations S and T , S ≡ T =def _{(ϕ ∈ S ⇔ ϕ ∈ T ), where ϕ}

is any infon based on all keywords in the document collection. In terms of vectors, this means that the underlying vectors for S and T are identical.

• Composition: Given two situations S and T , S⊗T =def _{(S∪T )}(p1,r1,...,pn,rn)(q1,s1,...,qn,sn)

with p,q and r,s being parameters used in S and T respectively. With respect to vectors, we create a new vector using composition that has a non-zero entry wherever either of the underlying vectors for S and T have a non-zero entry.

• Containment: Given two situations S and T , S → T =def _{(ϕ ∈ S → ϕ ∈ T ),}

where ϕ is any infon based on all keywords in the document collection. In the vector representation, where the underlying vector for S has a zero entry the underlying vector for T also has a zero entry and there is at least one non-zero entry that both share.

• Preclusion is not applicable. Preclusion is not applicable, as vectors always have a distance to each other, and vectors into the negative information space are not defined. rsv has to be larger or equal than 0 and smaller or equal than 1. The simple vector space model is therefore not able to express anti-aboutness beyond simple non- aboutness, as we will see later. Simple anti-aboutness would mean that we assume that given a situation S and another situation T , the vectors are perpendicular, or

S ⊥ T =def _{(ϕ ∈ S → ϕ 6∈ T ∧ ϕ ∈ T → ϕ 6∈ S). This would be equivalent to}

S / T .

According to the containment definition, any document is surface-contained in any other if and only if it contains only infons from the other document. Deep containment is an addition to the simple vector space model.

Next we discuss the rules, that help us define the behaviour of a model. [Huibers,

1996]’s approach is different from the one presented here, as it is not just some of the rules that are repeated for the analysis of the model but all the rules from Section 4.4. Huibers concentrates on the rules that prove completeness and soundness of the set of reasoning rules that describe the model. As discussed in Section4.4, the approach presented here is therefore akin to [Wong et al.,2001]’s inductive analysis where all rules are considered to be relevant as functional benchmarks of a model’s reasoning behaviour. It is important to understand detailed aspects of the reasoning in terms of conservative monotonicity, anti- aboutness behaviour, etc. In particular, one needs to understand which reasoning rules are not given or only given in certain circumstances, as this reasoning behaviour is highly conclusive for understanding experimental behaviour as outlined in Chapter8.

In document Theoretical evaluation of XML retrieval (Page 67-69)