Similarity table calculation - Back-End Services

Chapter 4 Designing a Framework for Self-Revision E-Course Mate-

4.6 Back-End Services

4.6.2 Similarity table calculation

Khairil Imran Bin and Nor Aniza [77] suggest that popular approaches to recom- mending related learning materials to students are collaborative filtering, content- based filtering and hybrid filtering. Collaborative filtering recommends learning materials based on similarity of students’ preferences, while content-based filtering makes recommendations according to similar content of learning materials [52]. The hybrid filtering approach uses both approaches to make recommendations. The first prototype of the SRECMATs framework focused on finding an appropriate technique to automatically recommend related materials based on similar content (particularly slides and past exam papers). Therefore, a content-based filtering approach was considered, which would produce automatic recommendations with no participation by students.

To recommend learning materials, the degree of similarity or similarity score between each pair of materials was calculated and kept in a relational database, as presented in Figure 4.14. Calculating the degree of similarity between online materials was another challenge. In this research, two components – term weighting and similarity calculation – were used to calculate the similarity between each pair of documents.

Term Weighting

Term weighting is a process of identifying the features of a document, calculating their weight and representing them through the information retrieval (IR) model. Before identifying features and calculating their weight, it was necessary to consider which IR models should be used. In this thesis, the classical IR model was used as described below.

Beazy-Yates [9] states that the IR model consists of four main parts. It can be defined as a framework to represent user information needs (Q) and documents in the collection (D). The notation R(Q, D) or R(D1, D2) is used to calculate similarity between two documents. The three classical IR models considered in this research were:

• Boolean Model: A simple model for retrieving information based on set

theory and boolean algebra [121]. It considers whether or not an index of terms is present in a document. Weightings of index terms are all presented

in binary form, which can be computed as presented in equation (4.1). Leti

be a term appearing in the documentj; the weight of the termiis computed

by w(i,j)= ⎧⎪⎪⎪ ⎪⎨ ⎪⎪⎪⎪ ⎩ 1 if i∈j, 0 if i∉j. (4.1)

The Boolean model has a major drawback, insofar as the retrieval techniques are based only on a binary decision on whether a document is related or unrelated. Hence, it does not deal with a partial matching between documents.

• Vector Model: This is an algebraic model that represents a querying of doc-

uments in the form of a vector containing the weightings of features, allow- ing partial matching of documents [9]. Weightings of terms in the document

are assigned as non-binary data to compute the degree of similarity between documents. For example, using term frequency as a feature for representing documents, the weight of termiin a documentj(w(i,j)) is a frequency of term

iwhich appeared in the documentj, as presented in equation (4.2).

w(i,j)=freq(i, j). (4.2)

The vector model of documentj can therefore be represented as:

Ð→_j ₌_w

(i,j)= (w(1,j), w(2,j), . . . , w(n,j)). (4.3)

• Probabilistic Model: This is a retrieval model based on a probabilistic

framework. Similarity between document and query is calculated by the prob- ability that the document appears in the answer set, divided by the proba- bility that the document does not appear in the answer set. The answer set can initially be guessed and then adjusted later. The common ranking for a probabilistic model is a BM25 score. Major disadvantages include the need to guess whether the documents are related or unrelated, and lack of matching the frequency of candidate terms occurring in documents.

In this research, the vector model was chosen to represent documents, since most content-based recommenders use relatively simple but effective retrieval models [76, 78, 79, 90]. The vector model is a promising technique for retrieving relevant e- materials based on their degree of similarity. It requires features with weightings to

characterise each document. For example, in Figure 4.19, setsS andP represent a

lecture slide and a past exam paper, and the table below shows terms that appear in both documents, as well as their frequency.

Figure 4.19: Example of terms in lecture slide documentS and past exam paperP for similarity calculation

Assume that a raw term frequency is used as a feature to represent the documents. theÐ→S and Ð→P can be written as:

Ð→_S _{= (}_w

(Cat,S), w(Dog,S), w(Rat,S),w_{(F ish,S)}) = (5,1,1,3).

Ð→_P _{= (}_w

(Cat,P), w(Dog,S), w(Rat,S), w(Bird,P)) = (2,1,5,7).

The first version of the SRECMATs prototype used a classical term weighting, which was the term frequency with inverse document frequency (TF-IDF) provided

by the open-source JATEtoolkit7 [169], as a feature of each document. TF-IDF is

a measure of the frequency of a candidate term appearing in the target document (TF), with the number of documents that contain the candidate term (IDF). The following equations show how TF-IDF is computed.

For the candidate termiin the documentj, the normalised term frequencyTF(i, j) is derived by dividing the raw frequencyfreq(i, j)by the maximum of raw frequency

of all terms in documentj as maxl freq(l, j).

TF(i, j) = freq(i, j) maxl freq(l, j)

. (4.4)

From Figure 4.19, the use of the normalised term frequency of TF(i, S) and

TF(i, P)as weightings of terms for Ð→S and Ð→P can be computed as follows:

Ð→_S _{= (}5 5, 1 5, 1 5, 3 5) = (1,0.2,0.2,0.6). Ð→_P _{= (}2 7, 1 7, 5 7, 7 7) = (0.28,0.14,0.71,1).

LetN be the total number of documents andni the number of documents in which

candidate term iappears. The inverse document frequency of term i,IDF(i) [135] is computed by:

IDF(i) =logN

. (4.5)

From Figure 4.19, the inverse document frequencyIDF(i) of bothÐ→S and Ð→P can be computed as follow: Ð→_S _{= Ð→}_P _{= (}_log2 2,log 2 2,log 2 2,log 2 1) = (0,0,0,0.69).

The common term weighting score TF-IDF(i, j)[135] is computed by:

From Figure 4.19, the final Ð→S and Ð→P based on TF-IDF(i, j) are computed as follow: Ð→ S = (1×0,0.2×0,0.2×0,0.6×0.69) = (0,0,0,0.41). Ð→_P _{= (}₀_.₂₈_×₀_,₀_.₁₄_×₀_,₀_.₇₁_×₀_,₀_.₁_×₀_.₆₉_{) = (}₀_,₀_,₀_,₀_.₆₉₎_. Similarity calculation

In order to calculate the degree of relevance between two learning materials (in the form of VSM), a commonly-used technique for content-based filtering known as cosine similarity was used [92]. Cosine similarity is a measure of the distance (cosine of the angle) between each two pairs of documents. Each document is considered as a vector, where each dimension corresponds with the weighting of a term appearing in a document. The classical term weighting scheme TF-IDF was used to weight terms in the first prototype, as shown in equation (4.6). For example, the degree of similarity between a lecture slide and a past exam paper,Sim(S, P), was calculated through a cosine similarity formula as follows:

Sim(S, P) = Ð→_S _{⋅ Ð→}_P ∣Ð→S∣∣Ð→P∣ = ∑ni=1w(i,S)×w(i,P) √ ∑n i=1w 2 (i,S)× √ ∑n i=1w 2 (i,P) , (4.7)

where w(i,j) denotes a weight of term i in learning materials j and

Ð→_j _{is a weight}

of term vector (w(1,j), w(2,j), . . . , w(n,j)) with its magnitude ∣Ð→j∣. Details of the

similarity calculations and term weighting components considered are discussed in Chapter 6.

From Figure 4.19 (using the raw frequency of terms), the cosine similarity between

Ð→_S _and Ð→_P _{can be calculated as follows:}

Sim(S, P) = √ (5×2) + (1×1) + (1×5) 52+₁2+₁2+₃2×√₂2+₁2+₅2+₇2

≈0.3.

4.7 Summary

The literature review presented in Chapter 2 and the survey results described in Chapter 3 enabled the development of a software framework to deliver online materials to students and support their revision by indexing and linking uploaded materials. Another reason for designing this framework was to reduce lecturer workloads, since it allows students greater flexibility in navigating through online materials during their revision. With the SRECMATs system, students can access materials anytime and anywhere, as well as browsing, searching and navigating through online materials for specific information in a single place. Thus, they can make maximum use of the online course materials, especially during the revision period.

The SRECMATs framework is composed of front-end and back-end services. The front-end services deal with the user interface and tool functionality, while the back- end services provide the technology that drives the system. The SRECMATs framework contains three main features that support students: direct access (keyword browsing and keyword searching), access to quick overviews and easy access to related materials.

Since the materials commonly available on University of Warwick course websites (Sitebuilder) are uploaded in PDF format and simply converted to plain text format, the back-end service then uses NLP as a technique to develop the front-end features of the SRECMATs framework. The back-end services comprise two major

components: candidate term extraction and similarity table calculation. These two components were implemented based on open source iText library, Apache Open NLP tool and JATEtoolkit. iText contains a library for converting PDF documents to plain text. The Apache Open NLP tool contains a library with regard to data pre-processing for sentence segmentation, tokenisation, part-of-speech tagging, stemming and lemmatisation, and stop-words filtering. The JATEtoolkit contains a library for the statistical approach, including term frequency (TF) and inverse document frequency (IDF). In addition, information in the similarity table is calculated from cosine similarity, based on a vector space model.

This chapter has discussed background studies on components relating to the SRECMATs framework. The next chapter will evaluate the SRECMATs prototype built from the SRECMATs framework.

In document Enhancing online course materials for self revision in higher education (Page 132-140)