Relation Disagreement - Improving Centruflow using semantic web technologies : a thesis present

Having created an algorithm for inferring relationships, we of course allow for such relationships to be shown within Centruow. It would be remiss of us however if we did not oer some mechanism to allow for users to disagree with an inferred relationship, as with- out such a mechanism, there would be no easy means to `manually correct' the system. This section therefore develops an additional function that we use to take into account the users input that they disagree with an inferred relationship. We have not integrated this function with our distance metric as it is not presently a metric itself, meaning that the distance metric would no longer be a metric either.

Relation disagreement deals with the number of times an inferred relationship is disagreed with. The greater this number, the more the distance should be decreased, as the inferred relationship is deemed incorrect. More formally, let R and U be disjoint sets. R is the set of all resources, and U is the set of all users. We dene the term relationshipDisagree- ment to represent the subset R ×R ×U. A relationship disagreement is between two

MaximumDisagreement can be dened as the largest disagreementCount value for all com- binations of (r1,r2∈R).

maximumDisagreement: N

maximumDisagreement=max({disagreementCount(r1, r2)|r1 ∈R∧r2 ∈R})

disagreementCount(r1, r2) : R×R→N

disagreementCount(r1, r2) =|{u| ∃u∈U : (r1, r2, u)∈RelationshipDisagreements}|

relationDisagreement(r1, r2) : R×R→[0,1]

relationDisagreement(r1, r2) =

disagreementCount(r1, r2)

maximumDisagreement

The value calculated by this function therefore is based solely on the number of disagree- ments input by users. A user may only disagree with each edge once, so this may not be abused by users.

When a company is only bootstrapping its tagging system, this relationDisagreement function is likely to be rather unfair, but over time it should become fairer.

Chapter 4

Software Implementation

As previously discussed, this thesis was focused on the development of algorithms to calculate user trust and the inferred distance between nodes. Having just developed these algorithms in the previous chapter, we were then required to implement these in our pro- gramming language of choice, Java.

Once these algorithms were implemented, our focus shifted to the extension of the Cen- truow framework, as well as the development of a Centruow Server. In addition to this, there was of course the tagging architecture to implement, which spans both client and server. The remainder of this chapter discusses the implementation details at a relatively high-level.

4.1 Implementation of Similarity and Trust Functions

With the mathematical algorithms developed in chapter 3, it was time to implement these in code. Our plan was to embed this functionality inside the Centruow Server, as our primary goal was to quickly enable a Centruow client to nd out which nodes have the strongest inferred relationships with other nodes. The two main approaches put forth as to how these functions could be developed were brute-force calculation and on-demand calculation, which are discussed below.

4.1.1 Brute-Force Computation

The brute-force approach simply suggests that all trust and resource distance (similarity) functions operate at a set schedule - perhaps once a day at 3:00am. All possible calculations should be run at this time, with the calculated data exported to a relational database. This would allow for queries from the Centruow client related to user trust and inferred

relationships to be handled by simple SQL queries to a relational database (and cached in memory within the Centruow Server).

This is only a good approach for small tagging systems, as the downside of such a solution is its lack of scalability. Centruow is primarily aimed at enterprises, and there is no cap on the number of potential users (and hence tags). In particular, there are companies wanting to expose their Centruow installations over the Internet, which could potentially lead to an explosion of users and taggings. We do not believe that this brute-force approach can handle these scenarios, and as such we would only recommend our tagging system to small enterprises. To enable universal use of the tagging system, an alternative approach is needed, such as that explained in the next section.

4.1.2 On-Demand Calculation

The on-demand solution attempts to allow for more `up to date' trust and resource distance values by calculating the necessary information only when relevant. This does not mean however that this approach does not perform any caching/pre-computation - it is simply that the data expires more frequently and is not calculated in a batch process that runs nightly.

There is however a problem with this approach also: the resource distance algorithm makes use of the user trust calculation as part of its calculation, meaning that to calculate resource distance, the server must know the trust values for all users who have tagged the resources in question. Depending on the number of users who have tagged a particular resource, this could be a limiting factor in ensuring a good response time.

One way around this issue is to pre-compute the user trust values as part of a nightly batch process, and then to calculate resource distances on-demand. This approach helps when considering that our ultimate goal is to respond to the Centruow client request in a minimal amount of time.

4.1.3 Our Approach

We chose to design our software such that we could easily switch implementations should a better idea be suggested, but for the purpose of this thesis, we implemented the brute-force implementation. This choice was made primarily due to its simplicity of implementation, but as noted in section 6.3, we need to more thoroughly plan and implement an on-demand approach in the future. In addition, implementing the weaker brute-force approach had no inuence on any of our results, as both approaches would be expected to produce the same result for any given query.

In terms of implementation, our approach would not cause a 'nightly calculation downtime', as all calculations would be performed in memory or in a temporary database, with only the last action being to overwrite all data in the production rened tags database table. This would limit the possibility of users not retrieving useful tagging data, but to be fair, does expose a time window where such a problem will be possibly encountered by users.

In document Improving Centruflow using semantic web technologies : a thesis presented in partial fulfillment of the requirements for the degree of Master of Science in Computer Science at Massey University, Palmerston North, New Zealand (Page 78-82)