4.3 Implementation
4.3.3 Ranking Layer
This layer mainly ranks candidate resources obtained in the previous layer, based on semantic similarity functions. These candidate resources are sorted according to values of a semantic similarity function, which measures the similarity between the initial resource and each one of these candidate resources. The framework in its current implementation includes (but is not limited to) three ranking algorithms.
Fig. 4.4 Example of a category graph for the resource Mole Antonelliana (candidate resources are not included for space reasons).
Similarly, to the algorithms of the generation layer, the ranking algorithms are also based on the semantic relationships in Linked Data.
Transversal LDSD Ranking
The transversal LDSD ranking algorithm calculates the Linked Data Semantic Dis- tance (LDSD) between an initial resource and each one of the candidate resources obtained in the generation layer. The LDSD distance, initially proposed by Pas- sant [66], is based on the number of indirect and direct links between two resources. The similarity of two resources (r1, r2) is measured by combining four properties:
the direct input links, the direct output links, the indirect input links, and indirect output links. Equation4.1illustrates the basic form of the LDSD distance.
LDSD(r1, r2) =
1
1 +Cdout+Cdin+Ciout+Ciin
(4.1) Cdout is the number of direct input links (from r1to r2), Cdin is the number of direct
output links, Ciin is the number of indirect input links, and Ciout is the number of
Unlike the implementation developed by Passant, which is limited to links from a specific domain, the LDSD function implemented in Allied takes into account all resources from the dataset. However, it can be customized to defined types of links belonging or not to a particular domain by adding a set of forbidden links.
The SPARQL query that counts direct input and output links between an initial resource <inURI> and a resource of the set of candidate resources is presented in Listing4.7. The SPARQL query that counts input and output indirect links between an initial resource (<inURI>) and a resource (<crURI>) from the set of candidate resources is presented in Listing4.8.
S E L E C T D I S T I N C T c o u n t(? p) W H E R E { # o u t p u t l i n k s { < i n U R I > ? p < c r U R I >. } # i n p u t l i n k s U N I O N { < c r U R I > ? p < i n U R I >. } }
Listing 4.7 The SPARQL query to count input and output direct links.
S E L E C T D I S T I N C T c o u n t(? p) W H E R E { # i n p u t l i n k s {? o ? p < i n U R I > . ? o ? p < c r U R I > .} U N I O N {? o ? p < i n U R I > . < c r U R I > ? p ? o .} # o u t p u t l i n k s U N I O N {< i n U R I > ? p ? o . ? o ? p < c r U R I > .} U N I O N {< i n U R I > ? p ? o . < c r U R I > ? p ? o .} }
Listing 4.8 The SPARQL query to count input and output indirect links.
Using these SPARQL queries, the transversal ranking algorithm calculates the LDSD for each pair of resources composed of an initial resource and each of the resources obtained from the generation layer.
HyProximity Ranking
The HyProximity ranking algorithm is based on the similarity measure defined by Stankovic et al. [72]. This measure can be used to calculate both transversal and hi- erarchical similarities. The HyProximity in its general form is shown in Equation4.2 as the inverted distance between two resources, balanced with a weighting function. In this equation d(r1, r2) is the distance function between the resources r1 and r2,
while p(r1, r2) is the weighting function, which is used to weight different distances.
hyP(r1, r2) =
p(r1, r2) d(r1, r2)
(4.2)
Based on the structural relationships (hierarchical and transversal), different dis- tance and weighting functions may be used to calculate the HyProximity similarity: Hierarchical Hyproximity The definition of this similarity function relies on the
work of Stankovic et al. [72]. It depends on the maximum distance of cat- egories from the initial resource as in the hierarchical generator algorithm (Section4.3.2). In particular, d(ir, ri) = maxDistance, where ir is the initial
resource, riis each one of the candidate resources generated in the hierarchical
algorithm, and maxDistance is the maximum distance of the broader categories from the base ones. The weighting function is defined in Equation4.3, which is an adaptation of the informational content function defined by Seco et al. [73]. In this equation, hypo(C) is the number of descendants of a category C and |C| is the total number of categories in the category graph of C. This function was chosen as minimizes the computation complexity of the informational content regarding to other functions that use external corpora [74].
p(C) = 1 −log(hypo(C) + 1)
log(|C|) (4.3)
Transversal Hyproximity In this similarity function d(ir, ri) = maxDistance if the
generator of resources is hierarchical, otherwise d(ir, ri) = 1 for resources
connected to the initial resource through direct transversal links or d(ir, ri) = 2
for indirect transversal links. The weighting function is defined in Equation4.4: ptransv(r1, r2) depends on the number of resources connected over a specific
in Allied, this algorithm is not limited to a specific property, and optionally can be configured to support a set of forbidden links or allowed links in a similar way as shown in Section4.3.2for the generation layer. The number of direct and indirect links was calculated with SPARQL queries. The value of M was fixed to the number of resources contained in DBpedia.8
ptransv(r1, r2) = − log
n
M (4.4)