Semantic Identification of Scientific
Collaboration Networks for Social Network
Analysis
Hector G. Ceballos, Juan C. Lavariega, Gustavo A. Parada, and Manuel Rodriguez-Mancha
Tecnologico de Monterrey, Campus Monterrey Av. Eugenio Garza Sada 2501, Monterrey, N.L. 64849 {ceballos,lavariega,A00812667,A00805445}@itesm.mx
Abstract. We present an approach for identifying scientific collabora-tion networks from repositories of research activities and products using Semantic Web technologies. Our approach permits isolating multiple as-pects or levels of collaboration through OWL definitions and exposes them as Linked Data compliant with Social Network Analysis formats. We illustrate the advantages of using query rewriting for extracting and persisting collaboration networks from a relational database.
Keywords: Scientific Collaboration, Linked Data, SNA Tools, Query Rewriting, Ontology, RDB2RDF.
1
Introduction
Networks are everywhere. From the World Wide Web to networks in economics, networks of disease, information transmission and even terrorist networks, the imagery of networks pervades modern culture [9]. There has been much research in this space across disciplines, since scholars have examined social networks through a variety of perspectives and using different statistical approaches.
Recent years have witnessed a dramatic increase in the availability of network datasets that comprise thousands and sometimes even millions of vertices, mainly as a consequence of the widespread availability of electronic databases, and even more important, the Internet. Networks of scientific collaborations, for example, can now be recorded in real time through electronic databases like Medline and the Science Citation Index [8, 3].
Nevertheless, the information used for Social Network Analysis (SNA) must be extracted, in the best scenario, from databases that were not designed for such analysis, and in the worst scenario, it is distributed among multiple hetero-geneous unstructured or semi-structured data sources. Heterogeneity of sources and formats makes Ontologies an ideal representation for integrating and recon-ciling information.
Some ontology-based approaches have been already proposed in this sense. For instance, Ahmedi et al. [1] shows how to extract the co-authorship rela-tion from a publicarela-tion repository in order to apply Social Network Analysis
techniques. Ahmedi’s approach extends the FOAF ontology, which represents relationships between persons in a broad sense. Nevertheless, the nature of such relationships is not specified and the scope of the social network extracted cannot be delimited for correlating different aspects of the co-authorship network.
In this work we propose a light-weight extension of the SWRC ontology [10] for identifying the different collaboration relations underlying in the seman-tic representation of research activities. Our approach facilitates analyzing the evolution of collaboration networks (including co-authorship networks) by de-limiting network stages through publication date. Furthermore, we provide a method for extracting research social networks from large databases through a D2R Server1using query rewriting techniques.
2
Background
In this section we review some ontologies that capture the semantics of research activities and describe the formats that SNA tools use for processing network structures.
2.1 Ontologies for representing Research Activities
Our approach for identifying scientific collaboration depends on measuring the joint participation of people in research activities and products, a well-accepted practice in co-authorship studies [8]. We extend this principle to research activ-ities like groups and projects that have not been considered before due to the lack of information. Nevertheless, the emergence of research social networks like ResearchGate2and Mendeley3will promote the proliferation of this information.
Furthermore, the VIVO project4can be used for indexing institutional
repos-itories of research activities using semantic web formats, i.e. producing native Linked Data. VIVO provides an open source semantic web application that per-mits to capture institutional research profiles using a reference ontology [7]. This ontology extends the FOAF ontology for representing people, and BIBO5 and Dublin Core ontologies for representing bibliographic metadata.
Similarly, the AKT project has developed a series of knowledge technologies supported by an ontology that comprises research activities. AKT’s reference ontology [2] is used by the ReSIST RKB Explorer6for providing a visualization of
researchers, organizations/groups and publications. This platform has also been used by several universities for publishing linked data of their faculty scientific production. 1 D2R Server. http://d2rq.org/d2r-server 2 ResearchGate. http://www.researchgate.net/ 3 Mendeley. http://www.mendeley.com/ 4
VIVO: an interdisciplinary network. http://vivoweb.org/
5 The Bibliographic Ontology. http://bibliontology.com/ 6
Finally, the Semantic Web Research Communities ontology (SWRC) is an-other proposal for representing research communities through their participants, activities and products [10]. SWRC extends the Dublin Core ontology for repre-senting publications and it has an extension for reprerepre-senting the different roles a person can play in a research community.
All these projects (and their ontologies) have a proper representation of scien-tific collaboration. SWRC and AKT represent authorship through dc:contributor and akt:has-author properties, whereas VIVO uses the class vivo:Authorship. Similarly, SWRC uses swrc:member for representing participation in research groups, meanwhile VIVO uses the class vivo:Position.
2.2 Network structures for SNA
The transformation from collaborative activities and products into network struc-tures is illustrated in Figure 1. Figure 1 (a) shows an authorship network where authors are represented by ovals (A1, A2, A3), papers are represented by boxes (P1, P2, P3), and directed edges denote authorship. In order to identify co-authorship relations, we need to identify and count the number of papers on which every couple authors participate. The result is shown in Figure 1 (b), where undirected arcs denote the collaboration relationship and they are labeled with the list and count of co-authored papers.
Fig. 1. Transformation from joint activities/products (a) to a weighted network (b).
The weighted network shown in Figure 1 (b) is an input for SNA tools, where the relevant information are: nodes (authors), ties (co-authorships), and tie strength (weights). The list of joint papers or joint activities is used for delimiting the collaboration network as we will explain later.
Network structures (graphs) are represented in three main formats: adja-cency matrixes, adjaadja-cency lists and vertices pairs. Adjaadja-cency matrixes are NxN matrixes representing the collaboration between N persons and can be square for directed graphs or triangular for undirected ones. In adjacency lists, designed for representing directed graphs, the source node is followed by the list of the nodes that are the targets of every arcs starting from the node. Vertices pairs is a special case of adjacency lists where there are only two nodes represented on each line and it can be used for representing directed or undirected graphs.
3
Identifying Scientific Collaboration Networks
We used the method illustrated in Figure 2 for extracting different types of collaboration networks from a research corporate memory, and introduce a light-weight ontology for materializing these networks.
Fig. 2. Collaboration networks extraction and persistence.
As shown in Figure 2, the research corporate memory is stored in a relational database where person names are already normalized, which means that each person is represented by a unique identifier across the entire repository. The method extraction consists of the following steps: 1) mapping research activi-ties in the relational database, 2) defining collaboration networks, 3) rewriting these definitions in terms of the available mappings, 4) extracting information of collaboration networks, and 5) optionally, materializing them. Next we describe the scientific collaboration ontology proposal and then we provide more details of each steps.
3.1 The Ontology SWRC-Collab
We introduce a slim ontology for representing scientific collaboration relation-ships as an extension of the ontology SWRC (version 0.7.1), shown in Figure 3. SWRC-Collab is composed of only two classes: Collaboration, which repre-sents collaboration between two persons or organizations, and CollabActivity which points to the activity or product on which these two agents collaborate. Publications, projects, products and research groups are examples of collabo-ration activities defined in SWRC. The relationship between the three former objects and a person or organization is represented by the Dublin Core property dc:contributor, whereas membership to organizations like research groups is denoted by the property swrc:member. Note that the class Collaboration uses the same property (collaborator) for linking both participants, hence it only supports undirected networks.
Fig. 3. Collaboration networks extraction and codification methodology.
3.2 Mapping Research Activities
We start by generating mappings from the relational database to SWRC schemas using the D2RQ language7. As recommended by ontology-based information
in-tegration methodologies like MOMIS [4], only those tables and properties related to research activities were mapped. For instance, publications, their authors, and their publication date were mapped for identifying co-authorship network dy-namics.
Given that D2R Server does not provide inference support, we overloaded class mappings for representing the class hierarchy. This is, we declared as many mappings to a class as subclasses this has. For instance, journal articles were mapped to both swrc:Document and swrc:Article.
3.3 Defining Collaboration Networks.
The collaboration aspect to analyze is delimited by declaring collaboration net-works as OWL concepts. A collaboration network is defined as a Collaboration subclass and it is delimited by constraints on: 1) the nature of the collaboration activity, and 2) the attributes of these activities. Every person or organization participating in a collaborative activity subsumed by the definition is included in the network. Note that: i ) a person or organization may belong to multiple collaboration networks, and ii ) activities performed by a single person do not denote collaboration, i.e. they do not produce any edges.
7
For instance, the following definitions delimit the boundaries of the collabo-ration network by constraining the nature of the collaborative activity:
Coauthorship ≡ Collaboration u ∀collab on.Document ResearchP rojectCollab ≡ Collaboration u ∀collab on.ResearchP roject
ResearchGroupCollab ≡ Collaboration u ∀collab on.ResearchGroup Note that in SWRC, ResearchProject is a subclass of Project, whereas ResearchGroup is a subclass of Organization.
Participants of a collaboration network are identified by those instances of the respective Collaboration subclass. For instance, for extracting the overall co-authorship network we use the query pattern: ?c rdf:type Coauthorship. This means that a collaboration network can be refined in two ways: 1) by defin-ing a new subclass with additional constraints, or 2) by expressdefin-ing additional constraints in the query pattern used for identifying its participants. The first approach allows calculating the containment (or not) between collaboration net-works, but it is limited to the expressivity of available concept constructors (and the concept reasoner capabilities). The first approach also permits validating that a collaboration network is consistent and it does not produce any results because there are no participants on it and not because its definition is faulted. On the other hand, the second approach allows expressing richer constraints provided by SPARQL filters, but these refinements cannot be persisted in RDF as we explain below. For instance, if we want to delimit the collaboration network to co-authorship on journal articles published in 2003 we build a query pattern like:
?c rdf:type Coauthorship . ?c collab_on ?a . ?a rdf:type swrc:Article . ?a dc:date ?d . FILTER (regex(str(?d), "2003")
swrc:Article distinguishes articles published in journals and it is disjoint with other swrc:Document subclasses, such as swrc:InProceedings. The pub-lication year is filtered using a regular expression on the property dc:date, which cannot be expressed using concept constructors. An alternative solution would be having an additional property for the publication year only.
As can be seen, we can identify co-authorship networks for a given period of time by using the attribute dc:date. Other collaboration networks which dynamics can be analyzed are: 1) joint participation in research projects (us-ing properties swrc:startDate and swrc:endDate), and 2) design of products (swrc:creationDate).
SPARQL queries shown in Figure 4 can be used for generating a represen-tation of collaboration networks in terms of nodes and vertices pairs. The edge query generates vertices pairs that represent the network structure. The collab-oration relationship can be replaced by the corresponding class or query pattern in the query clause WHERE.
In the edge query, the filter (?p1 > ?p2) avoids generating the same tu-ple in inverse order, i.e. (?p1,?p2,?weight) and (?p2,?p1,?weight). Collaboration
-- Edge (ties) query
SELECT ?p1 ?p2 (count(distinct ?c) AS ?weight) WHERE {
?c rdf:type Collaboration . ?c collaborator ?p1 . ?c collaborator ?p2 FILTER (str(?p1) > str(?p2))
} GROUP BY ?p1 ?p2 -- Node query SELECT DISTINCT ?p
WHERE { Collaboration(?c). collaborator(?c,?p) }
Fig. 4. SPARQL queries for extracting nodes and edges from collaboration networks.
strength between ?p1 and ?p2 is represented by the variable ?weight. The node query can additionally extract collaborators’ attributes, previously mapped from the relational database.
The SWRC-Collab ontology, along with some collaboration network defini-tions, can be found in the project website8.
3.4 Rewriting Collaboration Networks
Next we use query rewriting for expressing collaboration networks in terms of SWRC schemas. Query rewriting requires that: 1) the class Collaboration as well as properties collaborator and collab on be replaced by the correspond-ing SWRC schemas, and 2) constraints represented by collaboration definitions be introduced in the query.
For instance, Figure 5 shows the rewritten version of the edge query of the co-authorship network for articles published in Journals on the period 2003–2005. Similarly the query for extracting information about collaborators (nodes) must be reexpressed using the respective collaboration relation.
SELECT ?p1 ?p2 (count(distinct ?c) AS ?weight)
WHERE { ?c rdf:type swrc:Article . ?c dc:contributor ?p1 . ?c dc:contributor ?p2 . ?c dc:date ?date
FILTER (regex(str(?date), \"2003\") || regex(str(?date), \"2004\") || regex(str(?date), \"2005\"))
FILTER (str(?p1) > str(?p2)) } GROUP BY ?p1 ?p2
Fig. 5. Rewritten edges query for the co-authorship network on Journals (2003-2005).
8 SWRC-Collab Ontology. http://semtech.mty.itesm.mx/ontologies/
3.5 Extraction and persistence
Information obtained from these queries through a SPARQL end-point is trans-formed into XML for its analysis (c.f. section 2.2). Additionally, collaboration networks can be materialized in RDF by executing a SPARQL CONSTRUCT query like the one shown in Figure 6. This query materializes the co-authorship network using the corresponding query pattern. Queries in Figure 4 can be used for obtaining weighted edges and nodes from materialized networks.
PREFIX collab:<http://semtech.mty.itesm.mx/ontologies/swrc-collab.owl#> CONSTRUCT {_:c rdf:type collab:Coauthorship . _:c collab:collaborator ?p1 .
_:c collab:collaborator ?p2 . _:c collab:collab_on ?a } WHERE {?a rdf:type swrc:Document . ?a dc:contributor ?p1 .
?a dc:contributor ?p2 . FILTER( str(?p1) > str(?p2) )}
Fig. 6. SPARQL CONSTRUCT query for persisting the co-authorship network.
4
Evaluation and Results
The university research corporate memory we use as test-bed contains informa-tion of professors, students, research groups and publicainforma-tions that satisfies the author names normalization requirement [5]. Through D2RQ mappings we iden-tified 15,118 documents and 7,576 different authors from a MySQL database. The set of documents is made up of 4,662 articles in Journal (swrc:Article), 7,055 articles in Proceedings (swrc:InProceedings), 1,274 books (swrc:Book) and 2,127 book chapters (swrc:InBook).
Collaboration networks were modeled and tested with dummy instances using Prot´eg´e9. Then we use the D2R Server SPARQL end-point for querying the relational database. The size of some collaboration networks is shown in Table 1. This table also shows a comparison of the time required for extracting both the list of vertices pairs and the list of nodes using rewritten queries from: 1) models loaded in memory and 2) the SPARQL end-point directly. Query times are the average of five runs in a Laptop with an Intel i5 processor and 4 GB RAM memory running on Windows 7. As can be seen, execution time is greater when queries are asked directly to the endpoint. Nevertheless, the second choice enables processing very large networks without loading models in memory and hence avoiding memory overheads.
Additionally a partial co-authorship network was formatted as .gexf file10.
Figure 7 shows an extract of the co-authorship network visualized through Gephi.
9
The Prot´eg´e Ontology Editor and Acquisition System. http://protege.stanford. edu/
10
Table 1. Size of collaboration networks.
Collaboration Network Nodes Edges In-Memory End-point Coauthorship 7,273 21,094 7.7 sec. 50.64 sec. Coauthorship on Journal 3,549 9,559 8.9 sec. 15.57 sec. Coauthorship on 2009-2010 2,730 7,665 7.8 sec. 139.76 sec.
This network shows only professors, i.e. excludes students and external authors, and it emphasizes the most prolific authors. Nodes are labeled with the author name, colored upon the scientific discipline of the author, and sized by its average weighted degree. In order to add this information to the .gexf file we declared the respective D2RQ mappings for the class swrc:Person.
Fig. 7. Visualization of the Co-authorship network with Gephi (snapshot).
5
Discussion
Despite we selected the SWRC ontology, our approach can use other reference ontologies for scientific activities like AKT or VIVO. As shown, by expressing collaboration schemas in terms of upper-level classes (e.g. Documents), we enable the selection and delimitation of collaboration networks on lower-level classes (e.g. Articles, Books) and through their attributes (e.g. publication date).
In traditional approaches like [6, 1], identifying collaboration networks require using OWL inference and SWRL, which makes it impractical and non scalable.
Ereteo et al. [6] acknowledge that computing SNA metrics from networks with millions of actors stored in CORESE “is out of reach today”. Our query rewriting method does not have this problem as long as it uses a Relational Database (RDB) as back-end for effectively identifying relationships between actors and lets a specialized tool like Gephi to make such calculations.
Processing networks with millions of nodes and edges is possible through specialized software like Gephi, Pegasus11 and Pajek-XXL12. Despite Gephi is
focused in visualization, it is capable of processing networks with thousands of nodes and millions of edges13. Pegasus and Pajek-XXL, on the other hand,
are specialized in processing networks with millions of nodes in parallel. The co-authorship network, the largest network we obtained from our test database (see Table 1), was processed and visualized using Gephi.
Additionally, in approaches like [1] the resulting co-authorship network can-not be sliced, by publication type or year for instance, as long as it does can-not link the collaboration with its supporting activity. Besides, our approach do not normalize tie strength as long as SNA tools already provide normalization facilities.
Finally, despite we query a single database in the case study, a union of federated subqueries would permit consolidating multiple sources codified inclu-sively with different ontologies. Besides, integrating multiple collaboration layers (co-authorship and collaboration in projects, for instance) would only require expressing the corresponding Collaboration subclass and let a query rewriting engine to calculate the equivalent queries.
6
Conclusions
We presented an approach for identifying scientific collaboration networks from repositories of research activities and products using Semantic Web technology. Our approach uses a light-weight ontology that permits identifying the activ-ity/product that support the collaboration relationship, allowing the isolation of multiple aspects or levels of collaboration that can be analyzed and contrasted using SNA tools.
We illustrated how to extract such collaboration networks from a relational database using query rewriting techniques. Nevertheless, our approach can be used for analyzing collaboration patterns underlying in semantic web portals like VIVO or publications repositories codified using the AKT reference ontology.
Information produced through our methodology is already being used by so-ciologists of our institution for analyzing brokerage patterns in scientific networks and for recommending inter-institutional collaboration to our researchers.
11
Carnegie Mellon University. Project Pegasus. http://www.cs.cmu.edu/~pegasus/
12Pajek/Pajek-XXL. http://mrvar.fdv.uni-lj.si/pajek/ 13
7
Acknowledgements
Authors thank to Tecnologico de Monterrey and CONACYT for sponsoring this research through grants 0020PRY058 and CB-2011-01-167460, respectively.
References
1. Ahmedi, L., Abazi-Bexheti, L., Kadriu, A.: A Uniform Semantic Web Framework for Co-Authorship Networks. In Proceedings of the 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing (2011) 958–965 2. Alani, H., Dasmahapatra, S., O’Hara, K., Shadbolt, N.: Identifying communities of
practice through ontology network analysis. IEEE Intelligent Systems 18(2) (2003) 18–25
3. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Physica A 311 (2002) 590–614 4. Beneventano, B., Bergamaschi, S.: The MOMIS methodology for integrating
het-erogeneous data sources. In Proceedings of 18th IFIP World Computer Congress (2004)
5. Cantu, F., Ceballos, H.G., Mora, S.P., Escoffie, M.A.: A Knowledge-based infor-mation system for managing research programs and value creation in a university environment. In Proceedings of Americas Conference on Information Systems - AM-CIS (2005) 781–791
6. Ereteo, G., Limpens, F., Gandon, F., Corby, O., Buffa, M., Lietzelman, M., Sander, P.: Semantic Social Network Analysis: A Concrete Case. Chapter in Handbook of Research on Methods and Techniques for Studying Virtual Communities: Paradigms and Phenomena. IGI Global (2011) 122–156
7. Mitchell, S., Chen, S., Ahmed, M., Lowe, B., Markes, P., Rejack, N., Corson-Rikert, J., He, B., Ding, Y.: The VIVO Ontology: Enabling Networking of Scientists. In Proceedings of the ACM WebSci’11 (2011)
8. Newman, M. E.: The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98(2) (2001) 404–409
9. Newman, M., Barabasi A.L., Watts D. J.: The Structure and Dynamics of Networks. Princeton University Press (2011)
10. Sure, Y., Bloehdorn, S., Haase, P., Hartmann, J., Oberle, D.: The SWRC Ontol-ogy - Semantic Web for Research Communities. In Proceedings of the 12th tuguese Conference on Artificial Intelligence (EPIA 2005). Springer, Covilha, Por-tugal (2005)