An Analysis of Collaborative Patterns in Large-Scale Ontology Development Projects

(1)

An Analysis of Collaborative Patterns in Large-Scale

Ontology Development Projects

Sean M. Falconer

Stanford University Stanford, CA, USA

Tania Tudorache

Natalya F. Noy

ABSTRACT

Today, distributed teams collaboratively create and maintain more and more ontologies. To support this type of ontology develop-ment, software engineers are introducing a new generation of tools. However, we know relatively little about how existing large-scale collaborative ontology development works and what user work-flows the tools must support. In this paper, we analyze our experi-ence in supporting several such projects. We describe a visual and interactive project-management tool that we have developed, which helps ontology developers explore historical ontology change and discussion data. We present the results of qualitative and quantita-tive studies of the collaboraquantita-tive activity associated with three large-scale ontology-development projects. Based on the analysis, we conclude that domain and ontology experts have different patterns of ontology editing behavior, which has important implications for ontology-development tools.

Categories and Subject Descriptors

H.5.3 [Information Interfaces and Presentation]: Group and Or-ganization Interfaces—Computer-supported cooperative work, Web-based interaction, Evaluation/methodology; D.2.8 [Software En-gineering]: Metrics—Process metrics

General Terms

Measurement, Human Factors

Keywords

Collaboration patterns, Role identification, Authoring tools, Col-laborative and social approaches to knowledge management and acquisition, Knowledge acquisition tools, Knowledge engineering and modelling methodologies

1. INTRODUCTION

In the past, the development of ontologies has mostly been re-stricted to individuals or small groups. However, knowledge-intensive applications today depend on large-scale ontologies. As a result,

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ontologies are becoming so large that no single individual can pos-sibly maintain and develop the entire terminology. Large biomedi-cal ontologies, containing tens of thousands of classes, such as the Gene Ontology (GO), are possible only through collaborative de-velopment. Indeed, most large-scale ontology-development projects now involve a collaborative effort [12].

To support these large-scale projects, a growing number of col-laborative ontology editors now exist [8]; e.g. NeON toolkit1, Col-laborative Protégé [14], Semantic wikis [3], and Knoodl2_{. Despite}

the number of tools available, they are still very much in their in-fancy. The tools are for the most part generic and it is not obvious how they can be used to support different collaborative workflows for different projects.

In this paper, we take some of the first steps to learn more about the ways in which different collaborative ontology development projects currently work. What are the similarities and differences in terms of the roles that participants in the projects fulfill? What characteristics distinguish the role of a contributor? Is there a re-lationship between these roles and the roles of those contributing to open-source software (OSS) or projects such as Wikipedia? Can we re-use techniques, software, and approaches from these com-munities? Finally, can a deeper understanding of these projects and roles inform tool and ontology development?

We need to answer these questions in order to develop better and more flexible tools. To address these questions, we first need tools to help explore the collaborative activity of ontology projects. Thus, we developed a visual analysis tool that helps users explore the collaborative activity (changes and notes) associated with an ontology project. The supported visualizations allow an individ-ual to study the history of their ontology. In this paper, we present this analysis tool and the use of it in studying the collaborative pat-terns for three different ontology development projects with vary-ing workflows and scale. Based on the observed patterns, we apply clustering and statistical analysis to identify the implicit roles of ontology authors. We also present the analysis of the relationship between changes that occur in the ontology and how users com-municate in distributed environments. We found that different user roles, such as domain specialist versus ontology expert, lead to dif-ferent behavioral patterns in terms of a user’s collaborative editing workflow. We discuss the implications of this observation and other findings on tool design and ontology development.

2. THREE LARGE-SCALE BIOMEDICAL

ONTOLOGIES

To study the changes and collaborative patterns that users fol-low in collaborative ontology development, we collected data from

91http://www.neon-project.org

(2)

Figure 1: A screenshot of the Change-Analysis plugin for the NCI Thesaurus. Panel (1) is a class-tree representation of the NCI Thesaurus. The gray numbers in the tree represent the total number of changes in the selected branch of the ontology, while the bold numbers represent the number of changes specifically for the selected term. Panel (2) lists all the changes associated with the selected term.

three diverse projects: (1) the National Cancer Institute’s Thesaurus (NCI Thesaurus); (2) the World Health Organization’s (WHO) In-ternational Classification of Disease, revision 11 (ICD-11); and (3) the Biomedical Resources Ontology (BRO), being developed un-der the auspices of the US National Institutes of Health (NIH). The projects vary significantly, both in their scale and in the workflow that they adopt for making and publishing changes. Thus, our anal-ysis is rather general and it is tied to real ontology-development efforts. We now briefly describe the three projects.

The National Cancer Institute’s Thesaurus (NCI Thesaurus) [13] has over 80,000 classes and has been in development for sev-eral years. It is a reference vocabulary covering areas for clini-cal care, translational and basic research, and cancer biology. A multidisciplinary team of editors works to edit and update the ter-minology based on their respective areas of expertise, following a well-defined workflow. A lead editor reviews all changes made by the editors. The lead editor accepts or rejects the changes and pub-lishes a new version of the NCI Thesaurus. The NCI Thesaurus is an OWL ontology, which uses many OWL primitives such as de-fined classes, restrictions, and which defines domains and ranges for object properties.

The International Classification of Disease (ICD) revision 11 (ICD-11)3 _{is the standard diagnostic classification that is used to}

encode information relevant to epidemiology, health management, and clinical use. Health officials use ICD in all United Nations member countries to compile basic health statistics, to monitor health-related spending, and to inform policy makers. As a result, ICD is an essential resource for health care all over the world. ICD traces its origins to the 19th century and has since been revised at regular intervals. The current in-use version, ICD-10, the 10th revision of the ICD, contains more than 20,000 terms.

The development of ICD-11 represents a major change in the revision process. Previous versions were developed by relatively small groups of experts in face-to-face meetings. ICD-11 is being developed via a Web-based process with many experts contributing to, evaluating, and reviewing the content online. It is also the first

93_{http://www.who.int/classifications/icd/}

ICDRevision/

version to use OWL as its representational format. Unlike the NCI Thesaurus, the ICD-11 ontology is in early phases of development.

The Biomedical Resources Ontology (BRO)originated in the Biositemaps project4developed by the Biositemaps Working Group of the NIH National Centers for Biomedical Computing.5 _The

Biositemaps project is a mechanism for researchers working in biomedicine to publish meta-data about biomedical data, tools, and services. Applications can then aggregate this information for ap-plications such as semantic search. BRO is the enabling technol-ogy, a controlled terminology for describing the resource types, ar-eas of research, and activity of a biomedical related resource. BRO is being developed by a small group of editors, who use a Web-based interface to modify the ontology and to carry out discussions to reach consensus on their modeling choices.

These projects represent important and active development ef-forts in the biomedical informatics field. All three projects use Col-laborative Protégé for ontology development. ColCol-laborative Pro-tégé is a plugin to the open-source ontology and knowledge-base editor Protégé [14]. The plugin uses a client–server model, where authors make contributions via the client and the server stores and manages these contributions. Users can hold discussions, chat, pro-vide notes on ontology components and make changes. Among the three projects, only the NCI project has a well defined workflow that is also enforced by a custom editing plugin for Protégé. The ICD-11 and BRO projects do not have any formally defined roles or workflow. In the two latter projects, all users have the ability to modify the ontology in any way.

We used information about two types of collaborative activity in our analysis: (1)change logsprovided the information on each change that the users performed, who performed the change, and when; and (2)notesthat users added to classes and properties in the ontology, along with the information on who created the note and when. Users add the notes to describe their rationale for changes or to discuss modeling issues pertaining to a particular class to reach consensus.

Collaborative Protégé stores ontology changes and discussions,

94http://biositemaps.ncbcs.org/

(3)

as well as the metadata associated with the changes and discus-sions as instance in the Protégé Change and Annotation Ontol-ogy (ChAO) [9]. Change types are ontolOntol-ogy classes in ChAO and changes in the domain ontology are instances of these classes. Sim-ilarly, notes that users attach to classes or threaded discussions, which the users can have within the Protégé tool, are also stored in ChAO.

For our studies, we used these three domain ontologies and the change data and notes associated with each project over certain time periods. For the NCI Thesaurus, the change data consist of data collected from October 5th, 2009 to April 14th, 2010. During that period, there was a total of 43,702 changes. There were a total of 10 authors involved; however, individual contributions ranged from as few as 259 changes to as many as 14,220. The authors of this project do not make use of the discussion feature of Collabo-rative Protégé, thus no notes were available. The ICD-11 change data consist of changes and notes collected from November 2009 to May 2010. A total of 19 authors created 14,554 changes and 4,768 notes. Like those of the NCI Thesaurus, individual author contri-butions ranged greatly from as little as one change to as many as 7,707. Finally, the BRO project includes data collected from Febru-ary to March 2010. This project is much smaller than the previous two: there are only five authors involved, a total of 762 changes, and 373 notes. Contributions also range greatly, from 17 to 368.

Before describing our analysis, we describe a tool that we devel-oped for assisting in the project management of collaborative ontol-ogy development. Such tools are essential for understanding where changes occur in an ontology, who is making the changes, what ar-eas of the ontology are active, and how much impact a change may have. We use this tool in our qualitative analysis to inspect author behavior.

3. THE CHANGE-ANALYSIS PLUGIN

We developed a visual analysis tool to assist with the manage-ment of collaborative ontology-developmanage-ment projects. The tool, called the Change-Analysis Plugin, is a plugin for the Protégé editor. It works with Collaborative Protégé to manage ontology changes in any ontology. The plugin provides visual representa-tions of the change and note data associated with the ontology.

The “Concept changes” view (Figure 1) enables users to see all the ontology changes that are stored in ChAO for an ontology. The gray numbers in the tree represent the total number of changes for the selected branch in the ontology. For example, there are 1231 changes in the subtree rooted at “Unit_by_Category” (the top term in the tree). The bold numbers represent the number of changes that have occurred directly at the selected class. As shown in the figure, there were 9 changes for the selected term. A user can apply filters to this view to show changes made by certain authors or over a certain time period.

Another view provides a visual representation of changes over time for each specific author. In this visualization, a stacked-area chart displays the number of change contributions that an author has made over different time intervals (days, weeks, months and years). The area in the chart corresponds to the number of changes in a specific time interval. This view enables project managers to get a quick overview of the level of activity of each member of the team over time.

We also provide a graph-based visualization of the author de-pendency network. The nodes in the network correspond to the authors. Two nodes are linked (there is a dependency between the authors) if the two authors have modified either the same term or two terms that are related via an ontology relationship (i.e. subclass or property relation). We consider both explicit and implicit depen-dencies. Two authors have anexplicit dependencyif they have

edited the same term. For example, Figure 2(a) shows that authors 2, 5, 7, and 8 have made changes to some of the same terms, while the other authors have made completely independent changes. The graph defines a social network describing the relationship between authors and potentially overlapping changes.

Two authors have animplicit dependencyif they have modified terms that are related in the ontology through a subclass or property relationship. For example, two authors did not make changes to the same term, but one author modified the direct superclass of a term that was modified by a different author. These two authors have an implicit dependency. Users can specify how close the two terms need to be to each other in order for the dependency to exist. For instance, one may create a network where two authors have an implicit dependency only when they modified two terms that are directly linked to each other. Or define a network where authors are linked if they modified two terms that would require one to traverse

nlinks in the ontology graph. As we include changes made further and further away, the social network becomes more tightly bound (see Figure 2(b)).

In addition to providing the different visualization mechanisms for exploring the change data, the Change-Analysis Plugin provides similar views for exploring the note data. We used the plugin to per-form qualitative analysis of the data. By using the different views available in the tool and by systematically exploring the data across all three projects, we were able to make several important observa-tions about possible patterns of collaboration.

4. QUALITATIVE FINDINGS

We explored the ontology changes and notes left by developers in our three data sets (Section 2). During the exploration, we tried to extract patterns of consistent author behavior for each of the on-tology projects. We were able to make several observations about users’ behavior.

First, we noticed that in each of the three projects, we could differentiate the authors by where in the class hierarchy they per-formed their changes. It appeared that certain authors made edits mostly within a single sub-hierarchy, whereas a few others did not have an obvious pattern and their changes appeared throughout dif-ferent subject areas. We also noted that some authors appeared consistently to make changes at a higher level in the ontology hier-archy, whereas others primarily made changes at the leaf level.

Second, we investigated the overlap between authors and the changes that they make. We noticed that although most changes associated with a particular term were performed by a single au-thor, some terms had changes by multiple authors. Moreover, some authors seemed to participate in more of these overlapping edits. We used the graph-based visualization available in the Change-Analysis Plugin (e.g. Figure 2(a)), as well as the “Concept View” (e.g. Figure 1) to explore these relationships. Finally, by analyz-ing the types of changes committed by authors across the various projects, we observed that some authors primarily made certain types of changes, such as deletions, whereas others primarily made additions or edits.

In the next section, we test the hypothesis that these different pat-terns of behavior were associated with a given author’sexpertiseor

rolein the project. We wished to confirm whether these observa-tional characteristics really did exist, and whether these characteris-tics could be used to describe different user roles in a collaborative development project.

5. IDENTIFYING USER ROLES

Identifying roles that different authors play across different projects allows us to better understand the process of collaborative ontology

(4)

9(a) Explicit dependency graph. 9(b) Implicit and explicit dependency graph.

Figure 2: Explicit and implicit dependency graph for authors in a release of the NCI Thesaurus. Explicit dependencies are repre-sented by solid edges while implicit dependencies are reprerepre-sented by dotted edges. Implicit dependencies are included for changes up to five degrees of separation.

development, the similarities among different projects, and help us build better tools to support different user behavior. We define an author’s role as a set of expected behaviors. To analyze what roles exist, we first created a feature-vector representation of an author based on the observations that we previously derived from the qual-itative exploration of the data (Section 4). Using these vectors, we applied clustering to derive logical groupings for the authors. We then applied statistical analysis to determine what characteristics make each cluster unique.

5.1 Representing authors

LetArepresent the set of all authors. We represent an authora∈

Aas a vector−→a = (Cdel, Cadd, Cmov, Cpro, M, L, D, O, CE). The first four features represent the percentage of a change contri-bution of a given type (deletions, additions, moves, and property changes) relative to all change contributions, for an authora. The other five features are numerical representations of the qualitative characteristics that we described in Section 4, such as the location of the changes. Table 1 provides information about these features.

We measure the centrality of an author,CE, by using the au-thor dependency network that we described in Section 3 (see Fig-ure 2(a) for an example). We first calculate the degree of an author node as defined by the explicit author-dependency network. We combine this value with the author-node degrees calculated based on five different implicit dependency networks. We create each of these networks by considering changes of distance 1, 2, 3, 4, and 5 jumps away from a given changed term. Finally, we normalize this average degree by dividing each value by the largest average. That is, the most highly connected or “central” author across all networks will have a value of one, while a very unconnected, less “central” author will have a value closer to zero.

5.2 Repeated K-means

Before analyzing the author feature vectors, we discarded data on authors who made fewer than 10 changes so as not to skew the analysis. We classify these authors asexperimental contributors

and there were six such authors. They were present only in the ICD-11 project and we believe they are mostly users that were curious about the project, experimented with a few changes, and then never came back.

To analyze the authors contributions and to uncover the roles, we applied repeated K-means clustering to divide the authors into different groups based on their similarity. Following the method outlined by Liu and Keselj [4], we applied the K-means method repeatedly for valuesk = 2up tok = 8. We set a maximum of 8 clusters in order to avoid a large number of single entity clusters.

For eachk, we evaluated the quality of the clusters using metrics for calculating the cluster compactness and separation [7]. After applying repeated K-means and measuring the quality of the clus-ters, we foundk= 5to be the optimal number of clusters.

5.3 Analysis

Once we established the groupings for the authors from the three projects, we analyzed what characteristics made a cluster unique and tied the particular authors together. To determine these char-acteristics, we performed multiple Analysis of Variance (ANOVA) calculations to determine the statistically significant features of each cluster. That is, for each of the nine features outlined in Table 1, we performed ANOVA to check for statistical difference in that partic-ular feature when compared across all five clusters.

5.4 Results

There were statistically significant differences (p <0.05) across all features except the multi-author edit feature (see Table 2). Be-sides the values for relative depth, values in the table closer to 0 indicate a low activity level for a particular feature, whereas val-ues closer to 1 represent a higher level of activity. For relative depth, values above 1 indicate most edits occur deep in the hier-archy, while values lower than 1 indicate edits occur closer to the root level.

ANOVA tells us only whether there is a statistically significant difference between some clusters for a given feature; it does not tell us where the difference is. To pinpoint the differences, we applied the Tukey range test, which compares all pairs of means for a given independent variable (i.e., feature in our author data set). Using the results, we determined sets of characteristics that describe each of the clusters as well as cluster labels (see Table 3). We discov-ered that the five clusters correspond to five clearly discernible roles that users fulfill in the collaborative ontology-development process. These roles are: (1)ontology expert, (2)content manager, (3) do-main expert, (4)central domain expert, and (5)content editor. We will now describe how these roles varied between the projects and how the different roles relate to the workflow that each project uses.

5.5 Findings

Our analysis indicates that, despite all three projects having dif-ferent approaches to collaboration and change management, there are interesting commonalities in terms of author roles. Every clus-ter consisted of authors from multiple projects, except for the “Cen-tral domain expert” role, but no cluster contained authors from all three projects. We believe that this distribution is largely the re-sult of the different approaches that each project takes to managing

(5)

Table 1: Summary of author features used in author vector representation. Symbol Feature Explanation

Cdel Deletion Ratio of deletion changes to all changes committed.

Cadd Addition Ratio of terms added to all changes committed.

Cmov Move Ratio of terms moved to all changes committed.

Cpro Property change Ratio of terms where a property has been added/modified to all changes committed.

M Multi-author change Number of times an author edits a term that is also edited by another author divided by the total number of terms modified by the author.

L Leaf changes Number of times an author edits a leaf concept term divided by the total number of terms modified by the author.

D Relative depth The average depth of a term change relative to the average depth of any given concept term in the ontology.

O One hierarchy The percentage of changes an author makes that are restricted to one level of the ontology hierarchy.

CE Centrality The average centrality of an author.

Table 2: Feature comparison of means across all five clusters. Features correspond to those described in Table 1. Means

Feature Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 p-value

Deletion 0.068 0.436 0.0858 0.040 0.077 0.023 Addition 0.352 0.220 0.800 0.464 0.270 <0.001 Move 0.256 0.010 0.018 0.189 0.005 <0.001 Property change 0.307 0.242 0.238 0.265 0.757 0.001 Multi-author change 0.528 0.469 0.127 0.051 0.328 0.060 Leaf changes 0.443 0.818 0.783 0.628 0.687 0.001 Relative depth 0.833 1.35 1.58 1.33 0.591 0.001 Centrality 0.889 0.448 0.537 0.753 0.973 <0.001 One hierarchy 0.470 0.970 0.936 0.855 0.543 <0.001

changes and the different scopes of the ontologies. For instance, the ICD-11 project consists of many more contributors in comparison to both the NCI Thesaurus and BRO. As with the NCI Thesaurus, there are specific authors with specific areas of expertise, thus both projects have a large number of authors in the “Domain expert” cluster. However, like BRO, there are authors that are experts in ontology classification and are in charge of organizing the hierar-chy.

Interestingly, the ICD-11 project does not appear to have a des-ignated “Content editor” role. BRO and the NCI Thesaurus have much smaller, rigidly constructed teams where specific editors mon-itor changes and modify existing content. In contrast, ICD-11 cur-rently does not have a formal quality-assurance process. ICD-11 is also in the early stages of its development, so the most of the effort may currently be more organizational and focused on adding content.

6. COLLABORATION AND CHANGES

In addition to changes, we also analyzed the data on the notes that authors added to ontology elements, either to provide rationale for their changes or to carry out discussions among the authors. In analyzing the notes, we were interested in answering the following research questions:

Q1 : Is there a relationship between changes and author discus-sions with respect to a specific ontology term?

Q2 : Do people who make a lot of changes also participate in a lot of discussions?

6.1 Analysis

To answer these questions, we used a quantitative analysis ap-proach where we measured the correlation between change and dis-cussion activity. Unfortunately, since the editors of the NCI The-saurus do not use the notes feature of Collaborative Protégé, we had to exclude this project from our analysis. In all tests of corre-lation, we used the Pearson correlation coefficient, which measures

the linear dependence between two variables or series. It yields a value between +1 and -1, where positive values have a positive cor-relation (i.e. values go up and down together) and negative values have a negative correlation (i.e. values go up in one series, down in the other, and vice versa).

6.2 Results

To answer the first research question regarding the relationship between change and author discussion, we began by computing bi-nary change and note vectors for all terms in the ICD-11 and BRO ontologies. For each term in the ontologies, we computed two bi-nary vectors representing the changes and notes that occurred. A term has a value of 0 in the change vector if no change was ever recorded for that term, and a value of 1 if a change did take place. The note vector was constructed in a similar manner. Given these vectors, we can test whether the presence of change activity is cor-related with discussion activity.

In ICD-11 we found that the two activities were positively cor-related with a coefficient of 0.841 (p-value<0.001). Similarly, in BRO the two series were positively correlated with a coefficient of 0.274 (p-value<0.001).

We also wished to test whether a greater number of changes for a term was related to a greater number of note activity. To test this hypothesis, we again created change and note vectors for all terms in ICD-11 and BRO. However, this time instead of assigning values of 0 and 1, a term in the change vector has value ofN, where

Nis the number of changes that occurred involving the particular term. We computed a similar vector for note counts. Given these two vectors for each ontology, we can test the correlation between greater changes implying greater note activity.

Interestingly, we found that in ICD-11, the relationship between the rate of change and note activity was still highly correlated (co-efficient = 0.543,p-value<0.001), although less so than in the binary comparison. In BRO, the correlation between these two

(6)

vec-Table 3: Summary of roles

Cluster Role Characteristics Primary Size

activity

Cluster 1 Ontology expert

Highly central author, makes changes over multiple hierar-chies, involved in fewer leaf changes than domain experts, but performs a lot of movement changes in the hierarchy.

Organizational 4 authors: 3 ICD 1 BRO

Cluster 2 Content manager

Edits mostly in one sub-hierarchy, low centrality, performs few movement changes, but a high number of deletions.

Hierarchy clean-up 4 authors: 3 ICD 1 NCI Cluster 3 Domain expert Edits mostly deep within one hierarchy, low centrality, few

moves, but lots of concept additions.

Content creation 12 authors: 5 ICD 7 NCI Cluster 4 Central

domain ex-pert

Edits are restricted primarily to one sub-hierarchy, however, unlike domain experts, these authors are much more central and their changes occur at a higher level in the hierarchy. They also perform more movement operations than domain experts.

Management and con-tent creation of a spe-cific area of the ontol-ogy

2 authors: 2 ICD

Cluster 5 Content editor Highly central author, makes changes over multiple hierar-chies, lots of leaf changes, and a high number of property changes.

Editing of existing con-tent

6 authors: 2 NCI 4 BRO

tors was similar to the binary version (coefficient = 0.258,p-value

<0.001).

To address the second research question, we compared the cor-relation between the number of changes an author makes and the number of notes the author creates. In ICD-11 we found that these two series were highly correlated with a coefficient of 0.953 (p -value<0.001). However, in BRO, there was no correlation be-tween the number of changes made by an author and the number of notes.

6.3 Findings

As we mentioned, we found a correlation between change and discussion activity. This correlation in itself is perhaps not particu-larly surprising, but it is surprising how strongly correlated the lev-els of activity are. In particular, in the ICD-11 project, a term that is changed more often suggests that the term may also be the focus of more discussion. This high correlation between the change and the discussion activity may be related to the open editing process for ICD-11. Since there are many authors contributing to the project, there may be greater social pressure to document their changes as well as to provide feedback about the changes made by an individ-ual. This type of social pressure is well documented within the open-source software community. Since everything a developer produces is freely available to the public, there is social pressure to make the source code as readable, maintainable, and stable as possible [10].

Also interesting is the difference between the ICD-11 and BRO projects in terms of the correlation between author’s change and note activity. One possible explanation for this difference is that the workflow for the projects are different. BRO is a much smaller project, involving far fewer authors. There are authors with far more notes than changes. Interestingly, all these authors are classi-fied as “Content editors” in Table 3—a role that we did not find in ICD-11. This low correlation between the changes and the notes ac-tivity may be a characteristic of this particular ontology developer role. However, without more data, we cannot test this hypothesis.

7. DISCUSSION

Our analysis of author roles indicated that domain experts (in-cluding central ones) edit primarily within a single hierarchy of the ontology, whereas ontology experts and content editors work over multiple areas of the ontology. This difference in behavior is an important finding that has potential implications for tool and work-flow design. Ontology-editing tools like Protégé load and display

the entire ontology. For a domain expert, the ontology editor most likely contains large amounts of extraneous information that only hinder the content contributions of the user. Perhaps we need a role-based editing workflow that displays only the relevant part of the ontology to a domain expert.

It may be useful to analyze each author’s historical change and note activity to help compute that author’sdegree of interest(DOI) [2]. A user’s DOI is a predictive measure about the topics in which a user is potentially interested. Tools such as Mylyn6combine DOI with task context to reduce user-interface complexity by displaying only certain parts of a software project’s source code during devel-opment. A similar approach could use the change data to filter or highlight important parts of the ontology. This information could also be used to help reduce ontology load times and memory con-sumption. Furthermore, the change data associated with domain experts also helps distinguish different topic areas of the ontology. These topic areas are a potential starting point for the modulariza-tion of an ontology.

There appears to be some similarity between collaborative ontol-ogy development and the open-source software (OSS) and Wikipedia communities. Many researchers have started to analyze the roles of contributors in OSS projects. Nakakoji and colleagues classi-fied members of the OSS community into eight different roles [6], while Xu and colleagues reduced this number to four [15]. The four roles that Xu and colleagues describe include only users that make contributions directly to the software of the project: Project Leaders,Core Developers,Co-developers, andActive Users. The core developer role is somewhat analogous to the domain expert role in ontology development. Both are the main content creators for a project. A project leader in OSS are the project administrators: they guide the vision and direction of the project. We believe a sim-ilar role most likely exists for collaborative ontology-development projects. However, in the three projects that we analyzed, we did not identify this role from the cluster analysis. The lack of an ex-plicit project leader role may be in part due to the project leaders for these ontology development efforts not being active contribu-tors in terms of changes and notes. For example, in the ICD-11 project, there is a revision steering group as well as project man-agers that help manage and coordinate the project, but do not make active change contributions.

Also in OSS research, Birdet al. examined the relationship be-tween mailing list activity and development activity [1]. Similar to

(7)

our analysis of change and note activity, their findings indicate that there is a strong relationship between these two types of activities.

Liu and Ram analyzed contributions made by users of Wikipedia, following a similar clustering methodology to the one that we have described here [5]. They discovered six different roles: All-round Editors,Watchdogs,Starters,Content Justifiers,Copy Editors, and

Cleaners. There are some similarities between these roles to the roles we uncovered. For example, in ontology development, do-main experts are the primarystartersor creators of content, while the content managers work to help keep the ontologyclean.

8. RELATED WORK

Although research of collaborative ontology development is still very new, there has been some early work that is relevant to our research. The most similar work is an early investigation into the distributed, collaborative ontology engineering process and the ca-pabilities of Collaborative Protégé by Schober and colleagues [11]. The authors carried out a study where they observed users and an-alyzed the communication and interactions of the users inside and outside the tool while enriching the content of an existing ontol-ogy. The study was informal and observational. The authors were primarily interested in exploring the technical aspects of multiple authors using Collaborative Protégé. Similar to our analysis, the authors found large differences in the level of activity and contri-butions of authors. They also informally observed a general trend that people that chat a lot also made more changes. Finally, the authors also used informal analysis of the notes to propose differ-ent user roles based on a person’s discussion behavior. The roles proposed were “commenter”, “chatter”, and “editor”. Users cor-responding to the “editor” role tended to make many comments, often creating tasks for other users. This observation corresponds with our analysis of the “Content editor” role in BRO.

9. CONCLUSION AND FUTURE WORK

Collaborative ontology development is a growing area of research and both the tools that support this process and our general under-standing about how these projects function is in the early stages. Our results indicate that contributors have clearly discernible roles, which we can identify by analyzing users’ editing activities. To the best of our knowledge, this paper presents the first analytical study of the roles and connection between change and discussion activity that exist within different collaborative ontology develop-ment projects. The results indicate a need for the creation of role-based workflows in ontology development environments. More-over, change data could be used for filtering and highlighting im-portant information in an ontology, and a potential first step for ontology modularization.

In the future, we hope to have access to more data to help strengthen and generalize the analysis. One limitation of our current analy-sis is that we had data only for a small set of projects, represent-ing a limited set of users. Another limitation is the fact that all the projects that we studied used the Collaborative Protégé frame-work and the available tool features may have biased the patterns of activity. However, as we have mentioned in Section 2, Collab-orative Protégé does not enforce any particular workflow, and its default setting allows all users to perform any operation in the on-tology, therefore we believe that the biasing is minimal. These three projects are ongoing, and in particular, the ICD-11 project will see substantial growth in terms of the number of authors contributing over the next year. As we collect more data, we plan to investigate the relationship between changes, collaboration and contribution quality. We also will try to analyze similar data from collaborative editing available in other tools. We believe our current and future

analysis will be important contributions for helping inform tool de-sign in the collaborative ontology development space.

10. REFERENCES

[1] C. Bird, D. Pattison, R. D’Souza, V. Filkov, and P. Devanbu. Latent social structure in open source projects. In

Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 24–35, 2008.

[2] S. K. Card and D. Nation. Degree-of-interest trees: a component of an attention-reactive user interface. In

Proceedings of the Working Conference on Advanced Visual Interfaces, pages 231–245, 2002.

[3] M. Krötzsch, D. Vrandecic, and M. Völkel. Semantic MediaWiki. InISWC, pages 935–942, 2006.

[4] H. Liu and V. Kešelj. Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users’ future requests.Data Knowl. Eng., 61(2):304–330, 2007.

[5] J. Liu and S. Ram. Who Does What: Collaboration Patterns in the Wikipedia and Their Impact on Data Quality. In

Proceedings of 19th Annual Workshop on Information Technologies and Systems, 2009.

[6] K. Nakakoji, Y. Yamamoto, Y. Nishinaka, K. Kishida, and Y. Ye. Evolution patterns of open-source software systems and communities. InProceedings of the International Workshop on Principles of Software Evolution, pages 76–85, 2002.

[7] K. Niu, S. B. Zhang, and J. L. Chen. An Initializing Cluster Centers Algorithm Based on Pointer Ring.6th International Conference on Intelligent Systems Design and Applications, 1:655–660, 2006.

[8] N. F. Noy, A. Chugh, and H. Alani. The CKC Challenge: Exploring Tools for Collaborative Knowledge Construction.

IEEE Intelligent Systems, 23(1):64–68, 2008. [9] N. F. Noy, A. Chugh, W. Liu, and M. A. Musen. A

Framework for Ontology Evolution in Collaborative Environments. InISWC, 2006.

[10] E. S. Raymond.The Cathedral & the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary. O’Reilly Media, 2001.

[11] D. Schober, J. Malone, and R. Stevens. Observations in collaborative ontology editing using Collaborative Protégé. InWorkshop on Collaborative Construction, Management and Linking of Structured Knowledge, 2009.

[12] A. Sebastian, N. F. Noy, T. Tudorache, and M. A. Musen. A Generic Ontology For Collaborative Ontology-Development Workflows. InEKAW, 2008.

[13] N. Sioutos, S. d. Coronado, M. W. Haber, F. W. Hartel, W.-L. Shaiu, and L. W. Wright. NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information.

J. of Biomedical Informatics, 40(1):30–43, 2007. [14] T. Tudorache, N. F. Noy, S. Tu, and M. A. Musen.

Supporting Collaborative Ontology Development in Protégé. InISWC, pages 17–32, 2008.

[15] J. Xu, Y. Gao, S. Christley, and G. Madey. A Topological Analysis of the Open Source Software Development Community. InHawaii International Conference on System Sciences, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

An Analysis of Collaborative Patterns in Large-Scale Ontology Development Projects