Data harvesting - Research Question

CHAPTER 3: RESEARCH METHODS

3.4 Research Question

3.4.1 Data harvesting

WoS database has been used to carry out our analyses. Because Business and Management as a discipline often falls within the purview of the social sciences, the SSCI (Social Sciences Citation Index) database of the WoS was queried for a 30-year period, from 1980 to 2010.

The SSCI database was queried as of January 2011. The total number of documents in the business and management categories was 245 records, which consisted of articles

(209 documents), reviews (10 documents), letters (4 documents), corrections (1 document), proceedings paper (10 documents), book reviews (6 documents), editorial materials (3 documents), and note (1 document). The records were futher skimmed to include only ‘Articles’ as the study focus is only to include artifacts that represent prominent new research.. Thus, 209 harvested records fell into multiple WoS categories. However, because journals published by the “Academic Journals” publishing house have

been disqualified by MyRA (http://www.ippp.um.edu.my/images/ippp

/doc/myRA%20II.pdf, Section C, Publications), all articles published by the said publisher were removed. This narrowed our record set further to 160 records. Admittedly, the dataset is small, but it fully represents all the peer-reviewed articles indexed in the well-recognized indexing system, the WoS. Additionally, a smaller record set gave us the added advantage of cleaning the data meticulously by hand. These categories account for articles that fall under Business and Management subjects and might also be tagged with other subject categories. For example, an article might be in the main category of ‘Planning & Development’ but would be also tagged in the Business or Management category because it may also have business or management relevance.

Data is saved in delimited form and imported into spreadsheet software, MS-Excel. Cleaning the data was done by checking bibliographic data provided in the WoS. When uncertain, a cross-check was made to the actual journal abstract or article. Wherever available, an online check to the author’s curriculum vitae (CV) was also done. This minimized data redundancy and errors by thoroughly cleaning the records.

Author name disambiguation is a difficult task to resolve because some authors, during different times in their careers, represent themselves with different name variations. For example, sometimes they write their full name and sometimes they choose to just refer to themselves with initials and surnames. Although indexing databases, such as WoS and

Scopus are now standardizing author names, analysts still encounter earlier bibliographic data with author name variations that are difficult to disambiguate. Another issue is with authors that have names that are identical to other authors. This is difficult to detect, especially when the subject area of the authors is the same. Authors also move from one institution to another, or they may represent more than one institution simultaneously. These realities further complicate the problem because they could incorrectly relate authors and their publications. If authors have the same name but are actually two different individuals, not identifying them separately would combine the publications of the two, as if only the publications belonged to just one author. Similarly, if the same author with a different name variation could be taken as a different author, his or her publications could be distributed using a different name variation.

For large datasets the issue becomes about reducing errors due to name variations to a minimum, and several algorithms have thus been written. Newman (2001c), who conducted a co-authorship analysis with massive datasets, presented his results without data cleaning, by giving upper and lower limits. In microanalysis such as the present study, the effort is to create the most accurate datasets. The only practical way to accomplish this task is to clean the data manually. Using manual cleaning, each record is checked for anomalies.

Prior to 2008, WoS only indexed the initials and surnames of the authors, rather than full names. This made it difficult to identify authors while removing the possibility that someone having a similar or identical name as another might be mistaken for someone else. For example, at first look, ‘Ahmad, N.’ and ‘Ahmad, N.’ are identical names for the same person.

However, investigating their full names reveals:

a. Ahmad, N. -- Ahmad, Nursilah b. Admad, N. -- Ahmad, Nobaya

With the authors’ first names, one is now sure that they are two different authors. Those with the same names and belonging to same institution, faculty, or department are considered in our dataset as the same author. Where names are the same but they belong to different institutions, or when names were similar (with slight variance, such as Ramesh, M.Y. and Ramesh, M.) but both belonged to the same department and institution, such records were investigated by checking for full names. However, because WoS only indexed surnames and initials prior to 2008, actual article abstracts from journal websites for articles published before 2008 were reviewed. I also had to check the websites of journals, authors, or institutions to gain further details. The possibility also exists that authors with the same names but representing different institutions might actually be the same person. This can be the case for two reasons. First, they might be representing more than one institution, and second, they might have moved to another institution and therefore now represent a new entity. Manually checking a bibliography can identify the multi-representation of authors. Inter-person identity can distinguish authors from one another. For this reason, one of the ways an author can be discriminated from others with similar names, especially when dealing with larger datasets, is to identify his or her co- authors (Kang et al., 2009).

Apart from author name variations, authors’ institutional representation was the next largest issue to resolve. Prior to 2007, WoS did not explicitly identify each author with his or her institution. Hence, the only way authors’ institutional representation could be ascertained was to look into the actual bibliography of each paper from the journal’s

website and connect each author and institution. Some authors, mostly university faculty members, moved from one institution to another. For these authors, it was difficult to pinpoint the institution to which they belonged at the time their articles were written, especially when an author’s CV was not online. Many of the authors who had publications prior to 2000 were no longer active and their email addresses (if any) were invalid. For publications prior to 2000, some publishers did not identify authors with their respective institutions even on the actual papers. In such cases, alternate methods were tried such as searching the author’s other publications on the web and WoS. In most cases, this exercise was successful, and I was able to allocate the correct institution to each author.

In document Structure of research collaboration networks – case studies on Malaysia / Sameer Kumar (Page 89-93)