Concept classification - Content-based disciplines

Chapter 6: Operational definition of ‘disciplines’

6.3. Content-based disciplines

6.3.1. Concept classification

The first approach seeks to group together various concepts or keywords. To do so, three aspects need to be considered.

1. Concepts or keywords need to be extracted.

2. Some system establishing proximity or relevance of these concepts to one another. 3. A method to group the concepts into groups.

6.3.1.1. Concept extraction

There have been many different approaches to automatically extracting concepts or keywords. The majority are sophisticated and involved processes (Kaur and Gupta 2010, Parameswaran, Garcia- Molina et al. 2010, Metke-Jimenez and Karimi 2015).

One of the most commonly used unsupervised algorithms to extracting concepts or keywords are based on TextRank, a co-occurrence graph-based extractive algorithm (Mihalcea and Tarau 2004, Barrios, López et al. 2016, Allahyari, Pouriyeh et al. 2017). TextRank is directly analogous to PageRank. It uses Part-of-Speech (POS) approaches to identify concepts. Where these concepts co- occur in the same sentence, links can be established, which can then be used to find the TextRank score.

The Natural Language ToolKit (NLTK), a platform for building Python programs to process text (Bird, Loper et al. , Bird, Klein et al. 2009) provides an implementation for Rapid Automatic Keyword Extraction (RAKE). The method is similar to the TextRank but does not require POS analysis. It instead establishes concepts based on N-grams split by ‘stop words’ and punctuation (Rose, Engel et al. 2010). Individual words’ co-occurrence strength/frequency provide the score; for higher order N-grams, the score is just the sum of its members’ scores (Rose, Engel et al. 2010). It should be noted that both these approaches do not use more advanced techniques such as synonym detection, spelling error, or alternative spellings.

Both approaches were used in this research to explore whether concept-based classification was possible.

6.3.1.2. Proximity

Having established concepts and their scores in every abstract, it was necessary to group together concepts into a discipline. A network of concepts was the most straightforward approach. The top three concepts are taken from every abstract and defined as being connected by virtue of them being in the same abstract. Only the top three are considered as every noun being included would create a network of concepts that is far too connected and create false connections between abstracts.

6.3.1.3. Identify groups

As a network of concepts was created, a network community algorithm would be able to separate concepts into groups. Whilst many algorithms would be suitable, the Louvain Modularity algorithm is suitable to dealing with large networks efficiently and effectively (Blondel, Guillaume et al. 2008). The algorithm is designed to maximise the modularity, Q, defined as the proportion of links inside communities compared to between communities and is given in the following expression.

𝑄 = 1 2𝑚∑ [𝐴𝑖𝑗− 𝑘𝑖𝑘𝑗 2𝑚] 𝛿(𝑐𝑖, 𝑐𝑗) 𝑁 𝑖𝑗 (6.1)

Where Q is the modularity, m is the sum of all link’s weights, and 𝛿(𝑐𝑖, 𝑐𝑗) is the Kronecker delta

function, which is equal to one if the community of node i, 𝑐𝑖, is the same as the community of node

j, 𝑐𝑗. The Louvain Modularity algorithm is split into two iterative steps from the outset that there

are as many communities as there are nodes. The first step is to calculate the potential modularity gain of placing node i in the same community as its neighbour j. Once all potential modularity gains are found, node i is placed in the community that would yield the largest modularity increase. The

second phase is to construct a new network with the communities being the new nodes, and the process is repeated until the modularity is no longer increased. Finally, there is a tuning parameter that alter the size of the communities.

6.3.1.4. Findings

The produced network of concepts is shown below in Figure 6.2. The node sizes are linearly proportional to their degrees. The node colours represent different communities/partitions, which should define individual fields. 3,368 different disciplines were identified, and the 45 largest partitions only comprised of 34.82% of the concepts. The largest discipline was made of 0.87% of all concepts. Many communities were individual abstracts shown in the peripheries of the network. Clearly, better resolution is necessary, as in many cases, this suggests that an abstract is its own field.

Figure 6.2. The University of Bath publications' network of concepts, 2000-2017. The concepts are formed from communities of words. These communities are detected using the Louvain algorithm.

Figure 6.3 shows a tuned Louvain algorithm (Blondel, Guillaume et al. 2008) with its resolution parameter set to 2. The node sizes are linearly proportional to their degrees. The node colours represent different communities/partitions, which should define individual fields. 2,562 different disciplines were identified, the largest containing 23.67% of all concepts, the second largest only contains 1.53%, and the third largest 0.98%.

Another problem arises: most communities join a single large cluster. Therefore, most papers would be classified under that cluster.

Figure 6.3. The University of Bath publications' network of concepts, 2000-2017. The concepts are formed from communities of words. These communities are detected using the Louvain algorithm. In comparison to Figure 6.2. this figure shows the communities using a tuning parameter of 2, making the communities larger. However, a giant community forms (yellow).

Furthermore, as this method does not classify abstracts into any identifiable discipline, but instead groups together concepts, it would be extremely difficult to validate the method. This is because the groups remain abstract and would have to be verified individually as there is no reference. For these reasons, this approach is not a suitable candidate to determine discipline based on the contents. Instead an approach based on pre-determined disciplines that can be validated is necessary, making Machine Learning classification an ideal approach.

In document Sustaining Interdisciplinary Research: A Multilayer Perspective (Page 97-101)