Figure 23 – The topic modelling process (Source: Uys et al 2008)

Topic modelling serves as an extremely useful mechanism for identifying and characterising various concepts embedded within a document collection, enabling a researcher to navigate and understand this corpus in a topic-guided manner (Blei and Lafferty 2007, Uys et al. 2008). For this reason, the topic modelling approach based on LDA was utilised as a mechanism to provide additional insight into the concepts of innovation capability. This insight would come from an additional perspective on the literature – that of the LDA model.

5.3.4.1.2 CAT and its outputs

The Corpus Analysis Toolkit is an LDA-based topic modelling software application developed by Indutech (Pty) Ltd. It is continuously being refined and improved based on the outcomes of its utilisation. At the time when this analysis was performed, the CAT outputs were as follows:

 Topic-word matrix – presents the topics as calculated by the model and a list of the words associated with each of the topics.

 Document-topic matrix – represents a probable allocation of documents to topics in the form of a mixing ratio that is based on the occurrences of words per document.

chance.

The first 2 outputs are based on the LDA model, while the latter 2 are based on separate techniques that were implemented to provide additional understanding of the analysed corpus.

5.3.4.2 Objectives for CAT analysis

The statistical nature of the CAT output provides a perspective of objectivity towards the literature that is not possible for a human being to achieve. Further, the traditional application of this technique is to provide understanding of a large, previously unstudied corpus. However, in this case, the corpus had been studied by the author in detail – Appendix E presenting a summary thereof. The objects were therefore not centred on understanding, but rather on generating a new and objective perspective on innovation capability. The specific objectives of this analysis were to:

 Identify the core concepts pertaining to innovation capability according to the LDA-based topic modelling process.

 Through the execution of 3 separate CAT runs, specifying each of 5, 10 and 20 topics to be identified from the corpus, identify hierarchical structure within the topics of innovation capability.

 By relating topics to one another based on their inclusion of similar words, obtaining an improved understanding of the interrelations that may exist between the concepts of innovation capability.

 Provide a framework by which to compare and evaluate the content and structure of the ICMM v1.

These objectives align strongly with the objectives of the overall ICMM v1 refinement process, particularly in terms of improving the overall structuring of the model. While, to a certain degree, the outputs of this analysis could assist with ensuring comprehensiveness of content, the identification of interrelations between innovation capability themes (referred to as CAT topics) and the inherent hierarchy therein, would prove vital in improving the structure of the first maturity model.

5.3.4.3 Preparation for analysis

The nature of this LDA-based topic modelling process is such that the outputs may be improved by tweaking 2 specific parameters:

 The list of stop words – words deemed to be irrelevant such as “a”, “and” and “like”.

 The number of topics that the model must extract – a pre-specified variable stating the number of topics into which the LDA model must categorise the corpus.

The reason for tweaking the stop words was, firstly, to eliminate the noise words (such as “accesstotech” and “scientificinterpreter”) that occur as a result of erroneous text capturing when exporting to the portable

to the corpus that they portrayed little meaning (such as “journal”, “manufacture”, “business”, “profit”, etc.). The third and final reason was to remove names that may bias the interpretation of the topics because they were attached to certain concepts (such as “ibm”, “ideo”, “seagate”, “christensen”, “chesbrough”, etc.).

These stop words where identified through several CAT runs, each time the above mentioned rules being applied to identify additional stop words that were then added to the stop word list. By the fifth CAT run, the words representing the topics were deemed sufficiently void of noise and “meaningless” words and names.

The number of topics was not tweaked, as were the stop words. As mentioned previously, it was decided to perform 3 CAT runs – one for each of 5, 10 and 20 topics. The reasoning behind this was to establish whether hierarchy exists within the topics. However, the 20 topic run was selected as it was close to the maximum number of topics that were possible. The nature of the LDA implementation is such that if more topics are requested from the LDA output than are inherent within the corpus, the results tend to degenerate (Blei et al. 2003). If this occurs, the outputs are completely nonsensical – all of the requested topics have the same words in the same order, depicting 1 relevant topic. Test runs were performed on 30, 25 and 22 topics, each of which degenerated. At 20 topics, the CAT run no longer degenerated and the results appeared promising.

5.3.4.4 Post-run processing

Once the 3 CAT runs had been performed, various analyses were performed to improve the interpretation of, and add value to, the results. The most influential of these analyses will be briefly discussed.

5.3.4.4.1 Topic labelling

This is a basic process whereby the researcher (the author in this case) assigns a label to a topic that is represented as a list of words. These words are listed in ascending order of their probability of being representative of the specific topic. This facilitates the process for the researcher by making the identification of relevant words easier, although the words with the highest probability were not by implication more relevant.

Consider the words “innovation” and “management”. These words, and a few others, were represented in virtually all of the topics in each of the 3 CAT runs. They therefore had no influence on determining the most relevant topic labels and were ignored in the process.

Table 10 presents an extraction from the 20 topic CAT run highlighting the topic labelling process. Words that were used in the labelling thereof are highlighted in blue, while the post-processing stop words (such as “innovation”) are highlighted in grey. Note, that for each of the 20 topics from which this particular extraction was made, CAT generated 40 words to describe the topic – again in order of their probability of describing the particular topic.

Innovation

Process Change Knowledge Networks Management and Opportunity Identification

Relative Coverage 6.5% 6.3% 5.9% 7.6% 9.9% Relative Dependence 6.7% 7.0% 5.4% 8.5% 6.7% Topic words (CAT output)

innovation innovation knowledge innovation innovation

product management innovation process organization

management learning community management process

process strategy network products strategic

organizational product practice product ideas

entirely standards process networks market old paradigm

quality market people processes market

tqm pioneer management projects product

research strategic members organization products

innovation process capabilities important technology external

study products organization project growth

innovations service time innovation process management

adoption innovation management information success organizations

innovation

management loop complexity ideas organizational

relationship disruptive work produce managing opportunities

organization organisation communities factors radical

structure processes process result approach

practices change change successful time

In document Toward innovation capability maturity (Page 118-121)