Chapter III ~ Description of Methodologies & Related Literature ~ ················· 68
3.2. Text Mining Methodology ······························································ 73
3.2.2. Text Mining Methods: a review ·················································· 76
Content analysis is a systematic and replicable Knowledge Discovery Technique for categorizing many words of text into a few categories according to explicit rules outlined from the onset, and leading to meaningful patterns and models (Ur-Rahman & Harding 2012; Choudhary et al. 2009; Montabon et al. 2007; Choudhary et al. 2006; Klassen & Whybark 1999; Carley 1993). It builds on the grounded theory approach which aims to discover the ideas underpinning the main concepts within a focal field of research (Länsisalmi et al. 2004). It also allows the systematic evaluation and deeper understanding of the themes conveyed within written and recorded communication (Kolbe & Burnett 1991), such as a Corporate Sustainability Report. In addition, Carley (1993) outlines that Content Analysis aims to code the content and distribute it into its representative predefined categories and relations. Montabon et al. (2007) identify that Content Analysis has had limited application within the field of operations management, and are in agreement with Frohlich (2002) on the need for its innovative use in this area. Albeit a slightly different definition, the adopted approach within this section resembles that of a case study, which is one of the “most powerful research methods” (Voss et al. 2002), especially since it focuses on comprehending, understanding and interpreting (Gephart 2004) the dynamics that exist within multiple scenarios (Eisenhardt 1989). Data validity and reliability play an integral role in research, and thus could not be overlooked within this stage. Thus, as per Yin (1984) and Voss et al (2002), the construct validity requires input from various sources of evidence and is assured at the data collection stage, which is then followed by internal validity which is established in the data analysis stage, wherein the patterns are matched and explained. External validity is embedded throughout the research design through the implementation of replication logic across the various case studies and finally the data reliability identification in the data collection stage is facilitated through the protocols integrated therein. However, it is imperative that qualitative research is first outlined, to further understand the implementation of the adopted and developed methodology.
77
SCM is a continuing co-determinant which extends beyond the technical or journalistic comprehensions (New 1997), which increases the difficulty for making the choice of methodology in the field. Not only this, it is also imperative that the trade-offs between quality of data and efficiency are taken into consideration (Voss et al. 2002). However, since qualitative research passes through stages of induction, discovery and exploration, followed by theory/hypothesis testing (Johnson & Onwuegbuzie 2004), the following approach seemed to be the most appropriate. Stemming from the processes of building theory from case studies as is outlined by Patton & Appelbaum (2003), Eisenhardt (1989) and Yin (1984) the steps outlining the Text Mining approach are:
a) Case Selection & identification b) Instrument & protocol crafting
c) Entering the field – Combination of Data Collection & Data Analysis d) Analysis of data
e) Hypotheses shaping f) Enfolding literature
g) Theoretical saturation, closure and contribution
These are further expanded upon and form the structure for the subsections outlined in chapter 4 section 4.2, and which act as the backbone for the Text Mining methodology. In conducting the Text Mining, one of the important steps is the identification of the software options available for conducting the data Content Analysis; this is to select the most appropriate software with features that are aligned with the objectives of this research. This is further presented in the following tabular comparison.
Table 17 – Comparison of Text Mining & Content Analysis Software
Software Name Strengths Weaknesses
Atlas.ti
(Scientific Software Development GmbH 2016)
Codes are independent entities. Ability to hyperlink within software to simplify coding process. No tool required for inter-coder agreement. Visibility for tree structure. Flat code list. Taxonomical structure for codes & memos
Limited trial version up to 10 documents with a maximum of 100 quotations and limited data export.
Aquad
(Huber 2017) Multiple data types (text, images, videos…etc.) Many retrieval methods Relationship
Complex GUI Carrot 2
(Osinski 2016) Automatically clusters into thematic categories. Accesses many online Databases and thesauruses for linguistic thematic analysis
Limited applicability Coding Analysis Toolkit (CAT)
(Texifter 2007) Imports Atlas Data, based on Qualitative Data Analysis Program (QDAP). Works well with qualitative data collected from interview
Uses keystrokes instead of mouse clicks.
Better suited for group work Requires coding and manual
78
transcripts, field notes, open ended survey answers.
KH Coder
(Slashdot Media n.d.) Corpus linguistics Multi lingual Collocation statistics & Cluster analysis MySQL, R, Snowball stemmer back-end tools
Only takes raw texts
Japanese backbone slows down process with clunky & confusing interface. Constantly crashed when dealing with data Natural Language Toolkit (NLTK)
(NLTK Project 2016) Suite of libraries and programs. Facilitates symbolic & statistical analysis. Natural Language processing
Operates as an Add-on to R Requires understanding of R programming language Nvivo
(QSR International Pty Ltd 2016) Provides features to organize data accumulated and quantitatively analyzing them. Best utilized for analysis of transcribed interviews, downloaded journal papers, notes, memos and annotations.
Constantly crashes on Windows. Complicated GUI and confusing features that are not intuitive. Open NLP
(The Apache Software Foundation 2010)
Maximum Entropy and Perception based Machine learning based. Toolkit for Natural Language text processing.
Supporting common NLP tasks through segmenting sentences, tagging parts of speeches, extracting named entities, tokenization and co-reference resolution.
Pattern
(Python Software Foundation 2016) Web mining module for Python. Accesses online databases and works as an expansion pack to R-program.
Works as an add-on to R program. Operates in Python Language. More suited for web research. Requires knowledge of R programming.
QDA Miner Lite
(Provalis Research 2016) Works well with coding, annotating, retrieving and analyzing small and large collections of documents and images
Limited trial version does not permit use of features to full extent and limits number of reports.
SAS
(SAS Advanced Products Solutions 2016)
Web mining with annotation. Machine learning with GUIs. Tailored software package to client needs.
Expensive software, Limited trial available
TAMS
(Weinstein 2012) Analyser that works on GNU, Linux & Mac OS. Inductive coding approach. Weak GUI and processing tools.
Subsequently, it is imperative that the various steps and procedures that are to be adopted within the implementation of the Text Mining are outlined and established. As had been mentioned in the literature (Lee et al. 2015; Genovese et al. 2013; Carbone et al. 2012; Ghadge et al. 2012; Ur-Rahman & Harding 2012; Tate et al. 2010; Choudhary et al. 2009; Hearst 1999) and further adopted in previous studies, the process for the Text Mining is shown in Figure 18.
79
Text mining commences with the collection of the documents for analysis. These documents are then transformed into a format suitable for use in the software for analysis and loaded into the software. The documents are then retrieved and are pre-processed through the removal of unwanted text and delimiters. This is in addition to the removal of “stop words” which are identified from a “stop list” which indicates the list of words that are repetitive and distributed throughout the text without adding any value to the text context. As per Carley (1993), this facilitates the minimization of storage, facilitates auto-coding and also simplifies text generation. Finally, the text is truncated to the roots or “stems” of the words using a stemming approach, and based on the stemming dictionaries inbuilt within the Text Mining software, such that words with different endings are grouped into the main stem of the word, e.g. implementation and implementing are grouped with the stem of implement. Subsequently, the Text Mining process is instigated such that the mining process and report generation are done simultaneously, facilitating the identification of patterns, trends and associations. The outcomes of these would then be collated and utilised within the MCDA section of this research.