• No results found

Natural Language Processing and Analysis Overview

Natural Language Processing (NLP) is a tract of Artificial Intelligence and Linguistics, devoted to making computers understand the statements or words written in human languages. Natural language processing came into existence to ease the user’s work and to satisfy the wish to communicate with the computer in natural language. The goal of Natural Language Processing is to accommodate one or more specialities of an algorithm or system. The metric of NLP assessment on an algorithmic system allows for the integration of language understanding and language generation. (Jurafsky and Martin, 2008).

To answer the research question, the author performed the Natural Language Processing analysis using word vectors. These vectors were created by using the interview data to train Google's word2vec algorithm. The word vectors output from the algorithm calculated the similarity (also referred to as “association”) of words based on specifically selected words in the interviews. The algorithm was trained by the researcher separately for the analysis of the input text data from the interview transcripts of the project managers with LSE and without LSE to allow for comparison of the results for each group.

ANALYSIS of SELECTED WORD PAIRS in the INTERVIEW TRANSCRIPTS

172

Data pre-Processing

Compared to other fields, Natural Language Processing (NLP) data is inherently unstructured and requires extensive pre-processing. As mentioned earlier, two of the many reasons for this include 1) the data being user-provided and 2) the data being textual in nature. For this data processing, transcriptions of the recorded interviews were manually entered into Word documents. Since this is text written by humans for humans, the transcripts required appropriate manipulation before use in Natural Language Processing (NLP) (Collobert and Weston, 2008). The following pre-processing steps were performed to prepare the transcribed interview data for NLP analysis.

1. Convert each paragraph in each document into a list of words - This is performed so that the algorithm treats each word as a separate entity and can find relationships among all the words in the data.

The word2vec algorithm requires a very specific type of input. The input text needs to be formatted as lists of words, separated by commas. For instance, the sentence, "I have been a project manager for 10 years," needs to be converted to ["I", "have", "been", "a", "project", "manager", "for", "10", "years"], where [ ] bracket notation indicates a list of elements. Another example: ['chuck', 'already', 'answered', 'previous', 'question'].

These word sets in brackets are, basically, the words in a sentence, or clause, grouped in the bracket. Once the data is pre-processed, the word pair association analysis with word2vec will search for the co-occurrence of both words of a given word pair within brackets. Then, the frequency of such co-occurrences of the two words in the brackets will be calculated using cosine similarity to yield a numerical association result.

2. Remove paragraphs with interview guide or interviewer text. The presence of any non-interviewee language would bias the data.

In order to analyse the data effectively, the textual content sourced from the interviewer and the interviewee must be separated. In providing a comparative study of the language of the interviewees, it is imperative to isolate their language.

ANALYSIS of SELECTED WORD PAIRS in the INTERVIEW TRANSCRIPTS

173 3. Remove special characters - All special characters were removed for the purpose of this analysis. Special characters, such as !, @, #, $, %, etc., can be disruptive to the string splitting process involved in Natural Language Processing analysis.

In order to perform an NLP analysis, characters represented by strings must be separated into lists of words, as described earlier. A computer program, written in Python, can perform the string splitting by relying on assumptions of what denotes a delimitation of a word. Python assumes spaces represent the termination of words, and any time a space is found in a string, Python separates the text to the left and right of that space into two distinct words. Any character used as a delimiter, meaning to denote the end of a word or phrase, is dangerous because it can allow for errors in splitting the words during pre-processing. Special characters may also be indicative of typographical errors. All special characters were removed for the purpose of this analysis.

4. Convert words to lowercase. Uppercase letters are recognised as different from lowercase letters by the Python language involved in the process to call the word2vec algorithm.

All words are considered as unique distinct entities in NLP analysis. In order for this to be possible, the words must be exactly the same. "Project" and "project" are considered as two different words by the Python language. In order to ensure all words are represented correctly, the words are converted to only lowercase letters for the purposes of the analysis.

5. Remove stop-words. It is a standard practice to remove words such as "the," "and," and "or" from the text before utilising the text for NLP. In fact, packages in Python come preloaded with lists of stop-words for each language.

In natural language processing, some words have more meaning than others. The word "the," for instance, is often extraneous in order to gauge the meaning of a sentence, even for a human observer. Therefore, to

ANALYSIS of SELECTED WORD PAIRS in the INTERVIEW TRANSCRIPTS

174 ensure only the meaningful words are analysed, the standard English stop- words were removed from the text of the transcripts for this analysis.

In summary, the texts of the interview transcripts were pre-processed according the data pre-processing procedures mentioned above. Once the text data was pre-processed, the word2vec algorithm was invoked, using the Python program, to check the occurrence of a selected search word with a selected comparison word within each bracket [ ]. In other words, each word pair consists of one search word and one comparison word, and the word2vec algorithm searches to find both words within the same bracketed set of words, and based on the frequency of the co-occurrence of the word pair in the bracketed word sets, the numerical calculation is made for each group of participants, those deemed to have LSE and those deemed not to have LSE. The word2vec algorithm performs the word similarity (also referred to as “association” in this context) calculations in the form of cosine similarity to numerically represent the association of the search word with the comparison word. For details, please refer to section 7.5.2 (Similarity metric).