• No results found

Forum Text Mining

Chapter 3: Method

3.3 Forum Text Mining

For the following mining and analysis, we use titles and message texts stored the MySQL database mentioned in the previous section. A title and a message combined are then defined as one document in the analysis. We analyse text from Vigrid with each line in the text file vigridtvedtnet.txt defined as one complete message or document. We find words that are typical for a forum by finding out which words that are much used in that collection of text from the forum, but it does not have to have used in every single thread. We also find pairs of words that tend to be in the same message

3.3.1 TF and IDF

For finding words that are typical for a document, which in this thesis corresponds a forum message, one finds the term frequency (TF) of a term (word or expression) in an entire document (which in our case is a forum message) and then multiplies it with inverse document frequency (IDF). In our thesis a term is always the same as a word. If a word is used in nearly all the forum entries and therefore has a high document frequency, the word is given less weight by a low inverse document frequency. The product of TF and IDF we call TF-IDF. IDF is basically the ratio of the total number of documents to the number of

documents which include the term (word) we are interested in. When only the order of the TF-IDF values of the words in a forum is important, and not the values themselves, it is common to use the logarithm of what we spontaneously think IDF is [23]. We therefore define IDF as the logarithm of the real IDF in this project, which seems to be a normal definition in research. See formula in Figure 4.

Figure 4: Definition of logarithmic IDF

3.3.2 GTF and NGTF

Since the messages in a forum often are very short, it is more interesting and useful to apply the global term frequency (GTF), an expression we have invented in this project, instead of the term frequency (TF), which only counts words in single messages. GTF is the count of occurrences of a term (word) in all the text we have downloaded from a specific forum. TF values, as they are defined in the traditional definition of TF-IDF, are useless to us because many of the messages are very short. We are namely interested in how different words are used in a complete forum as a whole, not how they are used in each single message.

Analysing each single message would anyway be tedious, since a forum often has thousands of messages. If we want a GTF-IDF value, we multiply the GTF of a word with the IDF of the same word. IDF is then defined in the traditional way like described recently in this section and shown in Figure 4. IDF is not used so much in the analysis as first intended. We will explain why after introducing NGTF.

24

When comparing frequencies of words in two different forums with different sizes, we normalize the GTF value to get a value which we call normalized GTF or NGTF. NGTF is the GTF of the actual word divided by the maximum GTF value of any non-stop word in the forum, like shown in Figure 5.

Figure 5: Definition of NGTF

As a consequence of the definition, NGTF always has a value in the closed interval [0,1].

Although the theoretical minimum is 0, that value never appears in our result lists because we have no words in the analysis which are never used in the forum. We have also excluded all words that are used less than ten times in a forum, because they are practically unimportant for the analysis of a forum, where we find words which are used hundreds and even

thousands of times.

We have implemented word counting with Java, so we get words with accompanying GTF and IDF values in a CSV (comma-separated values) or TSV (tab-separated values) file for each forum. We use formulas in Microsoft Excel 2010 to compute NGTF and NGTF-IDF values, because we then first can manually delete frequent stop words which are left after the removal of several stop words in a list in our Java word count program. Stop words that are left in the CSV/TSV file, are of course unimportant and uninteresting words that we did not think of in advance. We then assure that NGTF is defined with a non-stop word as the word with maximum frequency in the forum, as shown in Figure 5.

NGTF-IDF of a specific word is the product of NGTF and IDF of that word. NGTF-IDF is sometimes used for word ordering in the most-frequent-words tables for single forums in chapter 4 of this thesis. IDF is meant as a factor for giving stop words a lower rank in the list with NGTF-IDF. (N)GTF-IDF is not really suited for comparison. NGTF works better, because it is easier to use words for the comparison like “used more in forum 1 than in forum 2 (if the two forums were of the same size)” for NGTF.

3.3.3 Forum Word Count Comparison

By means of NGTF ratios we look for words that are much more used in one forum than in another forum it is compared to. By means of the words more frequent in one forum than another, we can discover which topics are more discussed in one forum than another. Words are stored together with NGTF ratios in a CSV/TSV file for each forum, ready for manual analysis.

NGTF value computation is necessary because two different forums often do not have the same size such that GTF values are not very suitable for comparison. Two forums are compared to each other by finding all words that are in both forums and then computing the ratio of the NGTF of each such word in forum 1 to the NGTF of the same word in forum 2, and vice versa. We refer to these ratios as f1/f2 NGTF ratio and f2/f1 ratio, respectively. The

25

f means of course “forum”. We can of course instead compare the GTF value of a word in one forum with the GTF value of the same word in another forum to find out how many times more a word is actually used in one forum than another. However, if a forum is extremely much larger than the other it is compared to, then the GTF ratio values will not fully make sense for comparison, because if a specific word is for instance used 1000 times in a forum A with totally 100000 words, and 1000 times in a forum B with totally 200000 words, we will not get a quite correct image of the situation if we say that a word is used with same

frequency in both forums, such that the GTF ratio equals 1. This is namely true for absolute frequency (GTF), but not relative frequency, or what we have used: normalized frequency (NGTF). Therefore, normalization is necessary for simulating that the two forums are of the same size when comparing them. It then seems like the contents of the smaller forum are repeated until it reaches the size of the bigger one.

The comparative word analysis also includes finding all words that are in one of the forums, but not both. The importance of each of these unique words is ranked by GTF, so that we easier can find the words that better characterize the difference from the forum it is compared to. The important and interesting words are then normally the frequent ones and sometimes also the words down to the medium-frequent words. With these words we can find the topics that are discussed in one forum and absolutely not in another forum. We may get some few false-positives of such forum-unique words in the list because a word in one forum may be used less than 10 times in another forum. If a word is actually used for instance 9 times in one forum and 100 times in another forum, then it is at least used considerably much more in the latter forum, so the result is then not completely wrong.

Results can be seen in sections 4.1-4.18.

3.3.4 Word Colocation Analysis

The tendency of two specific words occurring together in the same forum message can be measured by a joint odds ratio as it it defined in Figure 6. X or Y in in pXY are there binary variables representing whether a word X or a word Y occurs in a forum message or not, respectively. The value 0 means that the word is not in a specific message, while 1 means that the word is in the message. pXY is then the probability that the words are together in the same message in a forum. More precisely, it is the frequency of how often two specific words are in the same message.

Odds are the ratio of the probability of some event to happen to the same event not to happen.

In our case, odds ratio of one word is the frequency of messages with that word divided by the frequency of messages without that word. This can be described briefly as

odds(Y) = pY=1 / pY=0

where Y represents the existence of a specific word in a forum message.

Odds ratio is the odds of something to happen in one group divided by the odds of the same event to happen in another group [24]. In our application of the odds ratio, these two groups are actually the event of another specific word X to occur in the same message as word Y, and the event of the same other word X not to occur in the same message as word Y,

26

respectively. In Figure 6, odds(Y) for the group where word X is in the message (X=1) is defined to the left of the big slash (the numerator of the joint odds ratio). Odds(Y) for the group where word X is not in the message (X=0) is defined to the right of the big slash (the denominator of the joint odds ratio).

The ratio of these two odds is in a sense the odds for word X, but more formally and precisely it is the definition of the joint odds ratio for X and Y. This can be simplified to the fraction to the right of the equal sign in Figure 6. The numerator is the product of the

probabilities/frequencies of both words being and not being in the same message,

respectively. The denominator is the product of the probabilities (frequencies) that only word X is in the message and that only word Y is in the message, respectively.

Figure 6: Joint odds ratio, formula from [24]

For the probability values we can actually choose to use relative, normalized or absolute frequencies just as we want, since they will all give the same result, because of the possibility of reducing and expanding the fraction. Therefore we use absolute frequencies (GTFs) for simplicity in our Java program for calculating odds ratios. It is important that p10 and p01 in Figure 6 must not equal 0, or the odds ratio will be equal infinity [24]. To avoid infinite values, we set pXY, where x ≠ y, to 1 when these pXY values really equal 0. The values for the odds ratio defined this way are then not formally correct, but we get a sensible order of the odds ratios of the different words when they are compared to each other.

The odds ratio of each word pair in a forum is computed for finding how closely the two words are related to each other in the sense of co-location. The algorithm for doing this unfortunately runs slow if we analyse thousands or even tens of thousands distinct words in a forum. It can take days to finish an execution if we need to find the odds ratios for all

possible pairs of words. Therefore, some words that are used only a few times, are excluded for the analysis. How many words that we had to exclude is mentioned together with the results in section 4.20.

Then the ratio of the odds ratios of the same word pair in two different forums can be computed for comparing how much closer the same two words are related to each other in one forum than in another. This of course depends on whether the same colocation or co-occurrence word pairs are in both the forums to be compared.