Constructing the networks for the Twitter data

4. METHOD

4.4 Measuring Diversity

4.4.2 Semantic Networks Analysis

4.4.2.2 Constructing the networks for the Twitter data

The construction of the semantic networks from the Twitter data differed from the news media data in two important ways. First, instead of selecting the top 1,000 words, only the top 100 TFIDF words were selected to create the semantic network. This difference stems from the lower number of unique words appearing in tweets relative to most other corpora (unique documents, tweets, contain far fewer words than the average article). Second, because tweets often contain a single thematic idea, words that appear together within the same document (tweet) were determined to be semantically related or “co-occurring”—an approach that is considerably more cost-efficient than the moving window approach.

Choosing the tokens. The extraction of relevant tokens was done in a similar fashion to the process used for the news data. First, the data was pre-processed, including a lemmatization of the tokens (using NLTK WordNetStemmer; Bird et al., 2009) and the

150

removal of stop-words. Like the news media corpus, the choice of tokens was done using the TFIDF measure, by selecting the top 100 words in each candidate’s corpus.

Defining co-occurrence. Because tweets often contain a single thematic idea, one can reasonably argue that two words appearing together within the same tweet are also semantically related and should be considered as “co-occurring.” Thus, unlike the analysis of the news data, and due to the relatively small text size of tweets, words in the Twitter corpus were viewed as related if they appear in the same document in a simple bag of words approach. Therefore, to analyze the Twitter data, the first step was to construct a term-document matrix. To illustrate, the following three short texts can be used:

Text 1 - Ant Bat Cat Dog Text 2 - Dog Eel Fox Goat Text 3 - Goat Hog Ibis Ant

The correspondent term-document matrix for these three texts is:

A B C D E F G H I

Text 1 1 1 1 1 0 0 0 0 0

Text 2 0 0 0 1 1 1 1 0 0

Text 3 1 0 0 0 0 0 1 1 1

Figure 21: The term-document matrix for the three short sentences (“Ant Bat Cat Dog,” “Dog Eel Fox Goat,” “Goat Hog Ibis Ant”).

This matrix can then be used for the construction of the word co-occurrence matrix and the network creation using the following steps.

Normalization of the co-occurrence matrix. A normalization of the term- document matrix was needed to allow for comparison between different matrices and

151

networks drawn for different candidates, and to convert the term-document matrix into a co-occurrence matrix. The normalizing using cosine similarity is considered as “best practice.” It is also equivalent to the Ochiai coefficient used for the news data and was therefore chosen for this analysis (Zhou & Leydesdorff, 2016).

For each pair of words of the 100 words used to construct the term-document matrix, cosine similarity was calculated as: ∑ 𝐴𝑖𝐵𝑖

𝑛 𝑖=1

√∑𝑛_𝑖=1𝐴_𝑖2√∑𝑛_𝑖=1𝐵_𝑖2

, with A referring to items in

the column related to word 1, and B referring to items in the column relating to word 2. In more descriptive manner, for each two words, the cosine similarity measure estimated the extent to which these words “share” documents. The more documents shared between two words, the more related they are. This is also normalized by the general frequency of the words, as more frequently used words are expected to co-appear with other words more often. The resulting matrix is similar to the matrix constructed for the news media, with rows and columns representing a set of 100 unique words, and each cell representing their normalized relatedness.

A B C D E F G H I

Text 1 1 1 1 1 0 0 0 0 0

Text 2 0 0 0 1 1 1 1 0 0

Text 3 1 0 0 0 0 0 1 1 1

Figure 22: The term-document matrix for the three short sentences (“Ant Bat Cat Dog,” “Dog Eel Fox Goat,” “Goat Hog Ibis Ant”).

To illustrate, Figure 22 presents the term-document matrix for the three short texts offered earlier. The transformation of this matrix can be done in the following manner (examples for the calculation shown only for three unique co-occurrences):

152 Cosine(A,B)= ∑ 𝐴𝑖𝐵𝑖 𝑛 𝑖=1 √∑𝑛_𝑖=1𝐴_𝑖2√∑𝑛_𝑖=1𝐵_𝑖2 = 1∗1+0∗0+1∗0 √1+0+0√1+0+1= 0.71 Cosine(A,D)= ∑ 𝐴𝑖𝐷𝑖 𝑛 𝑖=1 √∑𝑛_𝑖=1𝐴_𝑖2√∑𝑛_𝑖=1𝐷_𝑖2 = 1∗1+0∗1+1∗0 √1+0+1√1+1+0= 0.5 Cosine(A,E)= ∑ 𝐴𝑖𝐸𝑖 𝑛 𝑖=1 √∑𝑛_𝑖=1𝐴_𝑖2√∑𝑛_𝑖=1𝐸_𝑖2 = 1∗0+0∗1+0∗0 √1+0+0√0+1+0= 0

This process results in the following matrix, on the left of Figure 23.

Figure 23: The normalized co-occurrence matrix for the three short sentences (“Ant Bat

Cat Dog,” “Dog Eel Fox Goat,” “Goat Hog Ibis Ant”) and the corresponding semantic

network.

Following the same process for all available dyads, a matrix can be drawn to represent the relationship between all unique words. This normalized co-occurrence matrix can then be converted to a graph object for further analysis, as exemplified by the network structure on the right of Figure 23. More detailed examples of actual semantic networks drawn from political candidates’ social media activity can be seen in Section 5.2 in the following chapter, focusing on networks that exhibit high and low diversity scores—the calculation of which is explained in the following section.

In document Exploring Thematic Diversity In News Coverage And Social Media Activity Of Political Candidates Using Unsupervised Machine Learning (Page 164-167)