Using random networks as benchmark for diversity estimation

4. METHOD

4.4 Measuring Diversity

4.4.2 Semantic Networks Analysis

4.4.2.3 Using random networks as benchmark for diversity estimation

success, different networks with different number of nodes, edges, density, and degree of distribution need to be compared—a process not without difficulties (van Wijk, Stam, & Daffertshofer, 2010). The main problem is differentiating between network properties that might be random, as these can lead to spurious relationships with the outcome variable, and network properties that result from fundamental design principles of the observed semantic networks (Maslov, Sneppen, & Zaliznyak, 2004; Squartini &

Garlaschelli, 2011; Stouffer, Camacho, Jiang, & Nunes Amaral, 2007). Additionally, in both the news media and social media, the semantic networks are created from different

157

corpora, with different number of texts and different length-distributions for each set of texts. Such discrepancies lead one to suspect that there could be an impact on the observed relationships (despite the number of nodes being identical). Moreover, graph feature controls (such as number of texts, length of texts, or density) are expected to be correlated at least to some degree with the structural features of the network. Therefore, controlling for these differences—for example, by adding them as covariates in a regressions model—might increase multicollinearity and make the interpretation of the specific coefficients problematic (in terms of effect size and direction). Additionally, inserting these variables as covariates into the regression models assumes a linear relationship between corpus-level variables and diversity—an assumption that might not accurately characterize their relationship.

Thus, in order to address these problems and to improve the robustness of the comparative method, I use configuration models to generate random network models with an identical number of nodes and degree distributions for each semantic network to serve as a benchmark (Squartini & Garlaschelli, 2011). This strategy can help identify non-trivial and significant structural features of the semantic networks before examining their consequences and antecedents.

First, I calculated the graph-level statistics for the semantic network, as described earlier, which resulted in an observed diversity score for each network. Then, for each of these networks, I generated 100 random surrogate networks to be used as a benchmark. Each of the randomized networks was created with the configuration model method. For each network, the configuration model algorithm removes all edges but keeps the “stubs” of each edge intact. It then chooses an edge stub randomly and connects it to another

158

random stub. This process is iterated until all stubs are connected. Following this, all edge weights from the original network model are collected. For each re-wired stub, an edge is randomly assigned from the observed weight vector until all edges receive a weight score. The result is essentially a rewiring of the edges between all nodes and their weight scores, while keeping the number of nodes, density, and even degree and node- strength sequence constant.

This process was repeated 100 times for each network. The mean and standard deviation for each of the graph-level indicators over the ensemble of random networks was calculated. Finally, the diversity score for each graph was calculated as the Z-score for thematic diversity. This was done by subtracting the random network mean diversity from the observed network diversity and then dividing it by the standard deviation for diversity over the entire ensemble of random networks. The normalized diversity indicator in this method, therefore, represents the difference between diversity in the observed network and the expected graph-level indicators in random networks, with identical numbers of nodes and density. I present the results of this process in the next chapter following the general results of the semantic network analysis.

Interestingly, the randomized networks showed some very consistent features. First As figure 26 shows the spread of the diversity estimation was much larger for the

observed semantic networks than for the random networks. This indicated that diversity of random incoherent networks has a firmer lower bound. In other words The diversity for random networks is the result only of basic network properties, which might not be dictated by the organizational features of the network

159

Figure 26: A scatterplot for the diversity estimations for the observed and random-

generated networks.

This argument is strengthened by the diversity estimations for the random networks being generally higher than the diversity estimations for the observed networks in a fairly consistent manner. This can be seen in Figure 27 which shows the histogram for the gap between the diversity estimation of the observed network and the diversity estimation for the random-generated networks. The prevalence of negative values in this histogram indicates that in most cases (n=295) and aside from the smallest networks (n=35) generally the thematic diversity of randomly generated textual data was found to be higher than a more coherent text would produce. This is also supported by Figure 28 which plots the difference between the observed and random networks diversity estimations on the number of articles from which the data was driven.

160

Figure 27: A histogram for the values of (Observed Diversity – Random Diversity).

Figure 28: A scatterplot for the diversity gap between observed and random-generated

networks, and the number of articles from which data was drawn.

These findings can be seen as supportive of the suggested method. If more coherent and monothematic campaigns are expected to produce less diverse semantic networks, then a randomly connected network, representative of randomly and incoherent generated

161

text should show high levels of diversity. I discuss these results in further details on Chapter 6 of this dissertation. In addition, the models presented in the following chapter were estimated also on a sample that includes only the network which shoed lower diversity than random and the various results presented in the next chapter show similar and stronger results.

In document Exploring Thematic Diversity In News Coverage And Social Media Activity Of Political Candidates Using Unsupervised Machine Learning (Page 171-176)