Summary - 2018_Larsen.pdf

The motivation for implementing the experiments presented in Chapter 2 stemmed primarily from a desire to document and understand the general behavior of random forests when trained on different feature spaces under a variety of classification scenarios induced by different combinations of stochastic block model types and parameter combinations. The workflow for these experiments is summarized in the “synthetic” branch of figure 1.1. Our results are summarized in figures 2.2, 2.3, and 2.4. Unsurprisingly, for all but a few cases in the core-periphery experiments (figure 2.4), the use of network statistics as a feature space results in classifiers of discriminatory power much greater than those trained on the flattened adjacency matrix feature space.

0 25 50 75 100

Random Assortative Disassortative Ordered Core−periphery

Graph Type

Accur

acy (%)

Feature Space Flattened Adjacency Matrices Network Statistics

Figure 2.2: Summary of Classification Accuracies when Varying Number of Communities.

Results from tables 2.7 and 2.13.

Recall that sets of experiments in sections 2.3.2 and 2.5.1 examined random forests’ discriminatory power when classifying between models of the same type (i.e. random, assortative, disassortative, ordered, and core-periphery) with the same numbers of nodes and general parameters in the probability matrix. The only differences between these models was the number of communities. As seen in figure 2.2 not only did the use of network statistics result in substantial gains for all graph types, the relative accuracies between graph types was also approximately the same. In other words, classifiers on the flattened adjacency matrix space that performed better for a given graph type versus a different graph type still performed better for the given graph type when trained on the network statistics space.

The second set of experiments (sections 2.3.3 and 2.5.2) examined the discriminatory power of random forests when graph types are the same with same numbers of communities and nodes, but with different underlying edge probabilities. As noted before, when using network statistics as a feature set, the discriminatory power of random forests substantially increases. In many cases random forests achieved perfect classification accuracy scores. Interestingly, the relative change in accuracy when using network statistics as opposed to flattened adjacency matrices as a feature space does not necessarily remain the same as it did for figure 2.2. For example, consider the results

0 25 50 75 100

dense comparison sparse comparison spectrum

Accur

acy (%)

Flattened Adjacency Space

0 25 50 75 100

dense comparison sparse comparison spectrum

GraphType

Assortative Random Network Statistics Space

0 25 50 75 100

dense comparison L0:L2 sparse comparison spectrum

Accur acy (%) 0 25 50 75 100

dense comparison L0:L2 sparse comparison spectrum

GraphType

Disassortative Ordered

Figure 2.3: Summary of Classification Accuracies when Varying SBM Probability Matrices.

Results from tables 2.8, 2.9, 2.14, & 2.15. The lefthand column summarizes random forest classification accuracies on the flattened adjacency matrix feature space for experiments using all graph types, excluding core-periphery graphs. The righthand column does the same for the network statistics feature space.

for assortative and random graphs shown in the top row of figure 2.3. When trained on flattened adjacency matrices, random forests had better overall accuracy in discriminating different instances of random graphs than when classifying different assortative graphs. However, when using network statistics as a feature space, higher classification rates occur for the assortative graphs. This trend reversal also occurs when comparing disassortative and ordered graphs (second row of figure 2.3).

While the use of network statistics as a feature space lead to substantial gains in discriminatory power, it should be noted that not all gains were equal across graph types and classification scenarios. In general, it appears that the discriminatory power of random forests does depend on graph type and, as already noted by Caceres et al. (2016), the difference between the underlying graph parameters. From these results, we now have a working idea of under which circumstances random forests can best classify graphs generated by SBMs and can explore random forests’ capability as a model selection criterion according to the second workflow outlined in figure 1.1.

0 25 50 75 100 L0 − L2 L0 & L6 L3 − L5 L3 & L7 Accur acy (%) Core−Periphery, Exp. 1 0 25 50 75 100 L3 & L10 L8 & L9 Accur acy (%) Core−Periphery, Exp. 2 0 25 50 75 100 L0, L12, L13 L0, L3, L11 L3, L14, L15 L5, L16 L8, L18 Label Set Accur acy (%) Core−Periphery, Exp. 3

Figure 2.4: Summary of Classification Accuracies when Varying SBM Probability Matrices (core-periphery graphs).Results from tables providing accuracy scores from both feature spaces for the sub-experiments using core-periphery graph types. Red columns correspond to results from classifiers trained on the flattened adjacency matrix feature space, blue columns correspond to those trained on the network statistics feature space.

CHAPTER 3

Model Selection Using Random Forests

This chapter explores the question posed at the beginning of this thesis: givenN candidate stochastic block models for a particular networks data set, is it possible for a random forests classifier trained on a set of instance graphs generated from theseN SBMs to select the best fit to a real- world network? Our goal is to determine whether this well-known and relatively easy to implement machine-learning classification technique can serve as a comparable method for fitting generative models to a given network’s physical and probabilistic structure. We have divided this chapter into two sections. The first provides an overview of the tools and techniques used to determine theN candidate models. This section also describes the real data set used for our experiments. The final section presents our findings in comparing the model selected by random forest to the “gold standard” model selected using the criteria described below.

In document 2018_Larsen.pdf (Page 35-39)