Metropolitan Area Identification

(1)

Metropolitan Area

Identification

(2)

Outlining the problem

¡  The delineation of Metropolitan areas in its

current form is fairly arbitrary

¡  In its current form it misses out on information

provided by overlapping communities

¡  We wish to apply network community detection

algorithms to data with 100,885 edges and 3110 nodes

(3)

Network Basics

¡  Networks are used to represent connections

between entities

¡  Networks are represented with nodes and edges ¡  The edges can either be weighted or

un-weighted as well as directional or bidirectional

¡  Degree-The number of edges connected to a

node

¡  Geodesic Distance- The fewest number of edges

(4)

Configuration Model

¡  Is a model for un-weighted network

¡  What a random graph would look like with a

particular degree distribution

¡  Calculated based on the premise of dividing an

edge into stubs and looking at the probability two stubs would point at each other

¡  The result is the expected number of connections

between two nodes being

¡  where

d

i

d

j

d

_t

d

_t

=

d

_i

i

(5)

Community detection

¡  Deciding which nodes should be grouped

together is a central question in Network theory

¡  These groupings are called Communities and the

delineation of these communities has been the subject of numerous papers

¡  A community is a group of nodes that have a

(6)

Modularity

¡  The most popular way of deciding a good

community division is through the use of Modularity

¡  Modularity works by measuring the amount that

nodes in a community exceed the expected strength as given by the configuration model

¡  Mathematically:

Q =

1 d

_t

(A

ij

−

d

_i

d

_j

d

_t

)

ij

∑

∂

_ij

(7)

CCME

¡  An extractive approach based on the Continuous

Configuration Model

¡  Designed for weighted network Community detection

¡  Continuous Configuration model combines the

configuration model with a strength calculation

¡  This results in the calculation of the expected weight to be

and expected connection to be :

¡  When combined with the probability of connection from

the configuration model this gives the expected value of a connection between nodes i and j in a random graph

S_ij = E(w_ij) = SiSj

(8)

CCME Continued

¡  The method also defines a kappa to capture the variance

of the expected weight of connection.

¡  The method forms communities by comparing node how

connected a node is to an already established community.

¡  It does by approximating a p value using the CLT that has

a variance defined to be

¡  This method does not account for self loops leading to an

adjustment for the data that has two stages.

¡  One that identifies communities

¡  One that identifies Statistically Significant Self Commuting

(9)

Dampened Modularity

¡  Extractive approach designed to mimic

modularity results.

¡  Exists in 3 Stages:

¡  Stage 1: Gathering –consists of maximizing:

¡  Stage 2: Ensuring Connectivity- done by choosing a

threshold and ensuring all nodes are at least that connected to their respective community

¡  Stage 3: Filtering- Getting rid of redundant

communities

(θx−1ωij ij

(10)

Calculating Dampening

Factor

¡  The best current method to figure out the proper

dampening factor is based on tuning according to a quality measure

¡  In an un-weighted non-overlapping community

graph Modularity is the reasonable quality measure to use

¡  In case of these metropolitan areas the best

measure is instead one that accounts for

overlapping communities the measure used in this project was

(11)

Refining Communities

¡  Because the gathering stage only requires a

strong connection with one node this does not guarantee a strong connection with the rest of the community

¡  Thus by mimicking the method in the OSLOM

paper where connectivity with the community is calculated for each node and the least

connected node is removed if the node is not above a threshold t.

¡  Thus for a community of n nodes the calculation

(12)

Filtering

¡  Have attempted several ways of filtering through

similar communities. All based around a Jaccard similarity value, which if we have sets A and B is equal to

¡  One method involved merging communities

together but this had issues with maintaining connectivity

¡  A second method involved selecting all

communities with a certain minimum threshold

similarity and choosing the best one according to

A

∩

B A

∪

B

(13)

Other Uses and

Modifications

¡  The DMA could also be used as way of collecting

potential nodes for a community that another algorithm such as CCME performs community detection on.

¡  It can also utilize the approach laid out in the

OSLOM paper to deal with dynamic community detection

(14)

CCME Results

•  The CCME produced 214 communities with an average size of 11.01 counties and a standard deviation of 10.67 counties

(15)

DMA Results

¡  The DMA produced 347 communities with a

mean of 6.06055 counties and a standard deviation of 3.044

(16)

Comparison to MSA

•  The MSA has 919 communities with a mean size of 1.94 and a standard deviation of 2.26

(17)

Observations and Potential

Adjustments

¡  The first major observation is that the CCME

produces communities much larger than the other two methods

¡  The DMA produces a distribution of community

sizes very different than the other two methods could be result of it not really detecting

communities or a bad quality measure

¡  The DMA also at times produced communities

that did not have their seed node in them which could mean bad choice of seed nodes or that those nodes are Statistically Significant

(18)