Metropolitan Area
Identification
Outlining the problem
¡ The delineation of Metropolitan areas in its
current form is fairly arbitrary
¡ In its current form it misses out on information
provided by overlapping communities
¡ We wish to apply network community detection
algorithms to data with 100,885 edges and 3110 nodes
Network Basics
¡ Networks are used to represent connections
between entities
¡ Networks are represented with nodes and edges ¡ The edges can either be weighted or
un-weighted as well as directional or bidirectional
¡ Degree-The number of edges connected to a
node
¡ Geodesic Distance- The fewest number of edges
Configuration Model
¡ Is a model for un-weighted network
¡ What a random graph would look like with a
particular degree distribution
¡ Calculated based on the premise of dividing an
edge into stubs and looking at the probability two stubs would point at each other
¡ The result is the expected number of connections
between two nodes being
¡ where
d
id
jd
td
t=
d
ii
Community detection
¡ Deciding which nodes should be grouped
together is a central question in Network theory
¡ These groupings are called Communities and the
delineation of these communities has been the subject of numerous papers
¡ A community is a group of nodes that have a
Modularity
¡ The most popular way of deciding a good
community division is through the use of Modularity
¡ Modularity works by measuring the amount that
nodes in a community exceed the expected strength as given by the configuration model
¡ Mathematically:
Q =
1
d
t(A
ij−
d
id
jd
t)
ij∑
∂
ijCCME
¡ An extractive approach based on the Continuous
Configuration Model
¡ Designed for weighted network Community detection
¡ Continuous Configuration model combines the
configuration model with a strength calculation
¡ This results in the calculation of the expected weight to be
and expected connection to be :
¡ When combined with the probability of connection from
the configuration model this gives the expected value of a connection between nodes i and j in a random graph
Sij = E(wij) = SiSj
CCME Continued
¡ The method also defines a kappa to capture the variance
of the expected weight of connection.
¡ The method forms communities by comparing node how
connected a node is to an already established community.
¡ It does by approximating a p value using the CLT that has
a variance defined to be
¡ This method does not account for self loops leading to an
adjustment for the data that has two stages.
¡ One that identifies communities
¡ One that identifies Statistically Significant Self Commuting
Dampened Modularity
¡ Extractive approach designed to mimic
modularity results.
¡ Exists in 3 Stages:
¡ Stage 1: Gathering –consists of maximizing:
¡ Stage 2: Ensuring Connectivity- done by choosing a
threshold and ensuring all nodes are at least that connected to their respective community
¡ Stage 3: Filtering- Getting rid of redundant
communities
(θx−1ωij ij
Calculating Dampening
Factor
¡ The best current method to figure out the proper
dampening factor is based on tuning according to a quality measure
¡ In an un-weighted non-overlapping community
graph Modularity is the reasonable quality measure to use
¡ In case of these metropolitan areas the best
measure is instead one that accounts for
overlapping communities the measure used in this project was
Refining Communities
¡ Because the gathering stage only requires a
strong connection with one node this does not guarantee a strong connection with the rest of the community
¡ Thus by mimicking the method in the OSLOM
paper where connectivity with the community is calculated for each node and the least
connected node is removed if the node is not above a threshold t.
¡ Thus for a community of n nodes the calculation
Filtering
¡ Have attempted several ways of filtering through
similar communities. All based around a Jaccard similarity value, which if we have sets A and B is equal to
¡ One method involved merging communities
together but this had issues with maintaining connectivity
¡ A second method involved selecting all
communities with a certain minimum threshold
similarity and choosing the best one according to
A
∩
B A∪
BOther Uses and
Modifications
¡ The DMA could also be used as way of collecting
potential nodes for a community that another algorithm such as CCME performs community detection on.
¡ It can also utilize the approach laid out in the
OSLOM paper to deal with dynamic community detection
CCME Results
• The CCME produced 214 communities with an average size of 11.01 counties and a standard deviation of 10.67 counties
DMA Results
¡ The DMA produced 347 communities with a
mean of 6.06055 counties and a standard deviation of 3.044
Comparison to MSA
• The MSA has 919 communities with a mean size of 1.94 and a standard deviation of 2.26
Observations and Potential
Adjustments
¡ The first major observation is that the CCME
produces communities much larger than the other two methods
¡ The DMA produces a distribution of community
sizes very different than the other two methods could be result of it not really detecting
communities or a bad quality measure
¡ The DMA also at times produced communities
that did not have their seed node in them which could mean bad choice of seed nodes or that those nodes are Statistically Significant