USING DYNAMIC COMMUNITY DETECTION TO TRACK
TOPICS ON TWITTER
by Kai Nichols
ABSTRACT
Since 2008, many people have proposed methods to track topics on social media using a variety of natural language processing (NLP) techniques. The use of Dynamic Community Detection (DCD) for topic tracking combines the pros of many other methods. It allows for the unsupervised detection of topics, as groups of key terms, and the tracking of the evolution of those topics over time. DCD has previously been applied to social media topic tracking on a small scale, but the application to larger, more complex, and richer social networks, such as Twitter, has not previously been tackled. In this paper I will propose a method for using DCD to track trends from streaming Twitter content.
TABLE OF CONTENTS
ABSTRACT . . . ii
LIST OF FIGURES AND TABLES . . . vi
LIST OF ABBREVIATIONS . . . vii
ACKNOWLEDGMENTS . . . viii
CHAPTER 1 INTRODUCTION . . . 1
CHAPTER 2 BACKGROUND AND RELATED WORK . . . 3
2.1 NLP and Short Form Text . . . 3
2.1.1 Usage . . . 3
2.1.2 Language Quality . . . 3
2.1.3 Sampling . . . 4
2.2 Social Media Trend Detection . . . 4
2.2.1 Popular Papers . . . 4
2.2.2 Other Papers . . . 5
2.2.3 NLP Dynamic Community Detection . . . 5
2.3 Static Community Detection . . . 6
2.3.1 Traditional Methods . . . 6
2.3.2 Popular Methods . . . 6
2.4 Dynamic Community Detection . . . 7
2.4.1 Instant Optimal . . . 9
2.5 NLP and Networks . . . 11
CHAPTER 3 METHODOLOGY . . . 12
3.1 Pre-Processing . . . 12
3.2 Dynamic Community Detection . . . 14
3.2.1 Static Community Detection . . . 15
3.2.2 Matching . . . 15
CHAPTER 4 EXPERIMENTS AND RESULTS . . . 17
4.1 Dataset . . . 17 4.2 Community Properties . . . 19 4.3 Top Events . . . 20 4.4 Discussion . . . 25 4.4.1 Improvements . . . 25 CHAPTER 5 CONCLUSION . . . 27 REFERENCES CITED . . . 28
A.1 Appendix A Code . . . 33
A.2 Static Community Detection . . . 33
A.3 Match Communities . . . 34
B.1 Appendix B Example Top Communities . . . 38
B.2 Method 1 . . . 38
B.2.1 Sports . . . 38
B.2.2 Politics . . . 38
B.2.3 Music . . . 39
B.2.5 Award Shows . . . 39 B.2.6 Noise . . . 40 B.3 Method 2 . . . 40 B.3.1 Sports . . . 40 B.3.2 Politics . . . 40 B.3.3 Music . . . 41 B.3.4 Video Games . . . 41 B.3.5 Awards Shows . . . 42 B.3.6 General News . . . 42 B.3.7 Noise . . . 43
LIST OF FIGURES AND TABLES
Figure 2.1 Formal Transformations of Communities (1. Maintain, 2. Contraction, 3. Growth, 4. Death, 5. Birth, 6. Split, 7. Merge) . . . 8 Figure 4.1 Total Number of Posts per Hour vs Number of Posts after Pre-Processing . 17 Figure 4.2 Number of Posts Available per Day in January . . . 18 Figure 4.3 Histograms of Co-occurence Frequency (left) and Words per Post (right) . 18 Figure 4.4 Transforms per Day Broken Down by Type . . . 19 Figure 4.5 Histogram of Community Length/Lifespan (left) and Average
Community Size (right) . . . 20 Figure 4.6 Community Lifespan vs Average Size . . . 20 Figure 4.7 Comparison of Noise to Relevant Information in Detected Communities . . 23 Figure 4.8 Comparison of Community Topic for Relevant Communities . . . 24 Table 3.1 Post Pre-Processing Examples . . . 13 Table 3.2 Transformation Conditions . . . 16
LIST OF ABBREVIATIONS
Dynamic Community Detection . . . DCD Word Co-Occurrence Matrix . . . WCN Clique Percolation Method . . . CPM Clique Percolation Method with Weights . . . CPMw Natural Language Processing . . . NLP
ACKNOWLEDGMENTS
I would like to thank my co-workers at Seagate Technology for getting me interested in working with Social Media data, my friends and family for their support throughout this project, and the staff of the Writing Center for their assistance.
CHAPTER 1 INTRODUCTION
Around 2008, people were beginning to use the wealth of information available on social media and were in awe at the growth of the platforms. Now, social media is an integral part of many US citizens’ day to day lives; 70% of US adults use Facebook [1] and one tenth of the US population uses Twitter on a daily basis [2]. Social media users create a continuous stream of content on every topic imaginable. People talk about their day to day lives, the news, entertainment, sports, politics, and more. Social media has created a wealth of opportunities for advertising and marketing, it has allowed people to connect with each other regionally and worldwide, and it allows fast dissemination of information during disasters [3].
Many methods have been proposed to detect and/or track topics on social media [4– 7]. Some aim to do event detection, such as detect breaking news events. Some aim to do emerging trend detection, predicting some upcoming trend. There is a wealth of research in this area because knowing what people are talking about and the variation of those topics over time is of great use to many groups; that information has been used in marketing [8, 9], natural disaster response [10], tracking flu epidemics [11], and news reporting [12, 13].
Existing topic tracking falls mainly into the group of trend detection. Most people are concerned with not what is popular, but what is newly popular. I’d argue that whether a topic is still popular or ceasing to be popular is also useful. In the example of tracking flu epidemics, it is certainly useful to know if there is a flu epidemic in the area, but once you know that it is still useful to know if the amount of effect it is having on an area is increasing or decreasing. Moreover, existing methods focus mainly on detecting single trending terms, when many topics are composed of multiple terms or a single term may be being used in multiple contexts. They also don’t often account for the quickly evolving language of the internet. Throughout the lifetime of a topic words may enter and leave while the core
meaning and content still stays the same.
Dynamic Community Detection (DCD) allows for topic tracking on social media without many of these drawbacks. It allows overlapping communities of groups of key terms, allows terms to enter and leave a topic, and tracking of all topics’ temporal patterns, not just those that are “trending”. Cazabet et al. first proposed using DCD for trend detection on social media in 2012 using the tags on the Japanese social media site Nico Nico Douga [4]. I will expand upon their work by applying DCD to Twitter data and creating an end-to-end method for feature extraction, DCD, and topic exploration.
I will first review the existing work on analyzing social media data, social media topic tracking, community detection, and language networks in Chapter 2. Then, in Chapter 3, I will outline my methodology for DCD based on word co-occurrence networks, which com-prises of a feature extraction methodology for the rich and complex Twitter data and a DCD algorithm. Following the methodology, In Chapter 4, I will run my topic tracking algorithm on approximately 2 weeks of Twitter data, and show its ability to detect and track the popular topics. I will also discuss the advantages, failings, and possible improvements of the method. This will be followed by my concluding remarks in Chapter 5.
CHAPTER 2
BACKGROUND AND RELATED WORK
2.1 NLP and Short Form Text 2.1.1 Usage
In February 2020, Twitter reported 152 million average daily active users with an av-erage 31 million in the US [2]. Since 2014, there have been about 6,000 tweets per second, 350,000 tweets per minute, 500 million tweets per day, and 200 billion tweets per year [14]. This implies that the average user tweets about 3 times per day. A 2007 paper found that most posts on Twitter fell into 4 categories: sharing daily activities, conversations, sharing information/URLs, and reporting or discussing news [15]. A 2009 study categorized trends on Twitter and found 27.25% to be entertainment related, 19.53% to be sports news, 12% to be internet memes, and 12% to be technology news [16].
2.1.2 Language Quality
Twitter allows its users to tweet 280 characters [17]. Various works have shown a correla-tion between lexical errors and the quality of content on the web [18]. Rello et al. examined the lexical quality of various social media sites versus that of news sites and general web-sites. They define lexical quality as the average number of misspellings for a set of frequently misspelled words. They showed the lexical quality of social media to be better than sites like Wikipedia or CNN, but worse than the overall quality of the web. However, the quality of social media was increasing, whereas the web as a whole was decreasing. Of all social media sites the top 5 for lexical quality were Twitter, Yahoo Answers, Facebook, Friendster, and Fotolog, all with a lexical quality greater than that over the overall web [19].
2.1.3 Sampling
Twitter currently provides two APIs for getting streaming tweet data: the sample stream and the decahose stream. The sample stream allows developers to stream at least a 1% random sample of new public tweets and is available free of charge. The Decahose stream is an enterprise API that allows developers to stream about a 10% random sample of all new tweets. Twitter also offers a Firehose, 100% of all new tweets, API option, but little information about it is publicly available [20]. In their 2013 paper, Morstatter et al. analyze whether the streaming API provides a representative sample of the full Firehose stream by comparing the results for a number of NLP tasks. Firstly, they find the coverage, percent of full data, that the streaming API offers to vary day by day, never dropping below 1%, but averaging 43.5%. Later they look at topic analysis, comparing results from running LDA on the streaming API data vs random samples of the Firehose data. They find the results on the sample to be a good representation of the full dataset. They use the Jensen–Shannon divergence, which measures similarity between probability distributions. They compute the mean for different days of data which have different levels of coverage. For the minimum coverage they observe a mean divergence of 0.024 with a standard deviation of 0.019. For the maximum coverage they observe a mean of 0.016 with a standard deviation of 0.018. In all the samples the maximum divergence achieved is 0.15 [21].
2.2 Social Media Trend Detection 2.2.1 Popular Papers
“TwitterMonitor: Trend Detection over the Twitter Stream”, published in 2010, was one of the earlier papers to propose a trend detection scheme for Twitter stream data. TwitterMonitor first identifies bursty keywords, keywords with a spike in its number of mentions. It then groups those bursty keywords into trends based on their co-occurrences. The concept of burstiness as a method for detecting trends is the main contribution of this paper. This method only looks at groups of keywords that are seeing a sudden growth in the
number of mentions. It does not continue to track a trend if it maintains a high popularity or if it starts to drop [5].
2.2.2 Other Papers
In their 2012 paper, Lau et al. reject the idea of using keywords or n-grams for trend analysis. Instead they use topic models, which learn clusters of terms rather than just looking at single keywords or n-grams, and look at changes in co-occurrence rather than term-frequency. They use an LDA model, which provides them with a set of topics, each with its own word distribution, and a topic distribution for each document. They perform this model at consecutive time steps, using the previous model to construct priors. This guarantees a one-to-one correspondence between topics for adjacent model updates. The evolution of each topic can be tracked through the changes in its word distribution [6]. However, LDA uses word co-occurrence frequencies at the document level. In the case of short texts, this has been found to cause it to be less effective [7].
2.2.3 NLP Dynamic Community Detection
In their 2012 paper, Cazabet et al. look at using DCD for trend detection to address tracking trends over both long and short periods of time. They apply DCD to a word co-occurrence network based on tags on the Japanese social network Nico Nico Douga and find it to be able to detect communities lasting varying amounts of time and of varying sizes. They also note the ability to see terms enter and leave a community. They suggest the future work of adapting this problem to a social network, such as Twitter, without built-in keywords or tags and richer variation of content [22].
In 2017, Konstantinidis et al. began to tackle trend detection using DCD on Twitter. However, they define their communities as discussions between groups of users. They study these discussions between users on a specific topic by creating a graph with edges being defined by a user directing content to another user via the @ symbol. They also contribute a method for ranking the communities based on community size, presence of influential users,
and the text being talked about within the discussions [23].
However, how to track/detect trends on Twitter using DCD on only the content of posts it is still an open question.
2.3 Static Community Detection 2.3.1 Traditional Methods
Traditional methods for community detection fall into two main categories, agglomerative and divisive. Agglomerative methods iteratively add edges to form communities, whereas divisive methods iteratively remove edges to form communities.
The most traditional method for community detection is hierarchical clustering, which is agglomerative. Hierarchical clustering involves calculating a weight between every pair of vertices then taking all vertices and joining them into communities in order of the calculated edge weights. This process creates a tree wherein a “slice” through this tree gives a set of communities.
A traditional divisive community detection is using the minimum-cut algorithm to divide a network into a set number of communities.
Both of these methods require a user set number of communities and can find structure where there is none [24].
2.3.2 Popular Methods
With the rise of social networks and the world wide web new methods for discovering community structure in these large networks became a hot research area causing a surge in community detection methods. In 2002, Girvan and Newman published the seminal paper “Community structure in social and biological networks”, which uses a divisive method based on their developed concept of “edge betweenness”, a generalization of vertex betweenness [25].
In 2004, Girvan and Newman introduced the concept of modularity, which is a mea-sure of quality of a division of a network into communities. It is a difference between the
fraction of edges that are in a community and the expected fraction if the edges were dis-tributed randomly. This metric sparked a number of community detection methods focused on modularity maximization [24].
In 1998, the concept of clique overlap was introduced by Everett and Borgatti [26]. This was later expanded for community detection by Palla et al. in 2005, who introduced the clique percolation method (CPM) for uncovering overlapping community structure. The assumption is that a typical community consists of several fully-connected subgraphs. A k-clique community is a union of all k-k-cliques, complete subgraphs of size k, that can be reached from each other through a series of adjacent k-cliques, where adjacency means sharing k-1 nodes. Palla et al. proposed the adaptation for a weighted graph of removing all links less than some threshold w* and consider the remaining links to be unweighted [27]. In 2007, Farkas et al. propose a weighted extension of CPM, CPMw (Clique Percolation Method with Weights). Here a threshold value, I, is set and adjacent k-cliques can only be joined if the resulting intensity, the geometric mean of the weights, is greater than the threshold value [28].
Another popular method for community detection has been label propagation, wherein each node is initialized with a unique label and at every iteration each node may change its label based on the values of its neighbors [29–31].
2.4 Dynamic Community Detection
DCD aims at discovering and describing the life-cycle of communities within a changing network. A number of formal transformations of communities have been defined including Birth, Death, Growth, Contraction, Merge, and Split, which can be seen in Figure 2.1. DCD faces the main challenge of community instability solved in most cases by “smoothing” the evolution of communities [32].
DCD methods can be split into three categories: instant-optimal, temporal trade-off, and cross-time [32].
Figure 2.1: Formal Transformations of Communities (1. Maintain, 2. Contraction, 3. Growth, 4. Death, 5. Birth, 6. Split, 7. Merge)
Instant-optimal community detection detects communities in snapshots and matches the communities found at different steps. This leverages previous static community detection methods, however communities can be unstable since it only considers one time step at a time. Temporal trade-off community detection detects communities at time t based on changes from time t-1. This method avoids some of the instability in instant-optimal CD, but can’t be easily parallelized due to the sequential nature and has a risk of an avalanche effect: long-term drift of communities compared to what a static detection method would find. Cross-time community detection considers all steps simultaneously, detecting communities based on both past and future evolutions. This method doesn’t suffer from instability or community drift, but requires novel approaches and can not handle real-time community detection [32].
In my case only instant optimal and temporal trade-off are relevant in practice. Cross time implies you have all the time points you will be working with already collected, whereas tracking topics over time implies you only have current and past data.
2.4.1 Instant Optimal
Instant Optimal approaches can be broken up into three main categories: similarity based approaches, core node based approaches, and multi-step matching approaches. Similarity based approaches match communities based on a function used to quantify the similarity be-tween communities in different time-steps. Core-node based approaches identify “core-nodes” to each community, based on measures such as centrality or term frequency. Communities at adjacent time-steps are matched by the presence of the core node. This does not allow for detecting splits or merges. Multi-step matching looks at more than just adjacent time-steps when matching communities. This allows for detection of resurgence of communities, when one dies but then is born again [32].
As I would like to be able to detect all community transformations without too much computational complexity, I focused on some similarity based approaches including those proposed in [33–35]. Hopcroft et al. proposes matching communities using the formula:
match(C, C0) = min |C ∩ C 0| |C| , |C ∩ C0| |C0| .
If match(C, C0) is close to 1, C0 is considered a continuation of C [33]. Palla et al. proposes using the CPM to find communities at each snapshot and matching the communities by running CPM on the joint graph of snapshot t and t+1 [34]. Takaffoli et al., proposes a set of rules to match communities for the transformations of continuation, birth, death, split, and merge [35].
2.4.2 Temporal Trade-off Community Detection
iLCD (intrinsic, Longitudinal Community Detection) uses a multi-agent simulation to detect communities in a dynamic network. There are two types of agents: node agents and community agents. Node agents have a label, a list of other node agents they are connected to, and a list of communities they belong to. They also have five actions: to create a new community, to bond with another agent, to remove a bond with another agent, to ask to
integrate into a community, and to ask to ban a node from a community. A community agent has a label, a list of node agents in its community, and a value of seclusion, measuring the quality of the community. They have four actions: decide for the integration of a node, decide for the banning of a node agent, decide whether or not to integrate another community, and to die. With the addition or removal of an edge (n1, n2) three actions are triggered: the node agents will ask for integration to their neighbors communities containing n1 and n2, the community agents decide whether or not to integrate the node agents, and the modified communities try to merge with neighboring communities [36].
In “A Real-Time Detecting Algorithm for Tracking Community Structure of Dynamic Networks”, the authors describe an incremental algorithm with low computing complexity. They define four types of edge updates each resulting in four possible operations. Inner community edges are two nodes connected by an edge that already exists and are in the same community. They cause no change to the community structure. Cross-community edges are when the two nodes already exist and belong to different communities. They cause either no change or the combination of two communities into one. Half-new edges are when one of the nodes doesn’t exist before the edge. They cause the new node to be assigned to an existing community or the creation of a new community. Finally, new edges are when both nodes don’t exist before the edge. They cause the two nodes to either be assigned to an existing community or create a new community. In the case of multiple possible operations the choice is made based on modularity gain [37].
Tiles is an easily parallelizable DCD algorithm. Tiles uses label propagation to diffuse changes to a nodes surroundings and update neighbor’s community membership. It defines two levels of involvement in a community: core, involved in a triangle with other nodes in the same community, and peripheral, a one-hop neighbor of a core node. Only core nodes can spread community membership to their neighbors. Communities are grown through PeripheralPropagation, where a new node becomes part of an established community, and CorePropagation, where a node becomes a core member if it connects to two core nodes or a
new community is formed if a new triangle of nodes not belonging to a community is formed [38].
2.5 NLP and Networks
Word co-occurrence networks are used in a variety of applications from part-of-speech tagging, semantic analysis, and topic detection. Co-occurrence refers to two words appearing within some window of each other; that window can be N words, a sentence, a paragraph, or an entire document. WCN can be directed or undirected. The edges can be weighted or unweighted. If weighted, direct number of co-occurrences can be used or a metric such as log-likelihood ratio, mutual information, or the dice score [39].
In their 2018 paper, Garg and Kumar study the structure of word co-occurrence networks for microblogs, specifically looking at Twitter data. They show that the short form text of Twitter data creates a WCN with a different structure than traditional WCN. Some observations of note include that: the node degrees of a microblog WCN follows a power-law unlike a traditional WCN which follows a two-regime power law, microblog WCN are scale free, and microblog WCN are disassortative [40].
CHAPTER 3 METHODOLOGY
3.1 Pre-Processing
For each time slice. I first perform pre-processing on a corpus of posts to transform it into a word co-occurrence graph.
First, I filter the text content of each post to reduce each post to only relevant words. There are 3 main sections to this filtering: removing punctuation, breaking up URLs, and removing stop-words:
• I process all punctuation based on type and context. All punctuation that occurs mid-word, such as apostrophes and commas between numbers, is replaced with nothing. This preserves the original meaning of the word while removing the punctuation. All hyphens and forward slashes are replaced with a space. This allows the two words connected by those punctuation to be considered a bi-gram, and match with situations in which punctuation is not used. All other punctuation is replaced with a period. I use the period to denote words that are next to each other but should not be considered a bi-gram. This includes words separated by a period, comma, semi-colon, etc.
• I break up URLs under the assumption that some of the words in a URL will be useful. For example, I would like to know if a certain domain is often being mentioned. I replace all forward slashes with a space, and I remove all words containing http or www. I do not remove the com (or gov, org, net, etc.). So that the website will be represented as a bi-gram even if the domain name itself is just one word.
• Stop-words are words filtered out during an NLP task. In this case, there are three types of words I remove. Firstly, I remove all words in the NLTK English stop-word list, 179 common words (ex: I, me, all, have, is). These words provide very little information
about what is being talked about, yet make up a large proportion of the text. Second, I remove words shorter than 3 characters and longer than 30 characters. Words with a length of one or two contain almost no information. Almost no words in the English language are longer than 30 characters, so I assume these words to be part of a URL (see Table Table 3.1 example 2). The final set of words I remove are ‘meta-content’, words being used to talk about tweeting and Twitter (ex: followed, unfollowed, rt, retweet, tweet, etc.).
Table 3.1: Post Pre-Processing Examples
Original Processed
https://twitter.com twitter com
https://docs.google.com/document/d/1nHt nWzFrIMF1L9d0ItyLzLZwkMv2GjmglrXG-Zw9HZY/edit#
docs google com document . zw9hzy edit
Deontay Wilder confirms rematch against Tyson Fury on Twitter https://t.co/VdiMZZk6PP
wilder . fury . deontay wilder confirms rematch . tyson fury . twitter . vdimzzk6pp
RT @thatchicknickk: On a jury panel all week and just found out the reporter who types ev-erything during the trial makes around $600 an hour
jury panel . week . found . reporter . types everything . trial makes around . 600 . hour
June issue is out peeps. We are excited about this month’s issue. We have Crooked Colours, Dom Youdan, Allipha
june issue . peeps . excited . month . issue . crooked colours . dom youdan . allipha
Once I have my cleaned corpus of posts I use sklearn’s CountVectorizer model to create a sparse binary matrix ∈ Bposts,f eatures. Here I set the maximum features (bi-grams) to 10,000. So for each matrix, I only include the top 10,000 bi-grams. For different time slices, different features may be chosen, though there will be broad overlap. This results in a sparse matrix for time t, Xt, where Xt,ij = f eatt,j in post i.
I use this feature count matrix to construct the WCN. Because Xt is a binary matrix this can be formulated as:
G = (XXT)ij = n X k=1 XikXjk = X postk∈posts
(word i in postk) and (word j in postk)
Each entry of this matrix represents the co-occurrence between f eati and f eatj. This matrix is symmetric, and the diagonal is the total number of times the feature i occurs. Since its undirected, so I pull the values that are in the upper triangle of G for the edges of my word co-occurrence network. Moreover, I do not need the values on the diagonal, which represent term occurrences, so I pull the values that are strictly in the upper triangular.
Word co-occurrence graphs tend to be heavily positively skewed, with the mode often being one. I exclude all edges with a weight of one and two from my graphs.
3.2 Dynamic Community Detection
I considered a variety of DCD approaches. First, I used a Temporal Trade-off approach, similar to that in [36]. The algorithm is as follows. First, every node can possibly enter a community if one of its neighbors is in that community; the node sets a threshold for entering that community based on the weight of the connection its immediate neighbors have to that community. For each time-step iterate through all edges in the WCN and for each edge update the communities based on if the nodes can enter or leave a community as a result of the update of this edge. Finally, new communities are created when a clique of size k formed and those nodes were not already sharing a community. This method suffered from temporal instability, so I moved on to other methods. Second I considered an instant optimal approach using the CPMw as my static community detection. For the matching of communities across time, I tested both a rule-based approach similar to [35] and a multi-time step community detection approach similar to [34]. They produced similar results, with small variations in speed and quality, and I ended up choosing the rule-based approach.
3.2.1 Static Community Detection
At each time step I use CPMw (Clique Percolation Method with Weights) to find over-lapping communities. A method that generates overover-lapping communities is useful for topic modeling because when clustering groups of words there are many words that may belong to multiple clusters. This may be due to the word having multiple meanings or being used in multiple contexts. The k-clique method is also not dependent on any randomness; the k-cliques found for some network will be the same every time.
The main parameter of the static CPMw method is the threshold, I. If too small of an I is chosen, many barely related communities will merge together. If too high of an I is chosen, the model will not include communities that may be important, but have low edge weights compared to others. The same I must be used for multiple time slices of data. If it is varied, the model may end up excluding some community one day and not the next day.
3.2.2 Matching
After determining the communities at each iteration, I determine which communities match each other across time. I do this by taking each adjacent set of communities and considering the potential transformations of growth, contraction, split, merge, death, and birth. I created conditions under which I could define these transitions as occurring. First, I defined an ordering for the transformations, where a higher ordering implies that all the transitions below are not possible and transitions on the same level can’t have both conditions be fulfilled. This ordering is:
1. Merge, Split 2. Maintain 3. Grow, Decline 4. Birth, Death
I then defined metrics under which each of these transformations could happen. Those are detailed in Table 3.2. Ct are the communities at time t. Ct, i is the ith community at time t, and there is some threshold value 1 − .
Table 3.2: Transformation Conditions
Transformation Rule Formula
merge S i∈S Ct−1,i ≈ Ct,j ∀i ∈ S, |Ct−1,i∩Ct,j| |Ct−1,i| > 1 − and P i∈S |Ct−1,i∩Ct,j| |Ct,i| > 1 − split Ct−1,i≈ S j∈S Ct,j ∀j ∈ S, |Ct−1,i∩Ct,j| |Ct,i| > 1 − and P j∈S |Ct−1,i∩Ct,j| |Ct−1,i| > 1 −
maintain Ct−1,i≈ Ct,j 2|Ct−1,i
∩Ct,j|
|Ct−1,i|+|Ct,j| > 1 −
grow Ct−1,i∩ Ct,j ≈ Ct−1,i
|Ct−1,i∩Ct,j|
|Ct−1,i| > 1 −
decline Ct−1,i∩ Ct,j ≈ Ct,i
|Ct−1,i∩Ct,j|
|Ct,j| > 1 −
birth This community did not exist before New exists with no corresponding old death This community does not exist
any-more
Old exists with no corresponding new
The magnitude of the intersection between two communities, the number of features they share, is easy to calculate. Following the same logic as constructing the co-occurrence matrix I can construct the matrix S. First I construct Ctsuch that the rows represent a community, and the columns represent a feature, and each element Ct,ij = |Ct,i∩ Ct,j|
Sij = CtCt−1T
However, constructing the union did not have as apparent of an easy solution, so some ap-proximations were made between the rule and the formula. These apap-proximations, however, do not account for overlap between communities.
CHAPTER 4
EXPERIMENTS AND RESULTS
4.1 Dataset
For the dataset, I use Twitter Stream data for January 2019 from the Internet Archives [41]. It has been limited to only tweets with the language marked as English. I only look at the content of the tweets. I note in Figure 4.1 that after pre-processing I only retain around 40-50% of the original tweets, but the percentage is relatively constant throughout a day. This is due to many tweets not having any bi-grams not involving stop-words.
Figure 4.1: Total Number of Posts per Hour vs Number of Posts after Pre-Processing
I then looked at how much data is available each day in Figure 4.2. There are some days missing data, and some days with less than expected data (See figure), so I chose to focus on the 15 day period from January 13th to January 27th where I have a stable amount of data for 15 consecutive days.
For that time range I then looked at the distribution of co-occurrence frequency (excluding terms which only co-occurred one or two times), and the number of bi-grams per post in
Figure 4.2: Number of Posts Available per Day in January
Figure 4.3. I use these numbers to set the threshold I for the time period.
Figure 4.3: Histograms of Co-occurence Frequency (left) and Words per Post (right)
Using my DCD algorithm with a time-step of one day, I find 2133 disjoint community pieces (If a community merges/splits there is a piece for before and after the merge/split). This corresponds to 2251 community paths, a path from community birth to death with portions before a split or after a merge duplicated). About half, 1120, of these community paths last longer than one day.
4.2 Community Properties
I observe the types of transformations happening to communities each day. The most com-mon transformations are, in descending order, Birth, Death, and Maintain. A full overview of their occurrences can be found in Figure 4.4.
Figure 4.4: Transforms per Day Broken Down by Type
I then looked at the lifespan and size distributions for the communities, found in Fig-ure 4.5. As stated before, only about half of the communities last longer than one day. The lengths decline more slowly after that. The one day long communities are likely the result of a small news event or some discussion between a group of users. Longer communities are likely about more ongoing events or interests of users that are discussed regularly. In my later analysis, I will focus mainly on these communities with a longer lifespan. The average size of a community is the mean of the number of terms at each time-step of its life. The sizes of the communities are also heavily weighted towards the smallest value (3 in my case). Communities reach up to an average length of 227.5 terms. The maximum size achieved by a community is 562 terms. These large communities are often the overlap of several popular topics.
As can be seen in Figure 4.6, communities that last longer tend to be made up of more terms, but only up until a limit. For communities longer than a week the average size begins
Figure 4.5: Histogram of Community Length/Lifespan (left) and Average Community Size (right)
to decrease again. I hypothesize that in the short term communities with more words can live longer by covering many bases, but in the long term, they only survive if they contain continuously relevant data and having only the continuously relevant data, discarding any extra, helps them live longer. However, with only about 2 weeks of data, these averages have extremely high margins of error. I assume that these would decrease with more data, but it is possible that there is just no relation between size and lifespan.
Figure 4.6: Community Lifespan vs Average Size
4.3 Top Events
Based on an online crowd-sourced rank of the top events from January 2019 [42], eight out of the top ten events were also detected by my method. They are in order as follows:
1. Government Shutdown and Immigration
• 25 communities were related to the government shutdown
• 8 communities were related to Donald Trump’s immigration policies 2. Bird Box
• 2 communities mentioned the movie bird box. In one community it is the sole topic. In the other, it is mentioned along with other popular entertainment 3. Instagram Egg
• 2 communities are about the Instagram egg 4. Renewed Calls for The Prosecution of R. Kelly
• 1 community is about R. Kelly being dropped by Sony Music 5. Tidying Up with Marie Kondo
• There are no communities about this topic 6. Brexit
• 6 communities are about Brexit 7. Polar Vortex
• There are no communities about this topic. The polar vortex occurred towards the end of the time frame studied; it formed January 24, 2019 and brought snowfall January 27-29th.
8. Ted Bundy
• 6 communities were about the Ted Bundy biopic 9. Pelosi/Trump Feud
• 19 communities talked about Pelosi and Trump. 10. People announcing runs for president
• 4 communities were about Kamala Harris announcing she was running for presi-dent.
• There were no communities about Julian Castro or Kirsten Gillibrand announcing they were running for president
I then looked at the top communities using two methods. Method 1 looks at the com-munities lasting longer than one day who the mean term frequency for the terms it contains is the highest. Method two looks at communities longer than 3 days with the size of the community always less than 30 terms. I then take the average of the sum of term frequencies per day and order based on that.
I looked at the top 100 communities based on each of these ranking methods and found the following. First, a large proportion of the top communities were what I considered noise. These are things such as people talking about daily life, memes, advertisements, horoscopes, giveaways, and amalgamations of several topics. Relevant topics are those which focus on one area whose knowledge has potential value. A breakdown of relevant vs noise can be found in Figure 4.7.
For all relevant communities in the top 100, I then manually sorted them into categories. The breakdown for this can be found in Figure 4.8. There were 6 categories all of these communities fell into General News, Award Shows, Video Games, Music, Politics, Sports. All but Award Shows likely represent the main categories reflected year-round. Film award season is November through February, so the Award Shows category is likely not found outside of those months. Of the categories, some broke down further. Sports was comprised mostly of football-related communities, this is likely tied to what sports are in season. Politics consisted mostly of U.S. politics and some U.K. politics.
4.4 Discussion
This method is successfully able to capture the top topics and a broad variety of topics from Twitter Stream data. It has the advantage of tracking overlapping trends over both short and long periods of time. It is relatively efficient, with 2 weeks of posts taking less than an hour to process, so it can be run as data is available, and even has the possibility of handling more data. The ability to define the evolution of a trend opens up the potential for more than just emerging topic detection, but detection of other temporal patterns as well. 4.4.1 Improvements
The main potential improvements to this method lie in better keyword extraction. There is a large amount of irrelevant noise in Twitter data that must be filtered out to maintain relevant and useful communities. A more comprehensive list of stop-words would be the simplest improvement, but more advanced NLP methods could be used for filtering out the noise.
Moreover, there are issues related to the community matching. Some communities matched between time-steps are not related. This happens often when a small community grows to a large community. All of the original community may be a subset of the new larger com-munity, but the new larger community is not a single cohesive topic. That issue can then compound causing drift over time if the large community then shrinks or splits resulting in a continuous community that switches between topics.
Another potential improvement would be a better ranking of the relevance or importance of a community. Having noise in the dataset would be less of an issue if the resulting commu-nities could be sorted in such a way that the noise was at the bottom and the useful topics at the top.
In reality, any users of this kind of method would likely have an area of interest more specific than the entirety of the data on Twitter. First filtering the data down to a subset that contains some keywords from a list of relevant keywords, would decrease the amount
of noise in the dataset, and allow for a higher percentage of useful topics to be found and tracked.
CHAPTER 5 CONCLUSION
I have shown a method for applying Dynamic Community Detection to Twitter Stream data that is able to detect relevant topics being discussed on Twitter and track their evolution over time. More work can still be done to improve the keyword extraction and community matching in the algorithm. However, this method is a good starting point for a feature extraction and DCD method applicable to large corpora of unstructured text such as Twitter data.
This method can benefit marketing/advertising efforts, journalists, and politicians who are looking to understand what people are talking about in their area of interest and the evolution of those topics over time. Moreover, this method is not limited to Twitter, but any large and continuously increasing textual corpus such as news sites, online forums, or product reviews.
There are many extensions of this method that I would recommend for future work. First, this method could further be extended to also include a tool for users to interact with, sort, and filter through the detected communities. Second, more analysis of the temporal patterns of the communities could be done, namely classification and prediction. Classification of the temporal patterns would allow users to see which topics are, and prediction which are about to be, “trending” or which are (or about to be) declining in popularity.
REFERENCES CITED
[1] 1615 L. St NW, Suite 800Washington, and DC 20036USA202-419-4300 | Main202-857-8562 | Fax202-419-4372 | Media Inquiries. 10 facts about Americans and Facebook. URL https://www.pewresearch.org/fact-tank/2019/05/16/ facts-about-americans-and-facebook/. Library Catalog: www.pewresearch.org. [2] Document, . URL https://www.sec.gov/Archives/edgar/data/1418091/
000141809120000019/twtrq419ex992.htm.
[3] Dina Fine Maron. How Social Media Is Changing Disaster Re-sponse. URL https://www.scientificamerican.com/article/ how-social-media-is-changing-disaster-response/. Library Catalog: www.scientificamerican.com.
[4] Remy Cazabet, Hideaki Takeda, Masahiro Hamasaki, and Fr´ed´eric Amblard. Using dy-namic community detection to identify trends in user-generated content. Social Network Analysis and Mining, 2:1–11, December 2012. doi: 10.1007/s13278-012-0074-8.
[5] Michael Mathioudakis and Nick Koudas. TwitterMonitor: trend detection over the twitter stream. In SIGMOD Conference, 2010. doi: 10.1145/1807167.1807306.
[6] Jey Han Lau, Nigel Collier, and Timothy Baldwin. On-line Trend Analysis with Topic Models: #twitter Trends Detection Topic Model Online. In Proceedings of COLING 2012, pages 1519–1534, Mumbai, India, December 2012. The COLING 2012 Organizing Committee. URL https://www.aclweb.org/anthology/C12-1093.
[7] Qiang Jipeng, Qian Zhenyu, Li Yun, Yuan Yunhao, and Wu Xindong. Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. arXiv:1904.07695 [cs], April 2019. URL http://arxiv.org/abs/1904.07695. arXiv: 1904.07695.
[8] Juan Miguel Carrascosa, Roberto Gonz´alez, Rub´en Cuevas, and Arturo Azcorra. Are trending topics useful for marketing? visibility of trending topics vs traditional ad-vertisement. In Proceedings of the first ACM conference on Online social networks, COSN ’13, pages 165–176, Boston, Massachusetts, USA, October 2013. Association for Computing Machinery. ISBN 978-1-4503-2084-9. doi: 10.1145/2512938.2512948. URL https://doi.org/10.1145/2512938.2512948.
[9] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viral marketing. May 2007. URL https://doi.org/10.1145/1232722.1232727.
[10] Hoe Yun Kwon and Young Ok Kang. Risk analysis and visualization for detecting signs of flood disaster in Twitter. Spatial Information Research, 24(2):127–139, April 2016. ISSN 2366-3294. doi: 10.1007/s41324-016-0014-1. URL https://doi.org/10.1007/ s41324-016-0014-1.
[11] Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. Twitter catches the flu: detecting influenza epidemics using Twitter. In Proceedings of the Conference on Empirical Meth-ods in Natural Language Processing, EMNLP ’11, pages 1568–1576, Edinburgh, United Kingdom, July 2011. Association for Computational Linguistics. ISBN 978-1-937284-11-4.
[12] Mariam Adedoyin-Olowe, Mohamed Medhat Gaber, Carlos M. Dancausa, Frederic Stahl, and Jo˜ao B´artolo Gomes. A rule dynamics approach to event detection in Twitter with its application to sports and politics. Expert Systems with Applications, 55:351–360, August 2016. ISSN 0957-4174. doi: 10.1016/j.eswa.2016.02.028. URL http://www.sciencedirect.com/science/article/pii/S0957417416300598.
[13] David H. Weaver and Lars Willnat. Changes in U.S. Journalism. Journalism Practice, 10 (7):844–855, October 2016. ISSN 1751-2786. doi: 10.1080/17512786.2016.1171162. URL https://doi.org/10.1080/17512786.2016.1171162. Publisher: Routledge eprint: https://doi.org/10.1080/17512786.2016.1171162.
[14] Twitter Usage Statistics - Internet Live Stats, . URL https:// www.internetlivestats.com/twitter-statistics/. Library Catalog: www.internetlivestats.com.
[15] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng. Why we Twitter: Under-standing microblogging usage and communities. of the 9th WebKDD and 1st SNA, 43: 56–65, January 2007. ISSN 978-1-59593-848-0. doi: 10.1145/1348549.1348556.
[16] Marc Cheong. ‘What are you Tweeting about?’: A survey of Trending Topics within Twitter.
[17] Giving you more characters to express yourself, . URL https://blog.twitter.com/official/en_us/topics/product/2017/
Giving-you-more-characters-to-express-yourself.html.
[18] Irit Gelman and Anthony Barletta. A ”quick and dirty” website data quality indicator. pages 43–46, January 2008. doi: 10.1145/1458527.1458538.
[19] Luz Rello and Ricardo Baeza-Yates. Social Media is NOT that Bad! The Lexical Quality of Social Media. ICWSM 2012 - Proceedings of the 6th International AAAI Conference on Weblogs and Social Media, January 2012.
[20] Overview, . URL https://developer.twitter.com/en/docs/tweets/ sample-realtime/overview. Library Catalog: developer.twitter.com.
[21] Fred Morstatter, J¨urgen Pfeffer, Huan Liu, and Kathleen M. Carley. Is the Sample Good Enough? Comparing Data from Twitter’s Streaming API with Twitter’s Firehose. In Seventh International AAAI Conference on Weblogs and Social Media, June 2013. URL https://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/paper/view/6071.
[22] Remy Cazabet. Using dynamic community detection to identify trends in user-generated content. Social Network Analysis and Mining. URL https://www.academia.edu/2741129/Using_dynamic_community_detection_ to_identify_trends_in_user-generated_content.
[23] Konstantinos Konstantinidis, Symeon Papadopoulos, and Ioannis Kompatsiaris. Ex-ploring Twitter communication dynamics with evolving community analysis. PeerJ Computer Science, 3:e107, February 2017. doi: 10.7717/peerj-cs.107.
[24] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure in very large networks. Physical Review E, 70(6):066111, December 2004. ISSN 1539-3755, 1550-2376. doi: 10.1103/PhysRevE.70.066111. URL http://arxiv.org/abs/ cond-mat/0408187. arXiv: cond-mat/0408187.
[25] M. Girvan and M. E. J. Newman. Community structure in social and biological net-works. Proceedings of the National Academy of Sciences, 99(12):7821–7826, June 2002. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.122653799. URL https://www.pnas. org/content/99/12/7821.
[26] Martin G. Everett and Stephen P. Borgatti. Analyzing Clique Overlap. CONNEC-TIONS, 21:49–61, 1998.
[27] Gergely Palla, Imre Der´enyi, Ill´es Farkas, and Tam´as Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043): 814–818, June 2005. ISSN 1476-4687. doi: 10.1038/nature03607. URL https://www. nature.com/articles/nature03607.
[28] Illes J. Farkas, Daniel Abel, Gergely Palla, and Tamas Vicsek. Weighted network modules. New Journal of Physics, 9(6):180–180, June 2007. ISSN 1367-2630. doi: 10.1088/1367-2630/9/6/180. URL http://arxiv.org/abs/cond-mat/0703706. arXiv: cond-mat/0703706.
[29] James P. Bagrow and Erik M. Bollt. Local method for detecting communities. Physical Review E, 72(4):046108, October 2005. doi: 10.1103/PhysRevE.72.046108. URL https: //link.aps.org/doi/10.1103/PhysRevE.72.046108.
[30] Usha Nandini Raghavan, Reka Albert, and Soundar Kumara. Near linear time algo-rithm to detect community structures in large-scale networks. Physical Review E, 76 (3):036106, September 2007. ISSN 1539-3755, 1550-2376. doi: 10.1103/PhysRevE.76. 036106. URL http://arxiv.org/abs/0709.2938. arXiv: 0709.2938.
[31] Gennaro Cordasco and Luisa Gargano. Community Detection via Semi-Synchronous Label Propagation Algorithms. arXiv:1103.4550 [physics], March 2011. doi: 10.1504/.. 045103. URL http://arxiv.org/abs/1103.4550. arXiv: 1103.4550.
[32] Giulio Rossetti and R´emy Cazabet. Community Discovery in Dynamic Networks: a Survey. ACM Computing Surveys, 51(2):1–37, February 2018. ISSN 03600300. doi: 10.1145/3172867. URL http://arxiv.org/abs/1707.03186. arXiv: 1707.03186. [33] John Hopcroft, Omar Khan, Brian Kulis, and Bart Selman. Tracking evolving
communi-ties in large linked networks. Proceedings of the National Academy of Sciences, 101(suppl 1):5249–5253, April 2004. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.0307750100. URL https://www.pnas.org/content/101/suppl_1/5249.
[34] Gergely Palla, Albert-L´aszl´o Barab´asi, and Tam´as Vicsek. Quantifying social group evolution. Nature, 446(7136):664–667, April 2007. ISSN 0028-0836, 1476-4687. doi: 10.1038/nature05670. URL http://www.nature.com/articles/nature05670.
[35] Mansoureh Takaffoli, Farzad Sangi, Justin Fagnan, and Osmar R. Zaiane. MODEC -Modeling and Detecting Evolutions of Communities. In ICWSM, 2011.
[36] Remy Cazabet, Fr´ed´eric Amblard, and Chihab Hanachi. Detection of Overlapping Communities in Dynamical Social Networks. pages 309–314, August 2010. doi: 10.1109/SocialCom.2010.51.
[37] Jiaxing Shang, Lianchen Liu, Feng Xie, Zhen Chen, Jiajia Miao, Xuelin Fang, and Cheng Wu. A Real-Time Detecting Algorithm for Tracking Community Structure of Dynamic Networks. arXiv:1407.2683 [physics], July 2014. URL http://arxiv.org/abs/1407. 2683. arXiv: 1407.2683.
[38] Giulio Rossetti, Luca Pappalardo, Dino Pedreschi, and Fosca Giannotti. Tiles: an online algorithm for community discovery in dynamic social networks. Machine Learning, 106 (8):1213–1241, August 2017. ISSN 1573-0565. doi: 10.1007/s10994-016-5582-8. URL https://doi.org/10.1007/s10994-016-5582-8.
[39] Rada Mihalcea and Dragomir Radev. Graph-based Natural Language Processing and Information Retrieval, April 2011. URL /core/books/ graphbased-natural-language-processing-and-information-retrieval/
216B4D2C3F82BF04C0CC383CD3760C19. ISBN: 9780521896139 9780511976247 Li-brary Catalog: www.cambridge.org Publisher: Cambridge University Press.
[40] Daniel Preot¸iuc-Pietro, P. K. Srijith, Mark Hepple, and Trevor Cohn. Studying the Tem-poral Dynamics of Word Co-occurrences: An Application to Event Detection. In Pro-ceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4380–4387, Portoroˇz, Slovenia, May 2016. European Language Re-sources Association (ELRA). URL https://www.aclweb.org/anthology/L16-1694. [41] Archive Team. archiveteam-twitter-stream-2019-01, 2019. URL https://archive.
org/details/archiveteam-twitter-stream-2019-01.
[42] Things That Were A Thing: January 2019 Edition, . URL //www.ranker.com/list/ january-2019-trending-topics/sammy-leary. Library Catalog: www.ranker.com.
A.1 Appendix A Code
A.2 Static Community Detection
A.1 i n t e n s i t y = lambda g : gmean ( [ g . e d g e s [ e ] [ ’ w e i g h t ’ ] f o r e in g . e d g e s ( ) ] ) A.2
A.3 def i s c l i q u e ( row , i ) :
A.4 l = [ ]
A.5 f o r k , v in row . todok ( ) . i t e m s ( ) :
A.6 i f v == 9 :
A.7 l . append ( ( i , k [ 1 ] ) )
A.8 e l i f v == 8 :
A.9 i f i n t e n s i t y ( Graph . subgraph ( t r i a n g l e s [ i ] . u n i o n ( t r i a n g l e s [ k [ 1 ] ] ) ) ) > I :
A.10 l . append ( ( i , k [ 1 ] ) )
A.11 return l
A.12
A.13 def g e t c o m m u n i t i e s (X) :
A.14 global Graph , t r i a n g l e s
A.15 # g e t word co−o c c u r e n c e
A.16 G = s p a r s e . c s r m a t r i x (X. t r a n s p o s e ( ) ) @ X A.17 G = G − G. a s t y p e ( np . bool ) A.18 G = G − G. a s t y p e ( np . bool ) A.19 G. d a t a += 2 A.20 G t r i = s p a r s e . t r i u (G, k=1 , format=” c s r ” ) A.21 # make g r a p h
A.22 Graph = nx . Graph ( )
A.23 Graph . a d d w e i g h t e d e d g e s f r o m ( [ ( k [ 0 ] , k [ 1 ] , w) A.24 f o r k , w A.25 in l i s t ( G t r i . todok ( ) . i t e m s ( ) ) ] ) A.26 # g e t a l l t r i a n g l e s A.27 t r i a n g l e s = s e t ( f r o z e n s e t ( [ n , nbr , nbr2 ] ) A.28 f o r n in Graph A.29 f o r nbr , nbr2 in i t e r t o o l s . c o m b i n a t i o n s ( Graph [ n ] , 2 ) A.30 i f nbr in Graph [ nbr2 ]
A.31 and gmean ( [ Graph [ n1 ] [ n2 ] [ ’ w e i g h t ’ ]
A.32 f o r n1 , n2 A.33 in i t e r t o o l s . c o m b i n a t i o n s ( [ n , nbr , nbr2 ] , 2 ) ] ) >= I ) A.34 t r i a n g l e s = l i s t ( t r i a n g l e s ) A.35 A.36 #g e t s i m i l a r i t y b e t w e e n t r i a n g l e s A.37 T r i = s p a r s e . l i l m a t r i x ( ( len ( t r i a n g l e s ) , G. s h a p e [ 0 ] ) ) A.38 T r i . rows = np . a r r a y ( [ l i s t ( t ) f o r t in t r i a n g l e s ] ) A.39 T r i . d a t a = [ np . r e p e a t ( 1 , 3 ) f o r t in t r i a n g l e s ] A.40 S = T r i @ G. a s t y p e ( np . bool ) . a s t y p e ( np . i n t ) @ T r i . T
A.41
A.42 #match w h i c h t o j o i n
A.43 p o o l = m u l t i p r o c e s s i n g . Po ol ( 4 )
A.44 j o i n = p o o l . starmap ( i s c l i q u e , [ ( S [ i ] , i ) f o r i in range ( S . s h a p e [ 0 ] ) ] ) A.45 p o o l . c l o s e ( ) A.46 p o o l . j o i n ( ) A.47 j o i n = [ i t e m f o r s u b l i s t in j o i n f o r i t e m in s u b l i s t i f i t e m ] A.48 A.49 #j o i n
A.50 Join G = nx . Graph ( )
A.51 Join G . a d d e d g e s f r o m ( j o i n ) A.52 j o i n s = l i s t ( nx . c o n n e c t e d c o m p o n e n t s ( Join G ) ) A.53 #g e t c o m m u n i t i e s A.54 c o m m u n i t i e s = [ s e t ( ) . u n i o n ( ∗ [ t r i a n g l e s [ i ] f o r i in j ] ) f o r j in j o i n s ] A.55 A.56 C = np . a r r a y ( [ [ i n t ( i in c ) f o r i in range (G. s h a p e [ 1 ] ) ] f o r c in c o m m u n i t i e s ] ) A.57 C = s p a r s e . c s r m a t r i x (C) A.58 return C
A.3 Match Communities
A.1 C = {} A.2 s t a t s = {} A.3 o l d C = None A.4 o l d f e a t = None A.5 t h r e s h o l d = 0 . 5 5 A.6 I = 30 A.7
A.8 f o r i , day in enumerate ( days ) :
A.9 s t a t s [ day ] = { ” merge ” : 0 ,
A.10 ” s p l i t ” : 0 , A.11 ” grow ” : 0 , A.12 ” d e c l i n e ” : 0 , A.13 ” m a i n t a i n ” : 0 , A.14 ” b i r t h ” : 0 , A.15 ” d e a t h ” : 0 } A.16 X, n e w f e a t = l o a d d a t a ( day ) A.17 A.18 new C = g e t c o m m u n i t i e s (X) A.19 A.20 i f i == 0 :
A.21 f o r j , c in enumerate ( new C . t o l i l ( ) . rows ) :
A.22 C [ j ] = {}
A.24 o l d C = new C
A.25 o l d f e a t = n e w f e a t
A.26 e l s e :
A.27 # c r e a t e f e a t u r e s p a c e t o compare p a s t and p r e s e n t c o m m u n i t i e s A.28 i n t e r s e c t i n g w o r d s = s e t ( n e w f e a t ) . i n t e r s e c t i o n ( s e t ( o l d f e a t ) )
A.29 w o r d s o l d C = [ o l d f e a t . i n d e x ( word ) f o r word in
i n t e r s e c t i n g w o r d s ]
A.30 words new C = [ n e w f e a t . i n d e x ( word ) f o r word in
i n t e r s e c t i n g w o r d s ] A.31
A.32 # g e t i n t e r s e c t i o n b e t w e e n o l d and new c o m m u n i t i e s
A.33 i n t e r s e c t = new C [ : , words new C ] @ o l d C [ : , w o r d s o l d C ] . T # (
new x o l d ) A.34
A.35 # g e t t o t a l number o f t e r m s i n e a c h community
A.36 o l d t o t a l = ( o l d C [ : , w o r d s o l d C ] @ o l d C [ : , w o r d s o l d C ] . T) . d i a g o n a l ( )
A.37 n e w t o t a l = ( new C [ : , words new C ] @ new C [ : , words new C ] . T) .
d i a g o n a l ( )
A.38 # i f z e r o s e t t o one b e c a s u e we w i l l b e d i v i d i n g
A.39 o l d t o t a l [ o l d t o t a l == 0 ] = 1
A.40 n e w t o t a l [ n e w t o t a l == 0 ] = 1
A.41
A.42 new C = np . a r r a y ( [ [ n e w f e a t [ w ] f o r w in c ] f o r c in new C . t o l i l
( ) . rows ] )
A.43 o l d C = np . a r r a y ( [ [ o l d f e a t [ w ] f o r w in c ] f o r c in o l d C . t o l i l ( ) . rows ] )
A.44
A.45 # k e e p t r a c k o f what c o m m u n i t i e s haven ’ t b e e n matched
A.46 o l d u n m a t c h e d = s e t ( np . a r a n g e ( 0 , len ( o l d t o t a l ) ) )
A.47 new unmatched = s e t ( np . a r a n g e ( 0 , len ( n e w t o t a l ) ) )
A.48
A.49 r e o r d e r C = [ [ ] f o r j in range ( len ( o l d C ) ) ]
A.50 n e x t i n d = len ( o l d C )
A.51 # f i r s t f i n d merges
A.52 f o r j , c in enumerate ( new C ) :
A.53 matches = np . where ( i n t e r s e c t [ j ] [ : , l i s t ( o l d u n m a t c h e d ) ] . t o a r r a y ( ) . f l a t t e n ( )
A.54 / o l d t o t a l [ l i s t ( o l d u n m a t c h e d ) ] >
t h r e s h o l d ) [ 0 ]
A.55 matches = np . a r r a y ( l i s t ( o l d u n m a t c h e d ) ) [ matches ]
A.56 i f len ( matches ) > 1 :
A.57 i f sum ( [ i n t e r s e c t [ j ,m] / n e w t o t a l [ j ] f o r m in matches ] )
> t h r e s h o l d :
A.58 # merge f o u n d
A.59 C [ n e x t i n d ] = {}
A.61 f o r m in matches :
A.62 o l d u n m a t c h e d . remove (m)
A.63 C [m ] [ i ] = ” merged t o {} ” . format ( n e x t i n d )
A.64 new unmatched . remove ( j )
A.65 r e o r d e r C . append ( new C [ j ] )
A.66 n e x t i n d += 1
A.67 s t a t s [ day ] [ ” merge ” ] += 1
A.68 # t h e n f i n d s p l i t s
A.69 mask = l i s t ( o l d u n m a t c h e d )
A.70 f o r j , c in enumerate ( o l d C [ mask ] ) :
A.71 j = mask [ j ]
A.72 matches = np . where ( i n t e r s e c t [ l i s t ( new unmatched ) ] [ : , j ] .
t o a r r a y ( ) . f l a t t e n ( )
A.73 / n e w t o t a l [ l i s t ( new unmatched ) ] >
t h r e s h o l d ) [ 0 ]
A.74 matches = np . a r r a y ( l i s t ( new unmatched ) ) [ matches ]
A.75 i f len ( matches ) > 1 :
A.76 i f sum ( [ i n t e r s e c t [m, j ] / o l d t o t a l [ j ] f o r m in matches ] )
> t h r e s h o l d : A.77 # s p l i t f o u n d A.78 s p l i t l o c = [ v f o r v in range ( n e x t i n d , n e x t i n d + len ( matches ) ) ] A.79 C [ j ] [ i ] = ” s p l i t i n t o {} ” . format ( s p l i t l o c ) A.80 f o r m in matches :
A.81 new unmatched . remove (m)
A.82 C [ n e x t i n d ] = {}
A.83 C [ n e x t i n d ] [ i ] = new C [m]
A.84 r e o r d e r C . append ( new C [m] )
A.85 n e x t i n d += 1
A.86 o l d u n m a t c h e d . remove ( j )
A.87 s t a t s [ day ] [ ” s p l i t ” ] += 1
A.88 # t h e n a l l c o n t a i n i n g one t o one
A.89 mask = l i s t ( new unmatched )
A.90 f o r j , c in enumerate ( new C [ mask ] ) :
A.91 j = mask [ j ] A.92 m = np . argmax ( 2 ∗ i n t e r s e c t [ j ] [ : , l i s t ( o l d u n m a t c h e d ) ] A.93 / ( np . a r r a y ( o l d t o t a l [ l i s t ( o l d u n m a t c h e d ) ] ) + n e w t o t a l [ j ] ) ) A.94 m = np . a r r a y ( l i s t ( o l d u n m a t c h e d ) ) [m] A.95 i f 2 ∗ i n t e r s e c t [ j ,m] / ( o l d t o t a l [m] + n e w t o t a l [ j ] ) > t h r e s h o l d : A.96 # m a i n t a i n f o u n d A.97 o l d u n m a t c h e d . remove (m)
A.98 new unmatched . remove ( j )
A.99 C [m ] [ i ] = new C [ j ]
A.100 r e o r d e r C [m] = new C [ j ]
A.102 e l i f i n t e r s e c t [ j ,m] / o l d t o t a l [m] > t h r e s h o l d :
A.103 # grow f o u n d
A.104 o l d u n m a t c h e d . remove (m)
A.105 new unmatched . remove ( j )
A.106 C [m ] [ i ] = new C [ j ]
A.107 r e o r d e r C [m] = new C [ j ]
A.108 s t a t s [ day ] [ ” grow ” ] += 1
A.109 e l s e : A.110 m = np . argmax ( i n t e r s e c t [ j ] [ : , l i s t ( o l d u n m a t c h e d ) ] A.111 / np . r e p e a t ( n e w t o t a l [ j ] , len ( o l d u n m a t c h e d ) ) ) A.112 m = np . a r r a y ( l i s t ( o l d u n m a t c h e d ) ) [m] A.113 i f i n t e r s e c t [ j ,m] / n e w t o t a l [ j ] > t h r e s h o l d : A.114 # t h e n d e c l i n e s A.115 o l d u n m a t c h e d . remove (m)
A.116 new unmatched . remove ( j )
A.117 C [m ] [ i ] = new C [ j ]
A.118 r e o r d e r C [m] = new C [ j ]
A.119 s t a t s [ day ] [ ” d e c l i n e ” ] += 1
A.120
A.121 # add b i r t h s
A.122 f o r j in new unmatched :
A.123 C [ n e x t i n d ] = {}
A.124 C [ n e x t i n d ] [ i ] = new C [ j ]
A.125 r e o r d e r C . append ( new C [ j ] )
A.126 n e x t i n d += 1 A.127 s t a t s [ day ] [ ” b i r t h ” ] += 1 A.128 # mark d e a t h s A.129 f o r j in o l d u n m a t c h e d : A.130 i f i −1 in C [ j ] : A.131 i f not i s i n s t a n c e (C [ j ] [ i − 1 ] , s t r ) : A.132 C [ j ] [ i ] = ” d e a t h ” A.133 s t a t s [ day ] [ ” d e a t h ” ] += 1 A.134
A.135 # re−o r d e r t o match c o m m u n i t i e s
A.136 o l d C = [ [ ( n e w f e a t [ w ] in c ) f o r w in range ( len ( n e w f e a t ) ) ] f o r
c in r e o r d e r C ]
A.137 o l d C = s p a r s e . c s r m a t r i x ( o l d C )
B.1 Appendix B Example Top Communities B.2 Method 1
B.2.1 Sports
• want nfc, chiefs super, call cost, trip super, saints trip, missed call, refs missed, rematch refs, cost saints, nfc championsh, championship rematch, nfc championship, super bowl, bowl want, rams chiefs
• brady greatest, brady posted, greatest time, super bowl, posted instagram, first super, tom brady
• sean mcvay, morning everyone, bill belichick, brady patriots, bowl appearances, show-ing super, last time, brady super, championship game, iphone android, answershow-ing ques-tions, brady showing, brady afc, last night, tom brady, great game, everyone except, play super, bowl every, high school, going super, new team, super bowl, rams super, england patriots, team great, every year, 9th super, chiefs saints, patriots super, afc championship, patriots going, patriots team, tony romo, another super, good morning, bob kraft, new england, first super
• bowl halftime, travis scott, big boi, scott big, super bowl, perform super, halftime show B.2.2 Politics
• new york, white house, options strike, strike iran, york times
• donald trump, national disgrace, fbi investigated, president trump, russian agent, trump russian
• lie congress, cohen lie, black eye, eye arm, mueller office, report trump, directed cohen, michael cohen, trump directed, president trump, buzzfeed report, told cohen, trump told, arm sling
• president united, resign position, leader senate, senatemajldr resign, citizen united, position leader, states america, states call, senate silencing, call senatemajldr, united states
• etc. B.2.3 Music
• another cancer, might push, direction another, emotions might, shawn mendes, one direction, say thank, push one, back together, everyone say
• bestfanarmy btsarmy, iheartawards bestfanarmy, bestfanarmy 5sosfam, btsarmy bts twt, vote iheartawards, btsarmy iheartawards
• iheartawards bestcoversong, stilltheone harry styles, bestcoversong stilltheone, harry styles kaceymusgraves, iheartawards stilltheone, bestcoversong iheartawards
• etc.
B.2.4 Video Games
• need backup, lvl vortex, backup lvl, 100 vortex, lvl 120, battle need, lvl 100, vortex dragon, 100 celeste
• year new, lunar new, overwatch lunar, new year • etc.
B.2.5 Award Shows
• congratulations leading, fan army, best picture, actor nominees, black panther, best fan, nominees oscarnoms, nominated best, leading actress, actress nominees, picture nominees, theacademy congratulations, congratulations best
B.2.6 Noise
• usually turn, need advice, turn need
• ever really, worst thing, ever seen, best thing, ever happened, might best, leave alone, really leave, thing ever, ever done, cutest thing
• mickey mouse, originalfunko originalfunko, originalfunko chance, tea party, betty boop, chance win
B.3 Method 2 B.3.1 Sports
• brady greatest, brady posted, greatest time, super bowl, posted instagram, first super, tom brady
• national baseball, fame day, first player, edgar martinez, congratulations mariano, unanimous hall, halladay edgar, elected baseball, mike mussina, rivera first, elected hall, derek jeter, mariano rivera, player unanimously, rivera hall, curt schilling, first unanimous, roy halladay, andy pettitte, barry bonds, schilling deserves, hall famer, hall fame, unanimously elected, baseball hall
B.3.2 Politics
• trans people, banning trans, burden trans, people military, people burden
• mike pence, things matter, border wall, trump martin, martin luther, closed monday, government shutdown, observance martin, monday january, eve mlk, pence says, end day, donald trump, says trump, king inspiring, mlk day, pence trump, king day, begin end, luther king
• workers getting, 800000 government, still getting, mitch mcconnell, getting paid, re-aldonaldtrump nancy, democrats congress, federal workers, mcconnell getting, pelosi
getting, nancy pelosi, million illegals, paid people, partying beach, people working, government workers
• deal table, jeremy corbyn, theresa may, axing human, human rights, mrs may, may take, take deal, act brexit, rights act, prime minister
B.3.3 Music
• album made, made dawn, mini album, official photo, dawn official, track list, eter-nal sunshine, 세븐틴 you made my dawn, dawn ver, sunshine ver, you made my dawn ymmd, seventeen 세븐틴, photo eternal, pledis 17 seventeen, photo dawn, seventeen 6th, 6th mini
• arianagrande 7rings, got 7rings, gee thanks, make big, big deposits, thanks bought, want got, gloss poppin, next weeks, thanks jus, got want, see want, got arianagrande, hair gee
• think might, keep music, thought might, might new, music thought, may new, see music, think may, music think, new itsmuyinza, itsmuyinza song
• taylorswift13 taylornation13, iheartawards taylorswift13, delicate bestmusicvideo, tay-lorswift13 delicate, bestmusicvideo iheartawards, vote taytay-lorswift13, vote delicate • behind scenes, 방탄소년단 bts, taehyung bts twt, bts taehyung, hyundai bts, bts behind • etc.
B.3.4 Video Games
• need backup, lvl vortex, backup lvl, 100 vortex, lvl 120, battle need, lvl 100, vortex dragon, 100 celeste
• call duty, playstation ps4live, duty black, check broadcast, black ops, fortnite live, broadcast playstation, ps4live fortnite
• begun let, kingss6 dragon, together tap, era together, dragon campaign, campaign begun, clash kingss6, new era, let step, tap official, step new’
B.3.5 Awards Shows
• congratulations leading, fan army, best picture, actor nominees, black panther, best fan, nominees oscarnoms, nominated best, leading actress, actress nominees, picture nominees, theacademy congratulations, congratulations best
• nominate felipeneto, favorite brazilian, personality kcanominees, felipeneto favorite, felipenetovotos felipeneto, felipeneto kcanominees, brazilian personality, kcanominees kca2019
• mvawards2019 manyvids, year awards, awards vote’, ’avn avnawards, came voted, watched came
B.3.6 General News
• three chicago, police officer, sentenced years, jason van, six years, police officers, dyke sentenced, officer jason, van dyke, laquan mcdonald, years prison, former chicago, chicago police, sentenced months, nearly years
• 2019 couture, van herpen, iris van, spring 2019, couture collection, dior spring, elie saab, herpen spring, ss19 couture
• eclipse leo, full moon, see super, eclipse 2019, moon lunar, moon 2019, super wolf, lunar eclipse, tonight total, sunday night, eclipse tonight, super blood, moon eclipse, blood moon, blood wolf, last night, night sky, moon tonight, eclipse happening, every year, moon last, tonight last, end world, january 21st, total lunar, tonight super, time lapse, wolf blood, wolf moon, night time
B.3.7 Noise
• new video, posted new, video facebook
• chance win, mickey mouse, originalfunko chance, originalfunko originalfunko
• best friend, friend since, middle school, point yes, prove point, school trying, since middle, still best