• No results found

In this section we provide details of our effort in building a five-faceted social media search system. The rest of the paper will describe our efforts in this direction.

4.1 Datasets Used

We will explain our approach with empirical evidence on two real-world datasets that consist of two very different types of social media - blogs and microblogs.

Blogs Dataset. We used the BLOG06 test data collection used in the TREC062Blog track to perform preliminary evaluations and test our hypothesis. This set was created

2

Table 2. Labels Assigned by Human Expert to Blog Clusters

No Topic Keywords Cluster

1 Gambling poker, casino, gamble 2

2 Birthdays birthday, whatdoesyourbirthdatemeanquiz 3

3 Classifieds classifieds, coaster, rollerblade, honda 4

4 Podcast podcasts, radio, scifitalk 5

5 Christmas christmas, santa, tree, gomer 6

6 Finance cash, amazon, payday 7

7 Trash dumb, rated, adult 8,10

8 Comics comic, seishiro 9

9 Thanksgiving thankful, food, dinner, shopping 11

10 Movies movie, harry, potter, rent 13

11 IT code, project, server, release, wordpress 16

12 Iraq War war, iraq, media, bush, american 17

13 Cars car, drive, money, kids 18

14 Books book, art, science, donor 19

15 Others dating, cards, nba, napolean, halloween, homework, hair 1, 12, 14, 15, 20

Table 3. Hashtags Used to Crawl Tweet Dataset

Purpose Hashtags

Occupy Wall Street #occupywallstreet, #ows, #occupyboston, #occupywallst, #occupytogether, #teaparty, #nypd, #occu- pydc, #occupyla, #usdor, #occupyboston, #occupysf, #solidarity, #citizenradio, #gop, #sep17, #occu- pychicago, #15o

and distributed by the University of Glasgow. It contains a crawl of blog feeds, and associated permalink3and homepage documents (from late 2005 and early 2006). We took 10,000 permalink documents for our study. We extracted 339 distinct blogs from these documents. We use a small dataset for ease of manually computing the concepts and also verifying the identified relationships.

Tweets Dataset. Using an existing Twitter event monitoring system, TweetTracker

[18], we collected tweets discussing the Occupy Wall Street (OWS) movement. The tweets were crawled over the course of 4 months starting from September 14, 2011 to December 13, 2011. A full list of the parameters used is presented in Table 3. The dataset consists of 8,292,822 tweets generated by 852,240 unique users. As in the case of blogs, we identify a smaller set of tweets whose topic can be determined and eval- uated manually. These tweets are generated when the number of tweets exceeded the average tweet traffic (102,380) by one standard deviation (77,287). More information on how these days are identified is presented later in the paper. In total there were 10 such days in our dataset which spans 90 days of the OWS movement.

4.2 Topic Extraction

For clustering the blogs extracted from the TREC dataset, we used an LDA implemen- tation where the number of concepts or topics was set to 20. We presented the top 15

3

A permalink, or permanent link, is a URL that points to a specific blog or forum entry after it has passed from the front page to the archives.

words from each distribution of cluster to a small group of human evaluators and asked them to manually label the clusters to real-world concepts. We picked the most com- mon topic as the single best label for a cluster. A few clusters to which no clear semantic label could be assigned was mapped to a generic concept called Other.

4.3 Learning Profiles

Fig. 3. Entropy of Blogs

Figure 3 plots the average number of posts for blogs having entropy within a given interval. We can see that blogs with a rela- tively small number of posts can also have high en- tropy while blogs with a large number of posts can have low entropy values. We have manually verified that blogs with high post frequency but low entropy map show high weights for a few concepts in our list while blogs with high en- tropy map to several concepts but with lower strengths. Bloggers who are devoted to a few concepts can be classified as specialists and the bloggers whose blogs show high entropy can be classified as generalists. However, the classification task becomes diffi- cult if the blogger has only a few posts for a concept.

4.4 Mapping Tweets to Topics

As our Twitter dataset is related to the Occupy Wall Street movement, the Purpose of the dataset was pre-determined and hence we did not focus on learning the same. We utilize the Gibbs sampler LDA to discover a coherent topic from the tweets from each selected day. The top 15 words describing these topics are presented in Table 4. By analyzing these topics we find a clear indication of the significant events associated with the OWS movement. For example, around 100 protesters were arrested by the police on October 11th in Boston and the prominent keywords: “boston, police, occupy, wall, street” from the topic for Oct 11 indicate this clearly. To check the veracity of the events extracted, we have compared them to the human edited OWS Timeline published in Wikipedia.

Tweet Dataset Mapping with OWS Timeline. To ensure that our approach accurately

presents relevant dates to the user, we compare the important dates generated by our method with those known to be distinguished dates in the Occupy movement. Our

Table 4. Topics identified using LDA for the important days

Date Topic

Oct 11 boston, police, occupy, wall, street, people, protesters, movement, protest, news, obama, media, jobs, 99, arrested

Oct 13 wall, occupy, street, protesters, mayor, bloomberg, park, eviction, defend, tomorrow, people, support, zuccotti,

call, police

Oct 14 wall, street, occupy, park, people, protesters, police, movement, tomorrow, global, live, support, today, news, 99

Oct 15 square, times, people, police, occupy, protesters, street, wall, world, nypd, live, movement, arrested, today, sq Oct 16 occupy, arrested, wall, people, street, protesters, movement, police, protests, protest, obama, support, west, world,

times

Oct 17 occupy, wall, street, people, movement, obama, protesters, support, police, protests, protest, world, 99, nypd,

party

Oct 18 occupy, wall, street, people, obama, protesters, movement, debate, cain, support, protest, 99, romney, protests,

party

Oct 26 police, oakland, occupy, protesters, people, street, march, gas, wall, tear, tonight, movement, live, support, cops Nov 15 park, police, nypd, protesters, zuccotti, occupy, press, nyc, people, live, street, wall, eviction, bloomberg, raid Nov 30 lapd, police, live, free, reading, psychic, media, people, protesters, occupy, arrested, city, cops, calling, eviction

Table 5. List of dates with tweets exceeding one standard deviation and the Wikipedia article’s

explanation of the day’s events. Average daily tweets: 102,380.52, Standard Deviation: 77,287.02.

Date #Tweets Wikipedia Justification

2011-10-11 186,816 Date not mentioned in article.

2011-10-13 200,835 Mayor Bloomberg tells protesters to leave Zuccotti Park so that it can be cleaned.

2011-10-14 228,084 Zuccotti Park cleaning postponed.

2011-10-15 376,660 Protesters march on military recruitment office to protest military spending.

2011-10-16 194,421 President Obama issues support for the OWS movement.

2011-10-17 193,332 Journalist fired for supporting OWS movement.

2011-10-18 185,743 Date not mentioned in article.

2011-10-26 220,571 Scott Olsen, Military veteran and OWS protester, hospitalized by police at OWS event.

2011-11-15 488,439 NYPD attempts to clear Zuccotti Park.

2011-11-30 203,846 Police arrest protesters in Occupy Los Angeles.

method defines an important date as one in which the number of tweets generated on that day exceeds the average number of tweets generated per day for all of the dates considered for the movement plus one standard deviation. To generate a list of impor- tant dates in the movement to which we will compare our method, we scraped all of the dates mentioned in the Wikipedia article “Timeline of Occupy Wall Street”4. Ta-

ble 5 shows the dates we identified as important alongside Wikipedia’s explanation for the importance of that date. In all but two instances, the dates we discovered match a landmark date discussed in the Wikipedia article.