SSIIM -‐ Seminários de Sistemas Inteligentes, Interacção e Mul8média, MIEIC
Social Network Mining
Eduarda Mendes Rodrigues
Assistant Professor
DEI-‐FEUP, Universidade do Porto
hHp://www.fe.up.pt/~eduarda [email protected]
Social Media Landscape
•
People
–
the individual is at the center of the social web
•
Social media networks
–
explicit
and
implicit
social 8es
–
interac8on among millions of people
•
User-‐generated content
–
rich source of collec8ve knowledge
–
diffusion of informa8on and opinions
Informa8on Retrieval and Social Media
• Proper8es of social media
– Scale: millions of ac8ve users, millions of posts per day
– Real-‐.me: breaking news, informa8on novelty
– Duplicates: informa8on diffusion (re-‐tweets, cross-‐posts, etc.)
– Content quality: spelling, grammar, punctua8on, emo8cons, etc.
– Social fabric: informa8on credibility, opinion leaders, topic experts
• Some challenges: relevance and ranking
– Social vs. non-‐social content
– Novelty detec8on
Informa8on Credibility
• Several newspapers picked up the fake
photos
• Wrongly indexed by search engines
based on the news stories
Social Media Mining
…and patterns are left
behind!
"
Social Media Mining
Can social network analysis enrich the content analysis?
Can the content analysis help explain the social network structure and dynamics?
Content Analysis!
Social Network Analysis!
§ user ac8vity sta8s8cs
§ interac8on paHerns
§ social network metrics
§ community detec8on
§ visualiza8on
§ text features
§ topic analysis
§ clustering and classifica8on
Current Research
• Data mining and IR in social media
– social network mining
– text classifica8on, opinion mining
– micro-‐blog search
• Network visualiza8on
– layout and clustering algorithms
– design of interac8ve tools
• Data journalism
– informa8on extrac8on from news
– real-‐8me social media analy8cs
Social Media Networks
• Explicit social .es
– Friends on Facebook
– Followers on TwiHer
– Professional contacts on LinkedIn
– ...
• Implicit social .es
– Like, favorite, repin
– Reply, retweet, share
– Comment, review
– Tag, rate, vote
Implicit Networks for Social Media Mining
• Discussion groups (usenet newsgroups)
– Can we iden.fy posts with answers in Q&A groups?
– Can we predict agreement and disagreement in debate groups?
• Community Q&A
– What type of ques.ons are posted?
Discussion Group Communi8es
• Discussion groups are extremely
valuable sources of informa8on
• Iden8fying the polarity of people’s
opinions about certain topics is useful for business intelligence
• People seeking informa8on through
newsgroup search may want to be pointed at answers to their ques8ons
Implicit Networks in Discussion Groups
thread structure" social network graph"
discussion thread"
w=2!
Mining PaHerns of Social Interac8on
Author Networks Thread Networks
• Reply-to Network: connects authors who reply to other authors
• Thread Participation Network: connects authors who
co-participate in threads
• Text Similarity Network: connects authors of similar content
• Common Authors Network: connects threads
that have common authors
• Text Similarity Network: connects threads of
similar content
Feature Sets
Supervised Learning (Linear SVM)
Message Categories
§ Agreement, Disagreement, Insult
§ Ques8on, Answer
B. Fortuna, E. Mendes Rodrigues, N. Milic-Frayling. Improving the Classification of Newsgroup Messages through Social Network analysis. ACM 16th Intl. Conf. on Information and Knowledge Management, CIKM 2007 (PDF).
Mining PaHerns of Social Interac8on
Topic Experts
Reply-to network at distance 2 for the most prolific authors of
talk.politics.guns (LEFT) and microsoft.public.internetexplorer.general (RIGHT) newsgroups.
Analysis of CQA Communi8es
• CQA services aim build a large knowledge base of
ques.ons and answers, on any topic, and make it available through search
Challenge: content quality!
2003" 2006" 2005" 2002" 2002" 2006" 2010" question" answers"
Is the community sharing knowledge?
User Intent & Ques8on Types
Mendes Rodrigues, E., Milic-Frayling, N., Sharing Knowledge or socializing? Characterizing User Intent in Community Question Answering, Proceedings of the 2009 ACM International Conference on Information and Knowledge Management, CIKM ’09.
Mining Ques8on Types
• Automa8c classifica8on problem
– Social vs. Non-‐social ques.ons
• Feature sets
– Ques.on features
Content (c.idf scores for single terms and n-‐grams), message length
– Thread features
Responsiveness, user par8cipa8on, presence of URLs in answers
– Tags and topic features
Aggregate informa8on about specificity of tag or topic
– Social network features for users involved in the thread
Social Network Structure
• Community ecosystem evolved in such a way that encouraged
interac8ons of a social nature
– 84.5% of ques8on are non-‐social and 6.5% are social
– Over 8me, the percentage of social
ques8ons and respec8ve answers and comments increased significantly
• How social are individual users?
• Social score:
– S(u) = |social| / |non-‐social|
Social Network Structure
• Users with high degree post a large percentage of social ques8ons
• Users who answer and comment on social threads have dense in-‐
neighborhoods
Social Network Analysis
•
Mapping and measurement of rela8onships and
flows between en88es that include people
•
Views social rela8onships in terms of network
theory consis8ng of
nodes
and
links
–
node
: “actor” on which rela8onships act
–
link
: rela8onship connec8ng nodes
Social Network Analysis
Social network graphs can be analysed using a number of
metrics
including:
•
cohesion
of the network or sub-‐network
measures the ease with which connec/ons can be made
•
density
of the network or sub-‐network
measures the robustness of the connec/ons
•
centrality
of the nodes
gives a rough indica/on of the social power of a node in the network
-‐ degree
-‐ betweenness
-‐ closenness
Degree Centrality
Count of the number of links to other nodes in the network
Higher degree of a node might indicate that the node is a hub in the network
Most connected does not mean most powerful!
Betweeness Centrality
Number of shortest paths between each node pair that a node is on
Boundary spanners that bridge
between groups have high betweeness
High betweenness generally indicates a powerful posi8on in the network!
Closeness Centrality
Mean shortest path between a node and all other nodes in the network reachable from it
Reflects the ability of a node in accessing informa8on through the network
Low closeness generally indicates high
visibility of what’s going on in the network! © Will Ockenden
Centrality Mesures and Node Roles
Social network graph
•
Peripheral
– below average centrality (C)
•
Central connector
– above average centrality (D)
Visual Signatures of Social Roles
Answerer Connector Originator
• Outward links to local isolates
• Rela8ve absence of triangles
• Few intense links
• Links from local isolates oren inward only
• Dense, many triangles
• Numerous intense links
• Links from local isolates oren inward only
• Sparse, few triangles
• Few intense links
Welser, H., Smith, M., Gleave, E. and Fisher, D. Visualizing the Signatures of Social Roles in Online Discussion Groups. Journal of Social Structure, vol. 8, 2007.
Network Visualiza8on
Visualiza8on should support knowledge discovery and communica8on
How good is a network visualiza8on?
Ideally…
•
Every node is visible
•
The degree of every node can be counted
•
It is possible to follow every link from source to
des8na8on
•
Clusters and outliers are iden8fiable
NetViz Nirvana!!!
C. Dunne and B. Shneiderman, “Improving graph drawing readability by incorpora8ng readability metrics: A sorware tool for network analysts,” University of Maryland, HCIL Tech Report HCIL-‐2009-‐13, May 2009.
How good is a network visualiza8on?
Challenge: real networks are oren very complex structures.
Interpreta8on of the network structure oren requires
visualizing addi8onal informa8on about the nodes and links.
Standard layout algorithms don’t help much when the size of the network is above a few hundred nodes and the network is rela8vely dense in the number of links.
Some Visualiza8on Approaches
•
Overview of the network
•
Zoom and details on demand
•
Dynamically filter nodes and links
•
Integrate metrics and visualiza8on
Interpret Data Adjust visual proper8es Choose network layout Apply data filters
Network Analysis and Visualiza8on Process Model
Interpret Data Collect Network Data Define Analysis Goals
D. L. Hansen, D. Rotman, E. M. Bonsignore, N. Milic-‐Frayling, E. Mendes Rodrigues, M. Smith, and B. Shneiderman, “Do you know the way to SNA?: A process model for analyzing and visualizing social media data.” in University of Maryland Tech Report: HCIL-‐2009-‐17.
Network Analysis and Visualiza8on Process Model
Interpret Data Collect Network Data Define Analysis Goals Adjust visual proper8es Choose network layout Apply data filtersD. L. Hansen, D. Rotman, E. M. Bonsignore, N. Milic-‐Frayling, E. Mendes Rodrigues, M. Smith, and B. Shneiderman, “Do you know the way to SNA?: A process model for analyzing and visualizing social media data.” in University of Maryland Tech Report: HCIL-‐2009-‐17.
Network Analysis and Visualiza8on Process Model
Interpret Data Collect Network Data Define Analysis Goals Refining / adjus8ng goals arer the firstlook at the data
Analysis may require addi8onal data Discovery may trigger
further analyses
D. L. Hansen, D. Rotman, E. M. Bonsignore, N. Milic-‐Frayling, E. Mendes Rodrigues, M. Smith, and B. Shneiderman, “Do you know the way to SNA?: A process model for analyzing and visualizing social media data.” in University of Maryland Tech Report: HCIL-‐2009-‐17.
Flickr Related Tags Network – “Mouse”
Computer Mickey
Connected Action: Marc Smith
Microsoft Research: Natasa Milic-Frayling, Tony Capone University of Porto: Eduarda Mendes Rodrigues
University of Maryland: Ben Shneiderman, Cody Dunne University of Stanford: Jure Leskovec
University of Washington: Eric Gleave Cornell University: Vladimir Barash
TEAM
NodeXL Project
Open source project at: hHp://nodexl.codeplex.com
Social Network Analysis add-‐in for MS Excel makes graph theory as easy as a bar chart, integrated analysis of social media sources.
REACTION Project
hHp://dmir.inesc-‐id.pt/project/Reac8on
• Computa.onal journalism
Intensive use of sorware tools for news research, produc8on and presenta8on
• What is the impact in the
rou.nes of newsrooms?
• What effect will these tools have
on the quality of news and the
produc.vity of journalists?
Retrieval, Extrac/on and Aggrega/on Compu/ng Technology for Integra/ng
Data Journalism – Implicit News Networks
• Informa8on extrac8on from
thousands of online news ar8cles
• SAPO Labs developed NLP
technology for Named En8ty Recogni8on in news (Verbetes service)
• Rela8onship extrac8on based
on co-‐occurrence
Pedro Passos Coelho,407,128 Silvio Berlusconi,271,106 Aníbal Cavaco Silva,234,98 …
'Paulo Bento' e 'Cris8ano Ronaldo' co-‐ocorreram em 72 no€cias
'Paulo Bento' e 'Bruno Alves' co-‐ocorreram em 39 no€cias 'Paulo Bento' e 'Raul Meireles' co-‐ocorreram em 37 no€cias …
Data Journalism – Implicit News Networks
• News social networks
– Named en8ty extrac8on
– En8ty co-‐occurrences
– Interac8ve visualiza8on
• Applica8ons
– Inves8ga8ve journalism
– Review of the week
– User engagement
Data Journalism – Opinion Mining
Data Journalism – Opinion Mining
TwiHerEcho Crawler
Opinion Mining Module Dic8onary of
names Sen8ment lexicon
Query
TwiHerEcho Rule-‐based classifier
Stats
Data Journalism -‐ TwiHeuro
• Real-‐8me social media
monitoring
– Big data crawling and analy8cs
– En8ty extrac8on
– Interac8ve visualiza8on
• Journalism applica8ons
– Event repor8ng (#Euro 2012)
Project Themes
•
Survey paper
– on mining social media data for business intelligence (e.g.
brand management; targeted adver8sing; new product development)
– on opinion mining techniques for social media content and
applica8ons
– on community detec8on techniques for implicit social
networks
•
Social media visualiza.on widgets
– visualiza8on for tracking the propaga8on of twiHer memes
– spa8o-‐temporal visualiza8on of tweets with named en88es