Big Data Science: Fundamental,
Techniques, and Challenges
(Data Mining on Big Data)
2014. 6. 27.
By Neil Y. Yen
Presented by Incheon Paik
University of Aizu
Japan
Tutorial, IEEE SERVICE 2014
Anchorage, Alaska
•
Growth of communication
channels
e-mails (10-20 times / day) messenger (3-4 hours / day) social media (80% of a day) and coming
•
Power of social media
an integrated portal to interact with•
Changes on human behaviors
information-sharing, experience crowdsourcing, and knowledge cultivationBackground
Data everywhere• Wikipedia ( 30 million pages ; 80 million edits )
• YouTube ( 100 million videos ; 150 million accesses )
• Blogosphere ( 250 million blogs ; 500 million views ) – number of posts are decreasing
• Twitter ( 30 billion tweets ; 5 million shares )
• Facebook ( 900 million objects ; 250 million uploads )
• Yahoo Answer / 知恵袋
( 1.7 billion questions ; 900 million answers )
• Flickr ( 5 billion photos)
*
Background
Data everywhereBackground
Data everywhere1990 2000 2010 2015 -
Entry, Industry-oriented, Focus
Platform, Person-oriented, Comprehensive
Background
More Data
More Information
( a foreseeable growth of the Internet and media )
More Data
More Complexity
( scalability, integrity, consistency of data )
More Data
More Heterogeneity
Part I: Data Mining on Social network
•
Introduction to social network
•
Emerging models for social network
Data Mining on Social network
Introduction to social network• Primary participants
• Individuals (or objects) – Node(s)
• Connections (or correlations) – Link(s)
• Social networks can be interpreted as “phenomenon derived by individuals with diverse interactions among them.”
Data Mining on Social network
Introduction to social network• Primary participants
• Individuals (or objects) – Node(s)
• Connections (or correlations) – Link(s)
• From another point of view, the Earth is an electronic nervous system, implementing by a conceptual network with nodes and links:
• nodes such as laptops, smartphones, satellites, etc.
• links such as cable lines, signals, etc. that make the node connected
• Communication networks: Many non-identical components with diverse connections between them
Data Mining on Social network
Introduction to social network• Consider many kinds of networks:
• social, technological, business, economic, content,…
• These networks tend to share certain informal properties:
• large scale, continual growth
• distributed, democratic growth: vertices “decide” who to link to
• mixture of local and long-distance connections
• abstract notions of distance: geographical, content, social,…
• Main concerns
• Do natural networks share quantitative universals?
• What would these universals be?
• How can they be well modeled, analyzed, and explained?
• All the phenomenon follows the theories of social network, and can always be easily explained through link analysis
Data Mining on Social network
Introduction to social network• Connected participants:
• how many, and how large
• Network diameter:
• maximum (worst-case) or average
• exclude infinite distances? (disconnected components)
• the small-world phenomenon
• Clustering:
• to what extent that links tend to cluster locally
• what is the balance between local and long-distance connections
• what roles do the two types of links play
• Degree distribution:
• what is the typical degree in the network
Data Mining on Social network
Introduction to social network• Probabilistic and/or statistical models
towards well management the generation of network(s)
• Various parameters to be concerned:
• network size
• degree of vertex
• the connections
• Statements are always statistical in nature:
• with high probability, diameter is small
• on average, degree distribution has heavy tail
Part I: Data Mining on Social network
•
Introduction to social network
•
Emerging models for social network
Data Mining on Social network
Emerging models for social network• Random graphs – Erdös-Rényi model (1960):
• Few components and small diameter
• No high clustering and heavy-tailed degree distributions
• A well-studied and understood mathematical model in general case
• Random graphs – Watts & Strogatz model (1998):
• Few components, small diameter and high clustering
• No heavy-tailed degree distributions
• Scale-free Networks:
• Few components, small diameter and heavy-tailed distribution
• No high clustering
• Hierarchical networks:
Data Mining on Social network
Emerging models for social networkCase I: The Internet
• Nodes: computers, routers
• Links: physical lines
…
cluster end device connector
Data Mining on Social network
Emerging models for social networkCase II: The Actor Network
• Nodes: actors
• Links: the rest casts
Flatliners (1990) The River Wild (1994)
Data Mining on Social network
Emerging models for social networkCase III: Co-authorship Network
• Nodes: authors
• Links: coauthor/coedit on academic publications
Data Mining on Social network
Emerging models for social networkCase IV: Academic Citation Network
• Nodes: authors
• Links: cite academic publications
Data Mining on Social network
Emerging models for social networkCase V: Food Network
R.J. Williams, N.D. Martinez Nature (2000)
• Nodes: trophic species
• Links: interactions among selected trophic species
Case VI: Food Network
Liljeros et al. Nature (2001)
• Nodes: human beings categorized by sexual property
• Links: sexual relationships
Part I: Data Mining on Social network
•
Introduction to social network
•
Emerging models for social network
Data Mining on Social network
Mining the social network(s)• Heterogeneous, multi-relational data represented as a graph
• Nodes as objects
• Heterogeneous objects need to be concerned
• Attributes of objects matter
• Considering sub-classes of objects and their corresponding labels
• Edges as links
• Different types of link may exist on same graph
• Weighted graph, dual-weighted graph, or others
• Links represent relationships and interactions between objects
• All we expect to know is the meaning of links
• Understanding the meaning(s) of link can help identify the relationship between objects
Data Mining on Social network
Mining the social network(s)• Conventional approaches applied in machine learning and data mining consider that “a random sample of homogeneous objects from single relation”
• However, the real-world datasets are supposed to be “multi-relational, heterogeneous, and semi-structured,” which are totally different from the traditional assumptions
• So, the link mining represents “an emerging field of research that
concentrates the intersection of network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming
• Simply speaking, it is a multi-disciplinary field of study although most of its core concepts are derived from the existing methods.
Data Mining on Social network
Mining the social network(s) – taxonomy of link mining• Object-Related Tasks
• Link-based object ranking
• Link-based object classification
• Object clustering (group detection)
• Object identification (entity resolution)
• Link-Related Tasks
• Link prediction
• Link re-construction
Data Mining on Social network
Mining the social network(s) – methods to link mining•
Properties
: Scale free [Barabasi ‘99], Clustering [Watts-Strogatz ‘98], Navigation [Adamic-Adar ’03, LibenNowell ‘05], Bipartite cores [Kumar et al. ‘99], Network Motifs [Milo et al. ‘02],Communities [Nawman ‘99], Conductance [Mihail-Papadimitriou ’06],Hub and authorities
[Page et al. ‘98, Kleinberg ‘99]
• PageRank [Page et al. ‘99], Hyperlink-Induced Topic Search [Kleinberg ‘99], EigenRumor
[Fujimura ‘05]
•
Models
:
Preferential attachment [Barabasi ‘99], Small-world [Watts-Strogatz ‘98],Copying model [Kleinberg et al. ‘01], Heuristically tradeoffs [Fabrikant et al. ‘02],
Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘02], Bowtie [Broder et al. ‘00],
Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ’01]
• Path-Oriented: Neighborhood Selection, Swarm Intelligence
• Efficiency-Oriented: Greedy approach, SSON (Semantic Social Overlay Network),
Data Mining on Social network
Mining the social network(s)• Link is defined as the relationship among data
• Two kinds of linked networks
• homogeneous vs. heterogeneous
• Homogeneous networks
• Single object type and single link type
• Single model social networks (e.g., friends)
• WWW: a collection of linked Web pages
• Heterogeneous networks
• Multiple object and link types
• Medical network: patients, doctors, disease, contacts, treatments
Data Mining on Social network
Mining the social network(s)• Link-based Ranking is primarily to exploit the link structure in a graph and to order or prioritize the set of objects within the graph
• Web information analysis
• PageRank and HITS are typical approaches inspired by link-based ranking
• Link-based ranking is considered a core technique in mining the network structure (so as in social network analysis)
• It is applied to rank participants in terms of centrality
• Degree centrality vs. Eigen vector/power centrality
• Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs
Data Mining on Social network
Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998)
• PageRank is essentially “citation counting”, but improves over simple counting
• Considering the indirect citations
• Smoothing of citations
• PageRank can also be interpreted as random surfing
A B C D P(A) P(B) P(C) P(D) deriving from referring to
Data Mining on Social network
Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998) Random surfing model:
at any page,
With prob. , randomly jumping to a page
With prob. (1 – ), randomly picking a link to follow
1 ( ) 0 0 1/ 2 1/ 2 1 0 0 0 0 1 0 0 1/ 2 1/ 2 0 0 1 ( ) (1 ) ( ) ( ) 1 ( ) [ (1 ) ] ( ) ( (1 ) ) j i t i ji t j t k d IN d k i ki k k T M p d m p d p d N p d m p d N p I M p d1 d2 d4 “Transition matrix” d3 Same as /N Stationary (“stable”) distribution, so we ignore time Iij = 1/N
Data Mining on Social network
Mining the social network(s)• Another model, link-based classification, is to predict the category of an object based on its attributes, links and the attributes of correlated objects among graph(s)
• Here we may need to take the multi-modal, multi-layered graph and their corresponding attributes to design the methods for link and object mining
• Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.
• Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations
• Communication: Predict whether a communication contact is by email, phone call or mail
Data Mining on Social network
Mining the social network(s)• Group detection
Cluster the nodes in the graph into groups that share common characteristics
• Web – Identifying communities
• Citation – identifying research communities
• Entity resolution
To predict when two objects are the same, based on their attributes and their links
• Web – predict when two sites are mirrors of each other
• Citation – predicting when two citations are referring to the same paper
• Epidemics – predicting when two disease strains are the same or similar
Data Mining on Social network
Mining the social network(s)• Methods in entity resolution was taken as pair-wise resolution problem: resolved based on the similarity of their attributes (i.e., association rule or model in data mining)
• All these methods consider the importance on links
• Links in entity resolution
• Collective resolution: one resolution decision affects another if connection exists among them
Data Mining on Social network
Mining the social network(s) – link prediction• Predict whether the relationship exists between two participants in graph based on attributes and all correlated links
• Web: predict if there will be a link between two pages
• Citation: predicting if a paper will cite another paper
• Epidemics: predicting who a patient’s contacts are
• Applied Methods
• Often viewed as a binary classification problem
• Local conditional probability model, based on structural and attribute features
• Difficulty: sparseness of existing links
Data Mining on Social network
Mining the social network(s) – link estimation• Make prediction to the number of links of a connected participant
• Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out-links
• Citation: predicting the impact of a paper based on the number of citations
• Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease
• Make prediction to the number of participants reachable by a given participant
• Web: predicting number of pages retrieved by crawling a site
• Citation: predicting the number of citations of a particular author in a specific journal
Conclusion
•
Big Data
Big Opportunity? or Big Problem?
What is your target or subjective?
How will it be done? – but do not forget the human
•
Making Balance is a challenging issue…
Infrastructure (storage), Management (governance, analysis), Search (value discovery), Security (transparency v.s. privacy), Applications (human-centered)
•
Next … ?
building the strategic alliances – industry, academia, and government worldwide making opportunities before intending to find them