Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

(1)

Big Data Science: Fundamental,

Techniques, and Challenges

(Data Mining on Big Data)

2014. 6. 27.

By Neil Y. Yen

Presented by Incheon Paik

University of Aizu

Japan

Tutorial, IEEE SERVICE 2014

Anchorage, Alaska

(2)

•

Growth of communication

channels

e-mails (10-20 times / day) messenger (3-4 hours / day) social media (80% of a day) and coming

•

Power of social media

an integrated portal to interact with

•

Changes on human behaviors

information-sharing, experience crowdsourcing, and knowledge cultivation

Background

Data everywhere

(3)

• Wikipedia ( 30 million pages ; 80 million edits )

• YouTube ( 100 million videos ; 150 million accesses )

• Blogosphere ( 250 million blogs ; 500 million views ) – number of posts are decreasing

• Twitter ( 30 billion tweets ; 5 million shares )

• Facebook ( 900 million objects ; 250 million uploads )

• Yahoo Answer / 知恵袋

( 1.7 billion questions ; 900 million answers )

• Flickr ( 5 billion photos)

*

Background

Data everywhere

(4)

Background

Data everywhere

(5)

1990 2000 2010 2015 -

Entry, Industry-oriented, Focus

Platform, Person-oriented, Comprehensive

Background

(6)

More Data



More Information

( a foreseeable growth of the Internet and media )

More Data



More Complexity

( scalability, integrity, consistency of data )

More Data



More Heterogeneity

(7)

Part I: Data Mining on Social network

•

Introduction to social network

•

Emerging models for social network

(8)

Data Mining on Social network

Introduction to social network

• Primary participants

• Individuals (or objects) – Node(s)

• Connections (or correlations) – Link(s)

• Social networks can be interpreted as “phenomenon derived by individuals with diverse interactions among them.”

(9)

Data Mining on Social network

• Primary participants

• Individuals (or objects) – Node(s)

• Connections (or correlations) – Link(s)

• From another point of view, the Earth is an electronic nervous system, implementing by a conceptual network with nodes and links:

• nodes such as laptops, smartphones, satellites, etc.

• links such as cable lines, signals, etc. that make the node connected

• Communication networks: Many non-identical components with diverse connections between them

(10)

Data Mining on Social network

• Consider many kinds of networks:

• social, technological, business, economic, content,…

• These networks tend to share certain informal properties:

• large scale, continual growth

• distributed, democratic growth: vertices “decide” who to link to

• mixture of local and long-distance connections

• abstract notions of distance: geographical, content, social,…

• Main concerns

• Do natural networks share quantitative universals?

• What would these universals be?

• How can they be well modeled, analyzed, and explained?

• All the phenomenon follows the theories of social network, and can always be easily explained through link analysis

(11)

Data Mining on Social network

• Connected participants:

• how many, and how large

• Network diameter:

• maximum (worst-case) or average

• exclude infinite distances? (disconnected components)

• the small-world phenomenon

• Clustering:

• to what extent that links tend to cluster locally

• what is the balance between local and long-distance connections

• what roles do the two types of links play

• Degree distribution:

• what is the typical degree in the network

(12)

Data Mining on Social network

• Probabilistic and/or statistical models

towards well management the generation of network(s)

• Various parameters to be concerned:

• network size

• degree of vertex

• the connections

• Statements are always statistical in nature:

• with high probability, diameter is small

• on average, degree distribution has heavy tail

(13)

Part I: Data Mining on Social network

•

(14)

Emerging models for social network

• Random graphs – Erdös-Rényi model (1960):

• Few components and small diameter

• No high clustering and heavy-tailed degree distributions

• A well-studied and understood mathematical model in general case

• Random graphs – Watts & Strogatz model (1998):

• Few components, small diameter and high clustering

• No heavy-tailed degree distributions

• Scale-free Networks:

• Few components, small diameter and heavy-tailed distribution

• No high clustering

• Hierarchical networks:

(15)

Data Mining on Social network

Case I: The Internet

• Nodes: computers, routers

• Links: physical lines

…

cluster end device connector

(16)

Data Mining on Social network

Case II: The Actor Network

• Nodes: actors

• Links: the rest casts

Flatliners (1990) The River Wild (1994)

(17)

Data Mining on Social network

Case III: Co-authorship Network

• Nodes: authors

• Links: coauthor/coedit on academic publications

(18)

Data Mining on Social network

Case IV: Academic Citation Network

• Nodes: authors

• Links: cite academic publications

(19)

Data Mining on Social network

Case V: Food Network

R.J. Williams, N.D. Martinez Nature (2000)

• Nodes: trophic species

• Links: interactions among selected trophic species

Case VI: Food Network

Liljeros et al. Nature (2001)

• Nodes: human beings categorized by sexual property

• Links: sexual relationships

(20)

Part I: Data Mining on Social network

•

(21)

Mining the social network(s)

• Heterogeneous, multi-relational data represented as a graph

• Nodes as objects

• Heterogeneous objects need to be concerned

• Attributes of objects matter

• Considering sub-classes of objects and their corresponding labels

• Edges as links

• Different types of link may exist on same graph

• Weighted graph, dual-weighted graph, or others

• Links represent relationships and interactions between objects

• All we expect to know is the meaning of links

• Understanding the meaning(s) of link can help identify the relationship between objects

(22)

Data Mining on Social network

• Conventional approaches applied in machine learning and data mining consider that “a random sample of homogeneous objects from single relation”

• However, the real-world datasets are supposed to be “multi-relational, heterogeneous, and semi-structured,” which are totally different from the traditional assumptions

• So, the link mining represents “an emerging field of research that

concentrates the intersection of network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

• Simply speaking, it is a multi-disciplinary field of study although most of its core concepts are derived from the existing methods.

(23)

Data Mining on Social network

Mining the social network(s) – taxonomy of link mining

• Object-Related Tasks

• Link-based object ranking

• Link-based object classification

• Object clustering (group detection)

• Object identification (entity resolution)

• Link-Related Tasks

• Link prediction

• Link re-construction

(24)

Mining the social network(s) – methods to link mining

•

Properties

: Scale free [Barabasi ‘99], Clustering [Watts-Strogatz ‘98], Navigation [Adamic-Adar ’03, LibenNowell ‘05], Bipartite cores [Kumar et al. ‘99], Network Motifs [Milo et al. ‘02],

Communities [Nawman ‘99], Conductance [Mihail-Papadimitriou ’06],Hub and authorities

[Page et al. ‘98, Kleinberg ‘99]

• PageRank [Page et al. ‘99], Hyperlink-Induced Topic Search [Kleinberg ‘99], EigenRumor

[Fujimura ‘05]

•

Models

:

Preferential attachment [Barabasi ‘99], Small-world [Watts-Strogatz ‘98],

Copying model [Kleinberg et al. ‘01], Heuristically tradeoffs [Fabrikant et al. ‘02],

Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘02], Bowtie [Broder et al. ‘00],

Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ’01]

• Path-Oriented: Neighborhood Selection, Swarm Intelligence

• Efficiency-Oriented: Greedy approach, SSON (Semantic Social Overlay Network),

(25)

• Link is defined as the relationship among data

• Two kinds of linked networks

• homogeneous vs. heterogeneous

• Homogeneous networks

• Single object type and single link type

• Single model social networks (e.g., friends)

• WWW: a collection of linked Web pages

• Heterogeneous networks

• Multiple object and link types

• Medical network: patients, doctors, disease, contacts, treatments

(26)

Data Mining on Social network

• Link-based Ranking is primarily to exploit the link structure in a graph and to order or prioritize the set of objects within the graph

• Web information analysis

• PageRank and HITS are typical approaches inspired by link-based ranking

• Link-based ranking is considered a core technique in mining the network structure (so as in social network analysis)

• It is applied to rank participants in terms of centrality

• Degree centrality vs. Eigen vector/power centrality

• Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs

(27)

Data Mining on Social network

Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998)

• PageRank is essentially “citation counting”, but improves over simple counting

• Considering the indirect citations

• Smoothing of citations

• PageRank can also be interpreted as random surfing

A B C D P(A) P(B) P(C) P(D) deriving from referring to

(28)

Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998) Random surfing model:

at any page,

With prob. , randomly jumping to a page

With prob. (1 – ), randomly picking a link to follow

1 ( ) 0 0 1/ 2 1/ 2 1 0 0 0 0 1 0 0 1/ 2 1/ 2 0 0 1 ( ) (1 ) ( ) ( ) 1 ( ) [ (1 ) ] ( ) ( (1 ) ) j i t i ji t j t k d IN d k i ki k k T M p d m p d p d N p d m p d N p I M p                                  d₁ d₂ d₄ “Transition matrix” d₃ Same as /N Stationary (“stable”) distribution, so we ignore time I_ij = 1/N

(29)

Data Mining on Social network

• Another model, link-based classification, is to predict the category of an object based on its attributes, links and the attributes of correlated objects among graph(s)

• Here we may need to take the multi-modal, multi-layered graph and their corresponding attributes to design the methods for link and object mining

• Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.

• Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations

• Communication: Predict whether a communication contact is by email, phone call or mail

(30)

Data Mining on Social network

• Group detection

Cluster the nodes in the graph into groups that share common characteristics

• Web – Identifying communities

• Citation – identifying research communities

• Entity resolution

To predict when two objects are the same, based on their attributes and their links

• Web – predict when two sites are mirrors of each other

• Citation – predicting when two citations are referring to the same paper

• Epidemics – predicting when two disease strains are the same or similar

(31)

Data Mining on Social network

• Methods in entity resolution was taken as pair-wise resolution problem: resolved based on the similarity of their attributes (i.e., association rule or model in data mining)

• All these methods consider the importance on links

• Links in entity resolution

• Collective resolution: one resolution decision affects another if connection exists among them

(32)

Data Mining on Social network

Mining the social network(s) – link prediction

• Predict whether the relationship exists between two participants in graph based on attributes and all correlated links

• Web: predict if there will be a link between two pages

• Citation: predicting if a paper will cite another paper

• Epidemics: predicting who a patient’s contacts are

• Applied Methods

• Often viewed as a binary classification problem

• Local conditional probability model, based on structural and attribute features

• Difficulty: sparseness of existing links

(33)

Data Mining on Social network

Mining the social network(s) – link estimation

• Make prediction to the number of links of a connected participant

• Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out-links

• Citation: predicting the impact of a paper based on the number of citations

• Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease

• Make prediction to the number of participants reachable by a given participant

• Web: predicting number of pages retrieved by crawling a site

• Citation: predicting the number of citations of a particular author in a specific journal

(34)

Conclusion

•

Big Data



Big Opportunity? or Big Problem?

What is your target or subjective?

How will it be done? – but do not forget the human

•

Making Balance is a challenging issue…

Infrastructure (storage), Management (governance, analysis), Search (value discovery), Security (transparency v.s. privacy), Applications (human-centered)

•

Next … ?

building the strategic alliances – industry, academia, and government worldwide making opportunities before intending to find them