• No results found

Tutorial, IEEE SERVICE 2014 Anchorage, Alaska

N/A
N/A
Protected

Academic year: 2021

Share "Tutorial, IEEE SERVICE 2014 Anchorage, Alaska"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data Science: Fundamental,

Techniques, and Challenges

(Data Mining on Big Data)

2014. 6. 27.

By Neil Y. Yen

Presented by Incheon Paik

University of Aizu

Japan

Tutorial, IEEE SERVICE 2014

Anchorage, Alaska

(2)

Growth of communication

channels

e-mails (10-20 times / day) messenger (3-4 hours / day) social media (80% of a day) and coming

Power of social media

an integrated portal to interact with

Changes on human behaviors

information-sharing, experience crowdsourcing, and knowledge cultivation

Background

Data everywhere
(3)

• Wikipedia ( 30 million pages ; 80 million edits )

• YouTube ( 100 million videos ; 150 million accesses )

• Blogosphere ( 250 million blogs ; 500 million views ) – number of posts are decreasing

• Twitter ( 30 billion tweets ; 5 million shares )

• Facebook ( 900 million objects ; 250 million uploads )

• Yahoo Answer / 知恵袋

( 1.7 billion questions ; 900 million answers )

• Flickr ( 5 billion photos)

*

Background

Data everywhere
(4)

Background

Data everywhere
(5)

1990 2000 2010 2015 -

Entry, Industry-oriented, Focus

Platform, Person-oriented, Comprehensive

Background

(6)

More Data

More Information

( a foreseeable growth of the Internet and media )

More Data

More Complexity

( scalability, integrity, consistency of data )

More Data

More Heterogeneity

(7)

Part I: Data Mining on Social network

Introduction to social network

Emerging models for social network

(8)

Data Mining on Social network

Introduction to social network

• Primary participants

• Individuals (or objects) – Node(s)

• Connections (or correlations) – Link(s)

• Social networks can be interpreted as “phenomenon derived by individuals with diverse interactions among them.”

(9)

Data Mining on Social network

Introduction to social network

• Primary participants

• Individuals (or objects) – Node(s)

• Connections (or correlations) – Link(s)

• From another point of view, the Earth is an electronic nervous system, implementing by a conceptual network with nodes and links:

• nodes such as laptops, smartphones, satellites, etc.

• links such as cable lines, signals, etc. that make the node connected

• Communication networks: Many non-identical components with diverse connections between them

(10)

Data Mining on Social network

Introduction to social network

• Consider many kinds of networks:

• social, technological, business, economic, content,…

• These networks tend to share certain informal properties:

• large scale, continual growth

• distributed, democratic growth: vertices “decide” who to link to

• mixture of local and long-distance connections

• abstract notions of distance: geographical, content, social,…

• Main concerns

• Do natural networks share quantitative universals?

• What would these universals be?

• How can they be well modeled, analyzed, and explained?

• All the phenomenon follows the theories of social network, and can always be easily explained through link analysis

(11)

Data Mining on Social network

Introduction to social network

• Connected participants:

• how many, and how large

• Network diameter:

• maximum (worst-case) or average

• exclude infinite distances? (disconnected components)

• the small-world phenomenon

• Clustering:

• to what extent that links tend to cluster locally

• what is the balance between local and long-distance connections

• what roles do the two types of links play

• Degree distribution:

• what is the typical degree in the network

(12)

Data Mining on Social network

Introduction to social network

• Probabilistic and/or statistical models

towards well management the generation of network(s)

• Various parameters to be concerned:

• network size

• degree of vertex

• the connections

• Statements are always statistical in nature:

• with high probability, diameter is small

• on average, degree distribution has heavy tail

(13)

Part I: Data Mining on Social network

Introduction to social network

Emerging models for social network

(14)

Data Mining on Social network

Emerging models for social network

• Random graphs – Erdös-Rényi model (1960):

• Few components and small diameter

• No high clustering and heavy-tailed degree distributions

• A well-studied and understood mathematical model in general case

• Random graphs – Watts & Strogatz model (1998):

• Few components, small diameter and high clustering

• No heavy-tailed degree distributions

• Scale-free Networks:

• Few components, small diameter and heavy-tailed distribution

• No high clustering

• Hierarchical networks:

(15)

Data Mining on Social network

Emerging models for social network

Case I: The Internet

• Nodes: computers, routers

• Links: physical lines

cluster end device connector

(16)

Data Mining on Social network

Emerging models for social network

Case II: The Actor Network

• Nodes: actors

• Links: the rest casts

Flatliners (1990) The River Wild (1994)

(17)

Data Mining on Social network

Emerging models for social network

Case III: Co-authorship Network

• Nodes: authors

• Links: coauthor/coedit on academic publications

(18)

Data Mining on Social network

Emerging models for social network

Case IV: Academic Citation Network

• Nodes: authors

• Links: cite academic publications

(19)

Data Mining on Social network

Emerging models for social network

Case V: Food Network

R.J. Williams, N.D. Martinez Nature (2000)

• Nodes: trophic species

• Links: interactions among selected trophic species

Case VI: Food Network

Liljeros et al. Nature (2001)

• Nodes: human beings categorized by sexual property

• Links: sexual relationships

(20)

Part I: Data Mining on Social network

Introduction to social network

Emerging models for social network

(21)

Data Mining on Social network

Mining the social network(s)

• Heterogeneous, multi-relational data represented as a graph

• Nodes as objects

• Heterogeneous objects need to be concerned

• Attributes of objects matter

• Considering sub-classes of objects and their corresponding labels

• Edges as links

• Different types of link may exist on same graph

• Weighted graph, dual-weighted graph, or others

• Links represent relationships and interactions between objects

• All we expect to know is the meaning of links

• Understanding the meaning(s) of link can help identify the relationship between objects

(22)

Data Mining on Social network

Mining the social network(s)

• Conventional approaches applied in machine learning and data mining consider that “a random sample of homogeneous objects from single relation”

• However, the real-world datasets are supposed to be “multi-relational, heterogeneous, and semi-structured,” which are totally different from the traditional assumptions

• So, the link mining represents “an emerging field of research that

concentrates the intersection of network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

• Simply speaking, it is a multi-disciplinary field of study although most of its core concepts are derived from the existing methods.

(23)

Data Mining on Social network

Mining the social network(s) – taxonomy of link mining

• Object-Related Tasks

• Link-based object ranking

• Link-based object classification

• Object clustering (group detection)

• Object identification (entity resolution)

• Link-Related Tasks

• Link prediction

• Link re-construction

(24)

Data Mining on Social network

Mining the social network(s) – methods to link mining

Properties

: Scale free [Barabasi ‘99], Clustering [Watts-Strogatz ‘98], Navigation [Adamic-Adar ’03, LibenNowell ‘05], Bipartite cores [Kumar et al. ‘99], Network Motifs [Milo et al. ‘02],

Communities [Nawman ‘99], Conductance [Mihail-Papadimitriou ’06],Hub and authorities

[Page et al. ‘98, Kleinberg ‘99]

• PageRank [Page et al. ‘99], Hyperlink-Induced Topic Search [Kleinberg ‘99], EigenRumor

[Fujimura ‘05]

Models

:

Preferential attachment [Barabasi ‘99], Small-world [Watts-Strogatz ‘98],

Copying model [Kleinberg et al. ‘01], Heuristically tradeoffs [Fabrikant et al. ‘02],

Congestion [Mihail et al. ‘03], Searchability [Kleinberg ‘02], Bowtie [Broder et al. ‘00],

Transit-stub [Zegura ‘97], Jellyfish [Tauro et al. ’01]

• Path-Oriented: Neighborhood Selection, Swarm Intelligence

• Efficiency-Oriented: Greedy approach, SSON (Semantic Social Overlay Network),

(25)

Data Mining on Social network

Mining the social network(s)

• Link is defined as the relationship among data

• Two kinds of linked networks

• homogeneous vs. heterogeneous

• Homogeneous networks

• Single object type and single link type

• Single model social networks (e.g., friends)

• WWW: a collection of linked Web pages

• Heterogeneous networks

• Multiple object and link types

• Medical network: patients, doctors, disease, contacts, treatments

(26)

Data Mining on Social network

Mining the social network(s)

• Link-based Ranking is primarily to exploit the link structure in a graph and to order or prioritize the set of objects within the graph

• Web information analysis

• PageRank and HITS are typical approaches inspired by link-based ranking

• Link-based ranking is considered a core technique in mining the network structure (so as in social network analysis)

• It is applied to rank participants in terms of centrality

• Degree centrality vs. Eigen vector/power centrality

• Rank objects relative to one or more relevant objects in the graph vs. ranks object over time in dynamic graphs

(27)

Data Mining on Social network

Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998)

• PageRank is essentially “citation counting”, but improves over simple counting

• Considering the indirect citations

• Smoothing of citations

• PageRank can also be interpreted as random surfing

A B C D P(A) P(B) P(C) P(D) deriving from referring to

(28)

Data Mining on Social network

Mining the social network(s) – the PageRank Algorithm by Brin & Page (1998) Random surfing model:

at any page,

With prob. , randomly jumping to a page

With prob. (1 – ), randomly picking a link to follow

1 ( ) 0 0 1/ 2 1/ 2 1 0 0 0 0 1 0 0 1/ 2 1/ 2 0 0 1 ( ) (1 ) ( ) ( ) 1 ( ) [ (1 ) ] ( ) ( (1 ) ) j i t i ji t j t k d IN d k i ki k k T M p d m p d p d N p d m p d N p I M p                                  d1 d2 d4 Transition matrix” d3 Same as /N Stationary (“stable”) distribution, so we ignore time Iij = 1/N

(29)

Data Mining on Social network

Mining the social network(s)

• Another model, link-based classification, is to predict the category of an object based on its attributes, links and the attributes of correlated objects among graph(s)

• Here we may need to take the multi-modal, multi-layered graph and their corresponding attributes to design the methods for link and object mining

• Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.

• Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations

• Communication: Predict whether a communication contact is by email, phone call or mail

(30)

Data Mining on Social network

Mining the social network(s)

Group detection

Cluster the nodes in the graph into groups that share common characteristics

• Web – Identifying communities

• Citation – identifying research communities

Entity resolution

To predict when two objects are the same, based on their attributes and their links

• Web – predict when two sites are mirrors of each other

• Citation – predicting when two citations are referring to the same paper

• Epidemics – predicting when two disease strains are the same or similar

(31)

Data Mining on Social network

Mining the social network(s)

• Methods in entity resolution was taken as pair-wise resolution problem: resolved based on the similarity of their attributes (i.e., association rule or model in data mining)

• All these methods consider the importance on links

• Links in entity resolution

• Collective resolution: one resolution decision affects another if connection exists among them

(32)

Data Mining on Social network

Mining the social network(s) – link prediction

• Predict whether the relationship exists between two participants in graph based on attributes and all correlated links

• Web: predict if there will be a link between two pages

• Citation: predicting if a paper will cite another paper

• Epidemics: predicting who a patient’s contacts are

• Applied Methods

• Often viewed as a binary classification problem

• Local conditional probability model, based on structural and attribute features

• Difficulty: sparseness of existing links

(33)

Data Mining on Social network

Mining the social network(s) – link estimation

• Make prediction to the number of links of a connected participant

• Web: predict the authority of a page based on the number of in-links; identifying hubs based on the number of out-links

• Citation: predicting the impact of a paper based on the number of citations

• Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease

• Make prediction to the number of participants reachable by a given participant

• Web: predicting number of pages retrieved by crawling a site

• Citation: predicting the number of citations of a particular author in a specific journal

(34)

Conclusion

Big Data

Big Opportunity? or Big Problem?

What is your target or subjective?

How will it be done? – but do not forget the human

Making Balance is a challenging issue…

Infrastructure (storage), Management (governance, analysis), Search (value discovery), Security (transparency v.s. privacy), Applications (human-centered)

Next … ?

building the strategic alliances – industry, academia, and government worldwide making opportunities before intending to find them

References

Related documents