• No results found

Computational Discovery in Evolving Complex Networks

N/A
N/A
Protected

Academic year: 2021

Share "Computational Discovery in Evolving Complex Networks"

Copied!
88
0
0

Loading.... (view fulltext now)

Full text

(1)

Computational Discovery in

Evolving Complex Networks

Yongqin Gao

(2)

Yongqin Gao December 2006 Dissertation Defense

Outline

• Background

• Methodology for Computational Discovery

• Problem Domain – OSS Research

• Process I: Data Mining

• Process II: Network Analysis

• Process III: Computer Simulation

• Process IV: Research Collaboratory

• Contributions

(3)

Background

• Network research gains more attentions

– Internet

– Communication network – Social network

– Software developer network – Biological network

• Understanding the evolving complex network

– Goal I: Search

– Goal II: Prediction

(4)

Yongqin Gao December 2006 Dissertation Defense

Computational Discovery

Our Methodology

Research Collaboratory Data Mining Network Analysis Computer Simulation Discovery Assessment Revision Feedback Researcher Community Members Contribution Reference Initialization

(5)

Problem Domain

• Open Source Software Movement

– What is OSS

• Free to use, modify and distribute and source code available and modifiable

• Potential advantages over commercial software: Potentially high quality; Fast development; Low cost

– Why study OSS (Goal)

• Software engineering — new development and coordination methods

• Open content — model for other forms of open, shared collaboration

• Complexity — successful example of self-organization/emergence

(6)

Yongqin Gao December 2006 Dissertation Defense

Glory of OSS

(7)

Problem Domain

• SourceForge.net community

– The biggest OSS development communities

– 134,751 registered projects

(8)

Yongqin Gao December 2006 Dissertation Defense

Problem Domain

• Our Data Set

– 25 monthly dumps since January 2003.

– Totally 460G and growing at 25G/month.

– Every dump has about 100 tables.

– Largest table has up to 30 million records.

• Experiment Environment

– Dual Xeon 3.06GHz, 4G memory, 2T storage

– Linux 2.4.21-40.ELsmp with PostgreSQL 8.1

(9)

Related Research

• OSS research

– W. Scacchi, “Free/open source software development practices in the computer game community”, IEEE Software, 2004.

– C. Kevin, A. Hala and H. James, “Defining open source software project success”, 24th International

Conference on Information Systems, Seattle, 2003.

• Complex networks

– L.A. Adamic and B.A. Huberman, “Scaling behavior of the world wide web”, Science, 2000.

– M.E.J. Newman, “Clustering and preferential

attachment in growing networks”, Physics Review, 2001.

(10)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Related Research:

– S. Chawla

,

B

.

Arunasalam and J. Davis,

“Mining open source software (OSS) data using

association rules network”,

PAKDD, 2003

.

– D. Kempe

,

J

.

Kleinberg and E. Tardos,

“Maximizing the spread of influence through a

social network”,

SIGKDD, 2003.

– C. Jensen and W. Scacchi, “Data mining for

software process discovery in open source

software development communities”,

Workshop on Mining Software Repositories,

2004

.

(11)

Process I: Data Mining

Raw data Relevant data Data Purging Feature Selection Algorithm Application Data Preparation Database

(12)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Data Preparation

– Data discovery

• Locating the information

– Data characterization

• Activity features: user categorization • Network features

– Data assembly

• Data Purging

– Treatment about data inconsistency

• Unifying the date presentation by loading into single depository

– Treatment about data pollution

• Removing “inactive” projects

• Feature Selection

– This method is used to remove dependent or insignificant features. – NMF (Non-negative Matrix Factorization)

(13)

Process I: Data Mining

• Result I

– Significant features

• By feature selection, we can identify the significant feature set describing the projects.

• Activity features: “file_releases”, “followup_msg”, “support_assigned”, “feature_assigned” and task related features

• Network features: “degrees”, “betweenness” and “closeness”

(14)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Distribution-based clustering

(Christley, 2005)

– Clustering according to the distribution of

features instead of values of individual feature

– We assume every entity (project) has an

underlying distribution of the feature set

(activity features)

– Using statistical hypothesis test

• Non-parametric test

• Fisher’s contingency-table test is used

– Joachim Krauth, “Distribution-free statistics: an application-oriented approach”, Elsevier Science Publisher, 1988.

(15)

Process I: Data Mining

• Procedure:

While (still unclustered entities)

Put all unclustered entities into one cluster

While (some entities not yet pairwise compared) A = Pick entity from cluster

For each other entity, B, in cluster not yet compared to A

Run statistical test on A and B If significant result

Remove B from cluster

(16)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Result II

• Unsupervised learning

– Distribution-based method used to cluster the project history using the activity distribution

– We named the clusters using ID and the results are shown in the table

– High support and confidence in evaluation 100960 Total 2060 3 9191 2 89709 1 Size Cluster ID

(17)

Process I: Data Mining

• Two sample

distributions from

different categories

• Unbalanced feature

distribution

could

be “unpopular”

• Balanced feature

distribution

could

be “popular”

20 1641 3488 22 0 312 736 229 1510 534 82 121 28 0 4 0 500 1000 1500 2000 2500 3000 3500 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Activity Category Cluster 1 134 3781 8435 431 0 21792537 667 9169 7134 601 2411 1651 0 399 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Activity Category Cluster 3

(18)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Discoveries in Process I

– Significant feature set selection

• Network features are important • Further inspection in next process

– Distribution based predictor

• Based on the activity feature distribution

• Prediction of the “popularity” based on the balance of the activity feature distribution

• Benefit of these discoveries

– For collaboration based communities, these discoveries can help in resource allocation optimization.

(19)

Process II: Network Analysis

• Why network analysis

– Assess the importance of the network measures

to the whole network and to individual entity in

the network

– Inspect the developing patterns of these

network measures

• Network analysis

– Structure analysis

– Centrality analysis

– Path analysis

(20)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Related research:

– P. Erdös and A. Rényi, “On random graphs”,

Publicationes Mathematicae, 1959

.

– D.J. Watts and S. H. Strogatz, “Collective

dynamics of small-world networks”,

Nature,

1998.

– R. Albert and A.L. Barab

ά

si, “Emergence of

scaling in random networks”,

Science, 1999

.

– Y. Gao, “Topology and evolution of the open

source software community”,

Master Thesis,

2003

.

(21)

Process II: Network Analysis

• Structure Analysis

– Understanding the influence of the network structure to individual entities in the network

– Inspected measures

• Approximate diameter

• Approximate clustering coefficient

• Component distribution 1 ) / log( ) / log( 1 2 1 + = z z z N D ) 3 2 ( ) )( ( 1 1 3 2 1 1 1 2 1 2 1 2 ! ! ! ! µ ! ! µ µ + " " " + = C

(22)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Conversion among C-NET, P-NET and

D-NET

(23)

Process II: Network Analysis

• Result I

– Approximate Diameters

• D-NET: between (5,7) while network size ranged from 151,803 to 195,744.

• P-NET: between (6,8) while network size ranged from 123,192 to 161,798.

– Approximate Clustering Coefficient

• D-NET: between (0.85, 0.95) • P-NET: between (0.65, 0.75)

(24)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

(25)

Process II: Network Analysis

• Centrality Analysis

– Understanding the importance of individual entities to the global network structure

– Inspected measures: • Average Degrees • Degree Distributions • Betweenness • Closeness

!

" # # = V t v s st st v v B $ $ ( ) ) (

!

" = V t dG v t v C ) , ( 1 ) (

(26)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Result II

– Average Degrees

• Developer degree in C-NET: 1.4525 • Project degree in C-NET: 1.7572

• Developer degree in D-NET: 12.3100 • Project degree in P-NET: 3.8059

(27)

Process II: Network Analysis

(28)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Result II (Degree distributions in D-NET

and P-NET)

(29)

Process II: Network Analysis

• Result II

– Average Betweenness

• P-NET: 0.2669e-003

– Average Closeness

• P-NET: 0.4143e-005

– Normally these two measures yield very small

value in large networks (N>10,000).

(30)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Path Analysis

– Understanding the developing patterns of the

network structure and individual entities in the

network

– Inspected measures:

• Active Developer Percentage • Average Degrees

• Diameters

• Clustering coefficients • Betweenness

(31)

Process II: Network Analysis

(32)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

(33)

Process II: Network Analysis

• Result III (Average degrees in D-NET and

P-NET)

(34)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Result III (Diameters in D-NET and

P-NET)

(35)

Process II: Network Analysis

• Result III (Clustering coefficients for

D-NET and P-D-NET)

(36)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Result III (Average betweenness and

closeness for P-NET)

(37)

Process II: Network Analysis

N/A Yes N/A Component Distribution N/A Yes Yes Average Closeness Development

N/A Yes

Yes Average Betweenness Development

N/A Yes

Yes Clustering Coefficient Development

N/A Yes Yes Diameter Development Yes Yes Yes Average Degree Development

Yes Yes

Yes Active Entity Size Development

N/A Yes Yes Average Closeness N/A Yes Yes Average Betweenness N/A Yes N/A Major Component Yes Yes Yes Degree Distribution N/A Yes Yes Clustering Coefficient N/A Yes Yes Diameter Yes Yes Yes Average Degree C-NET P-NET D-NET Measures

(38)

Yongqin Gao December 2006 Dissertation Defense

Process II: Network Analysis

• Discoveries in Process II:

– Measures of structure analysis and centrality analysis all indicate very high connectivity of the network. – Measures of path analysis reveal the developing

patterns of these measures (life cycle behavior).

• Benefits of these discoveries

– High connectivity in a network is an important feature for information propagation, failure proof.

Understanding this discovery can help us improve our practices in collaboration networks and communication networks.

– Understanding the developing patterns of these network measures provides us a method to monitor network

(39)

Process III: Computer Simulation

• Related Research:

– P.J. Kiviat, “Simulation, technology, and the decision process”, ACM Transactions on Modeling and

Computer Simulation,1991.

– R. Albert and A.L. Barabási, “Emergence of scaling in random networks”, Science, 1999.

– J. Epstein R. Axtell, R. Axelrod and M. Cohen, “Aligning simulation models: A case study and results”, Computational and Mathematical

Organization Theory, 1996.

– Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.

(40)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

• Iterative simulation

method

– Empirical dataset – Model – Simulation

• Verification and

validation

– More measures – More methods Model Simulation Empirical Data Collection Des crip tion Char acte rizat ion G en er ation Adju stm ent Verification Validation

(41)

Process III: Computer Simulation

• Previous iterated models (master thesis):

– Adapted ER Model

– BA Model

– BA Model with fitness

– BA Model with dynamic fitness

• Iterated models in this study

– Improved Model Four (Model I)

– Constant user energy (Model II)

– Dynamic user energy (Model III)

(42)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

• Model I

– Realistic stochastic procedures.

• New developer every time step based on Poisson distribution

• Initial fitness based on log-normal distribution

– Updated procedure for the weighted project

pool (for preferential selection of projects).

(43)

Process III: Computer Simulation

(44)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

(45)

Process III: Computer Simulation

(46)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

(47)

Process III: Computer Simulation

(48)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

• Model II

– New addition: user energy.

– User energy

• the “fitness” parameter for the user

• Every time a new user is created, a energy level is randomly generated for the user

• Energy level will be used to decide whether a user will take a action or not during every time step.

(49)

Process III: Computer Simulation

(50)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

(51)

Process III: Computer Simulation

• Model III

– New addition: dynamic user energy.

– Dynamic user energy

• Decaying with respect to time

• Self-adjustable according to the roles the user is taking in various projects.

(52)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

(53)

Process III: Computer Simulation

Decreasing Decreasing Average Closeness Decreasing Decreasing Average Betweenness Decreasing Decreasing Diameter Decreasing Decreasing Clustering Coefficient Increasing Increasing Average Degrees

Power Law (small tail) Power Law (small tail)

Project Distribution

Power Law (large tail) Power Law (large tail)

Developer Distribution

Model III (dynamic user energy)

Decreasing Decreasing Average Closeness Decreasing Decreasing Average Betweenness Decreasing Decreasing Diameter Decreasing Decreasing Clustering Coefficient Increasing Increasing Average Degrees

Power Law (reasonable tail)

Power Law (small tail)

Project Distribution

Power Law (large tail) Power Law (large tail)

Developer Distribution

Model II

(constant user energy)

Decreasing Decreasing Average Closeness Decreasing Decreasing Average Betweenness Decreasing Decreasing Diameter Decreasing Decreasing Clustering Coefficient Increasing Increasing Average Degrees

Power Law (large tail) Power Law (small tail)

Project Distribution

Power Law (small tail) Power Law (large tail)

Developer Distribution Model I (more realistic distributions) Simulated Patterns Patterns in Data Measures Models

(54)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

• Discoveries in Process III

– Expanding the network models for modeling

evolving complex networks (more parameters)

– Providing a validated model to simulate the

community network at SourceForge.net

• Benefits of these discoveries

– Expanded network models can benefit other

researchers in complex networks.

– Validated model for SourceForge.net can be

used to study other OSS communities or similar

collaboration networks.

(55)

Process IV: Research

Collaboratory

• Related Research:

– G. Chin Jr. and C. Lansing, “The biological

sciences collaboratory”,

Mathematics and

Engineering Techniques in Medicine and

Biological Sciences, 2004

.

– L. Koukianakis, “A system for hybrid learning

and hybrid psychology”,

Cybernetics and

Information Technologies, Systems and

Applications, 2003.

(56)

Yongqin Gao December 2006 Dissertation Defense

Process IV: Research

Collaboratory

• What is Collaboratory?

– An elaborate collection of data, information,

analytical toolkits and communication

technologies

– A new networked organizational form that also

includes social processes, collaboration

techniques and agreements on norms,

principles, value, and rules

(57)

Process IV: Research

Collaboratory

(58)

Yongqin Gao December 2006 Dissertation Defense

Process IV: Research

Collaboratory

• Data tier - schema design

SF0205 SF0103 SF0405 SF0305 SF0605 SF0705 SF0805 SF0505 Every schema is a database dump from the SourceForge.net Timeline

(59)

Process IV: Research

Collaboratory

• Data tier - connection pool

Timeline Connection Pool Connection Assigner Logic Tier Connection Request Persistent Link Persistent Link Persistent Link

(60)

Yongqin Gao December 2006 Dissertation Defense

Process IV: Research

Collaboratory

• Presentation Tier

– Various access methods – Documentation and references – Community support – Wiki interface

(61)

Process IV: Research

Collaboratory

• Logic Tier

– Interactive web query system

• Authorized user can submit query to the back end repository through the web query

• Results are provided by files with various formats

– Dynamic web schema browser

• Authorized user can access the dynamic schema of the repository through the schema browser

(62)

Yongqin Gao December 2006 Dissertation Defense

Process IV: Research

Collaboratory

• Utilization reports

– Monthly statistics (June 2006)

• Total queries submitted: 16,947

• Total data files retrieved: 13,343

• Total bytes of query data downloaded: 26,684,556,278

• Programmable access method

– Programmable access method should be provided

for complicated access

(63)

Process IV: Research

Collaboratory

• Results in Process IV

– Designing, implementing and maintaining a

research collaboratory for OSS related research.

• Benefits of these results

– OSS researchers can access one of the most

complete data sets for a OSS community

development.

– By providing the community service to OSS

researchers, the collaboratory can help in

sparkling, improving and promoting research

ideas about OSS.

(64)

Yongqin Gao December 2006 Dissertation Defense

Contributions

• Designed and demonstrated a computational discovery methodology to study evolving complex networks using research on OSS as a

representative problem domain

• Understanding the OSS movement by applying the methods.

– Process I: data mining

• Identifying significant features to describe a project

• Using distribution based clustering to generate a distribution based predictor to predict the “popularity” of a project

– Process II: network analysis

• Introducing more complete analysis to inspect more complete data set from SourceForge.net.

• Discovering high connectivity and possible life cycle behaviors in both the network structure and individuals in the network

– Process III: computer simulation

• Introducing more parameters in modeling evolving complex networks • Generating a “fit” model to replicate the evolution of the SourceForge.net

community.

– Process IV: research collaboratory

• Designing, implementing and maintaining a research collaboratory to host the SourceForge.net data set and provide community support for OSS related researches.

(65)

Publications to-date

• Y. Gao; G. Madey and V. Freeh. “Modeling and simulation of the open source software community”, ADSC, San Diego, 2005.

• Y. Gao and G. Madey. “Project development analysis of the oss community using st mining”, NAACSOS, Notre Dame, 2005.

• S. Christley; Y. Gao; J: Xu and G. Madey. “Public goods theory of the open source software development community”, Agent, Chicago, 2004. • Y. Gao, Y. Huang and G. Madey, “Data Mining Project History in Open

Source Software Communities”, NAACSOS, Pittsburgh, 2004. • J. Xu, Y. Gao, J. Goett and G. Madey, “A Multi-model Docking

Experiment of Dynamic Social Network Simulations”, Agent, Chicago, 2003.

• Y. Gao, V. Freeh, and G. Madey, “Analysis and Modeling of the Open Source Software Community”, NAACSOS, Pittsburgh, 2003.

• Y. Gao, V. Freeh, and G. Madey, “Conceptual Framework for Agent-based Modeling and Simulation”, NAACSOS, Pittsburgh, 2003.

• G. Madey; V. Freeh; R: Tynan and Y. Gao. “Agent-based modeling and simulation of collaborative social networks”, AMCIS, Tampa, 2003.

• Y. Gao; V. Freeh and G. Madey. “Topology and evolution of the open source software community”, SwarmFest, Notre Dame, 2003.

(66)

Yongqin Gao December 2006 Dissertation Defense

Publication Plan

• Chapter III (data mining)

– Journal of Machine Learning Research – Journal of Systems and Software

• Chapter IV (network analysis)

– Journal of Network and Systems Management – Journal of Social Structure

• Chapter V (computer simulation)

– Spring Simulation Conference 2007 (under review) – IEEE Computing in Science and Engineering

• Chapter VI (research collaboratory)

– CITSA 2007

(67)

Conclusion and Future Work

• Cyclic computational discovery method for

studying evolving complex networks

• Study of Open Source Software by applying this

method

• Future works:

– Maintaining and expanding the collaboratory – Verifying the discoveries in the SourceForge.net

against further accumulated database dump from SourceForge.net

– Applying our simulation model on other software development communities

– Extending our methodology to other evolving complex networks like Internet, communication network and various social networks

(68)

Yongqin Gao December 2006 Dissertation Defense

Acknowledgement

• My advisor: Dr. Madey • My committee members: – Dr. Flynn – Dr. Striegel – Dr. Wood • My Colleagues:

– Scott Christley, Yingping Huang, Tim Schoenharl, Matt Van Antwerp, Ryan Kennedy, Alec Pawling and Jin Xu

• SourceForge.net managers:

– Jeff Bates, VP of OSTG Inc.

– Jay Seirmarco, GM of SourceForge.net.

• US NSF CISE/IIS-Digital Society & Technology, under Grant No. 0222829.

(69)
(70)

Yongqin Gao December 2006 Dissertation Defense

Case Study II

15850 dev[46] dev[83] 15850 dev[46] dev[48] 15850 dev[46] dev[56] 15850 dev[46] dev[58] 6882 dev[58] dev[47] 6882 dev[47] dev[79] 6882 dev[47] dev[52] 6882 dev[47] dev[55] 7028 dev[46] dev[99] 7028 dev[46] dev[51] 7028 dev[46]

dev[57] 7597 dev[46]dev[45] 7597 dev[46] dev[72] 7597 dev[46] dev[55] 7597 dev[46] dev[58] 7597 dev[46] dev[61] 7597 dev[46] dev[64]7597 dev[46] dev[67] 7597 dev[46] dev[70] 9859 dev[46] dev[49] 9859 dev[46] dev[53] 9859 dev[46] dev[54] 9859 dev[46] dev[59] dev[46] dev[83] dev[56] dev[48] dev[52] dev[79] dev[72] dev[51] dev[57] dev[55] dev[99] dev[47] dev[58] dev[53] dev[58] dev[65] dev[45] dev[70] dev[67] dev[59] dev[54] dev[49] dev[64] dev[61] Project 6882 Project 9859 Project 7597 Project 7028 Project 15850

OSS Developer Network (Part)

Developers are nodes / Projects are links 24 Developers

5 Projects 2 hub Developers

(71)

Process I: Data Mining

• Characteristics of data set

– Massive

– Incomplete, noisy, redundant

– Complex structures, unstructured

• Classic analysis tools are often inadequate and

inefficient for analyzing these data, especially in

exploratory research

• What is DM (Data mining)

– Nontrivial extraction of implicit, previously unknown and potentially useful information from data.

(72)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Feature Selection

– Given a non-negative n x

m

matrix

V

, find

factors

W

(

n, r

) and

H

(

r, m

) , such that

V

W *H

– This is called the non-negative matrix

factorization (NMF) of the matrix

V

– NMF can be used on multivariate data to

reduce the dimension of the data set

– By using NMF, we can reduce dimension from

(73)

Why NMF?

• Feature extraction methods

– linear methods are simpler and more completely understood.

– nonlinear methods are more general and more difficult to analyze.

• Linear methods:

– ICA: Independent Component Analysis – Matrix decomposition: PCA, SVD, NMF

• In practice, NMF is most popular and simple.

• Dimensionality reduction is effective if the loss of

information due to mapping to a

lower-dimensional space is less than the gain due

simplifying the problem.

(74)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

• Feature-based Clustering

– Grouping data into K number of clusters based on features.

– The distance metrics used is Euclidean distance like

– Hierarchical K-Means is used.

• The result is a binary tree.

• The root is the whole data set and the leaf clusters are

the fine-grained clusters, which are the resulting K

(75)

Process I: Data Mining

• Case Study Result II

• Unsupervised learning

– K-Means method used to cluster the project history using the features we

selected

– We named the clusters using ID and the results are shown in the table

– The result is not acceptable by evaluation 2 4 4 5 29724 6 4 7 10 8 9 9 84 10 100960 Total 64824 3 98 2 6201 1 Size Cluster ID

(76)

Yongqin Gao December 2006 Dissertation Defense

Process I: Data Mining

Admin_flags?

Administrator Core developer Co-developer Active user lurker Grantcvs? Yes No Yes User_group table artifact

table Forumtable People_jobtable Project_tasktable

Doc_data table UNION Other tables User_project_act table Assigned? Activities? Yes No No Yes No

(77)
(78)

Yongqin Gao December 2006 Dissertation Defense

Clustering Result Evaluation

• Evaluation test set generation

– Popular/unpopular projects

– Stratified sampling to make 500 projects

• Feature sets used

– Popular feature set

– Activity Feature set (Page 34, Table 3.2) – Network Feature set (Page35, Table 3.3)

• Generating rules for the test sets

(79)

Popularity Definition

Number of views of the pages Number of views of the subdomain

Number of views of the website Number of downloads

Number of core developers

Description Page_views Subdomain_views Site_views Downloads Developers Feature

(80)

Yongqin Gao December 2006 Dissertation Defense

Why K-MEAN?

• The algorithm has remained extremely popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points.

• K-Means is most successful algorithm in large data set (size>1000, dimension > 2) than GA and Evolution • CLIQUE is sensitive to noise

• CURE is not scalable O(n2logn)

• CLARANS & BIRCH are not good for high dimension data

• D. Arthur, S. Vassilvitskii (2006): "How Slow is the k-means Method?," Proceedings of the 2006 Symposium on Computational Geometry (SoCG).

(81)

K-MEAN

• It maximizes inter-cluster (or minimizes

intra-cluster) variance, but does not ensure

that the result has a global minimum of

variance. Multiple run is needed.

• Elbow criterion

(82)

Yongqin Gao December 2006 Dissertation Defense

Distribution Categories

Artifact assigned 15 Todo assigned 14 Support assigned 13 Feature assigned 12 Patch assigned 11 Bug assigned 10 Bug reports 9 Patch request 8 Feature request 7 New message 2 Support request 6 Todo request 5 Artifact request 4 Followup message 3 File release 1 Feature Category

(83)

Process III: Computer Simulation

Start Stop End of Simu? Weighted Project Pool User Action No Yes Project List User List Project Pool Update Join Create Idle Drop User_Project Links New Users Simulation model procedure

(84)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

• Poisson Process:

– It expresses the probability of a number of events occurring in a fixed period of time if these events

occur with a known average rate, and are independent of the time since the last event.

– PDF:

!

)

;

(

k

e

k

F

k

!

!

! "

=

(85)

Process III: Computer Simulation

(86)

Yongqin Gao December 2006 Dissertation Defense

Process III: Computer Simulation

Kolmogorov-Smirnov test

– Used to determine whether two underlying one-dimensional distributions differ.

– Two one-sided K-S test statistics are given by

))

(

)

(

max(

))

(

)

(

max(

x

F

x

F

D

x

F

x

F

D

n n n n

!

=

!

=

! +

(87)
(88)

Yongqin Gao December 2006 Dissertation Defense

Similar Publications

• Chapter III (data mining)

– JMLR: G. Hamerly, E. Perelman..Using machine learning to guide simulation (Feb. 2006)

– JSS: S. Kim, J. Yoon..Shape-based retrieval in time-series database (Feb. 2006)

• Chapter IV (network analysis)

– JNSM: Special Issue Self-Managing Systems and Networks

– JoSS: The Journal of Social Structure (JoSS) is an electronic journal of the International Network for Social Network Analysis (INSNA)

• Chapter V (computer simulation)

– SSC 2007: simulation co

– IEEE/CSE: E. Luijten..Fluid simulation with monte carlo algorithm (2006 Vol. 8, Issue 2)

• Chapter VI (research collaboratory)

– CITSA 2007: L. Koukianakis..A system for hybrid learning and hybrid psychology (2005)

– JCSA: S. Chen, K. Wen..An Integrated System for Cancer-Related Genes Mining from Biomedical Literatures (2006)

References

Related documents

In conclusion, the method described offers a high post- process and post-thaw yield of hematopoetic stem cells, in combination with a small storage volume, does not require specific

The main purpose of this paper is to analyze the impacts of current security stability actions like Operation Zarb-e-Azb (army operation) and National Action

In this paper, we draw the attention to the short- and long-run dynamic linkages among distinct European carbon markets by investigating the interdependence and the

Although SSIT and SSAT substantially simplify the work of a developer, ISA extensions still need to be written in assembly language where the developer needs to take everything

The tube fits easily on fingers or toes to relieve pressure and friction and to help prevent and ease pain from corns and calluses!. Stretches to fit

learning and teaching earthquake engineering at high school level (b) examined changes in science teachers’ conceptual understanding of earthquake engineering as a result of

According to Law 34/2000, the district government has authority to create other local taxes beyond those mentioned above with the following criteria: (1) tax objects must

Heathcliff says no, but that he does have a son whom Catherine has met before and invites Cathy and Nelly to come back to Wuthering Heights with him.. Nelly suspects Heathcliff's