• No results found

Message based MVC and High Performance Multi core Runtime

N/A
N/A
Protected

Academic year: 2020

Share "Message based MVC and High Performance Multi core Runtime"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Message-based MV

and

High Performance Multi-core Runtim

Xiaohong Qiu

(2)

Session Outline

My Brief Background

Education and Work Experiences

Ph.D. Thesis Research

Message-based MVC Architecture for

Distributed and Desktop Applications

Recent Research Project

(3)

My Brief Background I

1987 ─ 1991 Computer Science program at Beihang

University

– CS was viewed a promising field to get into at the time

– Four years of foundation courses, computer hardware & software courses, labs, projects, and internship. Programming languages used include assembly language, Basic, Pascal, Fortran 77,

Prolog, Lisp, and C. Programming environment were DOS, Unix, Windows, and Macintosh.

1995 ─ 1998 Computer Science graduate program at

Beihang University

– Graduate Research Assistant at National Lab of Software Development Environment

– Participated in a team project SNOW (shared memory network of workstations) working on an improved algorithm of parallel IO subsystem based on two-phase method and MPI I/O.

1991 ─ 1998 Faculty at Beihang University

(4)

My Brief Background II

• 1998 ─ 2000 M.S., Computer Information Science program at Syracuse University

• 2000 ─ 2005 Ph.D., Computer Information Science program at Syracuse University

– The thesis project involved survey, designing, and evaluating a new

paradigm for the next generation of rich media software applications that unifies legacy desktop and Internet applications with automatic

collaboration and universal access capabilities. Attended conferences for presenting research papers and exhibiting projects

• Awarded with Syracuse University Fellowship from 1998 to 2001 and Outstanding Graduate Student of College of Electrical Engineering and Computer Science in 2005

• May 2005 ─ present Visiting Researcher at Community Grids Lab, Indiana University

• June ─ November 2006 Software Project Lead at Anabas Inc.

(5)

Message-based MVC (M-MVC)

Research Background

Architecture of Message-based MVC

Collaboration Paradigms

SVG Experiments

Performance Analysis

(6)

Research Background

Motivations

– CPU speed (Moore’s law) and network bandwidth (Gilder’s law) continue to improve bring fundamental changes

– Internet and Web technologies have evolved into a global information infrastructure for sharing of resources

– Applications getting increasingly sophisticated – Internet collaboration enabling virtual enterprises

Large-scale distributed computing

Requires new application architecture that is adaptable to fast technology changes with properties such as simplicity, reusability, scalability, reliability, and

performance

• General area is technology support for Synchronous and

Asynchronous Resource Sharing

e-learning (e.g. video/audio conferencing)

e-science (e.g. large-scale distributed computing)

e-business (e.g. virtual organizations)

e-entertainment (e.g. online game)

Research on a generic model of building applications

Application domains

Distributed (Web)

Service Oriented Architecture and Web Services

Desktop (Client)

Model-View-Controller (MVC) paradigm

Internet collaboration

(7)

Architecture of Message-based MVC

• A comparison of MVC, Web Service Pipeline, and Message-based MVC

• Features of Message-based MVC Paradigm

– M-MVC is a general approach for building applications with a message-based paradigm

– It emphasizes a universal modularized service model with messaging linkage – Converges desktop application, Web application, and Internet collaboration

• MVC and Web Services are fundamental architectures for desktop and Web applications

• Web Service pipeline model provides the general collaboration architecture for distributed applications

• M-MVC is a uniform architecture integrating the above models

– M-MVC allows automatic collaboration, which simplifies the architecture design

a. MVC Model b. Three-stage pipeline

Model View Controller

Controller

View

Display Model

Messages contain control information

Decomposition of SVG Browser

High Level UI

Raw UI Display

Rendering as messages Events as

messages

Semantic

Events as

messages Rendering asmessages

Input port Output port

(8)

Collaboration Paradigms I

SMMV vs. MMMV as MVC interactive patterns

Flynn’s Taxonomy classifies parallel computing platforms in four types:

SISD, MISD, SIMD, and MIMD.

SIMD– A single control unit dispatches instructions to each processing unit.

MIMD– Each processor is capable of executing a different program

independent of the other processors. It enables asynchronous processing.

SMMV generalizes the concept of SIMD

MMMV generalizes the concept of MIMD

In practice, SMMV and MMMV patterns can be applied in both asynchronous and synchronous applications, thus form general collaboration paradigms

View

n-1 Viewn

View

1 View2

Model

a) Single Model Multiple View

View

n-1 Viewn

View

1 View2

Model m-1 Model m

Model

1 Model2

(9)

Collaboration Paradigms II

Monolithic collaboration

– CGL applications of PowerPoint, OpenOffice and data visualization

Collaboration paradigms deployed with M-MVC model

– SMMV (e.g. Instructor led learning) – MMMV (e.g. Participatory learning)

master SVG browser client other NaradaBrokeri ng

Identical programs receiving identical events

master SVG browser client master master SVG browser client other master SVG browser client other master client Vie w NaradaBroker ing other client Vie w Model

as Web Service

other client Vie w other client Vie w Br ok er master client Vie w other client Vie w other client Vie w other client Vie w NaradaBroker ing Model

as WS as WSModel as WSModel as WSModel

(10)

SVG Experiments I

Monolithic SVG Experiments

Collaborative SVG Browser

Collaborative SVG Chess game

Players

(11)

SVG Experiments II

Decomposed SVG browser into stages of pipeline

T0: A given user event such as a mouse click that is sent from View

to Model.

T1: A given user event such as a mouse click can generate multiple

associated DOM change events transmitted from the Model to the View. T1 is the arrival time at the View of the first of these.

T2: This is the arrival of the last of these events from the Model and

the start of the processing of the set of events in the GVT tree

T3: This is the start of the rendering stage

T4: This is the end of the rendering stage

T0 Machine B Output (Renderin g) Input (UI events) GVT tree’ GVT tree Machine A Event Proce ssor JavaScri pt DOM tree (before mutation) DOM tree’ (after mutation) Machine C Event Proce ssor

T3 T1

T4 T2

Brok er

Notification service

(NaradaBrokering) Model(Service)

View(Client) Event Proce ssor Event Proce ssor DOM tree’ (mirrore d) DOM tree (mirrore

(12)

Performance Analysis

• Average Performance of Mouse Events

189.0

337.0 ± 22.0

26.0 50.5 ± 3.4

15.3 29.6 ± 1.7

6.4 20.0 ± 1.3 Solaris server Inter-City 6 160.0 404.0 ± 20.0 23.3 48.4 ± 3.0

12.8 24.8 ± 1.6

4.3 17.0 ± 0.91 Solaris server Inter-City 5 194.0 334.0 ± 22.0 26.3 49.5 ± 3.0

13.6 29.7 ± 1.5

4.8 20.0 ± 1.1 Linux cluster node server Within-City (Campus area) 4 185.0 414.0 ± 24.0 20.5 43.9 ± 2.6

10.2 21.0 ± 1.3

2.8 14.9 ± 0.65 Linux server Office area 3 91.2 123.0 ± 8.9 17.6 31.0 ± 1.7

9.07 18.9 ± 0.89

2.8 18.0 ± 0.57 High-end Desktop server Switch connects 2 173.0 294.0± 20.0 23.7 48.9± 2.7 18.7 37.9 ± 2.1

14.8 33.6 ± 3.0 Desktop server Switch connects 1 stddev mean ± error stdde v mean ± error stddev mean ± error Stdde v mean ± error NB location distance No End Rendering T4-T0(microseconds)

Last return – Send time: T’1-T0

(milliseconds)

First return – Send time: T1-T0

(milliseconds)

First return – Send time: T1-T0

(milliseconds)

Average of all mouse events (mousedown, mousemove, and mouseup) Mousedown events

(13)

Performance Analysis II

• Immediate bouncing back event

6 5 4 3 2 1 No Test Inter-City Inter-City Within-City (Campus area) Office area Switch connects Switch connects distance Test scenarios 176.0 364.0 ± 25.0 23.6 55.6 ± 3.4

19.3 37.8 ± 2.7

9.8 21.7 ± 1.4 Solaris server 179.0 351.0 ± 27.0 32.8 54.6 ± 4.9

14.5 31.8 ± 2.2

8.8 18.1 ± 1.3 Solaris server 179.0 329.0 ± 25.0 20.6 46.7 ± 2.9

11.6 26.9 ± 1.6

7.6 15.4 ± 1.1 Linux cluster node server 166.0 364.0 ± 22.0 21.9 54.2 ± 2.9

14.2 36.3 ± 1.9

11.0 24.3 ± 1.5 Linux server 109.0 158.0 ± 12.0 29.4 49.5 ± 3.1

13.8 29.5 ± 1.5

12.3 20.6 ± 1.3 High-end Desktop server 159.0 405.0 ± 23.0 25.9 68.0 ± 3.7

19.4 52.1 ± 2.8

19.0 36.8 ± 2.7 Desktop server stddev mean ± error stdde v mean ± error stdde v mean ± error Stdde v mean ± error NB location End Rendering T4-T0(milliseconds)

Last return – Send time: T’1-T0 (milliseconds)

First return – Send time: T1-T0 (milliseconds)

Bounce back – Send time:

(milliseconds)

Average of all mouse events (mousedown, mousemove, and mouseup)

(14)

Performance Analysis III

• Basic NB performance in 2 hops and 4 hops

4.47

16.8 ± 0.72

3.67 7.96 ± 0.60

6

4.54

14.0 ± 0.74

3.68 7.96 ± 0.60

5

6.95

14.1 ± 1.1

3.76 7.89 ± 0.61

4

4.85

16.9 ± 0.79

3.69 9.16 ± 0.60

3

4.09

11.4 ± 0.66

2.53 4.46 ± 0.41

2

6.07

13.4 ± 0.98

3.78 7.65 ± 0.61

1

stddev

mean ± error

stddev mean ± error

No

milliseconds milliseconds

4 hops

(View – Broker – Model – Broker – View) 2 hops

(15)

NB on Model; Model and View on two desktop 1.5 Ghz PCs; local switch network connection.

NB on View; Model and View on two desktop PCs with “high-end” graphics Dell (3 Ghz Pentium) for View; 1.5 Ghz Dell for model; local switch network connection.

Comparison of performance results to

highlight the importance of the client

All Event Mousedow Mouseu

Mousemove All Event

Mousedow Mouseu

Mousemove

Time T1-T0 milliseconds

Events per 5 ms bin

(16)

NB on local 2-processor Linux server; Model and View on two 1.5 Ghz desktop PCs; local switch network connection.

NB on 8-processor Solaris server ripvanwinkle; Model and View on two 1.5 Ghz desktop PCs; remote network connection through routers.

Comparison of performance results with

Local and remote NB locations

All Event Mousedow Mouseu

Mousemove All Event

Mousedow Mouseu

Mousemove

Time T1-T0 milliseconds

Events per 5 ms bin

(17)

Observations

• This client to server and back transit time is only 20% of the total processing time in the local examples.

• The overhead of the Web service decomposition is not directly measured in tests shown these tables

• The changes in T1-T0 in each row reflect the different network transit times as we move the server from local to organization locations.

• This overhead of NaradaBrokering itself is 5-15 milliseconds depending on the operating mode of the Broker in simple stand-alone

measurements. It consists forming message objects, serialization and network transit time with four hops (client to broker, broker to server, server to broker, broker to client).

• The contribution of NaradaBrokering to T1-T0 is about 30 milliseconds in preliminary measurements due to the extra thread scheduling inside the operating system and interfacing with complex SVG application.

• We expect the main impact to be the algorithmic effect of breaking the code into two, the network and broker overhead, thread scheduling from OS

• We expect our architecture will work dramatically better on multi-core chips

(18)

Summary of Thesis Researc

• Proposing an “explicit Message-based MVC” paradigm (M-MVC) as the general architecture of Web applications

• Demonstrating an approach of building “collaboration as a Web service” through monolithic SVG experiments.

• Bridging the gap between desktop and Web application by leveraging the existing desktop application with a Web service interface through

“M-MVC in a publish/subscribe scheme”.

– As an experiment, we convert a desktop application into a distributed system by modifying the architecture from method-based MVC into message-based MVC.

• Proposing Multiple Model Multiple View (MMMV) and Single Model Multiple View (SMMV) collaboration as the general architecture of

“collaboration as a Web service” model.

(19)

High Performance Multi-core Runtime

Multi-core Architecture are expected to be the

future of “Moore’s Law” with single chip

performance coming from parallelism with

multiple cores rather than from increased clock

speed and sequential architecture improvements

This implies parallelism should be used in all

applications and not just the familiar scientific and

engineering areas

The runtime could be message passing for all

cases. It is interesting to compare and try to unify

runtime for MPI (classic scientific technology),

Objects and Services which are all message

based

We have finished an analysis of Concurrency and

(20)

Research Question: What is “core” multicore runtime and its performance?

• Many parallel and/or distributed programming models are a supported by a runtime consisting of long-running or dynamic threads exchanging messages

– Those coming from distributed computing often have overheads of a millisecond or more when ported to multicore (See M-MVC thesis results earlier)

• Need microsecond level performance on all models – like the best MPI

• Examination of Microsoft CCR suggests this will be possible

– Current CCR spawning threads in MPI mode 2-4 microsecond overhead

– Two-way service style messages around 30 microsecond

(21)

Intel Fall 2005 Multicore Roadmap

(22)

Summary of CRR and DSS Project

CCR is a message based run time supporting interacting

concurrent threads with high efficiency

• Replaces CLR Thread Pool with Iteration

DSS is a Service (not a Web Service) environment designed

for Robotics (which has many control and analysis modules

implemented as services and linked by workflow)

DSS is built on CCR and released by Microsoft

We used a 2 processor 2-core

AMD Opteron

and a

2-processor 2-core

Intel Xeon

and looked at CCR and DSS

performance

• For CCR we chose message patterns similar to those used in MPI • For DSS we chose simple one way and two way message exchange

between 2 services

This is first step in examining possibility of linking science

and more general runtime and seeing if we can get very high

performance in all cases

(23)

Implementing CCR Performance Measurements

– CCR is written in C# and we built a suite of test programs in this

language

– Multi-threaded performance analysis tools

• On the AMD machine, there is the free CodeAnalyst Performance Analyzer

– It allows one see how work is assigned to threads but it cannot look at microsecond resolution needed for this work

• Intel thread analyzer (VTune) does not currently support C# or Java

• Microsoft Visual Studio 2005 Team Suite Performance Analyzer (no support WOW64 or x64 yet)

– We looked at several thread message exchange patterns similar to basic

Exchange and Shift in MPI

– We took a basic computation whose smallest unit took about 1.4(AMD)-1.5(Intel) microseconds

– We typically ran 107such units on each core to take 14 or 15 seconds

– We divided this run from 1 to 107 stages where at end of each stage the

threads sent messages (in various patterns) to the next threads that continued computation

(24)
(25)

Messag e

Thread3 Port3

Messag e

Messag

e Message

Thread3 Port3

Messag e Messag e Messag e

Thread2 Port2

Messag e

Messag

e Message

Thread2 Port2

Messag e Messag e Messag e

Thread0 Port0

Messag e

Messag

e Message

Thread0 Port0

Messag e

Messag

e Message

Thread0 Port0

Messag e Messag e Messag e

Thread3 Port3

Messag e Messag e Messag e

Thread2 Port2

Messag e Messag e Messag e

Thread1 Port1

Messag e

Messag

e Message

Thread1 Port1

Messag e

Messag

e Message

Thread1 Port1

Messag e

Messag e

One Stage

Pipeline which is Simplest loosely synchronous execution in CC Note CCR supports thread spawning model

MPI usually uses fixed threads with message rendezvous

Messag e

Thread0 Port0

Messag e

Messag

e Message

Thread0 Port0

Messag e

Messag

e Message

Thread0 Port0

Messag e

Messag e

Messag

e Message

Messag e

Messag

e Message

Messag e

Messag e

Thread1 Port1

Messag e

Messag

e Message

Thread1 Port1

Messag e

Messag

e Message

Thread1 Port1

Messag e

Messag e

(26)

Messag e

Thread0 Port0

Messag e

Messag e

Thread0 Messag

e

Messag e

Thread3 Port3

Messag e

Messag e

Thread3

EndPort

Messag e

Thread2 Port2

Messag e

Messag e

Messag e

Thread2 Messag

e Messag

e

Thread1 Port1

Messag e

Messag e

Thread1 Message

(27)

Write Exchange Messages

Port 3 Port

2

Thread0

Thread3 Thread2 Thread1 Port

1 Port

0 Thread0

Write Exchange Messages

Port 3

Thread2 Port

2

Exchanging Messages with 1D Torus Exchang

topology for loosely synchronous execution in CCR

Thread0

Read Message s

Thread3 Thread2 Thread1 Port

1 Port

0

(28)

Thread0

Port 3

Thread2 Port2

Port 1 Port

0

Thread3 Thread1

Thread2 Port2

Thread0 Port0

Port 3 Thread3

Port 1 Thread1

Thread3 Port3

Thread2 Port2

Thread0 Port0

Thread1 Port1

(a) Pipeline (b) Shift

(d) Exchange

Thread0

Port 3

Thread2 Port2

Port 1 Port

0

Thread3 Thread1

(c) Two Shifts

(29)

Stages (millions)

Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107

stages on HP Opteron Multicore. Each stage separated by reading and writing CCR

ports in Pipeline mode

Time Seconds

8.04 microseconds per stage averaged from 1 to 10

million stages Overhead =

Computatio n

Computation Component if no Overhead

(30)

Stages (millions)

Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107

stages on Dell Xeon Multicore. Each stage separated by reading and writing CCR

ports in Pipeline mode

Time Seconds

12.40 microseconds per stage

averaged from 1 to 10 million stages

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon

Overhead = Computatio

n

(31)

Summary of Stage Overheads for AMD Machine

(32)

Summary of Stage Overheads for Intel Machine

These are stage switching overheads for a set of runs with

different levels of parallelism and different message patterns

–each stage takes about 30 microseconds. AMD overheads

in parentheses

(33)

AMD Bandwidth Measurements

• Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by

exchanging larger messages of different size between threads • We used three types of data structures for receiving data

– Array in thread equal to message size

– Array outside thread equal to message size

– Data stored sequentially in a large array (“stepped” array)

(34)

Intel Bandwidth Measurements

For bandwidth, the Intel did better than AMD especially when one

exploited cache on chip with small transfers

For both AMD and Intel, each stage executed a computational task

after copying data arrays of size 10

5

(labeled small), 10

6

(labeled

large) or 10

7

double words. The last column is an approximate value

in microseconds of the compute time for each stage. Note that

copying 100,000 double precision words per core at a

(35)

Typical Bandwidth measurements showing effect of cache with slope change

5,000 stages with run time plotted against size of double array copied in each

stage from thread to stepped locations in a large array on Dell Xeon Multicore

Time Seconds

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon

Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to

100,000 double words

Array Size: Millions of Double Words

Slope Change (Cache

(36)

Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)

n CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

(37)

References

• Thesis for download

– http://grids.ucs.indiana.edu/~xqiu/dissertation.html

• Thesis project

– http://grids.ucs.indiana.edu/~xqiu/research.html

• Publications and Presentations

– http://grids.ucs.indiana.edu/~xqiu/publication.html

• NaradaBrokering Open Source Messaging System

– http://www.naradabrokering.org

• Information about Community Grids Lab project and publications

– http://grids.ucs.indiana.edu/ptliupages/

• Xiaohong Qiu, Geoffrey Fox, Alex Ho, Analysis of Concurrency and

Coordination Runtime CCR and DSS for Parallel and Distributed Computing,

technical report, November 2006

References

Related documents

To open the Setup Utility, turn on or restart the computer in Windows, and then press f10 while the prompt, “Press <F10> to enter setup,” is displayed in the lower-left

This test helps to smooth some of the roughest edges of Exemption 7(F). This test would exempt from disclosure documents currently being used in an investiga- tion and thus help

1.2.7 input no sound Problem Analyze and identify the causes Solutions Version 1.0 Page 13 of 36 input no sound Defective input RCA jack INPUT PCB soldering

(C) Whereas the Broker has agreed to act as a financial broker and an agent for commission in accordance with instructions issued by the Client on the basis that the Broker shall

The studies in this thesis examined the use of portable multimedia devices like the iPod Touch® and iPad® in combination with video modelling to teach leisure, academic and

Years 7-10 Years 11-12 Summer Uniform (Term 1 & 4 only) Light blue 3/4 sleeve Wisdom College blouse Blue/Grey Skirt (mid-calf in length) Navy school jumper and jacket

Thus …rms that seek a rating from the new agency may not receive higher ratings but instead, for each rating class, the newly rated insurers should have higher average

ROTARY CLUB OF BEAUMARIS BULL ET IN – SERVING THE COMMUNIT Y SINCE 1985 This Week’s Meeting. PP David Hone finally got to hand out some of the Paul Harris Fellowship Awards from his